Hung-Wei Tseng X Dean Tullsen Ooo Superscalar Processor

Multi-threaded processors Hung-Wei Tseng x Dean Tullsen OoO SuperScalar Processor

• Fetch instructions in the instruction window • Register renaming to eliminate false dependencies • Schedule an instruction to execution stage (issue) whenever all data inputs are ready for the instruction • Put the instruction in reorder buffer and commit the instruction if the instruction is (1) not mis-predicted and (2) all the instruction prior to this instruction are committed

2 INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

2.1 THE SKYLAKE MICROARCHITECTURE The Skylake microarchitecture builds on the successes of the Haswell and Broadwell microarchitectures. TheIntel basic pipeline SkyLakefunctionality of the Skylake microarchitecture architecture is depicted in Figure 2-1.

32K L1 Instruction BPU Cache

MSROM Decoded Icache Legacy Decode (DSB) Pipeline 4 uops/cycle 6 uops/cycle 5 uops/cycle Instruction Decode Queue (IDQ,, or micro-op queue) 4-issue integer pipeline 4-issue memory pipeline Allocate/Rename/Retire/MoveElimination/ZeroIdiom

Scheduler 256K L2 Cache Port 2 (Unified) Port 0 Port 1 Port 5 Port 6 LD/STA Int ALU, Int ALU, Int ALU, Int ALU, Vec FMA, Fast LEA, Fast LEA, Port 3 Int Shft, Vec MUL, Vec FMA, Vec SHUF, LD/STA Branch1, Vec Add, Vec MUL, Vec ALU, Vec ALU, Vec Add, CVT 32K L1 Data Cache Vec Shft, Vec ALU, Port 4 Divide, Vec Shft, STD Branch2 Int MUL, Slow LEA Port 7 STA

Figure 2-1. CPU Core Pipeline Functionality3 of the Skylake Microarchitecture

The Skylake microarchitecture offers the following enhancements: • Larger internal buffers to enable deeper OOO execution and higher cache bandwidth. • Improved front end throughput. • Improved branch predictor. • Improved divider throughput and latency. • Lower power consumption. • Improved SMT performance with Hyper-Threading Technology. • Balanced floating-point ADD, MUL, FMA throughput and latency. The microarchitecture supports flexible integration of multiple processor cores with a shared uncore sub- system consisting of a number of components including a ring interconnect to multiple slices of L3 (an off-die L4 is optional), processor graphics, integrated memory controller, interconnect fabrics, etc. A four-core configuration can be supported similar to the arrangement shown in Figure 2-3.

2-2 Dynamic execution with register naming • Consider the following dynamic instructions 1: lw $t1, 0($a0) 2: lw $a0, 4($a0) You code looks like this when 3: add $v0, $v0, $t1 performing “linked list” traversal 4: bne $a0, $zero, LOOP 5: lw $t1, 0($a0) 6: lw $t2, 4($a0) 7: add $v0, $v0, $t1 8: bne $t2, $zero, LOOP Assume a superscalar processor with unlimited issue width & physical registers that can fetch up to 4 instructions per cycle, 2 cycles to execute a memory instruction how many cycles it takes to issue all instructions? cycle #1 1 2 A. 1 B. 2 cycle #2 Wasted C. 3 cycle #3 3 4 5 6 D. 4 cycle #4 Wasted E. 5 4 cycle #5 7 8 Outline

• Simultaneous multithreading • Chip multiprocessor • Parallel programming

5 Simultaneous Multi- Threading (SMT)

6 Simultaneous Multi-Threading (SMT) • Fetch instructions from different threads/processes to fill the not utilized part of pipeline • Exploit “thread level parallelism” (TLP) to solve the problem of insufficient ILP in a single thread • Keep separate architectural states for each thread • PC • Register Files • Reorder Buffer • Create an illusion of multiple processors for OSs • The rest of superscalar processor hardware is shared • Invented by Dean Tullsen • Now a professor in UCSD CSE! You may take his CSE148 in Spring 2015 • 7 Simplified SMT-OOO pipeline

Instruction Fetch: T0 ROB: T0 Instruction Register ROB: T1 Fetch: T1 Execution Data Instruction renaming Schedule Decode Units Cache Instruction logic Fetch: T2 ROB: T2 Instruction Fetch: T3 ROB: T3

8 Simultaneous Multi-Threading (SMT)

• Fetch 2 instructions from each thread/process at each cycle to ﬁll the not utilized part of pipeline • Issue width is still 2, commit width is still 4 T1 1: lw $t1, 0($a0) IF ID Ren Sch EXE MEM C T1 2: lw $a0, 0($t1) IF ID Ren Sch Sch Sch EXE MEM C T2 1: sll $t0, $a1, 2 IF ID Ren Sch EXE MEM C T2 2: add $t1, $a0, $t0 IF ID Ren Sch Sch EXE C T1 3: addi $a1, $a1, -1 IF ID Ren Sch EXE C C C T1 4: bne $a1, $zero, LOOP IF ID Ren Sch Sch EXE C C T2 3: lw $v0, 0($t1) IF ID Ren Sch Sch Sch EXE MEM C T2 4: addi $t1, $t1, 4 IF ID Ren Sch Sch Sch EXE C C T2 5: add $v0, $v0, $t2 IF ID Ren Sch Sch Sch Sch EXE C T2 6: jr $ra IF ID Ren Sch Sch Sch EXE C C

Can execute 6 instructions before bne resolved.

9 Simultaneous Multithreading

• SMT helps covering the long memory latency problem • But SMT is still a “superscalar” processor • Power consumption / hardware complexity can still be high. • Think about Pentium 4

10 SMT

• Improve the throughput of execution • May increase the latency of a single thread • Less branch penalty per thread • Increase hardware utilization • Simple hardware design: Only need to duplicate PC/ Register Files • Real Case: • Intel HyperThreading (supports up to two threads per core) • Intel Pentium 4, Intel Atom, Intel Core i7 • AMD Zen

11 SMT • How many of the following about SMT are correct? • SMT makes processors with deep pipelines more tolerable to mis-predicted branches • SMT can improve the throughput of a single-threaded application hurt, b/c you are sharing resource with other threads. • SMT processors can better utilize hardware during cache misses comparing with superscalar processors with the same issue width • SMT processors can have higher cache miss rates comparing with superscalar processors with the same cache sizes when executing the same set of applications. A. 0 B. 1 C. 2 D. 3 12 E. 4 Chip multiprocessor (CMP)

13 A wide-issue processor or multiple narrower-issue processors What can you do within a 21 mm * 21 mm area?

21 21mm mm 2121 mm mm

I-CacheI-Cache #1 #1 (8K) (8K) I-CacheI-Cache #2 #2 (8K) (8K) InstructionInstruction External ExternalExternal CacheCache External InstructionInstruction InterfaceInterface InterfaceInterface (32(32 KB) KB) FetchFetch

) TLB ) ProcessorProcessor ProcessorProcessor

TLB B

#1 #2 KB)

5 Inst. Decode & s s

Inst. Decode & s s Data a

2 Data a

(

Rename

Rename CacheCache

ssb

Pa Pa You will have more

(32(32 KB) KB) h

D-Cache #1 (8K) D-Cache #2 (8K) a

a 21 mm 21 mm C

21 mm 21 mm

C D-Cache #3 (8K) D-Cache #4 (8K)

n D-Cache #3 (8K) D-Cache #4 (8K)

ALUs if you choosen

cki

Reorder Buffer,

l p

i Instruction Queues, c

Instruction Queues, i

n this!

and Out-of-Order Logic U

U and Out-of-Order Logic

-C

r Processor Processor

e Processor Processor

O #3 #4 mmu

Floating Point 2 Floating Point L UnitUnit L I-CacheI-Cache #3 #3 (8K) (8K) I-Cache I-Cache #4 #4 (8K) (8K) A 6-issue superscalar processor 4 2-issue superscalar processor FigureFigure 2. 2. Floorplan Floorplan for for the the six-issue six-issue dynamic dynamic superscalar superscalar FigureFigure 3. 3. Floorplan Floorplan for for the the four-way four-way single-chip single-chip 3 integer ALUsmicroprocessor.microprocessor. 4*1 integermultiprocessor.multiprocessor. ALUs 3 floating point ALUs 4*1 floating point ALUs ventsvents the the instr instructionuction fetch fetch mechanism mechanism fr omfrom becoming becoming a bottleneca bottleneck k sorsors iss isless less than than one-f one-fourourthth the the siz size eof of the the 6-w 6-waya ySS SS pr processor,ocessor, as as sincesince the the 36-w 6-wload/storeay aeyx ecutionexecution eng unitsengineine requir requires esa ma ucmuch hhigher higher instr instruc-uc- shownshown14 in in Table Table4*1 3. 3. Theload/store The number number of of executionunits execution units units actually actually increases increases tiontion fetc fetch bandh bandwidthwidth than than the the 2-w 2-waya y pr ocessor processors useds used in in the the MP MP inin the the MP MP because because the the 6-w 6-wayay processor processor had had three three units units of of each each type, type, architecture.architecture. whilewhile the the 4-w 4-waya yMP MP m mustust ha havev ef ourfour — — one one f orfor eac each hCPU CPU. .On On the the otherother hand, hand, the the issue issue logic logic becomes becomes dramatically dramatically smaller, smaller, due due to to the the The on-chip memory hierarchy is similar to the Alpha 21164 — a The on-chip memory hierarchy is similar to the Alpha 21164 — a decrdecreaseease in in instr instructionuction b uff bufferer por portsts and and the the smaller smaller n numberumber of of small, fast level one (L1) cache backed up by a large on-chip level small, fast level one (L1) cache backed up by a large on-chip level entrentriesies in in eac each hinstr instructionuction b uffer.buffer. T Thehe scaling scaling f actorfactors sof of these these tw twoo two (L2) cache. The wide issue width requires the L1 cache to sup- two (L2) cache. The wide issue width requires the L1 cache to sup- unitsunits balance balance each each other other out, out, leaving leaving the the entire entire processor processor very very close close port wide instruction fetches from the instruction cache and multi- port wide instruction fetches from the instruction cache and multi- toto one-fourth one-fourth of of the the size size of of the the 6-way 6-way processor. processor. pleple loads loads from from the the da tada tacac cache hedur duringing eac each chycle. cycle. T heThe tw two-wao-way yset set associative 32 KB L1 data cache is banked eight ways into eight The on-chip cache hierarchy of the multiprocessor is significantly associative 32 KB L1 data cache is banked eight ways into eight The on-chip cache hierarchy of the multiprocessor is significantly small, single-ported, independent 4 KB cache banks each of which different from the cache hierarchy of the 6-way superscalar proces- small, single-ported, independent 4 KB cache banks each of which different from the cache hierarchy of the 6-way superscalar proces- handling one access every 2 ns processor cycle. However, the addi- sor. Each of the 4 processors has its own single-banked and single- handling one access every 2 ns processor cycle. However, the addi- sor. Each of the 4 processors has its own single-banked and single- tional overhead of the bank control logic and crossbar required to ported 8 KB instruction and data caches that can both be accessed tional overhead of the bank control logic and crossbar required to ported 8 KB instruction and data caches that can both be accessed arbitrate between the multiple requests sharing the 8 data cache in a single 2 ns cycle. Since each cache can only be accessed by a arbitrate between the multiple requests sharing the 8 data cache in a single 2 ns cycle. Since each cache can only be accessed by a banks adds another cycle to the latency of the L1 cache, and single processor with a single load/store unit, no additional over- banks adds another cycle to the latency of the L1 cache, and single processor with a single load/store unit, no additional over- increases the area by 25%. Therefore, our modeled L1 cache has a head is incurred to handle arbitration among independent memory- increases the area by 25%. Therefore, our modeled L1 cache has a head is incurred to handle arbitration among independent memory- hit time of 2 cycles. Backing up the 32 KB L1 caches is a large, uni- access units. However, since the four processors now share a single hit time of 2 cycles. Backing up the 32 KB L1 caches is a large, uni- access units. However, since the four processors now share a single fied, 256 KB L2 cache that takes 4 cycles to access. These latencies L2 cache, that cache requires an extra cycle of latency during every fied, 256 KB L2 cache that takes 4 cycles to access. These latencies L2 cache, that cache requires an extra cycle of latency during every are simple extensions of the times obtained for the L1 caches of access to allow time for interprocessor arbitration and crossbar are simple extensions of the times obtained for the L1 caches of access to allow time for interprocessor arbitration and crossbar current Alpha microprocessors [4], using a 0.25 µm process tech- delay. We model this additional L2 delay by penalizing the MP an current Alpha microprocessors [4], using a 0.25 µm process tech- delay. We model this additional L2 delay by penalizing the MP an nology additional cycle on every L2 cache access, resulting in a 5 cycle L2 additional cycle on every L2 cache access, resulting in a 5 cycle L2 nology hit time. 4.2 4 x 2-way Superscalar Multiprocessor hit time. 4.2 4 x 2-way Superscalar Multiprocessor Architecture 5 Simulation Methodology Architecture 5 Simulation Methodology The MP architecture is made up of four 2-way superscalar proces- Accurately evaluating the performance of the two microarchitec- Thesors MP interconnected architecture is by made a crossbar up of thatfour allows 2-way the superscalar processors proces- to share Accurately evaluating the performance of the two microarchitec- sors interconnected by a crossbar that allows the processors to share tures requires a way of simulating the environment in which we the L2 cache. On the die, the four processors are arranged in a grid turwouldes requir expectes a these way ar ofchitectur simulatinges to thebe used envir onmentin real systems. in whic Inh wthise thewith L2 cachethe L2. Oncache the at die one, the end, four as processorsshown in Figure are arrang 3. Internally,ed in a g rideach would expect these architectures to be used in real systems. In this with the L2 cache at one end, as shown in Figure 3. Internally, each section we describe the simulation environment and the applica- of the processors has a register renaming buffer that is much more sectiontions used we descrin thisibe study. the simulation environment and the applica- of limitedthe processor than thes has one a in register the 6-way renaming architecture, buffer sincethat is each much CPU more only tions used in this study. limitedhas an than 8-entr the yone instr inuction the 6-way buffer. architecture, We also quar sinceter edeach the CPU size onlyof the has an 8-entry instruction buffer. We also quartered the size of the 5.1 Simulation Environment branch prediction mechanisms in the fetch units, to 512 BTB 5.1 Simulation Environment brancentrhies pr ediction and 8 call-r meceturhanismsn stac k in entr theies. f etcAfterh units, the ar toea 512adjustments BTB entrcausedies and b y 8 these call-r feturactorn s stacare kaccounted entries. After for, eac theh arofea the adjustments four proces- We execute the applications in the SimOS simulation environment caused by these factors are accounted for, each of the four proces- W[18]e ex.ecute SimOS the models applica thetions CPUs, in the memory SimOS hierarch simulationy and en vironmentI/O devices [18]. SimOS models the CPUs, memory hierarchy and I/O devices Die photo of a CMP processor

15 CMP advantages

• How many of the following are advantages of CMP over traditional superscalar processor • CMP can provide better energy-efﬁciency within the same area • CMP can deliver better instruction throughput within the same die area (chip size) • CMP can achieve better ILP for each running thread • CMP can improve the performance of a single-threaded application without modifying code A. 0 B. 1 C. 2 D. 3 E. 4 16 CMP v.s. SMT

• Assuming both application X and application Y have similar instruction combination, say 60% ALU, 20% load/store, and 20% branches. Consider two processors:

P1: CMP with a 2-issue pipeline on each core. Each core has a private L1 32KB D-cache

P2: SMT with a 4-issue pipeline. 64KB L1 D-cache

Which one do you think is better? A. P1 B. P2 17 Speedup a single application on multi- threaded processors

18 Parallel programming

• To exploit CMP/SMT parallelism you need to break your computation into multiple “processes” or multiple “threads” • Processes (in OS/software systems) • Separate programs actually running (not sitting idle) on your computer at the same time. • Each process will have its own virtual memory space and you need explicitly exchange data using inter-process communication APIs • Threads (in OS/software systems) • Independent portions of your program that can run in parallel • All threads share the same virtual memory space • We will refer to these collectively as “threads” • A typical user system might have 1-8 actively running threads. • Servers can have more if needed (the sysadmins will hopefully conﬁgure it that way)

19 Create threads/processes

• The only way we can improve a single application performance on CMP/SMT • You can use fork() to create a child process (CSE120) • Or you can use pthread or openmp to compose multi-threaded programs

/* Do matrix multiplication */ for(i = 0 ; i < NUM_OF_THREADS ; i++) { Spawn a thread tids[i] = i; pthread_create(&thread[i], NULL, threaded_blockmm, &tids[i]); } for(i = 0 ; i < NUM_OF_THREADS ; i++) pthread_join(thread[i], NULL); Synchronize and wait a for thread to terminate 20 Is it easy?

• Threads are hard to ﬁnd and some algorithms are hard to parallelize • Finite State Machine • N-body • Linear programming • Circuit Value Problem • LZW data Compression • Data sharing among threads • Architectural or API support • You need to use locks to control race conditions • Hard to debug!

21 Supporting shared memory model

• Provide a single memory space that all processors can share • All threads within the same program shares the same address space. • Threads communicate with each other using shared variables in memory • Provide the same memory abstraction as single- thread programming

22 Simple idea...

• Connecting all processor and shared memory to a bus. • Processor speed will be slow b/c all devices on a bus must run at the same speed

Core 0 Core 1 Core 2 Core 3

Bus

Shared $

23 Memory hierarchy on CMP

• Each processor has its own local cache Core 0 Core 1

Local $ Local $

Bus

Local $ Local $ $ Shared

Core 2 Core 3 24 Cache on Multiprocessor

• Coherency • Guarantees all processors see the same value for a variable/ memory address in the system when the processors need the value at the same time • What value should be seen • Consistency • All threads see the change of data in the same order • When the memory operation should be done

25 Simple cache coherency protocol

• Snooping protocol • Each processor broadcasts / listens to cache misses • State associate with each block (cacheline) • Invalid • The data in the current block is invalid • Shared • The processor can read the data • The data may also exist on other processors • Exclusive • The processor has full permission on the data • The processor is the only one that has up-to-date data

26 Cache coherency practice

• What happens when core 0 modiﬁes 0x1000?, which belongs to the same cache block as 0x1000?

Core 0 Core 1 Core 2 Core 3

Local $ SharedExcl. 0x1000 SharedInvalid 0x10000x1000 SharedInvalid 0x10000x1000 SharedInvalid 0x10000x1000

Write miss 0x1000 Bus

Shared $ 27 Cache coherency practice

• Then, what happens when core 2 reads 0x1000?

Core 0 Core 1 Core 2 Core 3

Local $ SharedExcl. 0x1000 Invalid 0x1000 SharedInvalid 0x1000 Invalid 0x1000

Write back 0x1000 Read miss 0x1000 Fetch 0x1000 Bus

Shared $ 28 Simple cache coherency protocol read/write miss (bus) read miss(processor) read miss/hit Invalid Shared write miss(bus)

write back data back write write request(processor)

write miss(bus) write readwrite miss(bus) back data write hit miss(processor) write Exclusive

29 Cache coherency

• Assuming that we are running the following code on a CMP with some cache coherency protocol, which output is NOT possible? (a is initialized to 0) thread 1 thread 2 while(1) while(1) printf(“%d ”,a); a++;

A. 0 1 2 3 4 5 6 7 8 9 B. 1 2 5 9 3 6 8 10 12 13 C. 1 1 1 1 1 1 1 1 1 1 D. 1 1 1 1 1 1 1 1 1 100

30 It’s show time!

• Demo!

thread 1 thread 2 while(1) while(1) printf(“%d ”,a); a++;

31 Cache coherency practice

• Now, what happens when core 2 writes 0x1004, which belongs the same block as 0x1000? • Then, if Core 0 accesses 0x1000, it will be a miss!

Core 0 Core 1 Core 2 Core 3

Local $ SharedInvalid 0x1000 Invalid 0x1000 SharedExcl. 0x1000 Invalid 0x1000

Write miss 0x1004 Bus

Shared $ 32 4C model

• 3Cs: • Compulsory, Conflict, Capacity • Coherency miss: • A “block” invalidated because of the sharing among processors. • True sharing • Processor A modifies X, processor B also want to access X. • False Sharing • Processor A modifies X, processor B also want to access Y. However, Y is invalidated because X and Y are in the same block!

33 Hard to debug

thread 1 thread 2 int loop; void* modifyloop(void *x) { int main() sleep(1); { printf("Please input a number:\n"); pthread_t thread; scanf("%d",&loop); loop = 1; return NULL; } pthread_create(&thread, NULL, modifyloop, NULL); while(loop == 1) { continue; } pthread_join(thread, NULL); fprintf(stderr,"User input: %d\n", loop); return 0; }

34 Performance of multi-threaded programs

• Multi-threaded block algorithm for matrix multiplication • Demo!

35 Announcement

• CAPE (Course Evaluation) • Drop one more lowest quiz grade (currently 2) if the response rate is higher than 60% • Homework #5 due next Tuesday • Special ofﬁce hour: 2p-3p this Friday @ CSE 3208

36 Hardware complexity of wide issue processors • Multi-ported Register File ~ IW4 • IQ size ~ IW4 • Bypass logic ~ IW4 • Wiring delay

37 Chip Multiprocessor (CMP)

• Multiple processors on a single die! • Increase the frequency: increase power consumption by cubic! • Doubling frequency increases power by 8x, doubling cores increases power by 2x • But the process technology (Moore’s law) allows us to cram more core into a single chip! • Instead of building a wide issue processor, we can have multiple narrower issue processor. • e.g. 4-issue v.s. 2x 2-issue processor • Now common place • Improve the throughput of applications • You may combine CMP and SMT together Like Core i7 • 38 Simple cache coherency protocol read/write miss (bus) read miss(processor) read miss/hit Invalid Shared write miss(bus)

write back data back write write request(processor)

write miss(bus) write readwrite miss(bus) back data write hit miss(processor) write Exclusive