A Superscalar Out-of-Order Soft for FPGA

Henry Wong University of Toronto, [email protected]

June 5, 2019 EE380

1 Hi!

● CPU architect, Intel Hillsboro ● Ph.D., University of Toronto

● Today: x86 OoO processor for FPGA (Ph.D. work) – Motivation – High-level design and results – details and some circuits

2 FPGA: Field-Programmable Gate Array

● Is a digital circuit (logic gates and wires) ● Is field-programmable (at power-on, not in the fab) ● Pre-fab everything you’ll ever need – 20x area, 20x delay cost – Circuit building blocks are somewhat bigger than logic gates

6-LUT6-LUT 6-LUT6-LUT

3 6-LUT 6-LUT FPGA: Field-Programmable Gate Array

● Is a digital circuit (logic gates and wires) ● Is field-programmable (at power-on, not in the fab) ● Pre-fab everything you’ll ever need – 20x area, 20x delay cost – Circuit building blocks are somewhat bigger than logic gates

6-LUT 6-LUT 6-LUT 6-LUT

4 6-LUT 6-LUT FPGA Soft Processors

● FPGA systems often have software components – Often running on a soft processor ● Need more performance? – Parallel code and hardware accelerators need effort – Less effort if soft processors got faster

5 FPGA Soft Processors

● FPGA systems often have software components – Often running on a soft processor ● Need more performance? – Parallel code and hardware accelerators need effort – Less effort if soft processors got faster

6 FPGA Soft Processors

● FPGA systems often have software components – Often running on a soft processor ● Need more performance? – Parallel code and hardware accelerators need effort – Less effort if soft processors got faster

7 Current Soft Processors

● CPUs from the FPGA vendors: Small, in-order, 1-way – Nios II/f: 1100 ALUT (0.2% of IV), 240 MHz – MicroBlaze: 2100 LUT (0.3% of Virtex-7), 246 MHz ● Some other examples:

Processor ALUT Frequency (Stratix IV) Leon 3 In-order 1-way 5600 150 MHz

OpenRISC In-order 1-way 3500 130 MHz OR1200 SPREX In-order 1-way 1800 150 MHZ (Stratix III)

BOOM (RISC-V) OoO 2-way ? 50 MHz (Zync-7000)

OPA (RISC-V) OoO 3-way 12600 215 MHz

8 Current Soft Processors

● CPUs from the FPGA vendors: Small, in-order, 1-way – Altera Nios II/f: 1100 ALUT (0.2% of Stratix IV), 240 MHz – Xilinx MicroBlaze: 2100 LUT (0.3% of Virtex-7), 246 MHz ● Some other examples:

Processor ALUT Frequency (Stratix IV)

Leon 3 In-order 1-way 5600 150 MHz

OpenRISC In-order 1-way 3500 130 MHz OR1200 SPREX In-order 1-way 1800 150 MHZ (Stratix III)

BOOM (RISC-V) OoO 2-way ? 50 MHz (Zync-7000)

OPA (RISC-V) OoO 3-way 12600 215 MHz

9 Faster Soft Processors

● Faster means extracting parallelism – Thread: Multi-core, multi-thread – Data: Vectors, SIMD – Instruction: Pipelining, out-of-order

● Challenge: What microarchitecture and circuits? – Increase instructions per clock cycle (IPC) – Without decreasing clock cycles per second (MHz) – And at an affordable FPGA hardware cost

● First OoO hard processors: ~1.5× IPC, ~1.0× frequency

10 Faster Soft Processors

● Faster means extracting parallelism – Thread: Multi-core, multi-thread – Data: Vectors, SIMD – Instruction: Pipelining, out-of-order ← Least user effort

● Challenge: What microarchitecture and circuits? – Increase instructions per clock cycle (IPC) – Without decreasing clock cycles per second (MHz) – And at an affordable FPGA hardware cost

● First OoO hard processors: ~1.5× IPC, ~1.0× frequency

11 ISA: Why not x86

● A few x86-specific features are really hard – Length decoding – Self-modifying code – Two destination operands (MUL, DIV) – ...but most hard features are not x86-specific ● Existing ISA: Can’t shift unwanted features to software ● RISC-V: didn’t exist in 2009 – But is RISC even the right clean-slate design?

12 ISA: Why x86

● It’s easy to implement! ● The alternative: – Benchmarks ● “Way too much of my time is wasted on corraling benchmarks” ― Chris Celio, on porting SPEC to RISC-V, 2015 ● “I’ll just use the binaries” ― me – OS, compiler, development tools – Web browser, JavaScript JIT, ... – Software in unexpected places: VGA BIOS ● Alpha uses x86 emulation in firmware

13 ISA: Why x86

● It’s easy to implement! ● The alternative: – Benchmarks ● “Way too much of my time is wasted on corraling benchmarks” ― Chris Celio, on porting SPEC to RISC-V, 2015 ● “I’ll just use the binaries” ― me – OS, compiler, development tools – Web browser, JavaScript JIT, ... – Software in unexpected places: VGA BIOS ● Alpha uses x86 emulation in firmware

… I still can’t do any of this

14 Soft Goals

● Performance: Two-issue, out-of-order – Reasonable per-clock performance (~2× vs. Nios II/f) – 300 MHz (25% higher than Nios II/f on Stratix IV) ● No software rewrite: 32-bit x86 instruction set () – Binary compatible: OS, dev tools, user programs – Both system and user mode ● Tested with 16 OSes and >50 user benchmarks

15 How are FPGAs different from CMOS?

Whole processors 22×

Multiplexer Multi-write port RAM (4r2w) Content-Addressable Memory (CAM) Multiplier Delay SRAM (1rw) Area 2.2 22 220 FPGA Area or Delay Ratio vs. Custom CMOS

16 How are FPGAs different from CMOS?

Whole processors 22×

Multiplexer Multi-write port RAM (4r2w) Content-Addressable Memory (CAM) Adder Multiplier Delay SRAM (1rw) Area 2.2 22 220 FPGA Area or Delay Ratio vs. Custom CMOS

17 How are FPGAs different from CMOS?

Whole processors 22×

Multiplexer Multi-write port RAM (4r2w) Content-Addressable Memory (CAM) Adder Multiplier Delay SRAM (1rw) Area 2.2 22 220 FPGA Area or Delay Ratio vs. Custom CMOS

18 How are FPGAs different from CMOS?

Whole processors 22×

Multiplexer Multi-write port RAM (4r2w) Content-Addressable Memory (CAM) Adder Multiplier Delay SRAM (1rw) Area 2.2 22 220 FPGA Area or Delay Ratio vs. Custom CMOS

19 How are FPGAs different from CMOS?

Whole processors 22×

Multiplexer Multi-write port RAM (4r2w) Content-Addressable Memory (CAM) Adder Multiplier Delay SRAM (1rw) Area 2.2 22 220 FPGA Area or Delay Ratio vs. Custom CMOS

● Conclusion: ~Conventional processor microarchitecture, but try to avoid expensive circuits – Physical : Reduce SRAM ports, CAM size – Low-associativity caches: Reduce , CAMs 20 Soft Processor Design Methodology

21 Soft Processor Design Methodology

1. Bochs behavioural simulator

22 Soft Processor Design Methodology

1. Bochs behavioural simulator 2. New detailed CPU pipeline model

23 Soft Processor Design Methodology

1. Bochs behavioural simulator 2. New detailed CPU pipeline model 3. Verify, optimize microarchitecture

24 Soft Processor Design Methodology

1. Bochs behavioural simulator 2. New detailed CPU pipeline model 3. Verify, optimize microarchitecture 4. Circuit design, optimization

25 Our Processor’s Microarchitecture

● Fetch 8b 06 03 46 04 89 46 08

26 Our Processor’s Microarchitecture

● Fetch 8b 06 03 46 04 89 46 08 ● Length decode 8b 06 03 46 04 89 46 08

27 Our Processor’s Microarchitecture

● Fetch 8b 06 03 46 04 89 46 08 ● Length decode 8b 06 03 46 04 89 46 08

● Decode

28 Our Processor’s Microarchitecture

● Fetch 8b 06 03 46 04 89 46 08 ● Length decode 8b 06 03 46 04 89 46 08

● Decode ● Rename

29 Our Processor’s Microarchitecture

● Fetch 8b 06 03 46 04 89 46 08 ● Length decode 8b 06 03 46 04 89 46 08

● Decode ● Rename

● Schedule

30 Our Processor’s Microarchitecture

● Fetch 8b 06 03 46 04 89 46 08 ● Length decode 8b 06 03 46 04 89 46 08

● Decode ● Rename

● Schedule

● Execute

31 Our Processor’s Microarchitecture

● Fetch 8b 06 03 46 04 89 46 08 ● Length decode 8b 06 03 46 04 89 46 08

● Decode ● Rename

● Schedule

● Execute

● Commit

32 Our Processor’s Microarchitecture

● Fetch 8b 06 03 46 04 89 46 08 ● Length decode 8b 06 03 46 04 89 46 08 1. ● Decode ● Rename

2. ● Schedule

3. ● Execute

● Commit

Main challenges: 1.Circuit complexity vs. support common case fast, worst-case correct 2.Circuit complexity vs. capacity (IPC) 3.Circuit complexity vs. pipeline latency (IPC) 33 Our Processor’s Microarchitecture

● Fetch 8b 06 03 46 04 89 46 08 ● Length decode 8b 06 03 46 04 89 46 08 1. ● Decode ● Rename

2. ● Schedule

3. ● Execute

● Commit

Main challenges: 1.Circuit complexity vs. support common case fast, worst-case correct 2.Circuit complexity vs. capacity (IPC) 3.Circuit complexity vs. pipeline latency (IPC) 34 Our Processor’s Microarchitecture

● Fetch 8b 06 03 46 04 89 46 08 ● Length decode 8b 06 03 46 04 89 46 08 1. ● Decode ● Rename

2. ● Schedule

3. ● Execute

● Commit

Main challenges: 1.Circuit complexity vs. support common case fast, worst-case correct 2.Circuit complexity vs. capacity (IPC) 3.Circuit complexity vs. pipeline latency (IPC) 35 Processor Area and Frequency

Component Estimated Frequency Area (ALM) (MHz) Decode * 6 000 247 Renaming 1 900 317 Scheduler * 4 000 275 Register file 2 400 260 Execution 2 300 240 Memory system (Logic) 9 000 200 (Caches) 5 000 Commit + ROB * 2 000 * 500 Total 28 700 200 * Area estimate for partial or unimplemented circuit

● Compare to Nios II/f with MMU and 32K L1I + 32K L1D – 4400 ALM (6.5× ), 245 MHz (0.82× ) 36 Processor Area and Frequency

Component Estimated Frequency Area (ALM) (MHz) Decode * 6 000 247 Renaming 1 900 317 Scheduler * 4 000 275 Register file 2 400 260 Execution 2 300 240 Memory system (Logic) 9 000 200 (Caches) 5 000 Commit + ROB * 2 000 Microcode * 500 Total 28 700 200 * Area estimate for partial or unimplemented circuit 7% of Stratix IV ● Compare to Nios II/f with MMU and 32K L1I + 32K L1D – 4400 ALM (6.5× ), 245 MHz (0.82× ) 37 Processor Area and Frequency

Component Estimated Frequency Area (ALM) (MHz) Decode * 6 000 247 Renaming 1 900 317 Scheduler * 4 000 275 Register file 2 400 260 Execution 2 300 240 Memory system (Logic) 9 000 200 (Caches) 5 000

Commit + ROB * 2 000 Optimize more? Microcode * 500 Total 28 700 200 * Area estimate for partial or unimplemented circuit 7% of Stratix IV ● Compare to Nios II/f with MMU and 32K L1I + 32K L1D – 4400 ALM (6.5× ), 245 MHz (0.82× ) 38 Processor Area and Frequency

Component Estimated Frequency Area (ALM) (MHz) Decode * 6 000 247 Renaming 1 900 317 Scheduler * 4 000 275 Register file 2 400 260 “OoO stuff” Execution 2 300 240 Memory system (Logic) 9 000 200 (Caches) 5 000

Commit + ROB * 2 000 Optimize more? Microcode * 500 Total 28 700 200 * Area estimate for partial or unimplemented circuit 7% of Stratix IV ● Compare to Nios II/f with MMU and 32K L1I + 32K L1D – 4400 ALM (6.5× ), 245 MHz (0.82× ) 39 Per-clock performance (SPECint2000)

Slower

VIA C3 550 MHz 2.68

Nios II/f 100 MHz 2.73

Pentium 200 MHz 2.06

Atom (Bonnell) 1600 MHz 1.63

AMD K6 166 MHz 1.46

Pentium 4 2800 MHz 1.58

ARM Cortex-A9 800 MHz 1.42

Pentium Pro 233 MHz 1.26

This work ~200 MHz 1.00

Atom () 2400 MHz 0.99

VIA Nano 1000 MHz 0.87

Opteron K8 2800 MHz 0.91

AMD Piledriver 3500 MHz 0.68

Core 2 Q9550 3400 MHz 0.56

Haswell 4300 MHz 0.44

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Relative Runtime Cycles 40 Per-clock performance (SPECint2000)

Slower

VIA C3 550 MHz 2.68 ● Nios II/f: 2.73× Nios II/f 100 MHz 2.73 – Wall-clock: 2.23× Pentium 200 MHz 2.06

(1995): 1.26× Atom (Bonnell) 1600 MHz 1.63 – Also 8 KB/256 KB cache AMD K6 166 MHz 1.46 – 3-way OoO 2800 MHz 1.58 ARM Cortex-A9 800 MHz 1.42 ● Atom Silvermont (2013): 0.99× Pentium Pro 233 MHz 1.26 – Also 2-way OoO This work ~200 MHz 1.00 – 32 KB/2 MB cache Atom (Silvermont) 2400 MHz 0.99

VIA Nano 1000 MHz 0.87

Opteron K8 2800 MHz 0.91 ● Large performance increases vs. Nios II/f AMD Piledriver 3500 MHz 0.68 ● Comparable per-clock performance to Core 2 Q9550 3400 MHz 0.56 similar x86 Haswell 4300 MHz 0.44

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Relative Runtime Cycles 41 Per-clock performance (SPECint2000)

Slower

VIA C3 550 MHz 2.68 ● Nios II/f: 2.73× Nios II/f 100 MHz 2.73 – Wall-clock: 2.23× Pentium 200 MHz 2.06

● Pentium Pro (1995): 1.26× Atom (Bonnell) 1600 MHz 1.63 – Also 8 KB/256 KB cache AMD K6 166 MHz 1.46 – 3-way OoO Pentium 4 2800 MHz 1.58 ARM Cortex-A9 800 MHz 1.42 ● Atom Silvermont (2013): 0.99× Pentium Pro 233 MHz 1.26 – Also 2-way OoO This work ~200 MHz 1.00 – 32 KB/2 MB cache Atom (Silvermont) 2400 MHz 0.99

VIA Nano 1000 MHz 0.87

Opteron K8 2800 MHz 0.91 ● Large performance increases vs. Nios II/f AMD Piledriver 3500 MHz 0.68 ● Comparable per-clock performance to Core 2 Q9550 3400 MHz 0.56 similar x86 microarchitectures Haswell 4300 MHz 0.44

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Relative Runtime Cycles 42 Per-clock performance (SPECint2000)

Slower

VIA C3 550 MHz 2.68 ● Nios II/f: 2.73× Nios II/f 100 MHz 2.73 – Wall-clock: 2.23× Pentium 200 MHz 2.06

● Pentium Pro (1995): 1.26× Atom (Bonnell) 1600 MHz 1.63 – Also 8 KB/256 KB cache AMD K6 166 MHz 1.46 – 3-way OoO Pentium 4 2800 MHz 1.58 ARM Cortex-A9 800 MHz 1.42 ● Atom Silvermont (2013): 0.99× Pentium Pro 233 MHz 1.26 – Also 2-way OoO This work ~200 MHz 1.00 – 32 KB/2 MB cache Atom (Silvermont) 2400 MHz 0.99

VIA Nano 1000 MHz 0.87

Opteron K8 2800 MHz 0.91 ● Large performance increases vs. Nios II/f AMD Piledriver 3500 MHz 0.68 ● Comparable per-clock performance to Core 2 Q9550 3400 MHz 0.56 similar x86 microarchitectures Haswell 4300 MHz 0.44

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Relative Runtime Cycles 43 Per-clock performance (SPECint2000)

Slower

VIA C3 550 MHz 2.68 ● Nios II/f: 2.73× Nios II/f 100 MHz 2.73 – Wall-clock: 2.23× Pentium 200 MHz 2.06

● Pentium Pro (1995): 1.26× Atom (Bonnell) 1600 MHz 1.63 – Also 8 KB/256 KB cache AMD K6 166 MHz 1.46 – 3-way OoO Pentium 4 2800 MHz 1.58 ARM Cortex-A9 800 MHz 1.42 ● Atom Silvermont (2013): 0.99× Pentium Pro 233 MHz 1.26 – Also 2-way OoO This work ~200 MHz 1.00 – 32 KB/2 MB cache Atom (Silvermont) 2400 MHz 0.99

VIA Nano 1000 MHz 0.87

Opteron K8 2800 MHz 0.91 ● Large performance increases vs. Nios II/f AMD Piledriver 3500 MHz 0.68 ● Comparable per-clock performance to Core 2 Q9550 3400 MHz 0.56 similar x86 microarchitectures Haswell 4300 MHz 0.44

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Relative Runtime Cycles 44 Summary 1

● Designed microarchitecture and circuits for a superscalar out-of-order x86 soft processor – Both user and system modes ● Area: 6.5× Nios II/f, but affordable – 7% Stratix IV or 1.3% Stratix 10 ● Performance: 2.2× Nios II/f on SPECint2000 – Per-clock: ~2.7× – Frequency: ~0.8×

● Out-of-order increases soft processor performance without rewriting software ● x86 is feasible on FPGA

45 Part 2: Pipeline Details

● Sketch of interesting circuits at each stage ● Timing budget: ~5 LUT levels (< 3.5 ns) ● Many circuits designed bottom-up – LUT granularity

46 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Front end: Fetch-decode

● Fetch bandwidth – 3.4 B per instruction – Fetch 8 B/cycle 1 1

0.9 0.9 Frequency 0.8 0.8 Cumulative frequency

0.7 0.7 y n e

0.6 0.6 u y q c e r n F e 0.5 0.5

u e q 6 v i e 3 t . r 0.4 0.4 a 0 l F 7 u 2 . m

0.3 0 0.3 u C

0.2 2 0.2 1 . 8 7 0 0 5 0 . 4 . 1 0 0 2 0 .

0.1 3 0.1 3 0 5 1 . - 0 - - - 0 . 0 E E E E 0 3 1 0 9 0 0 3 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Instruction length ()

x86: Worst case is complex, but common case isn’t too bad → → → → ICache Bytes Instructions Micro-ops Renamer 47 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Front end: Fetch-decode

● Fetch bandwidth – 3.4 B per instruction 8B/cycle Multi-cycle – Fetch 8 B/cycle 1 1

0.9 0.86 0.9 ● FrequencyFrequency Length decode 0.8 0.8 CumulativeCumulative frequency frequency y – 0.7 0.7 y c c n Prefix bytes uncommon n e e u 0.6 0.6 u y q q c e e r r – n F F e 0.5 0.5

Fast decode up to 1 u e e q 6 v v i i e 3 t t . r 0.4 0.4 a a 0 l l prefix F 7 u u 2 . m m

0.3 0 0.3 u u C C

0.2 2 0.2 1

0.14 . 8 7 0 0 5 0 . 4 . 1 0 0 2 0 0.1 . 0.1

0.1 3 0.1 3 0 5 1 . - 0 - - - 0 . 0 E E E E 0 3

0.0006 1E-6 6E-101 3E-10 1E-10 3E-11 0 9 0 0 3 0 0 1 0 2 3 1 4 52 6 73 8 49 10 511 12 613 147+15 InstructionNumber of length prefix (bytes)bytes

x86: Worst case is complex, but common case isn’t too bad → → → → ICache Bytes Instructions Micro-ops Renamer 48 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Front end: Fetch-decode

● Fetch bandwidth e e u u s s s s i i – 3.4 B per instruction - - 2 1 Multi-cycle – Fetch 8 B/cycle 4 1 9 1 . 0 0.9 0.86 0.9 ● FrequencyFrequencyFrequency Length decode 0.8 0.8 CumulativeCumulativeCumulative frequency frequency frequency y – 0.7 0.7 y c c n Prefix bytes uncommon n e e u 0.6 0.6 u y q q c e e r r – n F F e 0.5 0.5

Fast decode up to 1 u e e q 6 v v i i e 3 t t . r 0.4 0.4 a a 0 l l prefix F 7 u u 2 . m m

0.3 0 0.3 u u C ● C 0.2 2 0.2 1

Decode into micro-ops 0.14 . 8 7 0 0 5 0 . 4 4 . 1 2 0 0 2 0 0 0.1 . 0.1

0.1 3 0.1 3 0 4 5 5 4 6 5 9 5 4 5 1 . 0 . - 0 ------0 . . 0 0 E 0 E E E E E E E E E E E E – 0 3

0.0006 1E-6 6E-101 3E-10 1E-10 3E-11 1 2 4 3 1 0 0 0 7 5 8 1 1 is common case 0 9 0 0 3 0 0 1 0 2 3 1 4 52 6 73 8 49 10 511 12 613 147+15 – Dual-issue up to 2, single- InstructionMicro-opsNumber of perlength prefix instruction (bytes)bytes issue up to 4 x86: Worst case is complex, but common case isn’t too bad → → → → ICache Bytes Instructions Micro-ops Renamer 49 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Register Renamer e l

● b

a eax t

Read: Map logical

g ecx n i edx p

p ebx register numbers to a esp m

r ebp e t esi physical s PRF A i edi g e

r eflags

– e fpucw v ~14 sources/clk i t fpusw a l tmp0 u

c tmp1 e e

● l i p f

S

Write: Update mapping r e t s i g

table e r

l a c i

– s ~4 destinations/clk y h P

→ Map logical reg physical reg, 2 uops/clock 50 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Register Renamer e l

● b

a eax t

Read: Map logical

g ecx n i edx p

p ebx register numbers to a esp m

r ebp e t esi physical s PRF A PRF B PRF C i edi g e

r eflags

– e fpucw v ~14 sources/clk i t fpusw a l tmp0 u

c tmp1 e e

● l i p f

S

Write: Update mapping r e t s i g

table e r

l a c i

– s ~4 destinations/clk y h P ● Allows 1w-port reg. files – Each ALU “owns” an RF

→ Map logical reg physical reg, 2 uops/clock 51 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Register Renamer e l

● b

a eax t

Read: Map logical

g ecx n i edx p

p ebx register numbers to a esp m

r ebp e t esi physical s PRF A PRF B PRF C i edi g e

r eflags

– e fpucw v ~14 sources/clk i t fpusw a l tmp0 u

c tmp1 e e

● l i p f

S

Write: Update mapping r e t h s i s u g l f e table r e

l n i l a e c p i i s p

– y :

~4 destinations/clk y h p P o C ● e Allows 1w-port reg. files l b eax a t ecx g

n edx

– i

Each ALU “owns” an RF p ebx p

a esp m

ebp ● r e

t esi

Used for recovery from s i edi g

e eflags r fpucw misspeculation d e

t fpusw t i tmp0 m tmp1 m o → C Map logical reg physical reg, 2 uops/clock 52 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Renamer Circuit

From Decode String-op Microcode

μop 0 μop 1 StringOp × 2 μcode × 2 O O

Pause? μop 0 μop 1 C C Z Z G G G A A S P P P P R P S S R R R E

G

F F F F Choose Operands Choose Operands F

r r r r r F e e e e e r e e e e e e

l l e l l l i i i i i s s s s s

l t t t t t i

Forwarding and mux select

s

A B A C B t

6 regs 2 regs 4 regs

dst0 dst1 dst G G P P R R

C GPR RAT SREG RAT C OCZAPS RF_is_zero 13 x 8-bit 12 x 4-bit 1 x 5-bit 1 x 1-bit F o

(64 p.regs) (16 p.regs) (16 p.regs) r w

6r 2w ports 4r 2w ports 1r 1w ports 1r 1w ports a

r d i n g

GPR Forwarding

SREG Forwarding Forwarding

ALUop0 [dst,src,src] AGUop0 [dst,src,src,src,seg] ALUop1 [dst,src,src] AGUop1 [dst,src,src,src,seg] w r w r μop 0 μop 1

● Stage 1: Pick two uops; find where each operand comes from ● Stage 2: A bunch of read muxes; write destination regs to RAT ● 317 MHz, 1900 ALMs ● x86: Few registers: small FF-based RAT. But ≥3 register types. 53 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Scheduling: Track dependencies

● Pick a ready operation, execute, and wake up dependents

54 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Scheduling: Track dependencies

● Pick a ready operation, execute, and wake up dependents

55 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Scheduling: Track dependencies

● Pick a ready operation, execute, and wake up dependents

56 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Scheduling: Track dependencies

● Pick a ready operation, execute, and wake up dependents

57 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Scheduling: Track dependencies

● Pick a ready operation, execute, and wake up dependents

58 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Scheduler Size

0.9

0.8

0.7

0.6

0.5 C P I 0.4

0.3

0.2

0.1

0 0 8 16 24 32 40 48 56 64

Scheduler Capacity (entries)

● Can trade capacity (area and frequency) for IPC 59 Scheduler Size

0.9

0.8

0.7 32 entries

0.6 275 MHz

0.5 C P I 0.4

0.3

0.2

0.1

0 0 8 16 24 32 40 48 56 64

Scheduler Capacity (entries)

● Can trade capacity (area and frequency) for IPC 60 Scheduler Circuit

● 4-way distributed matrix scheduler ● 32-entry (10, 10, 7, 5) – 275 MHz

● Comparison: – Pentium Pro: 20 – Haswell: 60 – : 84 (6×14) – Skylake: 97

61 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Scheduler Circuit

● 4-way distributed matrix scheduler ● 32-entry (10, 10, 7, 5) – 275 MHz

● Comparison: – Pentium Pro: 20 – Haswell: 60 – Ryzen: 84 (6×14) – Skylake: 97

62 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Scheduler Circuit

● 4-way distributed matrix scheduler ● 32-entry (10, 10, 7, 5) – 275 MHz

● Comparison: – Pentium Pro: 20 – Haswell: 60 – Ryzen: 84 (6×14) – Skylake: 97

63 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Scheduler Circuit

● 4-way distributed matrix scheduler ● 32-entry (10, 10, 7, 5) – 275 MHz

● Comparison: – Pentium Pro: 20 – Haswell: 60 – Ryzen: 84 (6×14) – Skylake: 97

64 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Scheduler Circuit

● 4-way distributed matrix scheduler ● 32-entry (10, 10, 7, 5) – 275 MHz

● Comparison: – Pentium Pro: 20 – Haswell: 60 – Ryzen: 84 (6×14) – Skylake: 97

65 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Scheduler Circuit

● 4-way distributed matrix scheduler ● 32-entry (10, 10, 7, 5) – 275 MHz

● Comparison: – Pentium Pro: 20 – Haswell: 60 – Ryzen: 84 (6×14) – Skylake: 97

66 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Scheduler Picker: Bit Scan

Ready (input) Newer Older Scan for first ready

Bit scan (output)

● Pick first ready instruction to execute

67 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Scheduler Picker: Bit Scan

Ready (input) Newer Older Scan for first ready

Bit scan (output)

● Pick first ready instruction to execute

68 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Scheduler Picker: Bit Scan

Ready (input)

Bit scan (output)

● Pick first ready instruction to execute ● Logarithmic depth: radix-6 – Han-Carlson prefix tree – Very difficult to code: Synthesizer makes it linear depth again

69 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Execution

● Three different execution units – Complex, 3+ cycle – Simple, 1 cycle – Address generation ● Latency vs. delay circuit design problem

70 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Execution

● Three different execution units – Complex, 3+ cycle – Simple, 1 cycle – Address generation ● Latency vs. delay circuit design problem

71 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Execution

● Three different execution units – Complex, 3+ cycle – Simple, 1 cycle – Address generation ● Latency vs. delay circuit design problem

72 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Execution

● Three different execution units – Complex, 3+ cycle – Simple, 1 cycle – Address generation ● Latency vs. delay circuit design problem

73 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Execution

● Three different execution units – Complex, 3+ cycle – Simple, 1 cycle – Address generation ● Latency vs. delay circuit design problem

74 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Execution

● Three different execution units – Complex, 3+ cycle – Simple, 1 cycle – Address generation ● Latency vs. delay circuit design problem

75 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Execution

● Three different execution units – Complex, 3+ cycle – Simple, 1 cycle – Address generation ● Latency vs. delay circuit design problem

76 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Execution Circuits: Simple ALU

● Three parts: – Shifter – Adder – Bitwise logic ● We’ll look at shifter and adder circuits

77 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Execution Circuits: Simple ALU

● Three parts: – Shifter – Adder – Bitwise logic ● We’ll look at shifter and adder circuits

78 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Execution Circuits: Simple ALU

● Three parts: – Shifter – Adder – Bitwise logic ● We’ll look at shifter and adder circuits

79 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Execution Circuit: Shifter

● Minimal 32-bit shifter: Needs 3 LUT levels (4-to-1 mux per level) ● We used a rotate + mask circuit: Almost 3 LUT levels

80 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Execution Circuit: Shifter

● Minimal 32-bit shifter: Needs 3 LUT levels (4-to-1 mux per level) ● We used a rotate + mask circuit: Almost 3 LUT levels – Left and right. 32-, 16-, and 8-bit operands – Rotate, shift, arithmetic shift – Rotate-through-carry-by-1 – swap (aa bb cc dd → dd cc bb aa) – Sign extension 81 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Execution Circuit: Shifter

● Minimal 32-bit shifter: Needs 3 LUT levels (4-to-1 mux per level) ● We used a rotate + mask circuit: Almost 3 LUT levels – Left and right. 32-, 16-, and 8-bit operands – Rotate, shift, arithmetic shift – Rotate-through-carry-by-1 – Byte swap (aa bb cc dd → dd cc bb aa) – Sign extension

● 2.9 ns, 54% faster (and 46% smaller) than HDL synthesis 82 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Execution Circuit: Adder

● FPGAs have carry chains: Can’t improve on it by much

83 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Execution Circuit: Adder

● FPGAs have carry chains: Can’t improve on it by much ● Condition codes: ZF means “is the result zero?” – 32/16/8-bit NOR gate is 3 LUT levels plus the adder...

84 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Execution Circuit: Adder

● FPGAs have carry chains: Can’t improve on it by much ● Condition codes: ZF means “is the result zero?” – 32/16/8-bit NOR gate is 3 LUT levels plus the adder... ● Computing a + b = K does not need addition! – ZF: 3 LUT levels in parallel with adder 85 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Execution Circuit: Adder

● FPGAs have carry chains: Can’t improve on it by much ● Condition codes: ZF means “is the result zero?” – 32/16/8-bit NOR gate is 3 LUT levels plus the adder... ● Computing a + b = K does not need addition! – ZF: 3 LUT levels in parallel with adder 86 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Execution Circuit: Adder

● FPGAs have carry chains: Can’t improve on it by much ● Condition codes: ZF means “is the result zero?” – 32/16/8-bit NOR gate is 3 LUT levels plus the adder... ● Computing a + b = K does not need addition! – ZF: 3 LUT levels in parallel with adder

● 2.3ns, 24% faster, +55% area (+30 ALM) vs. HDL synthesis 87 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Memory System Microarchitecture

Store

● Memory operations Load – Store: mov [ecx], eax – Load: mov eax, [ecx]

88 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Memory System Microarchitecture

Store

● Memory operations Load – Store: mov [ecx], eax – Load: mov eax, [ecx] ● Caches – For performance (Instruction and data)

– For OS support (TLB and page walking) 89 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Basic Cache Trade-offs

90 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Basic Cache Trade-offs 1.4 Bigger L1 cache

1.2

1 0 0 0 0 0 9 8 0 0 0 0 7 0 9 9 ...... 4 9 1 1 1 1 . 1 0 0 1 9 . 0 9 . 0 5 0

0.8 8 . 0 4 7 . 0 0.6 0 6 . 0 6 4

0.4 . 0

0.2 256 KB L2 cache No L2 cache 0 Dhrystone SPECint 2000

● Bigger cache → higher IPC – Sensitivity varies with workload

91 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Basic Cache Trade-offs 1.4 Bigger L1 cache

1.2

1 0 0 0 0 0 9 8 0 0 0 0 7 0 9 9 ...... 4 9 1 1 1 1 . 1 0 0 1 9 . 0 9 . 0 5 0

0.8 8 . 0 4 7 . 0 0.6 0 6 . 0 6 4

0.4 . 0

0.2 256 KB L2 cache No L2 cache 0 Dhrystone SPECint 2000

● Bigger cache → higher IPC – Sensitivity varies with workload ● L1 caches need to be small (We chose 8 KB)

92 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Basic Cache Trade-offs 1.4 Bigger L1 cache 3 1 1 0 . 1.2 1 7 . 1 1 . 5 0 1 . 1 0 0 0 . 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 ...... 3 1 1 1 1 1 1 1 1 9

1 . 0 0 0 0 0 0 9 8 0 0 0 0 7 0 9 9 ...... 4 9 1 1 1 1 . 1 0 0 1 9 . 0 9 . 0 5 0

0.8 8 . 0 4 7 . 0 0.6 0 6 . 0 6 4

0.4 . 0

0.2 256 KB L2 cache No L2 cache 0 Dhrystone SPECint 2000

● Bigger cache → higher IPC – Sensitivity varies with workload ● L1 caches need to be small (We chose 8 KB)

● L2 cache (256 KB) mostly makes up for this 93 Fetch Len. decode Decode Rename Schedule Execute Memory Commit More Memory System Trade-offs

1.35 r t 1.32 s e h s i d g 1.3 y i r r g f m o o n

- i - k f e

1.25 n c m i o

o h l e - 1.21 s c B t

r e 1.2 a m u e

s c

d r s O r i e n o + - 1.15 d o n m I

r

l . l s 4 o v a -

1.1

+ 1.08 t C n P I S I

e 1.05 v i t 1.01 a l

e 1 R

0.95 Blocking, In-order Non-blocking, In-order Non-blocking, Out-of-order 0.9 Dhrystone SPECint 2000 ● Multiple in-fight misses ● Out-of-order – Memory dependence speculation 94 Fetch Len. decode Decode Rename Schedule Execute Memory Commit L1 Memory System

● TLB lookup ● Cache tag compare ● Cache data rotate (32 B)

95 Fetch Len. decode Decode Rename Schedule Execute Memory Commit L1 Memory System

● TLB lookup ● Cache tag compare ● Cache data rotate (32 B)

96 Fetch Len. decode Decode Rename Schedule Execute Memory Commit L1 Memory System

● TLB lookup ● Cache tag compare ● Cache data rotate (32 B)

● Long critical path for direct implementation 97 Fetch Len. decode Decode Rename Schedule Execute Memory Commit L1 Memory System

● TLB lookup ● Cache tag compare ● Cache data rotate (32 B)

● Long critical path for direct implementation 98 Fetch Len. decode Decode Rename Schedule Execute Memory Commit What happens to a load: simplified

99 Fetch Len. decode Decode Rename Schedule Execute Memory Commit What happens to a load: simplified

100 Fetch Len. decode Decode Rename Schedule Execute Memory Commit What happens to a load: simplified

101 Fetch Len. decode Decode Rename Schedule Execute Memory Commit What happens to a load: simplified

102 Fetch Len. decode Decode Rename Schedule Execute Memory Commit What happens to a load: simplified

103 Fetch Len. decode Decode Rename Schedule Execute Memory Commit What happens to a load: simplified

104 Fetch Len. decode Decode Rename Schedule Execute Memory Commit What happens to a load: simplified

Possible outcomes for a load:

105 Fetch Len. decode Decode Rename Schedule Execute Memory Commit What happens to a load: simplified

Possible outcomes for a load:

106 Fetch Len. decode Decode Rename Schedule Execute Memory Commit What happens to a load: simplified

Possible outcomes for a load:

107 Fetch Len. decode Decode Rename Schedule Execute Memory Commit What happens to a load: simplified

Possible outcomes for a load:

108 Fetch Len. decode Decode Rename Schedule Execute Memory Commit What happens to a load: simplified

Possible outcomes for a load:

109 Fetch Len. decode Decode Rename Schedule Execute Memory Commit What happens to a load: simplified

Possible outcomes for a load:

110 Fetch Len. decode Decode Rename Schedule Execute Memory Commit What happens to a load: simplified

Possible outcomes for a load:

111 Fetch Len. decode Decode Rename Schedule Execute Memory Commit What happens to a load: simplified

Possible outcomes for a load:

112 Fetch Len. decode Decode Rename Schedule Execute Memory Commit What happens to a load: simplified

Possible outcomes for a load:

113 Fetch Len. decode Decode Rename Schedule Execute Memory Commit What happens to a load: simplified ● Other paging features: – Page faults – TLB Accessed/Dirty bits – Uncacheable accesses

Possible outcomes for a load:

114 Fetch Len. decode Decode Rename Schedule Execute Memory Commit What happens to a load: simplified

● Replay accesses that fail

Possible outcomes for a load:

115 Fetch Len. decode Decode Rename Schedule Execute Memory Commit What happens to a load: simplified ● Also: – Split-cacheline/page – Stores – L2 and cache coherence

Possible outcomes for a load:

116 Fetch Len. decode Decode Rename Schedule Execute Memory Commit What happens to a load: simplified

Design a circuit to implement this microarchitecture...

117 Fetch Len. decode Decode Rename Schedule Execute Memory Commit What happens to a load: simplified

● Long critical path for direct implementation – TLB lookup – Cache tag compare – Cache data rotate (32 B) ● Shannon expansion: “TLB hit way 1?” – TLB lookup – 2-to-1 mux (1 bit) – 2-to-1 mux (32 bit)

118 Fetch Len. decode Decode Rename Schedule Execute Memory Commit What happens to a load: simplified

● Long critical path for direct implementation – TLB lookup – Cache tag compare – Cache data rotate (32 B) ● Shannon expansion: “TLB hit way 1?” – TLB lookup – 2-to-1 mux (1 bit) – 2-to-1 mux (32 bit)

119 Fetch Len. decode Decode Rename Schedule Execute Memory Commit What happens to a load: simplified

● Long critical path for direct implementation – TLB lookup – Cache tag compare – Cache data rotate (32 B) ● Shannon expansion: “TLB hit way 1?” – TLB lookup – 2-to-1 mux (1 bit) – 2-to-1 mux (32 bit)

120 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Memory System: L1 Circuit

● Final design: Many paths are near-critical — delays are balanced – Critical portions hand-mapped to LUTs (blue boxes) – Many critical paths have logic depth of 5 LUTs ● 4.3 ns (~230 MHz) – Nios II/f: TLB+cache circuit is ~6.0 ns (~1.5 stages) 121 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Memory System: L1 Circuit

● Final design: Many paths are near-critical — delays are balanced – Critical portions hand-mapped to LUTs (blue boxes) – Many critical paths have logic depth of 5 LUTs ● 4.3 ns (~230 MHz) – Nios II/f: TLB+cache circuit is ~6.0 ns (~1.5 stages) 122 Fetch Len. decode Decode Rename Schedule Execute Memory Commit Summary 2

● Out-of-order soft processors are useful and feasible – Area: 6.5× Nios II/f, but affordable – Performance: 2.2× Nios II/f on SPECint2000 ● ...if you pay attention to circuit design – Adapt microarchitecture to suit circuits – Or push circuits to make microarchitecture feasible – LUT-level design is often useful ● Avoiding (n)optimization has been painful

123 Future Work

● Hardware not yet functional – Some missing pieces ● More optimization – Frequency ● Add FPU

124 End

125 Thoughts on Complex Instructions

● There is little distinction between modern RISC and CISC – Abandoned “RISC” features ● Branch delay slot ● Register windows – “CISC” features in modern “RISC” ● Variable-length instructions ● Instructions with more than one micro-op ● Unaligned memory access

126 Complex Instructions 2

● It’s easier to lower than raise abstraction level – It’s easy to compile C to instructions. – It’s easy to crack complex instructions into many micro-ops. ● We’ve made a lot of instruction-level promises. Breaking them needs ISA changes – Fused Multiply-Add: Can’t synthesize from MUL and ADD – REP MOVS: Must maintain memory ordering – Binary translation is really hard: Need to keep low-level promises ● Can we really execute 5, 10, 20 IPC while keeping all instruction-level ISA promises?

127 Simulator ISA Support

128 Publications

● Comparing FPGA to custom CMOS – Comparing FPGA vs. custom CMOS and the impact on processor microarchitecture (FPGA 2011) – Quantifying the gap between FPGA and custom CMOS to aid microarchitectural design (TVLSI, 2013) ● Out-of-order memory system – Efficient methods for out-of-order load/store execution for high- performance soft processors (FPT 2013) – Microarchitecture and circuits for a 200 MHz Out-of-order soft processor memory system (TRETS 2016) ● Out-of-order instruction scheduling – High performance instruction scheduling circuits for out-of-order soft processors (FCCM 2016) – High performance instruction scheduling circuits for superscalar out-of- order soft processors (TRETS 2017)

129