Design of Parallel and High-Performance Computing Fall 2017 Lecture: Coherence & Memory Models

Motivational video: https://www.youtube.com/watch?v=zJybFF6PqEQ

Instructor: Torsten Hoefler & Markus Püschel TAs: Salvatore Di Girolamo Sorry for missing last week’s lecture!

2 To talk about some silliness in MPI-3

3 Stadt- und Klimawandel: Wie stellen wir uns den Herausforderungen?

Mittwoch, 8. November 2016, 15.00 – 19.30 Uhr ETH Zürich, Hauptgebäude

Informationen und Anmeldung: http://www.c2sm.ethz.ch/events/eth-klimarunde-2017.html

A more complete view

6 Architecture Developments

<1999 ’00-’05 ’06-’12 ’13-’20 >2020

distributed large cache- large cache- coherent and non- memory largely non- coherent multicore coherent multicore coherent coherent machines machines machines manycore communicating accelerators and communicating communicating accelerators and multicores through through coherent through coherent multicores messages communicating memory access memory access communicating through remote and messages and remote direct through memory direct memory memory access access and remote access

7 Sources: various vendors vs. Physics

 Physics (technological constraints) . Cost of data movement . Capacity of DRAM cells . Clock frequencies (constrained by end of Dennard scaling) . Speed of Light . Melting point of silicon

 Computer Architecture (design of the machine) . . ISA / Multithreading . SIMD widths “Computer architecture, like other architecture, is the art of determining the needs of the user of a structure and then designing to meet those needs as effectively as possible within economic and technological constraints.” – Fred Brooks (IBM, 1962)

Have converted many former “power” problems into “cost” problems

8 Credit: John Shalf (LBNL) Low-Power Design Principles (2005)

Tensilica XTensa

 Cubic power improvement with lower Intel Atom due to V2F

Intel Core2  Slower clock rates enable use of simpler cores

 Simpler cores use less area (lower Power 5 leakage) and reduce cost

 Tailor design to application to REDUCE WASTE

9 Credit: John Shalf (LBNL) Low-Power Design Principles (2005)

Tensilica XTensa  Power5 (server) Intel Atom . 120W@1900MHz . Baseline

 Intel Core2 sc (laptop) : Intel Core2 . 15W@1000MHz . 4x more FLOPs/watt than baseline

 Intel Atom (handhelds) . 0.625W@800MHz Power 5 . 80x more  GPU Core or XTensa/Embedded . 0.09W@600MHz . 400x more (80x-120x sustained)

10 Credit: John Shalf (LBNL) Low-Power Design Principles (2005)

Tensilica XTensa  Power5 (server) . 120W@1900MHz . Baseline

 Intel Core2 sc (laptop) : . 15W@1000MHz . 4x more FLOPs/watt than baseline

 Intel Atom (handhelds) . 0.625W@800MHz . 80x more

 GPU Core or XTensa/Embedded . 0.09W@600MHz . 400x more (80x-120x sustained)

Even if each simple core is 1/4th as computationally efficient as complex core, you can fit hundreds of them on a single chip and still be 100x more power efficient.

11 Credit: John Shalf (LBNL)

Heterogeneous Future (LOCs and TOCs) 0.2 mm 0.2 Big cores (very few) Tiny core 0.23mm Lots of them!

Latency Optimized Core Throughput Optimized Core (LOC) (TOC) Most energy efficient if you Most energy efficient if you DO don’t have lots of parallelism have a lot of parallelism! 12 Credit: John Shalf (LBNL) Data movement – the wires

 Energy Efficiency of copper wire: . Power = Frequency * Length / cross-section-area

wire

. Wire efficiency does not improve as feature size shrinks Photonics could break through  Energy Efficiencythe bandwidth of a Transistor:-distance limit . Power = V2 * Frequency * Capacitance . Capacitance ~= Area of Transistor . Transistor efficiency improves as you shrink it

 Net result is that moving data on wires is starting to cost more energy than computing on said data (interest in Silicon Photonics) 13 Credit: John Shalf (LBNL) Pin Limits

 Moore’s law doesn’t apply to adding pins to package . 30%+ per year nominal Moore’s Law . Pins grow at ~1.5-3% per year at best

 4000 Pins is aggressive pin package . Half of those would need to be for power and ground . Of the remaining 2k pins, run as differential pairs . Beyond 15Gbps per pin power/complexity costs hurt! . 10Gpbs * 1k pins is ~1.2TBytes/sec

 2.5D Integration gets in pin density . But it’s a 1 time boost (how much headroom?) . 4TB/sec? (maybe 8TB/s with single wire signaling?)

14 Credit: John Shalf (LBNL) A 45nm 1.3GHz 16.7 Double-Precision GFLOPS/W RISC-V with VectorDie PhotosAccelerators (3 classes of cores) 20mm Yunsup Lee⇤, Andrew Waterman⇤, Rimas Avizienis⇤, Henry Cook⇤, Chen Sun⇤†, ⇤† ⇤ Vladimir Stojanovic´ , Krste Asanovic´ 8.4mm Email: { yunsup, waterman, rimas, hcook, vlada, krste} @eecs.berkeley.edu, [email protected] ⇤Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA †Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA

Abstract—A 64-bit dual-core RISC-V processor with vector 3mm 1.2mm2.8mm accelerators has been fabricated in a 45 nm SOI . This VRF L1D$ mm 0.5 is the first dual-core processor to implement the open-sour ce 1

.

1

m

RISC-V ISA designed at the University of Califor nia, Berkeley. Core m In a standard 40 nm process, the RISC-V scalar cor e scores 10% Core Logic L1I$ higher in DMIPS/MHz than the Cortex-A5, ARM’s comparable L1VI$ single-issue in-order scalar core, and is 49% more ar ea-efficient. 1MB SRAM To demonstrate the extensibility of the RISC-V ISA, we integrate m Rocke0.23mmt Hwacha Rocket Hwacha m Array 6 Scalar Vector Scalar Vector a custom vector accelerator alongside each single-issue in-order Core Accelerator mm 0.2 Core Accelerator scalar core. The vector accelerator is 1.8⇥ more energy-efficient 16K 32K 8KB 16K 32K 8KB than the IBM Blue Gene/Q processor, and 2.6⇥ more than the L1I$ L1D$ L1VI$ L1I$ L1D$ L1VI$ IBM Cell processor, both fabricated in the same process. The Arbiter Arbiter dual-cor e RISC-V processor achieves maximum clock frequency Dual-Core RISC-V Coherence Hub FPGA FSB/ of 1.3GHz at 1.2V and peak energy efficiency of 16.7 double- 1MB SRAM Array 15 precision GFLOPS/W at 0.65V with an ar ea of 3mm2 . Credit: John Shalf (LBNL) HTIF

I. I NTRODUCTION Fig. 1. Backside chip micrograph (taken with a removed silicon handle) and processor block diagram. As we approach the end of conventional transistor scaling, computer architects are forced to incorporate specialized and heterogeneous accelerators into general-purpose processors for A. Rocket Scalar Core greater energy ef ficiency. Many proposed accelerators, such as Rocket is a 6-stage single-issue in-order that those based on GPU architectures, require a drastic reworking executes the 64-bit scalar RISC-V ISA (see Figure 2). The of application software to make use of separate ISAs operating scalar isfully bypassed but carefully designed to min- in memory spaces disjoint from the demand-paged virtual imize the impact of long clock-to-output delays of compiler- memory of the host CPU. RISC-V [1] is a new completely generated SRAMs in the caches. A 64-entry branch target open general-purpose instruction set architecture (ISA) de- buffer, 256-entry two-level , and return address veloped at the University of California, Berkeley, which is stack together mitigate the performance impact of control designed to be flexible and extensible to better integrate new hazards. Rocket implements an MMU that supports page- ef ficient accelerators close to the host cores. The open-source based virtual memory and is able to boot modern operating RISC-V software toolchain includes a GCC cross-compiler, systems, including Linux. Both caches are virtually indexed an LLVM cross-compiler, a software ISA simulator, an ISA physically tagged with parallel TLB lookups. The data cache verification suite, a Linux port, and additional documentation, is non-blocking, allowing the core to exploit memory-level and is available at www. ri scv. org. parallelism. In this paper, we present a 64-bit dual-core RISC-V Rocket has an optional IEEE 754-2008-compliant FPU, processor with custom vector accelerators in a 45 nm SOI which can execute single- and double-precision floating-point process. Our RISC-V scalar core achieves 1.72DMIPS/MHz, operations, including fused multiply-add (FMA), with hard- outperforming ARM’s Cortex-A5 score of 1.57DMIPS/MHz ware support for denormals and other exceptional values. The by 10% in a smaller footprint. Our custom vector accelerator is 1.8⇥ more energy-efficient than the IBM Blue Gene/Q processor and 2.6⇥ more than the IBM Cell processor for ITLB Int.RF DTLB PC Int.EX Commit to Hwacha double-precision floating-point operations, demonstrating that Gen. I$ Inst. D$ high ef ficiency can be obtained without sacrificing a unified Access Decode Access bypass paths omitted demand-paged virtual memory environment. for simplicity FP.RF FP.EX1 FP.EX2 FP.EX3 Rocket Pipeline II. CHIP ARCHITECTURE VITLB PC VInst. Seq- Bank1 Bank8 Figure 1 shows the block diagram of the dual-core pro- VI$ Expand ... Gen. Decode uencer R/W R/W cessor. Each core incorporates a 64-bit single-issue in-order Access Rocket scalar core, a 64-bit Hwacha vector accelerator, and from Rocket Hwacha Pipeline their associated instruction and data caches, as described below. Fig. 2. Rocket scalar plus Hwacha vector pipeline diagram. A 45nm 1.3GHz 16.7 Double-Precision GFLOPS/W RISC-V Processor with VectorStripAccelerators down to the core

Yunsup Lee⇤, Andrew Waterman⇤, Rimas Avizienis⇤, Henry Cook2.7 ⇤mm, Chen Sun⇤†,

Vladimir Stojanovic´⇤†, Krste Asanovic´⇤ mm 4.5 Email: { yunsup, waterman, rimas, hcook, vlada, krste} @eecs.berkeley.edu, [email protected] ⇤Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA †Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA

Abstract—A 64-bit dual-core RISC-V processor with vector 3mm 1.2mm2.8mm accelerators has been fabricated in a 45 nm SOI process. This VRF L1D$ mm 0.5 is the first dual-core processor to implement the open-sour ce 1

.

1

m

RISC-V ISA designed at the University of Califor nia, Berkeley. Core m In a standard 40 nm process, the RISC-V scalar cor e scores 10% Core Logic L1I$ higher in DMIPS/MHz than the Cortex-A5, ARM’s comparable L1VI$ single-issue in-order scalar core, and is 49% more ar ea-efficient. 1MB SRAM To demonstrate the extensibility of the RISC-V ISA, we integrate m Rocke0.23mmt Hwacha Rocket Hwacha m Array 6 Scalar Vector Scalar Vector a custom vector accelerator alongside each single-issue in-order Core Accelerator mm 0.2 Core Accelerator scalar core. The vector accelerator is 1.8⇥ more energy-efficient 16K 32K 8KB 16K 32K 8KB than the IBM Blue Gene/Q processor, and 2.6⇥ more than the L1I$ L1D$ L1VI$ L1I$ L1D$ L1VI$ IBM Cell processor, both fabricated in the same process. The Arbiter Arbiter dual-cor e RISC-V processor achieves maximum clock frequency Dual-Core RISC-V Coherence Hub Vector Processor FPGA FSB/ of 1.3GHz at 1.2V and peak energy efficiency of 16.7 double- 1MB SRAM Array 16 precision GFLOPS/W at 0.65V with an ar ea of 3mm2 . Credit: John Shalf (LBNL) HTIF

I. I NTRODUCTION Fig. 1. Backside chip micrograph (taken with a removed silicon handle) and processor block diagram. As we approach the end of conventional transistor scaling, computer architects are forced to incorporate specialized and heterogeneous accelerators into general-purpose processors for A. Rocket Scalar Core greater energy ef ficiency. Many proposed accelerators, such as Rocket is a 6-stage single-issue in-order pipeline that those based on GPU architectures, require a drastic reworking executes the 64-bit scalar RISC-V ISA (see Figure 2). The of application software to make use of separate ISAs operating scalar datapath isfully bypassed but carefully designed to min- in memory spaces disjoint from the demand-paged virtual imize the impact of long clock-to-output delays of compiler- memory of the host CPU. RISC-V [1] is a new completely generated SRAMs in the caches. A 64-entry branch target open general-purpose instruction set architecture (ISA) de- buffer, 256-entry two-level branch predictor, and return address veloped at the University of California, Berkeley, which is stack together mitigate the performance impact of control designed to be flexible and extensible to better integrate new hazards. Rocket implements an MMU that supports page- ef ficient accelerators close to the host cores. The open-source based virtual memory and is able to boot modern operating RISC-V software toolchain includes a GCC cross-compiler, systems, including Linux. Both caches are virtually indexed an LLVM cross-compiler, a software ISA simulator, an ISA physically tagged with parallel TLB lookups. The data cache verification suite, a Linux port, and additional documentation, is non-blocking, allowing the core to exploit memory-level and is available at www. ri scv. org. parallelism. In this paper, we present a 64-bit dual-core RISC-V Rocket has an optional IEEE 754-2008-compliant FPU, processor with custom vector accelerators in a 45 nm SOI which can execute single- and double-precision floating-point process. Our RISC-V scalar core achieves 1.72DMIPS/MHz, operations, including fused multiply-add (FMA), with hard- outperforming ARM’s Cortex-A5 score of 1.57DMIPS/MHz ware support for denormals and other exceptional values. The by 10% in a smaller footprint. Our custom vector accelerator is 1.8⇥ more energy-efficient than the IBM Blue Gene/Q processor and 2.6⇥ more than the IBM Cell processor for ITLB Int.RF DTLB PC Int.EX Commit to Hwacha double-precision floating-point operations, demonstrating that Gen. I$ Inst. D$ high ef ficiency can be obtained without sacrificing a unified Access Decode Access bypass paths omitted demand-paged virtual memory environment. for simplicity FP.RF FP.EX1 FP.EX2 FP.EX3 Rocket Pipeline II. CHIP ARCHITECTURE VITLB PC VInst. Seq- Bank1 Bank8 Figure 1 shows the block diagram of the dual-core pro- VI$ Expand ... Gen. Decode uencer R/W R/W cessor. Each core incorporates a 64-bit single-issue in-order Access Rocket scalar core, a 64-bit Hwacha vector accelerator, and from Rocket Hwacha Pipeline their associated instruction and data caches, as described below. Fig. 2. Rocket scalar plus Hwacha vector pipeline diagram. Actual Size

2.7mm

4.5 mm 4.5

belo

their

Rock

cessor

demand-paged

high

double-precision

processor

is

by

outperforming

process.

processor

and

v

an

RISC-V

ef

designed

v

open

memory

in

of

those

greater

heterogeneous

computer

pr

of

dual-cor

IBM

than

scalar

a

T

single-issue

higher

In

RISC-V

is

accelerators

A

erification

eloped

o

ficient

Department

custom

ecision

the

application

a

1.3

1.8

demonstrate

memory

Abstract

10%

LL

Figure

In

As

w

is

standard

et

the

ef

Cell

.

based

GHz

.

associated

general-purpose

VM

cor

first

a

in

we

45nm

this

ficienc

ener

Each

v

e

scalar

ISA

in

accelerators

Our

at

ailable

IBM

softw

of

more

GFLOPS/W

e.

v

t

DMIPS/MHz

RISC-V

Department

architects

o

pr

and

approach

ector

with

—A

1

cross-compiler

RISC-V

at

in-order

a

gy

The

dual-cor

suite,

the

on

paper

the

has

be

ocessor

spaces

designed

sho

smaller

RISC-V

core

40

y

1.2

softw

of

Blue

are

core,

accelerators

ef

ARM’

virtual

2.6

the

GPU

fle

ener

64-bit

can

Uni

v

nm

acceler ator

custom

host

at

been

ws

fici

Email:

Electrical

II.

ector

V

floating-point

instruction

,

a

xible

pr

toolchain

extensibility

incorporates

w.risv rg or scv. i www. r

,

Gene/Q

are

Y

v

e

enc

we

Linux

gy-ef

and

scalar

pr

ocessor

the

be

are

the

disjoint

a

I.

both

ersity

architectures,

unsup

close

s

more

footprint.

CPU.

at

pr

fabricated

ocess,

at

C

dual-cor

of

1.3GHz

scalar

memory

acceler ator

64-bit

than

i

y

to

Corte

obtained

nstruction

and

HIP

ocessor

{

end

.

present

forced

0.65

peak

block

v

I

the

ficient

Electrical

yunsup,

,

Man

NT

mak

ector

fabricated

cor

port,

into

a

to

Engineering

of

alongside

than

pr

achie

RISC-V

the

A

the

e

of

V

R

Lee

x-A5

Uni

includes

core

from

softw

xtensible

e,

and

Hw

ocessor

the

ener

RCH

y

e

OD

e

California,

of

with

diagram

con

en

Our

operations,

general

to

and

use

RISC-V

and

Cortex-A5,

proposed

accelerators

to

v

RISC-V

a

than

Processor

in

v

a

UCT

the

without

ersity

acha

the

vironment.

host

es

w

,

gy

is

achie

ITE

64-bit

incorporate

v

data

are

s

set

require

implement

the

Engineering

Andre

64-bit

a

aterman,

of

core

entional

is

an

additional

custom

1.8

maxi

,

in

[1]

RISC-V

efficiency

IBM

ION

each

45

CT

the

and

.

49%

a

separate

cores.

ISA

-purpose

v

architecture

to

ar

v

of

demand-paged

the

ector

scalar

and

of

caches,

nm

es

URE

GCC

pr

of

is

mum

ea

single-issue

Vladimir

accelerators,

mor

sacrificing

better

w

Calif

IBM

2.6

single-issue

Berk

a

16.7

ocessor

dual-core

ARM’

Cell

demonstrating

the

1

mor

a

simulator

same

1.57

of

v

SOI

transistor

Computer

.72

drastic

rimas,

W

in

The

ISA,

ector

e

ne

accelerator

c

the

specialized

documentation,

or

3

clock

cross-compiler

ISAs

of

e

ele

aterman

ener

or

dual-core

mor

processors

DMIPS/MHz,

w

a

mm

and

Blue

inte

processor

DMIPS/MHz

nia,

as

ar

s

pr

e

pr

open-source

16.7

y

we

open-sour

45

comparable

with

completely

ea-efficient.

scor

accelerator

,

ocess.

gy-efficient

ocess.

e

hcook,

(ISA)

re

2

grate

Stojano

described

fr

operating

Computer

,

a

.

which

Berk

integrate

nm

than

w

RISC-V

in-order

in-order

equency

Gene/Q

scaling,

an

such

double-

es

unified

Science,

Double-Precision

virtual

orking

v

,

ector

,

with

10%

This

pro-

eley

ne

SOI

ISA

that

The

and

and

de-

Rimas

the

for

for

vlada,

as

ce

w

is

vi

,

.

c

´

Sciences,

assachusettsM

,

krste

Fig.

w

operations,

which

parallelism.

is

ph

systems,

based

hazards.

stack

b

generated

imize

scalar

e

A.

processor

Fig.

A

f

R

b

o

x

uf

Krste

yp

Gen.

Gen.

r r

are

o

PC PC

6mm vizienis

ecutes

ysically

si

non-blocking,

a

cke

fer

Roc

Rock

Rock

f

V

1.2 mm 1.2

mp

1.

2.

ss ss

ro

}

D

V

m m

support

p

@eecs.berk

,

l

i

t

e

u

datapath

the

together

ci

a

virtual

can

256-entry

Backside

Uni

R

c

k

a

ector

Rock

Pi

t

t

h

to

y

Asano

l

block

o

et

-C

et

et

s s

cke

p

including

Rock

Acce

Acce

the

r

VI

o

o

e

impact

ITLB

3

Pr

v

mi

VI

SRAMs

Scalar

tagged

e

r

mm

T

I$

et

l

t

,

has

e

ersity

i

including

is

x

L

$

o

n

t

t

ss

ss

Institute

B

R

diagram.

Henry

c

e

ecute

e

scalar

64-bit

for

SR

memory

d

e

I

Array

et

1

SC

chip

vi

s

a

is

MB

mitig

s

an

A

o

-V -V

c

tw

´

ele

denormals

M

fully

6-stage

allo

implements

r

of

Cor

of

D

D

with

micrograph

plus

Int.RF

VI

e

e

single-

I

o-le

optional

Linux.

n

in

y

co

co

n

scalar

st

Cook

long

Accelerators

ate

.edu,

California,

st

wing

d

d

e

of

.

fused

.

e

e

Hw

bypassed

the

v

and

F

parallel

VR

PG

el

L1I$

R

1

Sca

T

CoreLogic

C

the

HTIF

6

o

acha

K K

A

o

F

cke

echnology

clock-to-output

single-issue

re

[email protected]

l

branch

and

a

F

Both

caches.

u

RISC-V

the

FP

I

r

SB/

,

is

t

n

Se

e

and

(tak

multi

t

IEEE

n

.

.RF

Arb

Chen

performance

v

EX

L1D$

ce q

0.5mm32K

an

able

ector

-

double-precision

Acce

i

en

r

t

core

TLB

e

GFLOPS/W

H

L1D$

b

Berk

r

other

V

w

caches

e

ut

with

a

ply-add

l

ct

predictor

MMU

e

ch

ra

o

L

L

pipeline

754-2008-compliant

r

8

L1I$

to

A

a

1

t

Exp

carefully

1

Sun

,

F

Acce

o

KB KB

C

VI

ISA

DTLB

VI

r

P

to

ele

o

lookups.

Cambridge,

D$

$

.

a

h

$

EX1

e

e

boot

a

64-ent

2

re

remo

in-order

xceptional

n

ss

.

y

8

1

e

n

d

mm

are

MB MB

,

ce

xploit

(see

that

delays

diagram.

CA,

,

(FMA),

H

,

SR

L1I$

R

v

1

impact

Sca

u

C

and

6

modern

b

o

ed

AM AM

K K

o

virtually

cke

C

F

ry

re

Ba

d

l

a

P

o

R/W

r

H

e

The

Arra

t

Figure

silicon

supports

.

mmi

EX2

n

USA

signed

Core

w

return

k1

Arb

memory-le

branch

floating-point

L1D$

pipeline

y

32K

of

a

t

MA,

Acce

i

t

v

ch

e

H

...

with

data

r

V

of

alues.

w

e

t

compiler

handle)

o

operating

a

l

ct

a

e

ch

ra

o

H

F

L

inde

2).

r

8

Pi

address

a

1

to

t

P

control

w

o

KB KB

USA

VI

.

r

a

EX3

p

cache

tar

Ba

page-

$

hard-

FPU,

ch

R/W

min-

e

The

The

that

n

x

l

a

get

v

and

i

k8

n

ed m m 1 .

el 1

e -

0.23mm mm 0.2

17 Credit: John Shalf (LBNL) Basic Stats

2.7mm Core Energy/Area est.

4.5 mm 4.5 Area: 12.25 mm2 Power: 2.5W Clock: 2.4 GHz E/op: 651 pj Wire Energy Assumptions for 22nm

100 fj/bit per mm

belo

their

Rock

cessor

demand-paged

high

double-precision

processor

is

by

outperforming

process.

processor

and

v

an

RISC-V

ef

designed

v

open

memory

in

of

those

greater

heterogeneous

computer

pr

of

dual-cor

IBM

than

scalar

a

T

single-issue

higher

In

RISC-V

is

accelerators

A

erification

eloped

o

ficient

Department

custom

ecision

the

application

a

1.3

1.8

demonstrate

memory

Abstract

10%

LL

Figure

In

As

w

is

standard

et

the

ef

Cell

.

based

GHz

.

associated

general-purpose

VM

cor

first

a

in

we

45nm

this

ficienc

ener

Each

v

e

scalar

ISA

in

accelerators

Our

at

ailable

IBM

softw

of

more

GFLOPS/W

e.

v

t

DMIPS/MHz

RISC-V

Department

architects

o

pr

and

approach

ector

with

—A

1

cross-compiler

RISC-V

at

in-order

a

gy

The

dual-cor

suite,

the

on

paper

the

has

be

ocessor

spaces

designed

sho

smaller

RISC-V

core

40

y

1.2

softw

of

Blue

are

core,

accelerators

ef

ARM’

virtual

2.6

the

GPU

fle

ener

64-bit

can

Uni

v

nm

acceler ator

custom

host

at

been

ws

fici

Email:

Electrical

II.

ector

V

floating-point

instruction

,

a

xible

pr

toolchain

extensibility

incorporates

w.risv rg or scv. i www. r

,

Gene/Q

are

Y

v

e

enc

we

Linux

gy-ef

and

scalar

pr

ocessor

the

be

are

the

disjoint

a

I.

both

ersity

architectures,

unsup

close

s

more

footprint.

CPU.

at

pr

fabricated

ocess,

at

C

dual-cor

of

1.3GHz

scalar memory

acceler ator 64bit operand

64-bit

than

i

y

to

Corte

obtained

nstruction

and

HIP

ocessor

{

end

.

present

forced

0.65

peak

block

v

I

the

ficient

Electrical

yunsup,

,

Man

NT

mak

ector

fabricated cor

port, 2

into

a

to

Engineering

of

alongside

than

pr

achie

RISC-V

the

A

the

e

of

V

R

Lee

x-A5

Uni

includes

core

from

softw

xtensible

e,

and

Hw

ocessor

the

ener

RCH

y

e

OD

e

California,

of

with

diagram

con

en

Our

operations,

general

to

and

use

RISC-V

and

Cortex-A5,

proposed

accelerators

to

v

RISC-V

a

than

Processor

in

v

a

UCT

the

without

ersity

acha

the

vironment.

host

es

w

,

gy

is

achie

ITE

64-bit

incorporate

v

data

are

s

set

require

implement

the

Engineering

Andre

64-bit

a

aterman,

of

core

entional

is

an

additional

custom

1.8

maxi

,

in

[1]

RISC-V

efficiency

IBM

ION

each

45

CT

the

and

.

49%

a

separate

cores.

ISA

-purpose

v

architecture

to

ar

v

of

demand-paged

the

ector

scalar

and

of

caches,

nm

es

URE

GCC

pr

of

is

mum

ea

single-issue

Vladimir

accelerators,

mor

sacrificing

better

w

Calif

IBM

2.6

single-issue

Berk

a

16.7

ocessor

dual-core

ARM’

Cell

demonstrating

the

1

mor

a

simulator

same

1.57

of

v

SOI

transistor

Computer

.72

drastic

rimas,

W

in

The ISA,

ector Area: 0.6 mm

e

ne

accelerator

c

the

specialized

documentation,

or

3

clock

cross-compiler

ISAs

of

e

ele

aterman

ener

or

dual-core

mor

processors

DMIPS/MHz,

w

a

mm

and

Blue

inte

processor

DMIPS/MHz

nia,

as

ar

s

pr

e

pr

open-source

16.7

y

we

open-sour

45

comparable

with

completely

ea-efficient.

scor

accelerator

,

ocess.

gy-efficient

ocess.

e

hcook,

(ISA)

re

2

grate

Stojano

described

fr

operating

Computer

,

a

.

which

Berk

integrate

nm

than

w

RISC-V

in-order

in-order

equency

Gene/Q

scaling,

an

such

double-

es

unified

Science,

Double-Precision

virtual

orking

v

,

ector

,

with

10%

This

pro-

eley

ne

SOI

ISA

that

The

and

and

de-

Rimas

the

for

for

vlada,

as

ce

w

is

vi

,

.

c

´

Sciences,

assachusettsM

,

krste

Fig.

w

operations,

which

parallelism.

is

ph

systems,

based

hazards.

stack

b

generated

imize

scalar

e

A.

processor

Fig.

A

f

R

b

o

x

uf

Krste

yp

Gen.

Gen.

r r

are

o

PC PC

6mm vizienis

ecutes

ysically

si

non-blocking,

a

cke

fer

Roc

Rock

Rock

f

V

1.2 mm 1.2

mp

1.

2.

ss ss

ro

}

D

V

m m

support

p

@eecs.berk

,

l

i

t

e

u

datapath

the

together

ci

a

virtual

can

256-entry

Backside

Uni

R

c

k

a

ector

Rock

Pi

t

t

h

to

y

Asano

l

block

o

et

-C

et

et

s s

cke

p

including

Rock

Acce

Acce

the

r

VI

o

o

e

impact

ITLB

3

Pr

v

mi

VI

SRAMs

Scalar

tagged

e

r

mm

T

I$

et

l

t

,

has

e

ersity

i

including

is

x

L

$

o

n

t

t

ss

ss

Institute

B

R

diagram.

Henry

c

e

ecute

e

scalar

64-bit

for

SR

memory

d

e

I

Array

et

1 SC

chip Energy:

vi

s

a

is

MB

mitig

s

an

A

o

-V -V

c

tw

´

ele

denormals

M

fully

6-stage

allo

implements

r

of

Cor

of

D

D

with

micrograph

plus

Int.RF

VI

e

e

single-

I

o-le

optional

Linux.

n

in

y

co

co

n

scalar

st

Cook

long

Accelerators

ate

.edu,

California,

st

wing

d

d

e

of

.

fused

.

e

e

Hw

bypassed

the

v

and

F

parallel

VR

PG

el

L1I$

R

1

Sca

T

CoreLogic

C

the

HTIF

6

o

acha

K K

A

o

F

cke

echnology

clock-to-output

single-issue

re

[email protected]

l

branch

and

a

F

Both

caches.

u

RISC-V

the

FP

I

r

SB/

,

is

t

n

Se

e

and

(tak

multi

t

IEEE

n

.

.RF

Arb

Chen

performance

v

EX

L1D$

ce q

0.5mm32K

an

able

ector

-

double-precision

Acce

i

en

r

t

core

TLB

e

GFLOPS/W

H

L1D$

b

Berk

r

other

V

w

caches

e

ut

with

a

ply-add

l

ct

predictor

MMU

e

ch

ra

o

L

L

pipeline

754-2008-compliant

r

8

L1I$

to

A

a

1

t

Exp

carefully

1

Sun

,

F

Acce

o

KB KB

C

VI

ISA

DTLB

VI

r

P

to

ele o

lookups. Power: 0.3W (<0.2W)

Cambridge,

D$

$

.

a

h

$

EX1

e

e

boot

a

64-ent

2

re

remo

in-order

xceptional

n

ss

.

y

8

1

e

n

d

mm

are

MB MB

,

ce

xploit

(see

that

delays

diagram.

CA,

,

(FMA),

H

,

SR

L1I$

R

v

1

impact

Sca

u

C

and

6

modern

b

o

ed

AM AM

K K

o

virtually

cke

C

F

ry

re

Ba

d

l

a

P

o

R/W

r

H

e

The

Arra

t

Figure

silicon

supports

.

mmi

EX2

n

USA

signed

Core

w

return

k1

Arb

memory-le

branch

floating-point

L1D$

pipeline

y

32K

of

a

t

MA,

Acce

i

t

v

ch

e

H

...

with

data

r

V

of

alues.

w

e

t

compiler

handle)

o

operating

a

l

ct

a

e

ch

ra

o

H

F

L

inde

2).

r

8

Pi

address

a

1

to

t

P

control

w

o

KB KB

USA

VI

.

r

a

EX3

p

cache

tar

Ba

page-

$

hard-

FPU,

ch

R/W

min-

e

The

The

that

n

x

l

a

get

v

and

i

k8

n

ed m m 1 .

el 1 e - Clock: 1.3 GHz 1mm=~6pj E/op: 150 (75) pj 20mm=~120pj

2

0.2 mm 0.2 Area: 0.046 mm 0.23mm Power: 0.025W Clock: 1.0 GHz E/op: 22 pj

18 Credit: John Shalf (LBNL) When does data movement dominate?

2.7mm Core Energy/Area est. Data Movement Cost

Compute Op == 4.5 mm 4.5 Area: 12.25 mm2 data movement Energy @ Power: 2.5W 108mm Clock: 2.4 GHz Energy Ratio for 20mm E/op: 651 pj 0.2x

Compute Op ==

belo

their

Rock

cessor

demand-paged

high

double-precision

processor

is

by

outperforming

process.

processor

and

v

an

RISC-V

ef

designed

v

open

memory

in

of

those

greater

heterogeneous

computer

pr

of

dual-cor

IBM

than

scalar

a

T

single-issue

higher

In

RISC-V

is

accelerators

A

erification

eloped

o

ficient

Department custom

ecision 2

the

application

a

1.3

1.8

demonstrate

memory

Abstract

10%

LL

Figure

In

As

w

is

standard

et

the

ef

Cell

.

based

GHz

.

associated

general-purpose

VM

cor

first

a

in

we

45nm

this

ficienc

ener

Each

v

e

scalar

ISA

in

accelerators

Our

at

ailable

IBM

softw

of

more

GFLOPS/W

e.

v

t

DMIPS/MHz

RISC-V

Department

architects

o

pr

and

approach

ector

with

—A

1

cross-compiler

RISC-V

at

in-order

a

gy

The

dual-cor

suite,

the

on

paper

the

has

be

ocessor

spaces

designed

sho

smaller

RISC-V

core

40

y

1.2

softw

of

Blue

are

core,

accelerators

ef

ARM’

virtual

2.6

the

GPU

fle

ener

64-bit

can

Uni

v

nm

acceler ator

custom

host

at

been

ws

fici

Email:

Electrical

II.

ector

V

floating-point

instruction

,

a

xible

pr

toolchain

extensibility

incorporates

w.risv rg or scv. i www. r

,

Gene/Q

are

Y

v

e

enc

we

Linux

gy-ef

and

scalar

pr

ocessor

the

be

are

the

disjoint

a

I.

both ersity

architectures, Area: 0.6 mm

unsup

close

s

more

footprint.

CPU.

at

pr

fabricated

ocess,

at

C

dual-cor

of

1.3GHz

scalar

memory

acceler ator

64-bit

than

i

y

to

Corte

obtained

nstruction

and

HIP

ocessor

{

end

.

present

forced

0.65

peak

block

v

I

the

ficient

Electrical

yunsup,

,

Man

NT

mak

ector

fabricated

cor

port,

into

a

to

Engineering

of

alongside

than

pr

achie

RISC-V

the

A

the

e

of

V

R

Lee

x-A5

Uni

includes

core

from

softw

xtensible

e,

and

Hw

ocessor

the

ener

RCH

y

e

OD

e

California,

of

with

diagram

con

en

Our

operations,

general

to

and

use

RISC-V

and

Cortex-A5,

proposed

accelerators

to

v

RISC-V

a

than

Processor

in

v

a

UCT

the

without

ersity

acha

the

vironment.

host

es

w

,

gy

is

achie

ITE

64-bit

incorporate

v

data

are

s

set

require

implement

the

Engineering

Andre

64-bit

a

aterman,

of

core

entional

is

an

additional

custom

1.8

maxi

,

in

[1]

RISC-V

efficiency

IBM

ION

each

45

CT

the

and

.

49%

a

separate

cores.

ISA

-purpose

v

architecture

to

ar

v

of

demand-paged

the

ector

scalar

and

of

caches,

nm

es

URE

GCC

pr

of

is

mum

ea

single-issue

Vladimir

accelerators,

mor

sacrificing

better w

Calif data movement Energy @

IBM

2.6

single-issue

Berk

a

16.7

ocessor

dual-core

ARM’

Cell

demonstrating

the

1

mor

a

simulator

same

1.57

of

v

SOI

transistor

Computer

.72

drastic

rimas,

W

in

The

ISA,

ector

e

ne

accelerator

c

the

specialized

documentation,

or

3

clock

cross-compiler

ISAs

of

e

ele

aterman

ener

or

dual-core

mor

processors

DMIPS/MHz,

w

a

mm

and

Blue

inte

processor

DMIPS/MHz

nia,

as

ar

s

pr

e

pr

open-source

16.7

y

we

open-sour

45

comparable

with

completely

ea-efficient.

scor

accelerator

,

ocess.

gy-efficient

ocess.

e

hcook,

(ISA)

re

2

grate

Stojano

described

fr

operating

Computer

,

a

.

which

Berk

integrate

nm

than

w

RISC-V

in-order

in-order

equency

Gene/Q

scaling,

an

such

double-

es

unified

Science,

Double-Precision

virtual

orking

v

,

ector

,

with

10%

This

pro-

eley

ne

SOI

ISA

that

The

and

and

de-

Rimas

the

for

for

vlada,

as

ce

w

is

vi

,

.

c

´

Sciences,

assachusettsM

,

krste

Fig.

w

operations,

which

parallelism.

is

ph

systems,

based

hazards.

stack

b

generated

imize

scalar

e

A.

processor

Fig.

A

f

R

b

o

x

uf

Krste

yp

Gen.

Gen.

r r

are

o

PC PC

6mm vizienis

ecutes

ysically

si

non-blocking,

a

cke

fer

Roc

Rock

Rock

f

V

1.2 mm 1.2

mp

1.

2.

ss ss

ro

}

D

V

m m

support

p

@eecs.berk

,

l

i

t

e

u

datapath

the

together

ci

a

virtual

can

256-entry

Backside

Uni

R

c

k

a

ector

Rock

Pi

t

t

h

to

y

Asano

l

block

o

et

-C

et

et

s s

cke

p

including

Rock

Acce

Acce

the

r

VI

o

o

e

impact

ITLB

3

Pr

v

mi

VI

SRAMs

Scalar

tagged

e

r

mm

T

I$

et

l

t

,

has

e

ersity

i

including

is

x

L

$

o

n

t

t

ss

ss

Institute

B

R

diagram. Henry

c Power: 0.3W (<0.2W)

e

ecute

e

scalar

64-bit

for

SR

memory

d

e

I

Array

et

1

SC

chip

vi

s

a

is

MB

mitig

s

an

A

o

-V -V

c

tw

´

ele

denormals

M

fully

6-stage

allo

implements

r

of

Cor

of

D

D

with

micrograph

plus

Int.RF

VI

e

e

single-

I

o-le

optional

Linux.

n

in

y

co

co

n

scalar

st

Cook

long

Accelerators

ate

.edu,

California,

st

wing

d

d

e

of

.

fused

.

e

e

Hw

bypassed

the

v

and

F

parallel

VR

PG

el

L1I$

R

1

Sca

T

CoreLogic

C

the

HTIF

6

o

acha

K K

A

o

F

cke

echnology

clock-to-output

single-issue

re

[email protected]

l

branch

and

a

F

Both

caches.

u

RISC-V

the

FP

I

r

SB/

,

is

t

n

Se

e

and

(tak

multi

t

IEEE

n

.

.RF

Arb

Chen

performance

v

EX

L1D$

ce q

0.5mm32K

an

able

ector

-

double-precision

Acce

i

en

r

t

core

TLB

e

GFLOPS/W

H

L1D$

b

Berk

r

other

V

w

caches

e

ut

with

a

ply-add

l

ct

predictor MMU

e 12mm

ch

ra

o

L

L

pipeline

754-2008-compliant

r

8

L1I$

to

A

a

1

t

Exp

carefully

1

Sun

,

F

Acce

o

KB KB

C

VI

ISA

DTLB

VI

r

P

to

ele

o

lookups.

Cambridge,

D$

$

.

a

h

$

EX1

e

e

boot

a

64-ent

2

re

remo

in-order

xceptional

n

ss

.

y

8

1

e

n

d

mm

are

MB MB

,

ce

xploit

(see

that

delays

diagram.

CA,

,

(FMA),

H

,

SR

L1I$

R

v

1

impact

Sca

u

C

and

6

modern

b

o

ed

AM AM

K K

o

virtually

cke

C

F

ry

re

Ba

d

l

a

P

o

R/W

r

H

e

The

Arra

t

Figure

silicon

supports

.

mmi

EX2

n

USA

signed

Core

w

return

k1

Arb

memory-le

branch

floating-point

L1D$

pipeline

y

32K

of

a

t

MA,

Acce

i

t

v

ch

e

H

...

with

data

r

V

of

alues.

w

e

t

compiler

handle)

o

operating

a

l

ct

a

e

ch

ra

o

H

F

L

inde

2).

r

8

Pi

address

a

1

to

t

P

control

w

o

KB KB

USA

VI

.

r

a

EX3

p

cache

tar

Ba

page-

$

hard-

FPU,

ch

R/W

min-

e

The

The

that

n

x

l

a

get

v

and

i

k8

n

ed m m 1 .

el 1 e - Clock: 1.3 GHz Energy Ratio for 20mm E/op: 150 (75) pj 1.6x

Compute Op == 2 Area: 0.046 mm data movement Energy @

0.2 mm 0.2 Power: 0.025W 0.23mm 3.6mm Clock: 1.0 GHz Energy Ratio for 20mm E/op: 22 pj 5.5x

19 Credit: John Shalf (LBNL) DPHPC Overview

20 Goals of this lecture

 Memory Trends

 Cache Coherence

 Memory Consistency

21 Memory – CPU gap widens

 Measure processor speed as “throughput” . FLOPS/s, IOPS/s, … . Moore’s law - ~60% growth per year

Source: Jack Dongarra

 Today’s architectures . POWER8: 338 dp GFLOP/s – 230 GB/s memory bw . BW i7-5775C: 883 GFLOPS/s ~50 GB/s memory bw . Trend: memory performance grows 10% per year

Source: John Mc.Calpin 22 Issues (AMD Interlagos as Example)

 How to measure bandwidth? . Data sheet (often peak performance, may include overheads) Frequency times bus width: 51 GiB/s . Microbenchmark performance Stride 1 access (32 MiB): 32 GiB/s Random access (8 B out of 32 MiB): 241 MiB/s Why? . Application performance As observed (performance counters) Somewhere in between stride 1 and random access

 How to measure Latency? . Data sheet (often optimistic, or not provided) <100ns . Random pointer chase 110 ns with one core, 258 ns with 32 cores! 23 Conjecture: Buffering is a must!

 Two most common examples:

 Write Buffers . Delayed write back saves memory bandwidth . Data is often overwritten or re-read

 Caching . Directory of recently used locations . Stored as blocks (cache lines)

24 Cache Coherence

 Different caches may have a copy of the same memory location!

 Cache coherence . Manages existence of multiple copies

 Cache architectures . Multi level caches . Shared vs. private (partitioned) . Inclusive vs. exclusive . Write back vs. write through . Victim cache to reduce conflict misses . …

25 Exclusive Hierarchical Caches

26 Shared Hierarchical Caches

27 Shared Hierarchical Caches with MT

28 Caching Strategies (repeat)

 Remember: . Write Back? . Write Through?

 Cache coherence requirements A memory system is coherent if it guarantees the following: . Write propagation (updates are eventually visible to all readers) . Write serialization (writes to the same location must be observed in order) Everything else: memory model issues (later)

29 Write Through Cache

1. CPU0 reads X from memory • loads X=0 into its cache

2. CPU1 reads X from memory • loads X=0 into its cache

3. CPU0 writes X=1 • stores X=1 in its cache • stores X=1 in memory

4. CPU1 reads X from its cache • loads X=0 from its cache

Incoherent value for X on CPU1

CPU1 may wait for update!

Requires write propagation!

30 Write Back Cache

1. CPU0 reads X from memory • loads X=0 into its cache

2. CPU1 reads X from memory • loads X=0 into its cache

3. CPU0 writes X=1 • stores X=1 in its cache

4. CPU1 writes X =2 • stores X=2 in its cache

5. CPU1 writes back cache line • stores X=2 in in memory

6. CPU0 writes back cache line • stores X=1 in memory

Later store X=2 from CPU1 lost

Requires write serialization!

31 A simple (?) example

 Assume C99: struct twoint { int a; int b; }

 Two threads: . Initially: a=b=0 . 0: write 1 to a . Thread 1: write 1 to b

 Assume non-coherent write back cache . What may end up in main memory?

32 Cache Coherence Protocol

 Programmer can hardly deal with unpredictable behavior!

 Cache controller maintains data integrity . All writes to different locations are visible

Fundamental Mechanisms

 Snooping . Shared bus or (broadcast) network

 Directory-based . Record information necessary to maintain coherence: E.g., owner and state of a line etc.

33 Fundamental CC mechanisms

 Snooping . Shared bus or (broadcast) network . Cache controller “snoops” all transactions . Monitors and changes the state of the cache’s data . Works at small scale, challenging at large-scale E.g., Intel Broadwell

 Directory-based Source: Intel . Record information necessary to maintain coherence E.g., owner and state of a line etc. . Central/Distributed directory for cache line ownership . Scalable but more complex/expensive E.g., Intel KNC/KNL

34 Cache Coherence Parameters

 Concerns/Goals . Performance . Implementation cost (chip space, more important: dynamic energy) . Correctness . (Memory model side effects)

 Issues . Detection (when does a controller need to act) . Enforcement (how does a controller guarantee coherence) . Precision of block sharing (per block, per sub-block?) . Block size (cache line size?)

35 An Engineering Approach: Empirical start

 Problem 1: stale reads . Cache 1 holds value that was already modified in cache 2 . Solution: Disallow this state Invalidate all remote copies before allowing a write to complete

 Problem 2: lost update . Incorrect write back of modified line writes main memory in different order from the order of the write operations or overwrites neighboring data . Solution: Disallow more than one modified copy

36 Invalidation vs. update (I)

 Invalidation-based: . On each write of a shared line, it has to invalidate copies in remote caches . Simple implementation for bus-based systems: Each cache snoops Invalidate lines written by other CPUs Signal sharing for cache lines in local cache to other caches

 Update-based: . Local write updates copies in remote caches Can update all CPUs at once Multiple writes cause multiple updates (more traffic)

37 Invalidation vs. update (II)

 Invalidation-based: . Only write misses hit the bus (works with write-back caches) . Subsequent writes to the same cache line are local .  Good for multiple writes to the same line (in the same cache)

 Update-based: . All sharers continue to hit cache line after one core writes Implicit assumption: shared lines are accessed often . Supports producer-consumer pattern well . Many (local) writes may waste bandwidth!

 Hybrid forms are possible!

38 MESI Cache Coherence

 Most common hardware implementation of discussed requirements aka. “Illinois protocol” Each line has one of the following states (in a cache):

 Modified (M) . Local copy has been modified, no copies in other caches . Memory is stale

 Exclusive (E) . No copies in other caches . Memory is up to date

 Shared (S) . Unmodified copies may exist in other caches . Memory is up to date

 Invalid (I)

. Line is not in cache 39 Terminology

 Clean line: . Content of cache line and main memory is identical (also: memory is up to date) . Can be evicted without write-back

 Dirty line: . Content of cache line and main memory differ (also: memory is stale) . Needs to be written back eventually Time depends on protocol details

 Bus transaction: . A signal on the bus that can be observed by all caches . Usually blocking

 Local read/write: . A load/store operation originating at a core connected to the cache

40 Transitions in response to local reads

 State is M . No bus transaction

 State is E . No bus transaction

 State is S . No bus transaction

 State is I . Generate bus read request (BusRd) May force other cache operations (see later) . Other cache(s) signal “sharing” if they hold a copy . If shared was signaled, go to state S . Otherwise, go to state E

 After update: return read value

41 Transitions in response to local writes

 State is M . No bus transaction

 State is E . No bus transaction . Go to state M

 State is S . Line already local & clean . There may be other copies . Generate bus read request for upgrade to exclusive (BusRdX*) . Go to state M

 State is I . Generate bus read request for exclusive ownership (BusRdX) . Go to state M

42 Transitions in response to snooped BusRd

 State is M . Write cache line back to main memory . Signal “shared” . Go to state S (or E)

 State is E . Signal “shared” . Go to state S and signal “shared”

 State is S . Signal “shared”

 State is I . Ignore

43 Transitions in response to snooped BusRdX

 State is M . Write cache line back to memory . Discard line and go to I

 State is E . Discard line and go to I

 State is S . Discard line and go to I

 State is I . Ignore

 BusRdX* is handled like BusRdX!

44 MESI State Diagram (FSM)

Source: Wikipedia 45 Small Exercise

 Initially: all in I state

Action P1 state P2 state P3 state Bus action Data from P1 reads x P2 reads x P1 writes x P1 reads x P3 writes x

46 Small Exercise

 Initially: all in I state

Action P1 state P2 state P3 state Bus action Data from P1 reads x E I I BusRd Memory P2 reads x S S I BusRd Cache P1 writes x M I I BusRdX* Cache P1 reads x M I I - Cache P3 writes x I I M BusRdX Memory

47 Optimizations?

 Class question: what could be optimized in the MESI protocol to make a system faster?

48 Related Protocols: MOESI (AMD)

 Extended MESI protocol

 Cache-to-cache transfer of modified cache lines . Cache in M or O state always transfers cache line to requesting cache . No need to contact (slow) main memory

 Avoids write back when another process accesses cache line . Good when cache-to-cache performance is higher than cache-to-memory E.g., shared last level cache!

49 MOESI State Diagram

Source: AMD64 Architecture Programmer’s Manual 50 Related Protocols: MOESI (AMD)

 Modified (M): Modified Exclusive . No copies in other caches, local copy dirty . Memory is stale, cache supplies copy (reply to BusRd*)

 Owner (O): Modified Shared . Exclusive right to make changes . Other S copies may exist (“dirty sharing”) . Memory is stale, cache supplies copy (reply to BusRd*)

 Exclusive (E): . Same as MESI (one local copy, up to date memory)

 Shared (S): . Unmodified copy may exist in other caches . Memory is up to date unless an O copy exists in another cache

 Invalid (I): . Same as MESI 51 Related Protocols: MESIF (Intel)

 Modified (M): Modified Exclusive . No copies in other caches, local copy dirty . Memory is stale, cache supplies copy (reply to BusRd*)

 Exclusive (E): . Same as MESI (one local copy, up to date memory)

 Shared (S): . Unmodified copy may exist in other caches . Memory is up to date

 Invalid (I): . Same as MESI

 Forward (F): . Special form of S state, other caches may have line in S . Most recent requester of line is in F state . Cache acts as responder for requests to this line 52 Multi-level caches

 Most systems have multi-level caches . Problem: only “last level cache” is connected to bus or network . Yet, snoop requests are relevant for inner-levels of cache (L1) . Modifications of L1 data may not be visible at L2 (and thus the bus)

 L1/L2 modifications . On BusRd check if line is in M state in L1 It may be in E or S in L2! . On BusRdX(*) send invalidations to L1 . Everything else can be handled in L2

 If L1 is write through, L2 could “remember” state of L1 cache line . May increase traffic though

53 Directory-based cache coherence

 Snooping does not scale . Bus transactions must be globally visible . Implies broadcast

 Typical solution: tree-based (hierarchical) snooping . Root becomes a bottleneck

 Directory-based schemes are more scalable . Directory (entry for each CL) keeps track of all owning caches . Point-to-point update to involved processors No broadcast Can use specialized (high-bandwidth) network, e.g., HT, QPI …

54 © Markus Püschel Computer Science

Basic Scheme

 System with N processors Pi

 For each memory block (size: cache line) maintain a directory entry . N presence bits

. Set if block in cache of Pi . 1 dirty bit

 First proposed by Censier and Feautrier (1978)

55 Directory-based CC: Read miss

 Pi intends to read, misses

 If dirty bit (in directory) is off . Read from main memory . Set presence[i] . Supply data to reader

 If dirty bit is on

. Recall cache line from Pj (determine by presence[]) . Update memory . Unset dirty bit, block shared . Set presence[i] . Supply data to reader

56 Directory-based CC: Write miss

 Pi intends to write, misses

 If dirty bit (in directory) is off

. Send invalidations to all processors Pj with presence[j] turned on . Unset presence bit for all processors . Set dirty bit

. Set presence[i], owner Pi

 If dirty bit is on

. Recall cache line from owner Pj . Update memory . Unset presence[j] . Set presence[i], dirty bit remains set . Supply data to writer

57 Discussion

 Scaling of memory bandwidth . No centralized memory

 Directory-based approaches scale with restrictions . Require presence bit for each cache . Number of bits determined at design time . Directory requires memory (size scales linearly) . Shared vs. distributed directory

 Software-emulation . Distributed (DSM) . Emulate cache coherence in software (e.g., TreadMarks) . Often on a per-page basis, utilizes memory virtualization and paging

59 Open Problems (for projects or theses)

 Tune algorithms to cache-coherence schemes . What is the optimal for a given scheme? . Parameterize for an architecture

 Measure and classify hardware . Read Maranget et al. “A Tutorial Introduction to the ARM and POWER Relaxed Memory Models” and have fun! . RDMA consistency is barely understood! . GPU memories are not well understood! Huge potential for new insights!

 Can we program (easily) without cache coherence? . How to fix the problems with inconsistent values? . Compiler support (issues with arrays)?

60 Case Study: Intel Xeon Phi

61 Communication?

62 Ramos, Hoefler: “Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi “, HPDC’13 Invalid read RI=278 ns Local read: RL= 8.6 ns Remote read RR = 235 ns

63 Inspired by Molka et al.: “Memory performance and cache coherency effects on an Intel Nehalem multiprocessor system” Single-Line Ping Pong

 Prediction for both in E state: 479 ns . Measurement: 497 ns (O=18)

64 Multi-Line Ping Pong

 More complex due to prefetch

Number Amortization of of CLs startup

Asymptotic Fetch Startup Latency for each cache overhead line (optimal prefetch!)

65 Multi-Line Ping Pong

 E state: . o=76 ns . q=1,521ns . p=1,096ns

 I state: . o=95ns . q=2,750ns . p=2,017ns

66 DTD Contention 

 E state: . a=0ns . b=320ns . c=56.2ns

67