Design of Parallel and High-Performance Computing Fall 2017 Lecture: Cache Coherence & Memory Models
Motivational video: https://www.youtube.com/watch?v=zJybFF6PqEQ
Instructor: Torsten Hoefler & Markus Püschel TAs: Salvatore Di Girolamo Sorry for missing last week’s lecture!
2 To talk about some silliness in MPI-3
3 Stadt- und Klimawandel: Wie stellen wir uns den Herausforderungen?
Mittwoch, 8. November 2016, 15.00 – 19.30 Uhr ETH Zürich, Hauptgebäude
Informationen und Anmeldung: http://www.c2sm.ethz.ch/events/eth-klimarunde-2017.html
A more complete view
6 Architecture Developments
<1999 ’00-’05 ’06-’12 ’13-’20 >2020
distributed large cache- large cache- coherent and non- memory largely non- coherent multicore coherent multicore coherent coherent machines machines machines manycore communicating accelerators and communicating communicating accelerators and multicores through through coherent through coherent multicores messages communicating memory access memory access communicating through remote and messages and remote direct through memory direct memory memory access access and remote access direct memory access
7 Sources: various vendors Computer Architecture vs. Physics
Physics (technological constraints) . Cost of data movement . Capacity of DRAM cells . Clock frequencies (constrained by end of Dennard scaling) . Speed of Light . Melting point of silicon
Computer Architecture (design of the machine) . Power management . ISA / Multithreading . SIMD widths “Computer architecture, like other architecture, is the art of determining the needs of the user of a structure and then designing to meet those needs as effectively as possible within economic and technological constraints.” – Fred Brooks (IBM, 1962)
Have converted many former “power” problems into “cost” problems
8 Credit: John Shalf (LBNL) Low-Power Design Principles (2005)
Tensilica XTensa
Cubic power improvement with lower Intel Atom clock rate due to V2F
Intel Core2 Slower clock rates enable use of simpler cores
Simpler cores use less area (lower Power 5 leakage) and reduce cost
Tailor design to application to REDUCE WASTE
9 Credit: John Shalf (LBNL) Low-Power Design Principles (2005)
Tensilica XTensa Power5 (server) Intel Atom . 120W@1900MHz . Baseline
Intel Core2 sc (laptop) : Intel Core2 . 15W@1000MHz . 4x more FLOPs/watt than baseline
Intel Atom (handhelds) . 0.625W@800MHz Power 5 . 80x more GPU Core or XTensa/Embedded . 0.09W@600MHz . 400x more (80x-120x sustained)
10 Credit: John Shalf (LBNL) Low-Power Design Principles (2005)
Tensilica XTensa Power5 (server) . 120W@1900MHz . Baseline
Intel Core2 sc (laptop) : . 15W@1000MHz . 4x more FLOPs/watt than baseline
Intel Atom (handhelds) . 0.625W@800MHz . 80x more
GPU Core or XTensa/Embedded . 0.09W@600MHz . 400x more (80x-120x sustained)
Even if each simple core is 1/4th as computationally efficient as complex core, you can fit hundreds of them on a single chip and still be 100x more power efficient.
11 Credit: John Shalf (LBNL)
Heterogeneous Future (LOCs and TOCs) 0.2 mm 0.2 Big cores (very few) Tiny core 0.23mm Lots of them!
Latency Optimized Core Throughput Optimized Core (LOC) (TOC) Most energy efficient if you Most energy efficient if you DO don’t have lots of parallelism have a lot of parallelism! 12 Credit: John Shalf (LBNL) Data movement – the wires
Energy Efficiency of copper wire: . Power = Frequency * Length / cross-section-area
wire
. Wire efficiency does not improve as feature size shrinks Photonics could break through Energy Efficiencythe bandwidth of a Transistor:-distance limit . Power = V2 * Frequency * Capacitance . Capacitance ~= Area of Transistor . Transistor efficiency improves as you shrink it
Net result is that moving data on wires is starting to cost more energy than computing on said data (interest in Silicon Photonics) 13 Credit: John Shalf (LBNL) Pin Limits
Moore’s law doesn’t apply to adding pins to package . 30%+ per year nominal Moore’s Law . Pins grow at ~1.5-3% per year at best
4000 Pins is aggressive pin package . Half of those would need to be for power and ground . Of the remaining 2k pins, run as differential pairs . Beyond 15Gbps per pin power/complexity costs hurt! . 10Gpbs * 1k pins is ~1.2TBytes/sec
2.5D Integration gets boost in pin density . But it’s a 1 time boost (how much headroom?) . 4TB/sec? (maybe 8TB/s with single wire signaling?)
14 Credit: John Shalf (LBNL) A 45nm 1.3GHz 16.7 Double-Precision GFLOPS/W RISC-V Processor with VectorDie PhotosAccelerators (3 classes of cores) 20mm Yunsup Lee⇤, Andrew Waterman⇤, Rimas Avizienis⇤, Henry Cook⇤, Chen Sun⇤†, ⇤† ⇤ Vladimir Stojanovic´ , Krste Asanovic´ 8.4mm Email: { yunsup, waterman, rimas, hcook, vlada, krste} @eecs.berkeley.edu, [email protected] ⇤Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA †Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA
Abstract—A 64-bit dual-core RISC-V processor with vector 3mm 1.2mm2.8mm accelerators has been fabricated in a 45 nm SOI process. This VRF L1D$ mm 0.5 is the first dual-core processor to implement the open-sour ce 1
.
1
m
RISC-V ISA designed at the University of Califor nia, Berkeley. Core m In a standard 40 nm process, the RISC-V scalar cor e scores 10% Core Logic L1I$ higher in DMIPS/MHz than the Cortex-A5, ARM’s comparable L1VI$ single-issue in-order scalar core, and is 49% more ar ea-efficient. 1MB SRAM To demonstrate the extensibility of the RISC-V ISA, we integrate m Rocke0.23mmt Hwacha Rocket Hwacha m Array 6 Scalar Vector Scalar Vector a custom vector accelerator alongside each single-issue in-order Core Accelerator mm 0.2 Core Accelerator scalar core. The vector accelerator is 1.8⇥ more energy-efficient 16K 32K 8KB 16K 32K 8KB than the IBM Blue Gene/Q processor, and 2.6⇥ more than the L1I$ L1D$ L1VI$ L1I$ L1D$ L1VI$ IBM Cell processor, both fabricated in the same process. The Arbiter Arbiter dual-cor e RISC-V processor achieves maximum clock frequency Dual-Core RISC-V Coherence Hub Vector Processor FPGA FSB/ of 1.3GHz at 1.2V and peak energy efficiency of 16.7 double- 1MB SRAM Array 15 precision GFLOPS/W at 0.65V with an ar ea of 3mm2 . Credit: John Shalf (LBNL) HTIF
I. I NTRODUCTION Fig. 1. Backside chip micrograph (taken with a removed silicon handle) and processor block diagram. As we approach the end of conventional transistor scaling, computer architects are forced to incorporate specialized and heterogeneous accelerators into general-purpose processors for A. Rocket Scalar Core greater energy ef ficiency. Many proposed accelerators, such as Rocket is a 6-stage single-issue in-order pipeline that those based on GPU architectures, require a drastic reworking executes the 64-bit scalar RISC-V ISA (see Figure 2). The of application software to make use of separate ISAs operating scalar datapath isfully bypassed but carefully designed to min- in memory spaces disjoint from the demand-paged virtual imize the impact of long clock-to-output delays of compiler- memory of the host CPU. RISC-V [1] is a new completely generated SRAMs in the caches. A 64-entry branch target open general-purpose instruction set architecture (ISA) de- buffer, 256-entry two-level branch predictor, and return address veloped at the University of California, Berkeley, which is stack together mitigate the performance impact of control designed to be flexible and extensible to better integrate new hazards. Rocket implements an MMU that supports page- ef ficient accelerators close to the host cores. The open-source based virtual memory and is able to boot modern operating RISC-V software toolchain includes a GCC cross-compiler, systems, including Linux. Both caches are virtually indexed an LLVM cross-compiler, a software ISA simulator, an ISA physically tagged with parallel TLB lookups. The data cache verification suite, a Linux port, and additional documentation, is non-blocking, allowing the core to exploit memory-level and is available at www. ri scv. org. parallelism. In this paper, we present a 64-bit dual-core RISC-V Rocket has an optional IEEE 754-2008-compliant FPU, processor with custom vector accelerators in a 45 nm SOI which can execute single- and double-precision floating-point process. Our RISC-V scalar core achieves 1.72DMIPS/MHz, operations, including fused multiply-add (FMA), with hard- outperforming ARM’s Cortex-A5 score of 1.57DMIPS/MHz ware support for denormals and other exceptional values. The by 10% in a smaller footprint. Our custom vector accelerator is 1.8⇥ more energy-efficient than the IBM Blue Gene/Q processor and 2.6⇥ more than the IBM Cell processor for ITLB Int.RF DTLB PC Int.EX Commit to Hwacha double-precision floating-point operations, demonstrating that Gen. I$ Inst. D$ high ef ficiency can be obtained without sacrificing a unified Access Decode Access bypass paths omitted demand-paged virtual memory environment. for simplicity FP.RF FP.EX1 FP.EX2 FP.EX3 Rocket Pipeline II. CHIP ARCHITECTURE VITLB PC VInst. Seq- Bank1 Bank8 Figure 1 shows the block diagram of the dual-core pro- VI$ Expand ... Gen. Decode uencer R/W R/W cessor. Each core incorporates a 64-bit single-issue in-order Access Rocket scalar core, a 64-bit Hwacha vector accelerator, and from Rocket Hwacha Pipeline their associated instruction and data caches, as described below. Fig. 2. Rocket scalar plus Hwacha vector pipeline diagram. A 45nm 1.3GHz 16.7 Double-Precision GFLOPS/W RISC-V Processor with VectorStripAccelerators down to the core
Yunsup Lee⇤, Andrew Waterman⇤, Rimas Avizienis⇤, Henry Cook2.7 ⇤mm, Chen Sun⇤†,
Vladimir Stojanovic´⇤†, Krste Asanovic´⇤ mm 4.5 Email: { yunsup, waterman, rimas, hcook, vlada, krste} @eecs.berkeley.edu, [email protected] ⇤Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA †Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA
Abstract—A 64-bit dual-core RISC-V processor with vector 3mm 1.2mm2.8mm accelerators has been fabricated in a 45 nm SOI process. This VRF L1D$ mm 0.5 is the first dual-core processor to implement the open-sour ce 1
.
1
m
RISC-V ISA designed at the University of Califor nia, Berkeley. Core m In a standard 40 nm process, the RISC-V scalar cor e scores 10% Core Logic L1I$ higher in DMIPS/MHz than the Cortex-A5, ARM’s comparable L1VI$ single-issue in-order scalar core, and is 49% more ar ea-efficient. 1MB SRAM To demonstrate the extensibility of the RISC-V ISA, we integrate m Rocke0.23mmt Hwacha Rocket Hwacha m Array 6 Scalar Vector Scalar Vector a custom vector accelerator alongside each single-issue in-order Core Accelerator mm 0.2 Core Accelerator scalar core. The vector accelerator is 1.8⇥ more energy-efficient 16K 32K 8KB 16K 32K 8KB than the IBM Blue Gene/Q processor, and 2.6⇥ more than the L1I$ L1D$ L1VI$ L1I$ L1D$ L1VI$ IBM Cell processor, both fabricated in the same process. The Arbiter Arbiter dual-cor e RISC-V processor achieves maximum clock frequency Dual-Core RISC-V Coherence Hub Vector Processor FPGA FSB/ of 1.3GHz at 1.2V and peak energy efficiency of 16.7 double- 1MB SRAM Array 16 precision GFLOPS/W at 0.65V with an ar ea of 3mm2 . Credit: John Shalf (LBNL) HTIF
I. I NTRODUCTION Fig. 1. Backside chip micrograph (taken with a removed silicon handle) and processor block diagram. As we approach the end of conventional transistor scaling, computer architects are forced to incorporate specialized and heterogeneous accelerators into general-purpose processors for A. Rocket Scalar Core greater energy ef ficiency. Many proposed accelerators, such as Rocket is a 6-stage single-issue in-order pipeline that those based on GPU architectures, require a drastic reworking executes the 64-bit scalar RISC-V ISA (see Figure 2). The of application software to make use of separate ISAs operating scalar datapath isfully bypassed but carefully designed to min- in memory spaces disjoint from the demand-paged virtual imize the impact of long clock-to-output delays of compiler- memory of the host CPU. RISC-V [1] is a new completely generated SRAMs in the caches. A 64-entry branch target open general-purpose instruction set architecture (ISA) de- buffer, 256-entry two-level branch predictor, and return address veloped at the University of California, Berkeley, which is stack together mitigate the performance impact of control designed to be flexible and extensible to better integrate new hazards. Rocket implements an MMU that supports page- ef ficient accelerators close to the host cores. The open-source based virtual memory and is able to boot modern operating RISC-V software toolchain includes a GCC cross-compiler, systems, including Linux. Both caches are virtually indexed an LLVM cross-compiler, a software ISA simulator, an ISA physically tagged with parallel TLB lookups. The data cache verification suite, a Linux port, and additional documentation, is non-blocking, allowing the core to exploit memory-level and is available at www. ri scv. org. parallelism. In this paper, we present a 64-bit dual-core RISC-V Rocket has an optional IEEE 754-2008-compliant FPU, processor with custom vector accelerators in a 45 nm SOI which can execute single- and double-precision floating-point process. Our RISC-V scalar core achieves 1.72DMIPS/MHz, operations, including fused multiply-add (FMA), with hard- outperforming ARM’s Cortex-A5 score of 1.57DMIPS/MHz ware support for denormals and other exceptional values. The by 10% in a smaller footprint. Our custom vector accelerator is 1.8⇥ more energy-efficient than the IBM Blue Gene/Q processor and 2.6⇥ more than the IBM Cell processor for ITLB Int.RF DTLB PC Int.EX Commit to Hwacha double-precision floating-point operations, demonstrating that Gen. I$ Inst. D$ high ef ficiency can be obtained without sacrificing a unified Access Decode Access bypass paths omitted demand-paged virtual memory environment. for simplicity FP.RF FP.EX1 FP.EX2 FP.EX3 Rocket Pipeline II. CHIP ARCHITECTURE VITLB PC VInst. Seq- Bank1 Bank8 Figure 1 shows the block diagram of the dual-core pro- VI$ Expand ... Gen. Decode uencer R/W R/W cessor. Each core incorporates a 64-bit single-issue in-order Access Rocket scalar core, a 64-bit Hwacha vector accelerator, and from Rocket Hwacha Pipeline their associated instruction and data caches, as described below. Fig. 2. Rocket scalar plus Hwacha vector pipeline diagram. Actual Size
2.7mm
4.5 mm 4.5
belo
their
Rock
cessor
demand-paged
high
double-precision
processor
is
by
outperforming
process.
processor
and
v
an
RISC-V
ef
designed
v
open
memory
in
of
those
greater
heterogeneous
computer
pr
of
dual-cor
IBM
than
scalar
a
T
single-issue
higher
In
RISC-V
is
accelerators
A
†
erification
eloped
o
ficient
Department
custom
ecision
the
application
a
1.3
1.8
demonstrate
memory
Abstract
10%
LL
Figure
In
As
w
is
standard
et
the
ef
Cell
.
based
GHz
.
associated
⇥
general-purpose
VM
cor
first
a
in
we
45nm
this
ficienc
ener
Each
v
e
scalar
ISA
in
accelerators
⇤
Our
at
ailable
IBM
softw
of
more
GFLOPS/W
e.
v
t
DMIPS/MHz
RISC-V
Department
architects
o
pr
and
approach
ector
with
—A
1
cross-compiler
RISC-V
at
in-order
a
gy
The
dual-cor
suite,
the
on
paper
the
has
be
ocessor
spaces
designed
sho
smaller
RISC-V
core
40
y
1.2
softw
of
Blue
are
core,
accelerators
ef
ARM’
virtual
2.6
the
GPU
fle
ener
64-bit
can
Uni
v
nm
acceler ator
custom
host
at
been
ws
fici
Email:
Electrical
II.
ector
V
floating-point
instruction
,
a
xible
pr
⇥
toolchain
extensibility
incorporates
w.risv rg or scv. i www. r
,
Gene/Q
are
Y
v
e
enc
we
Linux
gy-ef
and
scalar
pr
ocessor
the
be
are
the
disjoint
a
I.
both
ersity
architectures,
unsup
close
s
more
footprint.
CPU.
at
pr
fabricated
ocess,
at
C
dual-cor
of
1.3GHz
scalar
memory
acceler ator
64-bit
than
i
y
to
Corte
obtained
nstruction
and
HIP
ocessor
{
end
.
present
forced
0.65
peak
block
v
I
the
ficient
Electrical
yunsup,
,
Man
NT
mak
ector
fabricated
cor
port,
into
a
to
Engineering
of
alongside
than
pr
achie
RISC-V
the
A
the
e
of
V
R
Lee
x-A5
Uni
includes
core
from
softw
xtensible
e,
and
Hw
ocessor
the
ener
RCH
y
e
OD
e
California,
of
with
diagram
con
en
Our
operations,
general
to
and
use
RISC-V
and
Cortex-A5,
proposed
accelerators
to
v
RISC-V
a
than
Processor
⇤
in
v
a
UCT
the
without
ersity
acha
the
vironment.
host
es
w
,
gy
is
achie
ITE
64-bit
incorporate
v
data
are
s
set
require
implement
the
Engineering
Andre
64-bit
a
aterman,
of
core
entional
is
an
additional
custom
1.8
maxi
,
in
[1]
RISC-V
efficiency
IBM
ION
each
45
CT
the
and
.
49%
a
separate
cores.
ISA
-purpose
v
architecture
to
⇥
ar
v
of
demand-paged
the
ector
scalar
and
of
caches,
nm
es
URE
GCC
pr
of
is
mum
ea
single-issue
Vladimir
accelerators,
mor
sacrificing
better
w
Calif
IBM
2.6
single-issue
Berk
a
16.7
ocessor
dual-core
ARM’
Cell
demonstrating
the
1
mor
a
simulator
same
1.57
of
v
SOI
transistor
Computer
.72
drastic
rimas,
W
in
⇥
The
ISA,
ector
e
ne
accelerator
c
the
specialized
documentation,
or
3
clock
cross-compiler
ISAs
of
e
ele
aterman
ener
or
dual-core
mor
processors
DMIPS/MHz,
w
a
mm
and
Blue
inte
processor
DMIPS/MHz
nia,
as
ar
s
pr
e
pr
open-source
16.7
y
we
open-sour
45
comparable
with
completely
ea-efficient.
scor
accelerator
,
ocess.
gy-efficient
ocess.
e
hcook,
(ISA)
re
2
grate
Stojano
described
fr
operating
Computer
,
a
.
which
Berk
integrate
nm
than
w
RISC-V
in-order
in-order
equency
Gene/Q
scaling,
an
such
double-
es
unified
Science,
Double-Precision
virtual
orking
⇤
v
,
ector
,
with
10%
This
pro-
eley
ne
SOI
ISA
that
The
and
and
de-
Rimas
the
for
for
vlada,
as
ce
w
is
vi
,
.
c
´
Sciences,
⇤
assachusettsM
†
,
krste
Fig.
w
operations,
which
parallelism.
is
ph
systems,
based
hazards.
stack
b
generated
imize
scalar
e
A.
processor
Fig.
A
f
R
b
o
x
uf
Krste
yp
Gen.
Gen.
r r
are
o
PC PC
6mm vizienis
ecutes
ysically
si
non-blocking,
a
cke
fer
Roc
Rock
Rock
f
V
1.2 mm 1.2
mp
1.
2.
ss ss
ro
}
D
V
m m
support
p
@eecs.berk
,
l
i
t
e
u
datapath
the
together
ci
a
virtual
can
256-entry
Backside
Uni
R
c
k
a
ector
Rock
Pi
t
t
h
to
y
Asano
l
block
o
et
-C
et
et
s s
cke
p
including
Rock
Acce
Acce
the
r
VI
o
o
e
impact
ITLB
⇤
3
Pr
v
mi
VI
SRAMs
Scalar
tagged
e
r
mm
T
I$
et
l
t
,
has
e
ersity
i
including
is
x
L
$
o
n
t
t
ss
ss
Institute
B
R
diagram.
Henry
c
e
ecute
e
scalar
64-bit
for
SR
memory
d
e
I
Array
et
1
SC
chip
vi
s
a
is
MB
mitig
s
an
A
o
-V -V
c
tw
´
ele
denormals
M
fully
6-stage
allo
implements
r
of
Cor
of
⇤
D
D
with
micrograph
plus
Int.RF
VI
e
e
single-
I
o-le
optional
Linux.
n
in
y
co
co
n
scalar
st
Cook
long
Accelerators
ate
.edu,
California,
st
wing
d
d
e
of
.
fused
.
e
e
Hw
bypassed
the
v
and
F
parallel
VR
PG
el
L1I$
R
1
Sca
T
CoreLogic
C
the
HTIF
6
o
acha
K K
A
o
F
cke
echnology
clock-to-output
single-issue
re
⇤
l
branch
and
a
F
Both
caches.
u
RISC-V
the
FP
I
r
SB/
,
is
t
n
Se
e
and
(tak
multi
t
IEEE
n
.
.RF
Arb
Chen
performance
v
EX
L1D$
ce q
0.5mm32K
an
able
ector
-
double-precision
Acce
i
en
r
t
core
TLB
e
GFLOPS/W
H
L1D$
b
Berk
r
other
V
w
caches
e
ut
with
a
ply-add
l
ct
predictor
MMU
e
ch
ra
o
L
L
pipeline
754-2008-compliant
r
8
L1I$
to
A
a
1
t
Exp
carefully
1
Sun
,
F
Acce
o
KB KB
C
VI
ISA
DTLB
VI
r
P
to
ele
o
lookups.
Cambridge,
D$
$
.
a
h
$
EX1
e
e
boot
a
64-ent
2
re
remo
in-order
xceptional
n
ss
.
y
8
1
e
⇤
n
d
mm
are
MB MB
,
ce
xploit
†
(see
that
delays
diagram.
CA,
,
(FMA),
H
,
SR
L1I$
R
v
1
impact
Sca
u
C
and
6
modern
b
o
ed
AM AM
K K
o
virtually
cke
C
F
ry
re
Ba
d
l
a
P
o
R/W
r
H
e
The
Arra
t
Figure
silicon
supports
.
mmi
EX2
n
USA
signed
Core
w
return
k1
Arb
memory-le
branch
floating-point
L1D$
pipeline
y
32K
of
a
t
MA,
Acce
i
t
v
ch
e
H
...
with
data
r
V
of
alues.
w
e
t
compiler
handle)
o
operating
a
l
ct
a
e
ch
ra
o
H
F
L
inde
2).
r
8
Pi
address
a
1
to
t
P
control
w
o
KB KB
USA
VI
.
r
a
EX3
p
cache
tar
Ba
page-
$
hard-
FPU,
ch
R/W
min-
e
The
The
that
n
x
l
a
get
v
and
i
k8
n
ed m m 1 .
el 1
e -
0.23mm mm 0.2
17 Credit: John Shalf (LBNL) Basic Stats
2.7mm Core Energy/Area est.
4.5 mm 4.5 Area: 12.25 mm2 Power: 2.5W Clock: 2.4 GHz E/op: 651 pj Wire Energy Assumptions for 22nm
100 fj/bit per mm
belo
their
Rock
cessor
demand-paged
high
double-precision
processor
is
by
outperforming
process.
processor
and
v
an
RISC-V
ef
designed
v
open
memory
in
of
those
greater
heterogeneous
computer
pr
of
dual-cor
IBM
than
scalar
a
T
single-issue
higher
In
RISC-V
is
accelerators
A
†
erification
eloped
o
ficient
Department
custom
ecision
the
application
a
1.3
1.8
demonstrate
memory
Abstract
10%
LL
Figure
In
As
w
is
standard
et
the
ef
Cell
.
based
GHz
.
associated
⇥
general-purpose
VM
cor
first
a
in
we
45nm
this
ficienc
ener
Each
v
e
scalar
ISA
in
accelerators
⇤
Our
at
ailable
IBM
softw
of
more
GFLOPS/W
e.
v
t
DMIPS/MHz
RISC-V
Department
architects
o
pr
and
approach
ector
with
—A
1
cross-compiler
RISC-V
at
in-order
a
gy
The
dual-cor
suite,
the
on
paper
the
has
be
ocessor
spaces
designed
sho
smaller
RISC-V
core
40
y
1.2
softw
of
Blue
are
core,
accelerators
ef
ARM’
virtual
2.6
the
GPU
fle
ener
64-bit
can
Uni
v
nm
acceler ator
custom
host
at
been
ws
fici
Email:
Electrical
II.
ector
V
floating-point
instruction
,
a
xible
pr
⇥
toolchain
extensibility
incorporates
w.risv rg or scv. i www. r
,
Gene/Q
are
Y
v
e
enc
we
Linux
gy-ef
and
scalar
pr
ocessor
the
be
are
the
disjoint
a
I.
both
ersity
architectures,
unsup
close
s
more
footprint.
CPU.
at
pr
fabricated
ocess,
at
C
dual-cor
of
1.3GHz
scalar memory
acceler ator 64bit operand
64-bit
than
i
y
to
Corte
obtained
nstruction
and
HIP
ocessor
{
end
.
present
forced
0.65
peak
block
v
I
the
ficient
Electrical
yunsup,
,
Man
NT
mak
ector
fabricated cor
port, 2
into
a
to
Engineering
of
alongside
than
pr
achie
RISC-V
the
A
the
e
of
V
R
Lee
x-A5
Uni
includes
core
from
softw
xtensible
e,
and
Hw
ocessor
the
ener
RCH
y
e
OD
e
California,
of
with
diagram
con
en
Our
operations,
general
to
and
use
RISC-V
and
Cortex-A5,
proposed
accelerators
to
v
RISC-V
a
than
Processor
⇤
in
v
a
UCT
the
without
ersity
acha
the
vironment.
host
es
w
,
gy
is
achie
ITE
64-bit
incorporate
v
data
are
s
set
require
implement
the
Engineering
Andre
64-bit
a
aterman,
of
core
entional
is
an
additional
custom
1.8
maxi
,
in
[1]
RISC-V
efficiency
IBM
ION
each
45
CT
the
and
.
49%
a
separate
cores.
ISA
-purpose
v
architecture
to
⇥
ar
v
of
demand-paged
the
ector
scalar
and
of
caches,
nm
es
URE
GCC
pr
of
is
mum
ea
single-issue
Vladimir
accelerators,
mor
sacrificing
better
w
Calif
IBM
2.6
single-issue
Berk
a
16.7
ocessor
dual-core
ARM’
Cell
demonstrating
the
1
mor
a
simulator
same
1.57
of
v
SOI
transistor
Computer
.72
drastic
rimas,
W
in
⇥
The ISA,
ector Area: 0.6 mm
e
ne
accelerator
c
the
specialized
documentation,
or
3
clock
cross-compiler
ISAs
of
e
ele
aterman
ener
or
dual-core
mor
processors
DMIPS/MHz,
w
a
mm
and
Blue
inte
processor
DMIPS/MHz
nia,
as
ar
s
pr
e
pr
open-source
16.7
y
we
open-sour
45
comparable
with
completely
ea-efficient.
scor
accelerator
,
ocess.
gy-efficient
ocess.
e
hcook,
(ISA)
re
2
grate
Stojano
described
fr
operating
Computer
,
a
.
which
Berk
integrate
nm
than
w
RISC-V
in-order
in-order
equency
Gene/Q
scaling,
an
such
double-
es
unified
Science,
Double-Precision
virtual
orking
⇤
v
,
ector
,
with
10%
This
pro-
eley
ne
SOI
ISA
that
The
and
and
de-
Rimas
the
for
for
vlada,
as
ce
w
is
vi
,
.
c
´
Sciences,
⇤
assachusettsM
†
,
krste
Fig.
w
operations,
which
parallelism.
is
ph
systems,
based
hazards.
stack
b
generated
imize
scalar
e
A.
processor
Fig.
A
f
R
b
o
x
uf
Krste
yp
Gen.
Gen.
r r
are
o
PC PC
6mm vizienis
ecutes
ysically
si
non-blocking,
a
cke
fer
Roc
Rock
Rock
f
V
1.2 mm 1.2
mp
1.
2.
ss ss
ro
}
D
V
m m
support
p
@eecs.berk
,
l
i
t
e
u
datapath
the
together
ci
a
virtual
can
256-entry
Backside
Uni
R
c
k
a
ector
Rock
Pi
t
t
h
to
y
Asano
l
block
o
et
-C
et
et
s s
cke
p
including
Rock
Acce
Acce
the
r
VI
o
o
e
impact
ITLB
⇤
3
Pr
v
mi
VI
SRAMs
Scalar
tagged
e
r
mm
T
I$
et
l
t
,
has
e
ersity
i
including
is
x
L
$
o
n
t
t
ss
ss
Institute
B
R
diagram.
Henry
c
e
ecute
e
scalar
64-bit
for
SR
memory
d
e
I
Array
et
1 SC
chip Energy:
vi
s
a
is
MB
mitig
s
an
A
o
-V -V
c
tw
´
ele
denormals
M
fully
6-stage
allo
implements
r
of
Cor
of
⇤
D
D
with
micrograph
plus
Int.RF
VI
e
e
single-
I
o-le
optional
Linux.
n
in
y
co
co
n
scalar
st
Cook
long
Accelerators
ate
.edu,
California,
st
wing
d
d
e
of
.
fused
.
e
e
Hw
bypassed
the
v
and
F
parallel
VR
PG
el
L1I$
R
1
Sca
T
CoreLogic
C
the
HTIF
6
o
acha
K K
A
o
F
cke
echnology
clock-to-output
single-issue
re
⇤
l
branch
and
a
F
Both
caches.
u
RISC-V
the
FP
I
r
SB/
,
is
t
n
Se
e
and
(tak
multi
t
IEEE
n
.
.RF
Arb
Chen
performance
v
EX
L1D$
ce q
0.5mm32K
an
able
ector
-
double-precision
Acce
i
en
r
t
core
TLB
e
GFLOPS/W
H
L1D$
b
Berk
r
other
V
w
caches
e
ut
with
a
ply-add
l
ct
predictor
MMU
e
ch
ra
o
L
L
pipeline
754-2008-compliant
r
8
L1I$
to
A
a
1
t
Exp
carefully
1
Sun
,
F
Acce
o
KB KB
C
VI
ISA
DTLB
VI
r
P
to
ele o
lookups. Power: 0.3W (<0.2W)
Cambridge,
D$
$
.
a
h
$
EX1
e
e
boot
a
64-ent
2
re
remo
in-order
xceptional
n
ss
.
y
8
1
e
⇤
n
d
mm
are
MB MB
,
ce
xploit
†
(see
that
delays
diagram.
CA,
,
(FMA),
H
,
SR
L1I$
R
v
1
impact
Sca
u
C
and
6
modern
b
o
ed
AM AM
K K
o
virtually
cke
C
F
ry
re
Ba
d
l
a
P
o
R/W
r
H
e
The
Arra
t
Figure
silicon
supports
.
mmi
EX2
n
USA
signed
Core
w
return
k1
Arb
memory-le
branch
floating-point
L1D$
pipeline
y
32K
of
a
t
MA,
Acce
i
t
v
ch
e
H
...
with
data
r
V
of
alues.
w
e
t
compiler
handle)
o
operating
a
l
ct
a
e
ch
ra
o
H
F
L
inde
2).
r
8
Pi
address
a
1
to
t
P
control
w
o
KB KB
USA
VI
.
r
a
EX3
p
cache
tar
Ba
page-
$
hard-
FPU,
ch
R/W
min-
e
The
The
that
n
x
l
a
get
v
and
i
k8
n
ed m m 1 .
el 1 e - Clock: 1.3 GHz 1mm=~6pj E/op: 150 (75) pj 20mm=~120pj
2
0.2 mm 0.2 Area: 0.046 mm 0.23mm Power: 0.025W Clock: 1.0 GHz E/op: 22 pj
18 Credit: John Shalf (LBNL) When does data movement dominate?
2.7mm Core Energy/Area est. Data Movement Cost
Compute Op == 4.5 mm 4.5 Area: 12.25 mm2 data movement Energy @ Power: 2.5W 108mm Clock: 2.4 GHz Energy Ratio for 20mm E/op: 651 pj 0.2x
Compute Op ==
belo
their
Rock
cessor
demand-paged
high
double-precision
processor
is
by
outperforming
process.
processor
and
v
an
RISC-V
ef
designed
v
open
memory
in
of
those
greater
heterogeneous
computer
pr
of
dual-cor
IBM
than
scalar
a
T
single-issue
higher
In
RISC-V
is
accelerators
A
†
erification
eloped
o
ficient
Department custom
ecision 2
the
application
a
1.3
1.8
demonstrate
memory
Abstract
10%
LL
Figure
In
As
w
is
standard
et
the
ef
Cell
.
based
GHz
.
associated
⇥
general-purpose
VM
cor
first
a
in
we
45nm
this
ficienc
ener
Each
v
e
scalar
ISA
in
accelerators
⇤
Our
at
ailable
IBM
softw
of
more
GFLOPS/W
e.
v
t
DMIPS/MHz
RISC-V
Department
architects
o
pr
and
approach
ector
with
—A
1
cross-compiler
RISC-V
at
in-order
a
gy
The
dual-cor
suite,
the
on
paper
the
has
be
ocessor
spaces
designed
sho
smaller
RISC-V
core
40
y
1.2
softw
of
Blue
are
core,
accelerators
ef
ARM’
virtual
2.6
the
GPU
fle
ener
64-bit
can
Uni
v
nm
acceler ator
custom
host
at
been
ws
fici
Email:
Electrical
II.
ector
V
floating-point
instruction
,
a
xible
pr
⇥
toolchain
extensibility
incorporates
w.risv rg or scv. i www. r
,
Gene/Q
are
Y
v
e
enc
we
Linux
gy-ef
and
scalar
pr
ocessor
the
be
are
the
disjoint
a
I.
both ersity
architectures, Area: 0.6 mm
unsup
close
s
more
footprint.
CPU.
at
pr
fabricated
ocess,
at
C
dual-cor
of
1.3GHz
scalar
memory
acceler ator
64-bit
than
i
y
to
Corte
obtained
nstruction
and
HIP
ocessor
{
end
.
present
forced
0.65
peak
block
v
I
the
ficient
Electrical
yunsup,
,
Man
NT
mak
ector
fabricated
cor
port,
into
a
to
Engineering
of
alongside
than
pr
achie
RISC-V
the
A
the
e
of
V
R
Lee
x-A5
Uni
includes
core
from
softw
xtensible
e,
and
Hw
ocessor
the
ener
RCH
y
e
OD
e
California,
of
with
diagram
con
en
Our
operations,
general
to
and
use
RISC-V
and
Cortex-A5,
proposed
accelerators
to
v
RISC-V
a
than
Processor
⇤
in
v
a
UCT
the
without
ersity
acha
the
vironment.
host
es
w
,
gy
is
achie
ITE
64-bit
incorporate
v
data
are
s
set
require
implement
the
Engineering
Andre
64-bit
a
aterman,
of
core
entional
is
an
additional
custom
1.8
maxi
,
in
[1]
RISC-V
efficiency
IBM
ION
each
45
CT
the
and
.
49%
a
separate
cores.
ISA
-purpose
v
architecture
to
⇥
ar
v
of
demand-paged
the
ector
scalar
and
of
caches,
nm
es
URE
GCC
pr
of
is
mum
ea
single-issue
Vladimir
accelerators,
mor
sacrificing
better w
Calif data movement Energy @
IBM
2.6
single-issue
Berk
a
16.7
ocessor
dual-core
ARM’
Cell
demonstrating
the
1
mor
a
simulator
same
1.57
of
v
SOI
transistor
Computer
.72
drastic
rimas,
W
in
⇥
The
ISA,
ector
e
ne
accelerator
c
the
specialized
documentation,
or
3
clock
cross-compiler
ISAs
of
e
ele
aterman
ener
or
dual-core
mor
processors
DMIPS/MHz,
w
a
mm
and
Blue
inte
processor
DMIPS/MHz
nia,
as
ar
s
pr
e
pr
open-source
16.7
y
we
open-sour
45
comparable
with
completely
ea-efficient.
scor
accelerator
,
ocess.
gy-efficient
ocess.
e
hcook,
(ISA)
re
2
grate
Stojano
described
fr
operating
Computer
,
a
.
which
Berk
integrate
nm
than
w
RISC-V
in-order
in-order
equency
Gene/Q
scaling,
an
such
double-
es
unified
Science,
Double-Precision
virtual
orking
⇤
v
,
ector
,
with
10%
This
pro-
eley
ne
SOI
ISA
that
The
and
and
de-
Rimas
the
for
for
vlada,
as
ce
w
is
vi
,
.
c
´
Sciences,
⇤
assachusettsM
†
,
krste
Fig.
w
operations,
which
parallelism.
is
ph
systems,
based
hazards.
stack
b
generated
imize
scalar
e
A.
processor
Fig.
A
f
R
b
o
x
uf
Krste
yp
Gen.
Gen.
r r
are
o
PC PC
6mm vizienis
ecutes
ysically
si
non-blocking,
a
cke
fer
Roc
Rock
Rock
f
V
1.2 mm 1.2
mp
1.
2.
ss ss
ro
}
D
V
m m
support
p
@eecs.berk
,
l
i
t
e
u
datapath
the
together
ci
a
virtual
can
256-entry
Backside
Uni
R
c
k
a
ector
Rock
Pi
t
t
h
to
y
Asano
l
block
o
et
-C
et
et
s s
cke
p
including
Rock
Acce
Acce
the
r
VI
o
o
e
impact
ITLB
⇤
3
Pr
v
mi
VI
SRAMs
Scalar
tagged
e
r
mm
T
I$
et
l
t
,
has
e
ersity
i
including
is
x
L
$
o
n
t
t
ss
ss
Institute
B
R
diagram. Henry
c Power: 0.3W (<0.2W)
e
ecute
e
scalar
64-bit
for
SR
memory
d
e
I
Array
et
1
SC
chip
vi
s
a
is
MB
mitig
s
an
A
o
-V -V
c
tw
´
ele
denormals
M
fully
6-stage
allo
implements
r
of
Cor
of
⇤
D
D
with
micrograph
plus
Int.RF
VI
e
e
single-
I
o-le
optional
Linux.
n
in
y
co
co
n
scalar
st
Cook
long
Accelerators
ate
.edu,
California,
st
wing
d
d
e
of
.
fused
.
e
e
Hw
bypassed
the
v
and
F
parallel
VR
PG
el
L1I$
R
1
Sca
T
CoreLogic
C
the
HTIF
6
o
acha
K K
A
o
F
cke
echnology
clock-to-output
single-issue
re
⇤
l
branch
and
a
F
Both
caches.
u
RISC-V
the
FP
I
r
SB/
,
is
t
n
Se
e
and
(tak
multi
t
IEEE
n
.
.RF
Arb
Chen
performance
v
EX
L1D$
ce q
0.5mm32K
an
able
ector
-
double-precision
Acce
i
en
r
t
core
TLB
e
GFLOPS/W
H
L1D$
b
Berk
r
other
V
w
caches
e
ut
with
a
ply-add
l
ct
predictor MMU
e 12mm
ch
ra
o
L
L
pipeline
754-2008-compliant
r
8
L1I$
to
A
a
1
t
Exp
carefully
1
Sun
,
F
Acce
o
KB KB
C
VI
ISA
DTLB
VI
r
P
to
ele
o
lookups.
Cambridge,
D$
$
.
a
h
$
EX1
e
e
boot
a
64-ent
2
re
remo
in-order
xceptional
n
ss
.
y
8
1
e
⇤
n
d
mm
are
MB MB
,
ce
xploit
†
(see
that
delays
diagram.
CA,
,
(FMA),
H
,
SR
L1I$
R
v
1
impact
Sca
u
C
and
6
modern
b
o
ed
AM AM
K K
o
virtually
cke
C
F
ry
re
Ba
d
l
a
P
o
R/W
r
H
e
The
Arra
t
Figure
silicon
supports
.
mmi
EX2
n
USA
signed
Core
w
return
k1
Arb
memory-le
branch
floating-point
L1D$
pipeline
y
32K
of
a
t
MA,
Acce
i
t
v
ch
e
H
...
with
data
r
V
of
alues.
w
e
t
compiler
handle)
o
operating
a
l
ct
a
e
ch
ra
o
H
F
L
inde
2).
r
8
Pi
address
a
1
to
t
P
control
w
o
KB KB
USA
VI
.
r
a
EX3
p
cache
tar
Ba
page-
$
hard-
FPU,
ch
R/W
min-
e
The
The
that
n
x
l
a
get
v
and
i
k8
n
ed m m 1 .
el 1 e - Clock: 1.3 GHz Energy Ratio for 20mm E/op: 150 (75) pj 1.6x
Compute Op == 2 Area: 0.046 mm data movement Energy @
0.2 mm 0.2 Power: 0.025W 0.23mm 3.6mm Clock: 1.0 GHz Energy Ratio for 20mm E/op: 22 pj 5.5x
19 Credit: John Shalf (LBNL) DPHPC Overview
20 Goals of this lecture
Memory Trends
Cache Coherence
Memory Consistency
21 Memory – CPU gap widens
Measure processor speed as “throughput” . FLOPS/s, IOPS/s, … . Moore’s law - ~60% growth per year
Source: Jack Dongarra
Today’s architectures . POWER8: 338 dp GFLOP/s – 230 GB/s memory bw . BW i7-5775C: 883 GFLOPS/s ~50 GB/s memory bw . Trend: memory performance grows 10% per year
Source: John Mc.Calpin 22 Issues (AMD Interlagos as Example)
How to measure bandwidth? . Data sheet (often peak performance, may include overheads) Frequency times bus width: 51 GiB/s . Microbenchmark performance Stride 1 access (32 MiB): 32 GiB/s Random access (8 B out of 32 MiB): 241 MiB/s Why? . Application performance As observed (performance counters) Somewhere in between stride 1 and random access
How to measure Latency? . Data sheet (often optimistic, or not provided) <100ns . Random pointer chase 110 ns with one core, 258 ns with 32 cores! 23 Conjecture: Buffering is a must!
Two most common examples:
Write Buffers . Delayed write back saves memory bandwidth . Data is often overwritten or re-read
Caching . Directory of recently used locations . Stored as blocks (cache lines)
24 Cache Coherence
Different caches may have a copy of the same memory location!
Cache coherence . Manages existence of multiple copies
Cache architectures . Multi level caches . Shared vs. private (partitioned) . Inclusive vs. exclusive . Write back vs. write through . Victim cache to reduce conflict misses . …
25 Exclusive Hierarchical Caches
26 Shared Hierarchical Caches
27 Shared Hierarchical Caches with MT
28 Caching Strategies (repeat)
Remember: . Write Back? . Write Through?
Cache coherence requirements A memory system is coherent if it guarantees the following: . Write propagation (updates are eventually visible to all readers) . Write serialization (writes to the same location must be observed in order) Everything else: memory model issues (later)
29 Write Through Cache
1. CPU0 reads X from memory • loads X=0 into its cache
2. CPU1 reads X from memory • loads X=0 into its cache
3. CPU0 writes X=1 • stores X=1 in its cache • stores X=1 in memory
4. CPU1 reads X from its cache • loads X=0 from its cache
Incoherent value for X on CPU1
CPU1 may wait for update!
Requires write propagation!
30 Write Back Cache
1. CPU0 reads X from memory • loads X=0 into its cache
2. CPU1 reads X from memory • loads X=0 into its cache
3. CPU0 writes X=1 • stores X=1 in its cache
4. CPU1 writes X =2 • stores X=2 in its cache
5. CPU1 writes back cache line • stores X=2 in in memory
6. CPU0 writes back cache line • stores X=1 in memory
Later store X=2 from CPU1 lost
Requires write serialization!
31 A simple (?) example
Assume C99: struct twoint { int a; int b; }
Two threads: . Initially: a=b=0 . Thread 0: write 1 to a . Thread 1: write 1 to b
Assume non-coherent write back cache . What may end up in main memory?
32 Cache Coherence Protocol
Programmer can hardly deal with unpredictable behavior!
Cache controller maintains data integrity . All writes to different locations are visible
Fundamental Mechanisms
Snooping . Shared bus or (broadcast) network
Directory-based . Record information necessary to maintain coherence: E.g., owner and state of a line etc.
33 Fundamental CC mechanisms
Snooping . Shared bus or (broadcast) network . Cache controller “snoops” all transactions . Monitors and changes the state of the cache’s data . Works at small scale, challenging at large-scale E.g., Intel Broadwell
Directory-based Source: Intel . Record information necessary to maintain coherence E.g., owner and state of a line etc. . Central/Distributed directory for cache line ownership . Scalable but more complex/expensive E.g., Intel Xeon Phi KNC/KNL
34 Cache Coherence Parameters
Concerns/Goals . Performance . Implementation cost (chip space, more important: dynamic energy) . Correctness . (Memory model side effects)
Issues . Detection (when does a controller need to act) . Enforcement (how does a controller guarantee coherence) . Precision of block sharing (per block, per sub-block?) . Block size (cache line size?)
35 An Engineering Approach: Empirical start
Problem 1: stale reads . Cache 1 holds value that was already modified in cache 2 . Solution: Disallow this state Invalidate all remote copies before allowing a write to complete
Problem 2: lost update . Incorrect write back of modified line writes main memory in different order from the order of the write operations or overwrites neighboring data . Solution: Disallow more than one modified copy
36 Invalidation vs. update (I)
Invalidation-based: . On each write of a shared line, it has to invalidate copies in remote caches . Simple implementation for bus-based systems: Each cache snoops Invalidate lines written by other CPUs Signal sharing for cache lines in local cache to other caches
Update-based: . Local write updates copies in remote caches Can update all CPUs at once Multiple writes cause multiple updates (more traffic)
37 Invalidation vs. update (II)
Invalidation-based: . Only write misses hit the bus (works with write-back caches) . Subsequent writes to the same cache line are local . Good for multiple writes to the same line (in the same cache)
Update-based: . All sharers continue to hit cache line after one core writes Implicit assumption: shared lines are accessed often . Supports producer-consumer pattern well . Many (local) writes may waste bandwidth!
Hybrid forms are possible!
38 MESI Cache Coherence
Most common hardware implementation of discussed requirements aka. “Illinois protocol” Each line has one of the following states (in a cache):
Modified (M) . Local copy has been modified, no copies in other caches . Memory is stale
Exclusive (E) . No copies in other caches . Memory is up to date
Shared (S) . Unmodified copies may exist in other caches . Memory is up to date
Invalid (I)
. Line is not in cache 39 Terminology
Clean line: . Content of cache line and main memory is identical (also: memory is up to date) . Can be evicted without write-back
Dirty line: . Content of cache line and main memory differ (also: memory is stale) . Needs to be written back eventually Time depends on protocol details
Bus transaction: . A signal on the bus that can be observed by all caches . Usually blocking
Local read/write: . A load/store operation originating at a core connected to the cache
40 Transitions in response to local reads
State is M . No bus transaction
State is E . No bus transaction
State is S . No bus transaction
State is I . Generate bus read request (BusRd) May force other cache operations (see later) . Other cache(s) signal “sharing” if they hold a copy . If shared was signaled, go to state S . Otherwise, go to state E
After update: return read value
41 Transitions in response to local writes
State is M . No bus transaction
State is E . No bus transaction . Go to state M
State is S . Line already local & clean . There may be other copies . Generate bus read request for upgrade to exclusive (BusRdX*) . Go to state M
State is I . Generate bus read request for exclusive ownership (BusRdX) . Go to state M
42 Transitions in response to snooped BusRd
State is M . Write cache line back to main memory . Signal “shared” . Go to state S (or E)
State is E . Signal “shared” . Go to state S and signal “shared”
State is S . Signal “shared”
State is I . Ignore
43 Transitions in response to snooped BusRdX
State is M . Write cache line back to memory . Discard line and go to I
State is E . Discard line and go to I
State is S . Discard line and go to I
State is I . Ignore
BusRdX* is handled like BusRdX!
44 MESI State Diagram (FSM)
Source: Wikipedia 45 Small Exercise
Initially: all in I state
Action P1 state P2 state P3 state Bus action Data from P1 reads x P2 reads x P1 writes x P1 reads x P3 writes x
46 Small Exercise
Initially: all in I state
Action P1 state P2 state P3 state Bus action Data from P1 reads x E I I BusRd Memory P2 reads x S S I BusRd Cache P1 writes x M I I BusRdX* Cache P1 reads x M I I - Cache P3 writes x I I M BusRdX Memory
47 Optimizations?
Class question: what could be optimized in the MESI protocol to make a system faster?
48 Related Protocols: MOESI (AMD)
Extended MESI protocol
Cache-to-cache transfer of modified cache lines . Cache in M or O state always transfers cache line to requesting cache . No need to contact (slow) main memory
Avoids write back when another process accesses cache line . Good when cache-to-cache performance is higher than cache-to-memory E.g., shared last level cache!
49 MOESI State Diagram
Source: AMD64 Architecture Programmer’s Manual 50 Related Protocols: MOESI (AMD)
Modified (M): Modified Exclusive . No copies in other caches, local copy dirty . Memory is stale, cache supplies copy (reply to BusRd*)
Owner (O): Modified Shared . Exclusive right to make changes . Other S copies may exist (“dirty sharing”) . Memory is stale, cache supplies copy (reply to BusRd*)
Exclusive (E): . Same as MESI (one local copy, up to date memory)
Shared (S): . Unmodified copy may exist in other caches . Memory is up to date unless an O copy exists in another cache
Invalid (I): . Same as MESI 51 Related Protocols: MESIF (Intel)
Modified (M): Modified Exclusive . No copies in other caches, local copy dirty . Memory is stale, cache supplies copy (reply to BusRd*)
Exclusive (E): . Same as MESI (one local copy, up to date memory)
Shared (S): . Unmodified copy may exist in other caches . Memory is up to date
Invalid (I): . Same as MESI
Forward (F): . Special form of S state, other caches may have line in S . Most recent requester of line is in F state . Cache acts as responder for requests to this line 52 Multi-level caches
Most systems have multi-level caches . Problem: only “last level cache” is connected to bus or network . Yet, snoop requests are relevant for inner-levels of cache (L1) . Modifications of L1 data may not be visible at L2 (and thus the bus)
L1/L2 modifications . On BusRd check if line is in M state in L1 It may be in E or S in L2! . On BusRdX(*) send invalidations to L1 . Everything else can be handled in L2
If L1 is write through, L2 could “remember” state of L1 cache line . May increase traffic though
53 Directory-based cache coherence
Snooping does not scale . Bus transactions must be globally visible . Implies broadcast
Typical solution: tree-based (hierarchical) snooping . Root becomes a bottleneck
Directory-based schemes are more scalable . Directory (entry for each CL) keeps track of all owning caches . Point-to-point update to involved processors No broadcast Can use specialized (high-bandwidth) network, e.g., HT, QPI …
54 © Markus Püschel Computer Science
Basic Scheme
System with N processors Pi
For each memory block (size: cache line) maintain a directory entry . N presence bits
. Set if block in cache of Pi . 1 dirty bit
First proposed by Censier and Feautrier (1978)
55 Directory-based CC: Read miss
Pi intends to read, misses
If dirty bit (in directory) is off . Read from main memory . Set presence[i] . Supply data to reader
If dirty bit is on
. Recall cache line from Pj (determine by presence[]) . Update memory . Unset dirty bit, block shared . Set presence[i] . Supply data to reader
56 Directory-based CC: Write miss
Pi intends to write, misses
If dirty bit (in directory) is off
. Send invalidations to all processors Pj with presence[j] turned on . Unset presence bit for all processors . Set dirty bit
. Set presence[i], owner Pi
If dirty bit is on
. Recall cache line from owner Pj . Update memory . Unset presence[j] . Set presence[i], dirty bit remains set . Supply data to writer
57 Discussion
Scaling of memory bandwidth . No centralized memory
Directory-based approaches scale with restrictions . Require presence bit for each cache . Number of bits determined at design time . Directory requires memory (size scales linearly) . Shared vs. distributed directory
Software-emulation . Distributed shared memory (DSM) . Emulate cache coherence in software (e.g., TreadMarks) . Often on a per-page basis, utilizes memory virtualization and paging
59 Open Problems (for projects or theses)
Tune algorithms to cache-coherence schemes . What is the optimal parallel algorithm for a given scheme? . Parameterize for an architecture
Measure and classify hardware . Read Maranget et al. “A Tutorial Introduction to the ARM and POWER Relaxed Memory Models” and have fun! . RDMA consistency is barely understood! . GPU memories are not well understood! Huge potential for new insights!
Can we program (easily) without cache coherence? . How to fix the problems with inconsistent values? . Compiler support (issues with arrays)?
60 Case Study: Intel Xeon Phi
61 Communication?
62 Ramos, Hoefler: “Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi “, HPDC’13 Invalid read RI=278 ns Local read: RL= 8.6 ns Remote read RR = 235 ns
63 Inspired by Molka et al.: “Memory performance and cache coherency effects on an Intel Nehalem multiprocessor system” Single-Line Ping Pong
Prediction for both in E state: 479 ns . Measurement: 497 ns (O=18)
64 Multi-Line Ping Pong
More complex due to prefetch
Number Amortization of of CLs startup
Asymptotic Fetch Startup Latency for each cache overhead line (optimal prefetch!)
65 Multi-Line Ping Pong
E state: . o=76 ns . q=1,521ns . p=1,096ns
I state: . o=95ns . q=2,750ns . p=2,017ns
66 DTD Contention
E state: . a=0ns . b=320ns . c=56.2ns
67