why I love memory latency high throughput through latency masking hardware multithreading

WALID NAJJAR UC RIVERSIDE the team

the cheerleaders: Walid A. Najjar, Vassilis J. Tsotras, Vagelis Papalexakis, Daniel Wong

the players: • Robert J. Halstead (Xilinx) • Ildar Absalyamov (Google) • Prerna Budhkar (Intel) • Skyler Windh (Micron) • Vassileos Zois (UCR) • Bashar Roumanous (UCR) • Mohammadreza Rezvani (UCR)

Yale is 80 - © W. Najjar 2 outline

✦ background ✦ filament execution model ✦ application - SpMV ✦ application - Selection ✦ application - Group by aggregation ✦ conclusion

FPL 2019 - © W. Najjar 3 current DRAM trends

Growing memory capacity and low cost have 3000 created a niche for in-memory solutions.

Peak CPU Performance 2000 Low Latency memory hardware is more Peak Memory Performance expensive (39x) and has lower capacity than commonly used DDRx DRAM. Clock (MHz) 1000 Though memory performance is improving, it is not likely to match CPU performance anytime 0 soon 1990 1992 1994 1996 2000 2002 2004 2006 2008 2010 2012 2014 2016 Need alternative techniques to mitigate memory latency!

K. K. Chang. Understanding and Improving Latency of DRAM-Based Memory Systems. PhD thesis, Carnegie Mellon University, 2017.

Yale is 80 - © W. Najjar 4 latency mitigation - aka caching

Yale is 80 - © W. Najjar 5 latency mitigation - aka caching

Yale is 80 - © W. Najjar 5 latency mitigation - aka caching

Intel Core-i7-5960x

Launch Date Q3’14 Clock Freq. 3.0 GHz Cores / Threads 8 / 16

L3 20 MB Memory 4 Channels

Memory 68 GB/s Bandwidth

Yale is 80 - © W. Najjar 5 latency mitigation - aka caching

Intel Core-i7-5960x ๏ caches > 80% of Launch Date Q3’14 CPU area, Clock Freq. 3.0 GHz ๏ big energy sink! Cores / Threads 8 / 16 ๏ application must have temporal and/ L3 Cache 20 MB Memory 4 or spatial locality Channels

๏ up to 7 levels Memory 68 GB/s storage hierarchy Bandwidth (, L1, L2, L3, memory, disk cache, disk)

Yale is 80 - © W. Najjar 5 latency mitigation - aka caching

Intel Core-i7-5960x ๏ caches > 80% of Launch Date Q3’14 CPU area, Clock Freq. 3.0 GHz ๏ big energy sink! Cores / Threads 8 / 16 ๏ application must have temporal and/ L3 Cache 20 MB Memory 4 or spatial locality Channels

๏ up to 7 levels Memory 68 GB/s storage hierarchy Bandwidth (register file, L1, L2, L3, memory, disk cache, disk) caches everywhere! reduces average memory latency assumes locality Yale is 80 - © W. Najjar 5 where cycles go - cache misses

FPL 2019 - © W. Najjar 6 where cycles go - cache misses

[DaMoN’13] Tözün, P. et al. OLTP in wonderland: where do cache misses come from in major OLTP components?

FPL 2019 - © W. Najjar 6 where cycles go - cache misses

[DaMoN’13] Tözün, P. et al. OLTP in [TODS’07] Chen, S. et al. Improving Hash Join wonderland: where do cache misses Performance through Prefetching come from in major OLTP components?

FPL 2019 - © W. Najjar 6 big data analytics

Yale is 80 - © W. Najjar 7 big data analytics

BIG data (x petaB)

Yale is 80 - © W. Najjar 7 big data analytics

BIG data (x petaB) +

Yale is 80 - © W. Najjar 7 big data analytics

BIG data hyper (x petaB) + dimensional

Yale is 80 - © W. Najjar 7 big data analytics

BIG data hyper (x petaB) + dimensional

Yale is 80 - © W. Najjar 7 big data analytics

BIG data hyper (x petaB) + dimensional

no data marshaling

Yale is 80 - © W. Najjar 7 big data analytics

BIG data hyper (x petaB) + dimensional

no data marshaling

Yale is 80 - © W. Najjar 7 big data analytics

BIG data hyper (x petaB) + dimensional

no data sparse data marshaling

Yale is 80 - © W. Najjar 7 big data analytics

BIG data hyper (x petaB) + dimensional

no data sparse data marshaling

Yale is 80 - © W. Najjar 7 big data analytics

BIG data hyper (x petaB) + dimensional

no data sparse data marshaling

no locality!

Yale is 80 - © W. Najjar 7 You have to know the past to understand the present - Carl Sagan

FPL 2019 - © W. Najjar 8 the latency problem & solutions

✦ latency mitigations, aka “caching” ❖ effective only with spatial/temporal localities

✦ streaming: latency incorporated into ❖ requires spatial locality

✦ latency masking, aka hardware multithreading ❖ fast context after memory access ❖ small context ❖ concurrency == outstanding memory requests

FPL 2019 - © W. Najjar 9 Burton Smith (1941 - 2018)

FPL 2019 - © W. Najjar 10 Burton Smith (1941 - 2018)

FPL 2019 - © W. Najjar 10 Burton Smith (1941 - 2018)

❖ main architect of ❖ Denelcor HEP (1982) ❖ Horizon (1985-88) @ IDA SRC ❖ Tera MTA (1999) later Cray XMT

FPL 2019 - © W. Najjar 10 Burton Smith (1941 - 2018)

❖ main architect of ❖ Denelcor HEP (1982) ❖ Horizon (1985-88) @ IDA SRC ❖ Tera MTA (1999) later Cray XMT

FPL 2019 - © W. Najjar 10 Burton Smith (1941 - 2018)

❖ main architect of ❖ Denelcor HEP (1982) ❖ Horizon (1985-88) @ IDA SRC ❖ Tera MTA (1999) later Cray XMT

❖ latency masking ❖ = do useful work while waiting for memory = multithreading ❖ switch to a ready on long-latency I/O operations ❖ => higher throughput ❖ 50 years ago it was I/O latency => multi-programming

FPL 2019 - © W. Najjar 10 Burton Smith (1941 - 2018)

❖ main architect of ❖ Denelcor HEP (1982) ❖ Horizon (1985-88) @ IDA SRC ❖ Tera MTA (1999) later Cray XMT

❖ latency masking ❖ = do useful work while waiting for memory = multithreading ❖ switch to a ready process on long-latency I/O operations ❖ => higher throughput ❖ 50 years ago it was I/O latency => multi-programming

FPL 2019 - © W. Najjar 10 Burton Smith (1941 - 2018)

❖ main architect of ❖ Denelcor HEP (1982) ❖ Horizon (1985-88) @ IDA SRC ❖ Tera MTA (1999) later Cray XMT

❖ latency masking ❖ = do useful work while waiting for memory = multithreading ❖ switch to a ready process on long-latency I/O operations ❖ => higher throughput ❖ 50 years ago it was I/O latency => multi-programming

❖ https://www.microsoft.com/en-us/research/people/burtons/

FPL 2019 - © W. Najjar 10 Tera MTA/Cray XMT

, barrel registers ✦ 128 independent threads/processor … ✦ 2 inst/cycle issued: 1 ALU, 1 memory ✦ 1 full/empty bit per word insures correct

ordering of memory accesses ✦ longest memory latency: 128 cycles ✦ randomized memory: reduces bank conflicts ✦ processor & memory nodes on a 3D torus interconnection network ✦ 4096 nodes: sparsely populated

FPL 2019 - © W. Najjar 11

✦ CPU that between threads of execution on every cycle. ✦ also known as "interleaved" or "fine-grained" temporal multithreading. ✦ each thread is assigned its own program and other hardware registers (architectural ). ✦ guarantee each thread will execute one instruction every n cycles. ✦ n-way barrel processor acts like n separate processors, ✦ each running at ~ 1/n the original speed.

thread execution thread order threadthreadstate MEMORY threadstatestate threadthreadstate threadstatestate statestate

FPL 2019 - © W. Najjar 12 outline

✦ background ✦ filament execution model ✦ application - SpMV ✦ application - Selection ✦ application - Group by aggregation ✦ conclusion

FPL 2019 - © W. Najjar 13 what if we take it a step further?

✦ trade bandwidth for latency ✦ custom or semi custom processor/accelerator ✦ no need for large register files and thread state ➡ smaller data path ➡ more processors ➡ more threads per processor (engine or core) ➡ much higher throughput! ✦ suited for large scale irregular applications: ➡ data analytics, databases, sparse linear algebra: SpMV, SpMM, graph algorithms, bioinformatics

FPL 2019 - © W. Najjar 14 hardware multithreading

FPL 2019 - © W. Najjar 15 hardware multithreading MEMORY

core

FPL 2019 - © W. Najjar 15 hardware multithreading

ready filament MEMORY

core

FPL 2019 - © W. Najjar 15 hardware multithreading

ready filament

custom data path MEMORY

core

FPL 2019 - © W. Najjar 15 hardware multithreading

ready filament

custom data path request MEMORY

waiting filament

core

FPL 2019 - © W. Najjar 15 hardware multithreading

ready filament

custom data path request MEMORY

response waiting filament

core

FPL 2019 - © W. Najjar 15 hardware multithreading thread moved to ready threadto queuemoved

ready filament

custom data path request MEMORY

response waiting filament

core

FPL 2019 - © W. Najjar 15 hardware multithreading

Call them filament, to distinguish from “threads” Preempting executing filaments

thread moved to ready threadto queuemoved once it access memory ready filament Ready & Waiting Filament Queues fit 1000s of thread custom data path states request Massive parallelism: ≥ 4,000 MEMORY waiting filaments

response waiting filament

core

FPL 2019 - © W. Najjar 15 filament data path

DATA PATH

read write input output data data MEMORY

wait queue filament termination

FPL 2019 - © W. Najjar 16 filament data path example

Yale is 80 - © W. Najjar 17 filament data path example

A[i] += B[i] * C[D[i]]; % typical of sparse linear algebra

Yale is 80 - © W. Najjar 17 filament data path example

A[i] += B[i] * C[D[i]]; % typical of sparse linear algebra

DATA PATH

MEMORY

wait queue

Yale is 80 - © W. Najjar 17 filament data path example

A[i] += B[i] * C[D[i]]; % typical of sparse linear algebra

generate i DATA PATH

MEMORY

wait queue

Yale is 80 - © W. Najjar 17 filament data path example

A[i] += B[i] * C[D[i]]; % typical of sparse linear algebra

generate i DATA PATH

MEMORY

wait queue

Yale is 80 - © W. Najjar 17 filament data path example

A[i] += B[i] * C[D[i]]; % typical of sparse linear algebra

generate i DATA PATH

D[i]

fetch D[i] MEMORY

wait queue

Yale is 80 - © W. Najjar 17 filament data path example

A[i] += B[i] * C[D[i]]; % typical of sparse linear algebra

generate i DATA PATH

D[i] C[D[i]]

fetch D[i] fetch C[D[i]] MEMORY

wait queue

Yale is 80 - © W. Najjar 17 filament data path example

A[i] += B[i] * C[D[i]]; % typical of sparse linear algebra

generate i DATA PATH

D[i] C[D[i]] B[i]

fetch D[i] fetch C[D[i]] fetch B[i] MEMORY

wait queue

Yale is 80 - © W. Najjar 17 filament data path example

A[i] += B[i] * C[D[i]]; % typical of sparse linear algebra

generate i DATA PATH

D[i] C[D[i]] B[i]

write A[i]

fetch D[i] fetch C[D[i]] fetch B[i] MEMORY

wait queue

Yale is 80 - © W. Najjar 17 a very important

FPL 2019 - © W. Najjar 18 a very important and often forgotten Little law

FPL 2019 - © W. Najjar 18 a very important and often forgotten Little law

a queueing system

S λ

FPL 2019 - © W. Najjar 18 a very important and often forgotten Little law

a queueing system ❖ N = customers in system, w = wait time S λ ❖ ∀ distribution of λ and S (service time)

FPL 2019 - © W. Najjar 18 a very important and often forgotten Little law

a queueing system ❖ N = customers in system, w = wait time S λ ❖ ∀ distribution of λ and S (service time)

N = λ.w

FPL 2019 - © W. Najjar 18 a very important and often forgotten Little law

a queueing system ❖ N = customers in system, w = wait time S λ ❖ ∀ distribution of λ and S (service time)

N = λ.w

L C = L.B

B

FPL 2019 - © W. Najjar 18 a very important and often forgotten Little law

a queueing system ❖ N = customers in system, w = wait time S λ ❖ ∀ distribution of λ and S (service time)

N = λ.w

❖ L = pipeline latency ❖ B = memory bandwidth L C = L.B ❖ C = parallelism

B

FPL 2019 - © W. Najjar 18 a very important and often forgotten Little law

a queueing system ❖ N = customers in system, w = wait time S λ ❖ ∀ distribution of λ and S (service time)

N = λ.w

❖ L = pipeline latency ❖ B = memory bandwidth L C = L.B ❖ C = parallelism

B

high bandwidth + hyper-pipelining —> massive parallelism

FPL 2019 - © W. Najjar 18 regular vs. irregular applications

FPL 2019 - © W. Najjar 19 regular vs. irregular applications

✦Regular Applications ❖ Have good temporal and spatial locality ❖ Caches are well suited for these types of applications

FPL 2019 - © W. Najjar 19 regular vs. irregular applications

✦Regular Applications ❖ Have good temporal and spatial locality ❖ Caches are well suited for these types of applications

✦Irregular Applications ❖ Have poor temporal and spatial locality ❖ Caches are NOT well suited for these types of applications ❖ Multithreading is an alternative • Can mask long memory latencies • Needs high bandwidth to support a large number of concurrent threads

FPL 2019 - © W. Najjar 19 regular vs. irregular applications

✦Regular Applications ❖ Have good temporal and spatial locality ❖ Caches are well suited for these types of applications

✦Irregular Applications ❖ Have poor temporal and spatial locality ❖ Caches are NOT well suited for these types of applications ❖ Multithreading is an alternative • Can mask long memory latencies • Needs high bandwidth to support a large number of concurrent threads ✦Won’t increasing the cache size mitigate latency and improve performance?

FPL 2019 - © W. Najjar 19 taxonomy of irregular applications

FPL 2019 - © W. Najjar 20 taxonomy of irregular applications

FPL 2019 - © W. Najjar 20 taxonomy of irregular applications

dense matrix

FPL 2019 - © W. Najjar 20 taxonomy of irregular applications

dense matrix sparse matrix

FPL 2019 - © W. Najjar 20 taxonomy of irregular applications

dense matrix sparse matrix

fixed degree BFS

FPL 2019 - © W. Najjar 20 taxonomy of irregular applications

dense matrix sparse matrix

fixed degree BFS variable degree BFS

FPL 2019 - © W. Najjar 20 Convey HC-2ex architecture

Convey HC-2ex Clock Freq. 150 MHz Memory Controllers 8 Memory Channels 16 / FPGA Memory Bandwidth 76.8 GB/s Outstanding Requests ~ 500 Requests / Channel Data width 8 Byte

‣ multi bank memory architecture

FPL 2019 - © W. Najjar ‣ randomized memory allocation 21 outline

✦ background ✦ filament execution model ✦ application - SpMV ✦ application - Selection ✦ application - Group by aggregation ✦ conclusion

FPL 2019 - © W. Najjar 22 selection query

✦ selection is a database operator ‣ selects all rows (tuples) in a relation (table) ‣ that satisfy a set of conditions ‣ expressed as a logical expression on the attributes ‣ attributes (columns) are elements of a tuple ‣ e.g. ((a1 > 10) ∧ (a2>0)) ∨ ((a10<100) ∧ (a13<0)) ✦ a very common database operator ‣ poor locality ‣ relation can be store row or column wise

FPL 2019 - © W. Najjar 23 selection example

FPL 2019 - © W. Najjar 24 selection example

relation example row/ a1 a2 a3 a4 a5 a6 attribute 0 5 -70 CA 1992 0 12 1 10 70 CO 1990 1 4 2 15 -10 AZ 2001 1 24 3 20 3 NV 1989 0 18

FPL 2019 - © W. Najjar 24 selection example

relation example row/ a1 a2 a3 a4 a5 a6 attribute 0 5 -70 CA 1992 0 12 1 10 70 CO 1990 1 4 2 15 -10 AZ 2001 1 24 3 20 3 NV 1989 0 18

FPL 2019 - © W. Najjar 24 selection example

relation example row/ a1 a2 a3 a4 a5 a6 attribute 0 5 -70 CA 1992 0 12 1 10 70 CO 1990 1 4 2 15 -10 AZ 2001 1 24 3 20 3 NV 1989 0 18

query example: query A: (a5=1) ∨ (a3 = NV) ∨ [(a1<20)∧(a2>0)] query B: (a2<0) ∧ (a3=CA) ∧ (a5=0)

FPL 2019 - © W. Najjar 24 selection example

relation example row/ a1 a2 a3 a4 a5 a6 attribute 0 5 -70 CA 1992 0 12 1 10 70 CO 1990 1 4 2 15 -10 AZ 2001 1 24 3 20 3 NV 1989 0 18

query example: query A: (a5=1) ∨ (a3 = NV) ∨ [(a1<20)∧(a2>0)] query B: (a2<0) ∧ (a3=CA) ∧ (a5=0)

FPL 2019 - © W. Najjar 24 selection example

relation example row/ a1 a2 a3 a4 a5 a6 attribute 0 5 -70 CA 1992 0 12 1 10 70 CO 1990 1 4 2 15 -10 AZ 2001 1 24 3 20 3 NV 1989 0 18 qualifying rows row/ qA qB query 0 N Y query example: 1 Y N query A: (a5=1) ∨ (a3 = NV) ∨ [(a1<20)∧(a2>0)] 2 Y N query B: (a2<0) ∧ (a3=CA) ∧ (a5=0) 3 N N

FPL 2019 - © W. Najjar 24 selection example (2) qA predicate control block condition action query example: attri op const if T if F query A: (a5=1) ∨ (a3 = NV) ∨ [(a1<20)∧(a2>0)] bute query B: (a2<0) ∧ (a3=CA) ∧ (a5=0) a5 = 1 Terminate T a3 a3 = NV Terminate T a5 a1 < 20 a2 Terminate F a2 > 0 Terminate T Terminate F predicate control block (PCB) encodes the selection logic:

qB predicate control block ‣ early termination (T or F) condition action ‣ only the required attributes are attri op const if T if F fetched (no caching) bute a2 < 0 a3 Terminate F a3 = CA a5 Terminate F a5 = 0 Terminate T Terminate F

FPL 2019 - © W. Najjar 25 selection CDFG

…. …. row i+1 Explanations: row i new jobs queue

• PCB: process control block Initialize filament • fetch offset of 1st attribute (stored locally on the FPGA) from PCB • fetch 1st attribute from • The state of each filament is memory the current index into the PCB

• Multiple queries can be From memory executed simultaneously on the wait queue same relation, the state becomes the (PCB #, index) • data is returned from memory Continue evaluate Terminate T Write i as Continue filament condition qualifying • fetch offset of next from PCB row in the order fetched, on one attribute from PCB • fetch next channel attribute from memory

FPL 2019 - © W. Najjar 26 experimental evaluation

‣ Each row consists of 8 fixed size 64-bit columns. ‣ Query selectivity was varied from 0% to 100% (0%: no row qualifies, 100% all rows qualify) ‣ Dataset varied from 8M – 128M. ‣ Results were obtained on the following platforms:

Memory Clock Memory Device Make & Model Cores Bandwidth, (MHz) Size (GB) GB/s CPU Intel Xeon E5-2643 3,300 8 128 51.2 GPU Nvidia Titan X 1,500 3,584 12 480.0 FPGA Virtex6 – 760 150 64 64 76.8

FPL 2019 - © W. Najjar 27 bandwidth utilization - selection

k=1 k=2 k=3 k=4 k=1 k=2 k=3 k=4 k=1 k=2 k=3 k=4 k=1 k=2 k=3 k=4 k=5 k=6 k=7 k=8 k=5 k=6 k=7 k=8 k=5 k=6 k=7 k=8 k=5 k=6 k=7 k=8 100 80 60 40 20 0

Bandwidth Utilization (%) Utilization Bandwidth 0% 10% 50% 100% 0% 10% 50% 100% 0% 10% 50% 100% 0% 10% 50% 100%

FILAMENTMTP CPU SCALAR CPU SIMD GPU Selectivity

filament has the highest bandwidth utilization

FPL 2019 - © W. Najjar 28 predicate evaluation/sec

k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3 k=4 k=5 k=6 k=4 k=5 k=6 k=4 k=5 k=6 k=4 k=5 k=6 k=7 k=8 Peak k=7 k=8 Peak k=7 k=8 Peak k=7 k=8 Peak 10000

1000

100

10 (log scale) (log 1

Billion Predicates/sec Predicates/sec Billion 0% 10% 50% 100% 0% 10% 50% 100% 0% 10% 50% 100% 0% 10% 50% 100% 0.1 FILAMENTMTP CPU SCALAR CPU SIMD GPU Selectivity

peak = theoretical max throughput

FPL 2019 - © W. Najjar 29 runtime (msec)

k=1 k=2 k=3 k=4 k=1 k=2 k=3 k=4 k=1 k=2 k=3 k=4 k=1 k=2 k=3 k=4 k=5 k=6 k=7 k=8 k=5 k=6 k=7 k=8 k=5 k=6 k=7 k=8 k=5 k=6 k=7 k=8 300 250 200 150 100

Runtime (ms) Runtime 50 0 0% 10% 50% 100% 0% 10% 50% 100% 0% 10% 50% 100% 0% 10% 50% 100% FILAMENTMTP CPU SCALAR CPU SIMD GPU Selectivity

at only 150 MHz

FPL 2019 - © W. Najjar 30 TPC-H benchmark

‣ Query Q6 for our evaluation because all select conditions apply on the columns of one table and 8 has 5 predicates 6 ‣ Query Q6: 4 SELECT * FROM lineitem WHERE l_shipdate>= date ‘1995-01-01’ AND Throughput 2 (Billion Tuples/s) l_shipdate

attributes. Bandwidth(GB/s) 10

Throughput(MTuples/s)per 0 CPU Scalar CPU SIMD MTP GPU FPL 2019 - © W. Najjar 31 TPC-H Evaluation

‣ Memory Fetches: number of columns fetched for evaluation ‣ Avg Evaluations per Row: predicate comparisons per row ‣ Effective Bandwidth: total size of the “Lineitem” table processed per unit time / theoretical peak bandwidth, prescribed in [1]. ‣ Peak Bandwidth Utilization: actual amount of data fetched per unit time / theoretical peak bandwidth

CPU CPU GPU Filament Scalar SIMD Memory Fetches 100% 18.75% 18.75% 11.4% Avg Evaluations (per Row) 3 3 3 1.83 Effective Bandwidth 0.47 3.44 2.01 6.13 Peak Bandwidth Utilization 47.6% 61.2% 36.6% 70.3%

[1] B. Sukhwani, et al. Database Analytics Acceleration Using FPGAs. Parallel Architectures and Compilation Techniques, 2012.

FPL 2019 - © W. Najjar 32 outline

✦ background ✦ filament execution model ✦ application - SpMV ✦ application - Selection ✦ application - Group by aggregation ✦ conclusion

FPL 2019 - © W. Najjar 33 synchronization in irregular applications

FPL 2019 - © W. Najjar 34 synchronization in irregular applications

dense matrix sparse matrix

fixed degree BFS variable degree BFS

FPL 2019 - © W. Najjar 34 synchronization in irregular applications

inter-thread synchronization

dense matrix sparse matrix

fixed degree BFS variable degree BFS

FPL 2019 - © W. Najjar 34 synchronization in irregular applications

inter-thread synchronization

not needed dense matrix sparse matrix

fixed degree BFS variable degree BFS

FPL 2019 - © W. Najjar 34 synchronization in irregular applications

inter-thread synchronization

not needed dense matrix sparse matrix

needed fixed degree BFS variable degree BFS

FPL 2019 - © W. Najjar 34 synchronizing between threads

FPL 2019 - © W. Najjar 35 synchronizing between threads

✦Non-deterministic # threads ❖ requires inter-thread synchronization ➡ only FPGA platforms that supports in-memory synchronization: Convey MX ➡ Heavy overhead

FPL 2019 - © W. Najjar 35 synchronizing between threads

✦Non-deterministic # threads ❖ requires inter-thread synchronization ➡ only FPGA platforms that supports in-memory synchronization: Convey MX ➡ Heavy overhead ✦ Move the synchronization on-chip ❖ use a CAM to cache recently seen nodes: reduces number of memory accesses ❖ map threads (nodes) to engines deterministically ❖ deal with load balancing

FPL 2019 - © W. Najjar 35 CAMs as synchronizing caches

FPL 2019 - © W. Najjar 36 CAMs as synchronizing caches

✦ CAM caches address of all pending memory accesses ❖ all accesses are checked in CAM first • on miss, address is inserted into CAM • on hit, access is deferred • read hits get the returned value • write hits are deferred => strict serialization

FPL 2019 - © W. Najjar 36 CAMs as synchronizing caches

✦ CAM caches address of all pending memory accesses ❖ all accesses are checked in CAM first • on miss, address is inserted into CAM • on hit, access is deferred • read hits get the returned value • write hits are deferred => strict serialization ✦ some writes need not be done in memory ❖ e.g. aggregation: increment count locally ==> reduced traffic to memory ❖ very effective on some workloads

FPL 2019 - © W. Najjar 36 CAMs as synchronizing caches

✦ CAM caches address of all pending memory accesses ❖ all accesses are checked in CAM first • on miss, address is inserted into CAM • on hit, access is deferred • read hits get the returned value • write hits are deferred => strict serialization ✦ some writes need not be done in memory ❖ e.g. aggregation: increment count locally ==> reduced traffic to memory ❖ very effective on some workloads ✦ each CAM guards a slice of the address space ❖ partitioning of the address space ❖ deal with load balancing

FPL 2019 - © W. Najjar 36 in-memory aggregation

✦ A crucial operation for any OLAP workload ✦ Aggregation is challenging to implement in hardware logic ❖ Not all tuples will create a new node (e.g. update existing node) ❖ Every tuple reads and writes to the table ✦ Content Addressable Memories (CAMs) ❖ Used to enforce memory locks and merge jobs together ❖ Splits nodes among multiple hash tables, need to be merged

Relation R Key Value Agg Relation 1 123 Key Count 1 314 1 2 2 42 2 2 3 7 3 1 2 96 Halstead et al. FPGA-based Multithreading for In-Memory Hash Joins, in 7th Biennial Conf. on Innovative Data Systems Research (CIDR’15), Jan. 4-7, 2015, Asilomar, CA.

FPL 2019 - © W. Najjar 37 thread control flow graph

FPL 2019 - © W. Najjar 38 thread control flow graph

Two CAMs: ✦one used to synchronize the access to memory ✦one to cache the value of the aggregate

FPL 2019 - © W. Najjar 38 aggregation engine

✦ CAM: Filter Keys ❖ Jobs with the same key are merged ✦ CAM: HT Lock ❖ Locks individual hash table locations ✦ Linked List ❖ Read through to find match ❖ Update/Insert node ✦ Design is limited by the 128 CAM location

FPL 2019 - © W. Najjar 39 merging operation

✦ Mapping of nodes to engines done statically ❖ bandwidth between FPGAs is not sufficient ❖ may becomes the design bottleneck

✦ Relation keys can be in multiple nodes in separate hash tables ❖ keys are not duplicated within a hash table

✦ After the job is completed ❖ multiple hash table must be merged: a streaming operation ❖ can be done on FPGA: ➡ stream list chains for each hash table bucket ➡ merge matching keys together

FPL 2019 - © W. Najjar 40 concurrent FPGA engines

✦ target platform: Convey HC-2ex ❖ 4 Xilinx Virtex 6 FPGAs x 16 Memory channels/FPGA ✦ two designs ❖ replicated cores (4 channels/core, 4 cores/FPGA) ❖ multiplexed cores (2.5 channels/core, 6 cores/FPGA) - write channel shared by 2 cores

Replicated Engine Design Multiplexed Engine Design Main Memory Main Memory 1 2 ..Memory channels.. 15 16 1 2 ..Memory channels.. 15 16

Multiplexed Engines Engine1 … Engine4 Engine1 Engine2 …

FPL 2019 - © W. Najjar 41 fine v/s coarse-grain locking

FPL 2019 - © W. Najjar 42 fine v/s coarse-grain locking

hash table

key key key # # #

key key # #

key #

FPL 2019 - © W. Najjar 42 fine v/s coarse-grain locking

hash table ✦ coarse-grain ❖ lock on hash table entry key key key ❖ fewer locks ==> less # # # parallelism ❖ lower pressure on key key synchronizing CAM # #

key #

FPL 2019 - © W. Najjar 42 fine v/s coarse-grain locking

hash table ✦ coarse-grain ❖ lock on hash table entry key key key ❖ fewer locks ==> less # # # parallelism ❖ lower pressure on key key synchronizing CAM # #

key #

FPL 2019 - © W. Najjar 42 fine v/s coarse-grain locking

hash table ✦ coarse-grain ❖ lock on hash table entry key key key ❖ fewer locks ==> less # # # parallelism ❖ lower pressure on key key synchronizing CAM # #

✦ fine-grain lock key ❖ lock on linked list entry # ❖ more parallel filaments ❖ higher pressure on CAM

FPL 2019 - © W. Najjar 42 fine v/s coarse-grain locking

hash table ✦ coarse-grain ❖ lock on hash table entry key key key ❖ fewer locks ==> less # # # parallelism ❖ lower pressure on key key synchronizing CAM # #

✦ fine-grain lock key ❖ lock on linked list entry # ❖ more parallel filaments ❖ higher pressure on CAM who wins?

FPL 2019 - © W. Najjar 42 experimental evaluation FGL v/s CGL

Coarse Grain Locking - Uniform dataset FGL - Uniform dataset Coarse Grain Locking - Heavy hitter dataset FGL - Heavy Hitter dataset Coarse Grain Locking - Moving Cluster dataset FGL - Moving Cluster dataset 64 fine-grain lock

32

16 coarse-grain lock

8

4 Throughput (MTuples/s/GB/S) Throughput 2

1 2¹⁰ 2¹² 2¹⁴ 2¹⁶ 2¹⁸ 2²⁰ 2²² Relation Cardinality ✦ fine-grain ~4x higher throughput ✦ note: constant throughput v/s relation cardinality

FPL 2019 - © W. Najjar 43 comparison to software

✦ 5 aggregation algorithms ❖ Independent Tables Hardware-oblivious [VLDB’07] ❖ Shared Table ❖ Partition & Aggregate Hardware-conscious [VLDB’07] ❖ Hybrid Table Hybrid [DAMON’11] ❖ Partition with Local Aggregation Table

[ICDE’13] Balkesen, C. et al. Main-memory Hash Joins on Multi-core CPUs: Tuning to the underlying hardware. [VLDB’07] J. Cieslewicz et al. Adaptive Aggregation on Chip Multiprocessors. [DAMON’11] Y. Ye et al. Scalable aggregation on multicore processors

FPL 2019 - © W. Najjar 44 experimental evaluation

Hardware Region ✦ Software Approaches FPGA board Virtex-6 760 ❖ Individual Table # FPGAs 2 ❖ Shared Table (atomic) Clock Freq. 150 MHz ❖ Hybrid Table Engines per FPGA 4 / 3 ❖ Partitioning Memory Channels 32 ❖ PLAT Memory Bandwidth 38.4 GB/s (total) ✦ Datasets Software Region ❖ Uniform, Zipf 0.5, Heavy Hitter, Moving Cluster, Self Similar CPU Intel Xeon E5-2643 ✦ 8M to 64M tuples/ # CPUs 1 benchmarks Cores / Threads 4 / 8 ✦ 1K to 4M cardinality Clock Freq. 3.3 GHz L3 Cache 10 MB Memory Bandwidth 51.2 GB/s (total) FPL 2019 - © W. Najjar 45 uniform workload

Shared Table Hybrid Aggregation Multiplexed FPGA Independent Tables Partition & Aggregate PLAT FGL CAM64 FPGA FGL CAM128 FPGA 700 filament 600

500

400 300 5x CPU 200

100 Throughput (MTuples/sec) Throughput 0 2¹⁰ 2¹² 2¹⁴ 2¹⁶ 2¹⁸ 2²⁰ 2²² Relation Cardinality

Uniform - 256 M tuples

FPL 2019 - © W. Najjar 46 heavy-hitter workload

Shared Table Hybrid Aggregation Multiplexed FPGA Independent Tables Partition & Aggregate PLAT FGL CAM64 FPGA FGL CAM128 FPGA 1200

1000 filament

800

600 CPU

400 10x 200 Throughput (MTuples/sec) Throughput 0 2¹⁰ 2¹² 2¹⁴ 2¹⁶ 2¹⁸ 2²⁰ 2²² Relation Cardinality Heavy Hitter - 256 M tuples

FPL 2019 - © W. Najjar 47 self-similar workload

Shared Table Hybrid Aggregation Multiplexed FPGA Independent Tables Partition & Aggregate PLAT FGL CAM64 FPGA FGL CAM128 1200

1000

800 filament 600

400 CPU 10x 200 Throughput (MTuples/sec) Throughput 0 2¹⁰ 2¹² 2¹⁴ 2¹⁶ 2¹⁸ 2²⁰ 2²² Relation Cardinality SelfSimilar — 256 M tuples

FPL 2019 - © W. Najjar 48 bandwidth utilization

✦ FPGA implementation achieves much higher memory bandwidth utilization

16M software 32M software 64M software 128M software 16M FPGA 32M FPGA 64M FPGA 128M FPGA 0.90

0.80

0.70

0.60

0.50

0.40

0.30 Ratio of Avg/Peak Bandwidth of Avg/Peak Ratio 0.20

0.10

0.00 2¹⁰ 2¹² 2¹⁴ 2¹⁶ 2¹⁸ 2²⁰ 2²² FPL 2019 - © W. Najjar Relation Cardinality 49 outline

✦ background ✦ filament execution model ✦ application - SpMV ✦ application - Selection ✦ application - Group by aggregation ✦ conclusion

FPL 2019 - © W. Najjar 50 other results, past and on-going

✦ Big data is: BIG and multi-dimensional ❖ access patterns to data are extremely irregular ❖ no ideal storage scheme - constant poor locality

✦Other applications ❖ string matching, hash join, group by aggregation, selection, outer joins, sorting

✦Challenges ❖ inter-filament synchronization ❖ dynamic load re-balancing (at run time)

FPL 2019 - © W. Najjar 51 publications

http://www.cs.ucr.edu/~najjar/publications.html

✦ P. Budhkar, I. Absalyamov, V. Zois, S. Windh, W. A. Najjar and V. J. Tsotras. Accelerating In- Memory Database Selections Using Latency Masking Hardware Threads, in ACM TACO, 2019. ✦ I. Absalyamov, R. J. Halstead, P. Budhkar, W. A. Najjar, S. Windh, V. J. Tsotras. FPGA- Accelerated Group-by Aggregation Using Synchronizing Caches. 12th Int. Workshop on Data Management on New Hardware (DaMoN 2016), Co-located with ACM SIGMOD/PODS, San Francisco, USA, June 27, 2016. ✦ S. Windh, P. Budhkar and W. A. Najjar. CAMs as Synchronizing Caches for Multithreaded Irregular Applications on FPGAs. ICCAD 2015: 331-336 ✦ R. J. Halstead, I. Absalyamov, W. A. Najjar and V. J. Tsotras. FPGA-based Multithreading for In- Memory Hash Joins, in 7th Biennial Conference on Innovative Data Systems Research (CIDR 15), January 4-7, 2015, Asilomar, California. ✦ E. B. Fernandez, J. Villarreal, S. Lonardi, and W. A. Najjar. FHAST: FPGA-based acceleration of Bowtie in hardware, in IEEE/ACM Trans. on Computational Biology and Bioinformatics, # 99, Feb. 2015. ✦ R. J. Halstead, W. A. Najjar and O. Huseini. SpVM Acceleration with Latency Masking Threads on FPGAs. Technical Report UCR-CSE-2014-04001

FPL 2019 - © W. Najjar 52 conclusion

✦ advantages of filament execution ‣ very small state allows ‣ fast context switching between filaments ‣ small storage for waiting queues ‣ masking the latency of memory and non volatile storage ‣ no reliance on locality, no caches ‣ lower energy consumption ✦ limitations: custom or semi-custom accelerators

FPL 2019 - © W. Najjar 53 thank you!

questions?

FPL 2019 - © W. Najjar 54