why I love memory latency high throughput through latency masking hardware multithreading
WALID NAJJAR UC RIVERSIDE the team
the cheerleaders: Walid A. Najjar, Vassilis J. Tsotras, Vagelis Papalexakis, Daniel Wong
the players: • Robert J. Halstead (Xilinx) • Ildar Absalyamov (Google) • Prerna Budhkar (Intel) • Skyler Windh (Micron) • Vassileos Zois (UCR) • Bashar Roumanous (UCR) • Mohammadreza Rezvani (UCR)
Yale is 80 - © W. Najjar 2 outline
✦ background ✦ filament execution model ✦ application - SpMV ✦ application - Selection ✦ application - Group by aggregation ✦ conclusion
FPL 2019 - © W. Najjar 3 current DRAM trends
Growing memory capacity and low cost have 3000 created a niche for in-memory solutions.
Peak CPU Performance 2000 Low Latency memory hardware is more Peak Memory Performance expensive (39x) and has lower capacity than commonly used DDRx DRAM. Clock (MHz) 1000 Though memory performance is improving, it is not likely to match CPU performance anytime 0 soon 1990 1992 1994 1996 2000 2002 2004 2006 2008 2010 2012 2014 2016 Need alternative techniques to mitigate memory latency!
K. K. Chang. Understanding and Improving Latency of DRAM-Based Memory Systems. PhD thesis, Carnegie Mellon University, 2017.
Yale is 80 - © W. Najjar 4 latency mitigation - aka caching
Yale is 80 - © W. Najjar 5 latency mitigation - aka caching
Yale is 80 - © W. Najjar 5 latency mitigation - aka caching
Intel Core-i7-5960x
Launch Date Q3’14 Clock Freq. 3.0 GHz Cores / Threads 8 / 16
L3 Cache 20 MB Memory 4 Channels
Memory 68 GB/s Bandwidth
Yale is 80 - © W. Najjar 5 latency mitigation - aka caching
Intel Core-i7-5960x ๏ caches > 80% of Launch Date Q3’14 CPU area, Clock Freq. 3.0 GHz ๏ big energy sink! Cores / Threads 8 / 16 ๏ application must have temporal and/ L3 Cache 20 MB Memory 4 or spatial locality Channels
๏ up to 7 levels Memory 68 GB/s storage hierarchy Bandwidth (register file, L1, L2, L3, memory, disk cache, disk)
Yale is 80 - © W. Najjar 5 latency mitigation - aka caching
Intel Core-i7-5960x ๏ caches > 80% of Launch Date Q3’14 CPU area, Clock Freq. 3.0 GHz ๏ big energy sink! Cores / Threads 8 / 16 ๏ application must have temporal and/ L3 Cache 20 MB Memory 4 or spatial locality Channels
๏ up to 7 levels Memory 68 GB/s storage hierarchy Bandwidth (register file, L1, L2, L3, memory, disk cache, disk) caches everywhere! reduces average memory latency assumes locality Yale is 80 - © W. Najjar 5 where cycles go - cache misses
FPL 2019 - © W. Najjar 6 where cycles go - cache misses
[DaMoN’13] Tözün, P. et al. OLTP in wonderland: where do cache misses come from in major OLTP components?
FPL 2019 - © W. Najjar 6 where cycles go - cache misses
[DaMoN’13] Tözün, P. et al. OLTP in [TODS’07] Chen, S. et al. Improving Hash Join wonderland: where do cache misses Performance through Prefetching come from in major OLTP components?
FPL 2019 - © W. Najjar 6 big data analytics
Yale is 80 - © W. Najjar 7 big data analytics
BIG data (x petaB)
Yale is 80 - © W. Najjar 7 big data analytics
BIG data (x petaB) +
Yale is 80 - © W. Najjar 7 big data analytics
BIG data hyper (x petaB) + dimensional
Yale is 80 - © W. Najjar 7 big data analytics
BIG data hyper (x petaB) + dimensional
Yale is 80 - © W. Najjar 7 big data analytics
BIG data hyper (x petaB) + dimensional
no data marshaling
Yale is 80 - © W. Najjar 7 big data analytics
BIG data hyper (x petaB) + dimensional
no data marshaling
Yale is 80 - © W. Najjar 7 big data analytics
BIG data hyper (x petaB) + dimensional
no data sparse data marshaling
Yale is 80 - © W. Najjar 7 big data analytics
BIG data hyper (x petaB) + dimensional
no data sparse data marshaling
Yale is 80 - © W. Najjar 7 big data analytics
BIG data hyper (x petaB) + dimensional
no data sparse data marshaling
no locality!
Yale is 80 - © W. Najjar 7 You have to know the past to understand the present - Carl Sagan
FPL 2019 - © W. Najjar 8 the latency problem & solutions
✦ latency mitigations, aka “caching” ❖ effective only with spatial/temporal localities
✦ streaming: latency incorporated into pipeline ❖ requires spatial locality
✦ latency masking, aka hardware multithreading ❖ fast context switch after memory access ❖ small thread context ❖ concurrency == outstanding memory requests
FPL 2019 - © W. Najjar 9 Burton Smith (1941 - 2018)
FPL 2019 - © W. Najjar 10 Burton Smith (1941 - 2018)
FPL 2019 - © W. Najjar 10 Burton Smith (1941 - 2018)
❖ main architect of ❖ Denelcor HEP (1982) ❖ Horizon (1985-88) @ IDA SRC ❖ Tera MTA (1999) later Cray XMT
FPL 2019 - © W. Najjar 10 Burton Smith (1941 - 2018)
❖ main architect of ❖ Denelcor HEP (1982) ❖ Horizon (1985-88) @ IDA SRC ❖ Tera MTA (1999) later Cray XMT
FPL 2019 - © W. Najjar 10 Burton Smith (1941 - 2018)
❖ main architect of ❖ Denelcor HEP (1982) ❖ Horizon (1985-88) @ IDA SRC ❖ Tera MTA (1999) later Cray XMT
❖ latency masking ❖ = do useful work while waiting for memory = multithreading ❖ switch to a ready process on long-latency I/O operations ❖ => higher throughput ❖ 50 years ago it was I/O latency => multi-programming
FPL 2019 - © W. Najjar 10 Burton Smith (1941 - 2018)
❖ main architect of ❖ Denelcor HEP (1982) ❖ Horizon (1985-88) @ IDA SRC ❖ Tera MTA (1999) later Cray XMT
❖ latency masking ❖ = do useful work while waiting for memory = multithreading ❖ switch to a ready process on long-latency I/O operations ❖ => higher throughput ❖ 50 years ago it was I/O latency => multi-programming
FPL 2019 - © W. Najjar 10 Burton Smith (1941 - 2018)
❖ main architect of ❖ Denelcor HEP (1982) ❖ Horizon (1985-88) @ IDA SRC ❖ Tera MTA (1999) later Cray XMT
❖ latency masking ❖ = do useful work while waiting for memory = multithreading ❖ switch to a ready process on long-latency I/O operations ❖ => higher throughput ❖ 50 years ago it was I/O latency => multi-programming
❖ https://www.microsoft.com/en-us/research/people/burtons/
FPL 2019 - © W. Najjar 10 Tera MTA/Cray XMT
✦
supercomputer, barrel processor registers ✦ 128 independent threads/processor … ✦ 2 inst/cycle issued: 1 ALU, 1 memory ✦ 1 full/empty bit per word insures correct
ordering of memory accesses datapath ✦ longest memory latency: 128 cycles ✦ randomized memory: reduces bank conflicts ✦ processor & memory nodes on a 3D torus interconnection network ✦ 4096 nodes: sparsely populated
FPL 2019 - © W. Najjar 11 barrel processor
✦ CPU that switches between threads of execution on every cycle. ✦ also known as "interleaved" or "fine-grained" temporal multithreading. ✦ each thread is assigned its own program counter and other hardware registers (architectural state). ✦ guarantee each thread will execute one instruction every n cycles. ✦ n-way barrel processor acts like n separate processors, ✦ each running at ~ 1/n the original speed.
thread execution thread order threadthreadstate MEMORY threadstatestate threadthreadstate threadstatestate statestate
FPL 2019 - © W. Najjar 12 outline
✦ background ✦ filament execution model ✦ application - SpMV ✦ application - Selection ✦ application - Group by aggregation ✦ conclusion
FPL 2019 - © W. Najjar 13 what if we take it a step further?
✦ trade bandwidth for latency ✦ custom or semi custom processor/accelerator ✦ no need for large register files and thread state ➡ smaller data path ➡ more processors ➡ more threads per processor (engine or core) ➡ much higher throughput! ✦ suited for large scale irregular applications: ➡ data analytics, databases, sparse linear algebra: SpMV, SpMM, graph algorithms, bioinformatics
FPL 2019 - © W. Najjar 14 hardware multithreading
FPL 2019 - © W. Najjar 15 hardware multithreading MEMORY
core
FPL 2019 - © W. Najjar 15 hardware multithreading
ready filament MEMORY
core
FPL 2019 - © W. Najjar 15 hardware multithreading
ready filament
custom data path MEMORY
core
FPL 2019 - © W. Najjar 15 hardware multithreading
ready filament
custom data path request MEMORY
waiting filament
core
FPL 2019 - © W. Najjar 15 hardware multithreading
ready filament
custom data path request MEMORY
response waiting filament
core
FPL 2019 - © W. Najjar 15 hardware multithreading thread moved to ready threadto queuemoved
ready filament
custom data path request MEMORY
response waiting filament
core
FPL 2019 - © W. Najjar 15 hardware multithreading
Call them filament, to distinguish from “threads” Preempting executing filaments
thread moved to ready threadto queuemoved once it access memory ready filament Ready & Waiting Filament Queues fit 1000s of thread custom data path states request Massive parallelism: ≥ 4,000 MEMORY waiting filaments
response waiting filament
core
FPL 2019 - © W. Najjar 15 filament data path
DATA PATH
read write input output data data MEMORY
wait queue filament termination
FPL 2019 - © W. Najjar 16 filament data path example
Yale is 80 - © W. Najjar 17 filament data path example
A[i] += B[i] * C[D[i]]; % typical of sparse linear algebra
Yale is 80 - © W. Najjar 17 filament data path example
A[i] += B[i] * C[D[i]]; % typical of sparse linear algebra
DATA PATH
MEMORY
wait queue
Yale is 80 - © W. Najjar 17 filament data path example
A[i] += B[i] * C[D[i]]; % typical of sparse linear algebra
generate i DATA PATH
MEMORY
wait queue
Yale is 80 - © W. Najjar 17 filament data path example
A[i] += B[i] * C[D[i]]; % typical of sparse linear algebra
generate i DATA PATH
MEMORY
wait queue
Yale is 80 - © W. Najjar 17 filament data path example
A[i] += B[i] * C[D[i]]; % typical of sparse linear algebra
generate i DATA PATH
D[i]
fetch D[i] MEMORY
wait queue
Yale is 80 - © W. Najjar 17 filament data path example
A[i] += B[i] * C[D[i]]; % typical of sparse linear algebra
generate i DATA PATH
D[i] C[D[i]]
fetch D[i] fetch C[D[i]] MEMORY
wait queue
Yale is 80 - © W. Najjar 17 filament data path example
A[i] += B[i] * C[D[i]]; % typical of sparse linear algebra
generate i DATA PATH
D[i] C[D[i]] B[i]
fetch D[i] fetch C[D[i]] fetch B[i] MEMORY
wait queue
Yale is 80 - © W. Najjar 17 filament data path example
A[i] += B[i] * C[D[i]]; % typical of sparse linear algebra
generate i DATA PATH
∑
D[i] C[D[i]] B[i]
write A[i]
fetch D[i] fetch C[D[i]] fetch B[i] MEMORY
wait queue
Yale is 80 - © W. Najjar 17 a very important
FPL 2019 - © W. Najjar 18 a very important and often forgotten Little law
FPL 2019 - © W. Najjar 18 a very important and often forgotten Little law
a queueing system
S λ
FPL 2019 - © W. Najjar 18 a very important and often forgotten Little law
a queueing system ❖ N = customers in system, w = wait time S λ ❖ ∀ distribution of λ and S (service time)
FPL 2019 - © W. Najjar 18 a very important and often forgotten Little law
a queueing system ❖ N = customers in system, w = wait time S λ ❖ ∀ distribution of λ and S (service time)
N = λ.w
FPL 2019 - © W. Najjar 18 a very important and often forgotten Little law
a queueing system ❖ N = customers in system, w = wait time S λ ❖ ∀ distribution of λ and S (service time)
N = λ.w
L C = L.B
B
FPL 2019 - © W. Najjar 18 a very important and often forgotten Little law
a queueing system ❖ N = customers in system, w = wait time S λ ❖ ∀ distribution of λ and S (service time)
N = λ.w
❖ L = pipeline latency ❖ B = memory bandwidth L C = L.B ❖ C = parallelism
B
FPL 2019 - © W. Najjar 18 a very important and often forgotten Little law
a queueing system ❖ N = customers in system, w = wait time S λ ❖ ∀ distribution of λ and S (service time)
N = λ.w
❖ L = pipeline latency ❖ B = memory bandwidth L C = L.B ❖ C = parallelism
B
high bandwidth + hyper-pipelining —> massive parallelism
FPL 2019 - © W. Najjar 18 regular vs. irregular applications
FPL 2019 - © W. Najjar 19 regular vs. irregular applications
✦Regular Applications ❖ Have good temporal and spatial locality ❖ Caches are well suited for these types of applications
FPL 2019 - © W. Najjar 19 regular vs. irregular applications
✦Regular Applications ❖ Have good temporal and spatial locality ❖ Caches are well suited for these types of applications
✦Irregular Applications ❖ Have poor temporal and spatial locality ❖ Caches are NOT well suited for these types of applications ❖ Multithreading is an alternative • Can mask long memory latencies • Needs high bandwidth to support a large number of concurrent threads
FPL 2019 - © W. Najjar 19 regular vs. irregular applications
✦Regular Applications ❖ Have good temporal and spatial locality ❖ Caches are well suited for these types of applications
✦Irregular Applications ❖ Have poor temporal and spatial locality ❖ Caches are NOT well suited for these types of applications ❖ Multithreading is an alternative • Can mask long memory latencies • Needs high bandwidth to support a large number of concurrent threads ✦Won’t increasing the cache size mitigate latency and improve performance?
FPL 2019 - © W. Najjar 19 taxonomy of irregular applications
FPL 2019 - © W. Najjar 20 taxonomy of irregular applications
FPL 2019 - © W. Najjar 20 taxonomy of irregular applications
dense matrix
FPL 2019 - © W. Najjar 20 taxonomy of irregular applications
dense matrix sparse matrix
FPL 2019 - © W. Najjar 20 taxonomy of irregular applications
dense matrix sparse matrix
fixed degree BFS
FPL 2019 - © W. Najjar 20 taxonomy of irregular applications
dense matrix sparse matrix
fixed degree BFS variable degree BFS
FPL 2019 - © W. Najjar 20 Convey HC-2ex architecture
Convey HC-2ex Clock Freq. 150 MHz Memory Controllers 8 Memory Channels 16 / FPGA Memory Bandwidth 76.8 GB/s Outstanding Requests ~ 500 Requests / Channel Data bus width 8 Byte
‣ multi bank memory architecture
FPL 2019 - © W. Najjar ‣ randomized memory allocation 21 outline
✦ background ✦ filament execution model ✦ application - SpMV ✦ application - Selection ✦ application - Group by aggregation ✦ conclusion
FPL 2019 - © W. Najjar 22 selection query
✦ selection is a database operator ‣ selects all rows (tuples) in a relation (table) ‣ that satisfy a set of conditions ‣ expressed as a logical expression on the attributes ‣ attributes (columns) are elements of a tuple ‣ e.g. ((a1 > 10) ∧ (a2>0)) ∨ ((a10<100) ∧ (a13<0)) ✦ a very common database operator ‣ poor locality ‣ relation can be store row or column wise
FPL 2019 - © W. Najjar 23 selection example
FPL 2019 - © W. Najjar 24 selection example
relation example row/ a1 a2 a3 a4 a5 a6 attribute 0 5 -70 CA 1992 0 12 1 10 70 CO 1990 1 4 2 15 -10 AZ 2001 1 24 3 20 3 NV 1989 0 18
FPL 2019 - © W. Najjar 24 selection example
relation example row/ a1 a2 a3 a4 a5 a6 attribute 0 5 -70 CA 1992 0 12 1 10 70 CO 1990 1 4 2 15 -10 AZ 2001 1 24 3 20 3 NV 1989 0 18
FPL 2019 - © W. Najjar 24 selection example
relation example row/ a1 a2 a3 a4 a5 a6 attribute 0 5 -70 CA 1992 0 12 1 10 70 CO 1990 1 4 2 15 -10 AZ 2001 1 24 3 20 3 NV 1989 0 18
query example: query A: (a5=1) ∨ (a3 = NV) ∨ [(a1<20)∧(a2>0)] query B: (a2<0) ∧ (a3=CA) ∧ (a5=0)
FPL 2019 - © W. Najjar 24 selection example
relation example row/ a1 a2 a3 a4 a5 a6 attribute 0 5 -70 CA 1992 0 12 1 10 70 CO 1990 1 4 2 15 -10 AZ 2001 1 24 3 20 3 NV 1989 0 18
query example: query A: (a5=1) ∨ (a3 = NV) ∨ [(a1<20)∧(a2>0)] query B: (a2<0) ∧ (a3=CA) ∧ (a5=0)
FPL 2019 - © W. Najjar 24 selection example
relation example row/ a1 a2 a3 a4 a5 a6 attribute 0 5 -70 CA 1992 0 12 1 10 70 CO 1990 1 4 2 15 -10 AZ 2001 1 24 3 20 3 NV 1989 0 18 qualifying rows row/ qA qB query 0 N Y query example: 1 Y N query A: (a5=1) ∨ (a3 = NV) ∨ [(a1<20)∧(a2>0)] 2 Y N query B: (a2<0) ∧ (a3=CA) ∧ (a5=0) 3 N N
FPL 2019 - © W. Najjar 24 selection example (2) qA predicate control block condition action query example: attri op const if T if F query A: (a5=1) ∨ (a3 = NV) ∨ [(a1<20)∧(a2>0)] bute query B: (a2<0) ∧ (a3=CA) ∧ (a5=0) a5 = 1 Terminate T a3 a3 = NV Terminate T a5 a1 < 20 a2 Terminate F a2 > 0 Terminate T Terminate F predicate control block (PCB) encodes the selection logic:
qB predicate control block ‣ early termination (T or F) condition action ‣ only the required attributes are attri op const if T if F fetched (no caching) bute a2 < 0 a3 Terminate F a3 = CA a5 Terminate F a5 = 0 Terminate T Terminate F
FPL 2019 - © W. Najjar 25 selection CDFG
…. …. row i+1 Explanations: row i new jobs queue
• PCB: process control block Initialize filament • fetch offset of 1st attribute (stored locally on the FPGA) from PCB • fetch 1st attribute from • The state of each filament is memory the current index into the PCB
• Multiple queries can be From memory executed simultaneously on the wait queue same relation, the state becomes the (PCB #, index) • data is returned from memory Continue evaluate Terminate T Write i as Continue filament condition qualifying • fetch offset of next from PCB row in the order fetched, on one attribute from PCB • fetch next channel attribute from memory
FPL 2019 - © W. Najjar 26 experimental evaluation
‣ Each row consists of 8 fixed size 64-bit columns. ‣ Query selectivity was varied from 0% to 100% (0%: no row qualifies, 100% all rows qualify) ‣ Dataset varied from 8M – 128M. ‣ Results were obtained on the following platforms:
Memory Clock Memory Device Make & Model Cores Bandwidth, (MHz) Size (GB) GB/s CPU Intel Xeon E5-2643 3,300 8 128 51.2 GPU Nvidia Titan X 1,500 3,584 12 480.0 FPGA Virtex6 – 760 150 64 64 76.8
FPL 2019 - © W. Najjar 27 bandwidth utilization - selection
k=1 k=2 k=3 k=4 k=1 k=2 k=3 k=4 k=1 k=2 k=3 k=4 k=1 k=2 k=3 k=4 k=5 k=6 k=7 k=8 k=5 k=6 k=7 k=8 k=5 k=6 k=7 k=8 k=5 k=6 k=7 k=8 100 80 60 40 20 0
Bandwidth Utilization (%) Utilization Bandwidth 0% 10% 50% 100% 0% 10% 50% 100% 0% 10% 50% 100% 0% 10% 50% 100%
FILAMENTMTP CPU SCALAR CPU SIMD GPU Selectivity
filament has the highest bandwidth utilization
FPL 2019 - © W. Najjar 28 predicate evaluation/sec
k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3 k=4 k=5 k=6 k=4 k=5 k=6 k=4 k=5 k=6 k=4 k=5 k=6 k=7 k=8 Peak k=7 k=8 Peak k=7 k=8 Peak k=7 k=8 Peak 10000
1000
100
10 (log scale) (log 1
Billion Predicates/sec Predicates/sec Billion 0% 10% 50% 100% 0% 10% 50% 100% 0% 10% 50% 100% 0% 10% 50% 100% 0.1 FILAMENTMTP CPU SCALAR CPU SIMD GPU Selectivity
peak = theoretical max throughput
FPL 2019 - © W. Najjar 29 runtime (msec)
k=1 k=2 k=3 k=4 k=1 k=2 k=3 k=4 k=1 k=2 k=3 k=4 k=1 k=2 k=3 k=4 k=5 k=6 k=7 k=8 k=5 k=6 k=7 k=8 k=5 k=6 k=7 k=8 k=5 k=6 k=7 k=8 300 250 200 150 100
Runtime (ms) Runtime 50 0 0% 10% 50% 100% 0% 10% 50% 100% 0% 10% 50% 100% 0% 10% 50% 100% FILAMENTMTP CPU SCALAR CPU SIMD GPU Selectivity
at only 150 MHz
FPL 2019 - © W. Najjar 30 TPC-H benchmark
‣ Query Q6 for our evaluation because all select conditions apply on the columns of one table and 8 has 5 predicates 6 ‣ Query Q6: 4 SELECT * FROM lineitem WHERE l_shipdate>= date ‘1995-01-01’ AND Throughput 2 (Billion Tuples/s) l_shipdate attributes. Bandwidth(GB/s) 10 Throughput(MTuples/s)per 0 CPU Scalar CPU SIMD MTP GPU FPL 2019 - © W. Najjar 31 TPC-H Evaluation ‣ Memory Fetches: number of columns fetched for evaluation ‣ Avg Evaluations per Row: predicate comparisons per row ‣ Effective Bandwidth: total size of the “Lineitem” table processed per unit time / theoretical peak bandwidth, prescribed in [1]. ‣ Peak Bandwidth Utilization: actual amount of data fetched per unit time / theoretical peak bandwidth CPU CPU GPU Filament Scalar SIMD Memory Fetches 100% 18.75% 18.75% 11.4% Avg Evaluations (per Row) 3 3 3 1.83 Effective Bandwidth Speedup 0.47 3.44 2.01 6.13 Peak Bandwidth Utilization 47.6% 61.2% 36.6% 70.3% [1] B. Sukhwani, et al. Database Analytics Acceleration Using FPGAs. Parallel Architectures and Compilation Techniques, 2012. FPL 2019 - © W. Najjar 32 outline ✦ background ✦ filament execution model ✦ application - SpMV ✦ application - Selection ✦ application - Group by aggregation ✦ conclusion FPL 2019 - © W. Najjar 33 synchronization in irregular applications FPL 2019 - © W. Najjar 34 synchronization in irregular applications dense matrix sparse matrix fixed degree BFS variable degree BFS FPL 2019 - © W. Najjar 34 synchronization in irregular applications inter-thread synchronization dense matrix sparse matrix fixed degree BFS variable degree BFS FPL 2019 - © W. Najjar 34 synchronization in irregular applications inter-thread synchronization not needed dense matrix sparse matrix fixed degree BFS variable degree BFS FPL 2019 - © W. Najjar 34 synchronization in irregular applications inter-thread synchronization not needed dense matrix sparse matrix needed fixed degree BFS variable degree BFS FPL 2019 - © W. Najjar 34 synchronizing between threads FPL 2019 - © W. Najjar 35 synchronizing between threads ✦Non-deterministic # threads ❖ requires inter-thread synchronization ➡ only FPGA platforms that supports in-memory synchronization: Convey MX ➡ Heavy overhead FPL 2019 - © W. Najjar 35 synchronizing between threads ✦Non-deterministic # threads ❖ requires inter-thread synchronization ➡ only FPGA platforms that supports in-memory synchronization: Convey MX ➡ Heavy overhead ✦ Move the synchronization on-chip ❖ use a CAM to cache recently seen nodes: reduces number of memory accesses ❖ map threads (nodes) to engines deterministically ❖ deal with load balancing FPL 2019 - © W. Najjar 35 CAMs as synchronizing caches FPL 2019 - © W. Najjar 36 CAMs as synchronizing caches ✦ CAM caches address of all pending memory accesses ❖ all accesses are checked in CAM first • on miss, address is inserted into CAM • on hit, access is deferred • read hits get the returned value • write hits are deferred => strict serialization FPL 2019 - © W. Najjar 36 CAMs as synchronizing caches ✦ CAM caches address of all pending memory accesses ❖ all accesses are checked in CAM first • on miss, address is inserted into CAM • on hit, access is deferred • read hits get the returned value • write hits are deferred => strict serialization ✦ some writes need not be done in memory ❖ e.g. aggregation: increment count locally ==> reduced traffic to memory ❖ very effective on some workloads FPL 2019 - © W. Najjar 36 CAMs as synchronizing caches ✦ CAM caches address of all pending memory accesses ❖ all accesses are checked in CAM first • on miss, address is inserted into CAM • on hit, access is deferred • read hits get the returned value • write hits are deferred => strict serialization ✦ some writes need not be done in memory ❖ e.g. aggregation: increment count locally ==> reduced traffic to memory ❖ very effective on some workloads ✦ each CAM guards a slice of the address space ❖ partitioning of the address space ❖ deal with load balancing FPL 2019 - © W. Najjar 36 in-memory aggregation ✦ A crucial operation for any OLAP workload ✦ Aggregation is challenging to implement in hardware logic ❖ Not all tuples will create a new node (e.g. update existing node) ❖ Every tuple reads and writes to the table ✦ Content Addressable Memories (CAMs) ❖ Used to enforce memory locks and merge jobs together ❖ Splits nodes among multiple hash tables, need to be merged Relation R Key Value Agg Relation 1 123 Key Count 1 314 1 2 2 42 2 2 3 7 3 1 2 96 Halstead et al. FPGA-based Multithreading for In-Memory Hash Joins, in 7th Biennial Conf. on Innovative Data Systems Research (CIDR’15), Jan. 4-7, 2015, Asilomar, CA. FPL 2019 - © W. Najjar 37 thread control flow graph FPL 2019 - © W. Najjar 38 thread control flow graph Two CAMs: ✦one used to synchronize the access to memory ✦one to cache the value of the aggregate FPL 2019 - © W. Najjar 38 aggregation engine ✦ CAM: Filter Keys ❖ Jobs with the same key are merged ✦ CAM: HT Lock ❖ Locks individual hash table locations ✦ Linked List ❖ Read through to find match ❖ Update/Insert node ✦ Design is limited by the 128 CAM location FPL 2019 - © W. Najjar 39 merging operation ✦ Mapping of nodes to engines done statically ❖ bandwidth between FPGAs is not sufficient ❖ may becomes the design bottleneck ✦ Relation keys can be in multiple nodes in separate hash tables ❖ keys are not duplicated within a hash table ✦ After the job is completed ❖ multiple hash table must be merged: a streaming operation ❖ can be done on FPGA: ➡ stream list chains for each hash table bucket ➡ merge matching keys together FPL 2019 - © W. Najjar 40 concurrent FPGA engines ✦ target platform: Convey HC-2ex ❖ 4 Xilinx Virtex 6 FPGAs x 16 Memory channels/FPGA ✦ two designs ❖ replicated cores (4 channels/core, 4 cores/FPGA) ❖ multiplexed cores (2.5 channels/core, 6 cores/FPGA) - write channel shared by 2 cores Replicated Engine Design Multiplexed Engine Design Main Memory Main Memory 1 2 ..Memory channels.. 15 16 1 2 ..Memory channels.. 15 16 Multiplexed Engines Engine1 … Engine4 Engine1 Engine2 … FPL 2019 - © W. Najjar 41 fine v/s coarse-grain locking FPL 2019 - © W. Najjar 42 fine v/s coarse-grain locking hash table key key key # # # key key # # key # FPL 2019 - © W. Najjar 42 fine v/s coarse-grain locking hash table ✦ coarse-grain ❖ lock on hash table entry key key key ❖ fewer locks ==> less # # # parallelism ❖ lower pressure on key key synchronizing CAM # # key # FPL 2019 - © W. Najjar 42 fine v/s coarse-grain locking hash table ✦ coarse-grain ❖ lock on hash table entry key key key ❖ fewer locks ==> less # # # parallelism ❖ lower pressure on key key synchronizing CAM # # key # FPL 2019 - © W. Najjar 42 fine v/s coarse-grain locking hash table ✦ coarse-grain ❖ lock on hash table entry key key key ❖ fewer locks ==> less # # # parallelism ❖ lower pressure on key key synchronizing CAM # # ✦ fine-grain lock key ❖ lock on linked list entry # ❖ more parallel filaments ❖ higher pressure on CAM FPL 2019 - © W. Najjar 42 fine v/s coarse-grain locking hash table ✦ coarse-grain ❖ lock on hash table entry key key key ❖ fewer locks ==> less # # # parallelism ❖ lower pressure on key key synchronizing CAM # # ✦ fine-grain lock key ❖ lock on linked list entry # ❖ more parallel filaments ❖ higher pressure on CAM who wins? FPL 2019 - © W. Najjar 42 experimental evaluation FGL v/s CGL Coarse Grain Locking - Uniform dataset FGL - Uniform dataset Coarse Grain Locking - Heavy hitter dataset FGL - Heavy Hitter dataset Coarse Grain Locking - Moving Cluster dataset FGL - Moving Cluster dataset 64 fine-grain lock 32 16 coarse-grain lock 8 4 Throughput (MTuples/s/GB/S) Throughput 2 1 2¹⁰ 2¹² 2¹⁴ 2¹⁶ 2¹⁸ 2²⁰ 2²² Relation Cardinality ✦ fine-grain ~4x higher throughput ✦ note: constant throughput v/s relation cardinality FPL 2019 - © W. Najjar 43 comparison to software ✦ 5 aggregation algorithms ❖ Independent Tables Hardware-oblivious [VLDB’07] ❖ Shared Table ❖ Partition & Aggregate Hardware-conscious [VLDB’07] ❖ Hybrid Table Hybrid [DAMON’11] ❖ Partition with Local Aggregation Table [ICDE’13] Balkesen, C. et al. Main-memory Hash Joins on Multi-core CPUs: Tuning to the underlying hardware. [VLDB’07] J. Cieslewicz et al. Adaptive Aggregation on Chip Multiprocessors. [DAMON’11] Y. Ye et al. Scalable aggregation on multicore processors FPL 2019 - © W. Najjar 44 experimental evaluation Hardware Region ✦ Software Approaches FPGA board Virtex-6 760 ❖ Individual Table # FPGAs 2 ❖ Shared Table (atomic) Clock Freq. 150 MHz ❖ Hybrid Table Engines per FPGA 4 / 3 ❖ Partitioning Memory Channels 32 ❖ PLAT Memory Bandwidth 38.4 GB/s (total) ✦ Datasets Software Region ❖ Uniform, Zipf 0.5, Heavy Hitter, Moving Cluster, Self Similar CPU Intel Xeon E5-2643 ✦ 8M to 64M tuples/ # CPUs 1 benchmarks Cores / Threads 4 / 8 ✦ 1K to 4M cardinality Clock Freq. 3.3 GHz L3 Cache 10 MB Memory Bandwidth 51.2 GB/s (total) FPL 2019 - © W. Najjar 45 uniform workload Shared Table Hybrid Aggregation Multiplexed FPGA Independent Tables Partition & Aggregate PLAT FGL CAM64 FPGA FGL CAM128 FPGA 700 filament 600 500 400 300 5x CPU 200 100 Throughput (MTuples/sec) Throughput 0 2¹⁰ 2¹² 2¹⁴ 2¹⁶ 2¹⁸ 2²⁰ 2²² Relation Cardinality Uniform - 256 M tuples FPL 2019 - © W. Najjar 46 heavy-hitter workload Shared Table Hybrid Aggregation Multiplexed FPGA Independent Tables Partition & Aggregate PLAT FGL CAM64 FPGA FGL CAM128 FPGA 1200 1000 filament 800 600 CPU 400 10x 200 Throughput (MTuples/sec) Throughput 0 2¹⁰ 2¹² 2¹⁴ 2¹⁶ 2¹⁸ 2²⁰ 2²² Relation Cardinality Heavy Hitter - 256 M tuples FPL 2019 - © W. Najjar 47 self-similar workload Shared Table Hybrid Aggregation Multiplexed FPGA Independent Tables Partition & Aggregate PLAT FGL CAM64 FPGA FGL CAM128 1200 1000 800 filament 600 400 CPU 10x 200 Throughput (MTuples/sec) Throughput 0 2¹⁰ 2¹² 2¹⁴ 2¹⁶ 2¹⁸ 2²⁰ 2²² Relation Cardinality SelfSimilar — 256 M tuples FPL 2019 - © W. Najjar 48 bandwidth utilization ✦ FPGA implementation achieves much higher memory bandwidth utilization 16M software 32M software 64M software 128M software 16M FPGA 32M FPGA 64M FPGA 128M FPGA 0.90 0.80 0.70 0.60 0.50 0.40 0.30 Ratio of Avg/Peak Bandwidth of Avg/Peak Ratio 0.20 0.10 0.00 2¹⁰ 2¹² 2¹⁴ 2¹⁶ 2¹⁸ 2²⁰ 2²² FPL 2019 - © W. Najjar Relation Cardinality 49 outline ✦ background ✦ filament execution model ✦ application - SpMV ✦ application - Selection ✦ application - Group by aggregation ✦ conclusion FPL 2019 - © W. Najjar 50 other results, past and on-going ✦ Big data is: BIG and multi-dimensional ❖ access patterns to data are extremely irregular ❖ no ideal storage scheme - constant poor locality ✦Other applications ❖ string matching, hash join, group by aggregation, selection, outer joins, sorting ✦Challenges ❖ inter-filament synchronization ❖ dynamic load re-balancing (at run time) FPL 2019 - © W. Najjar 51 publications http://www.cs.ucr.edu/~najjar/publications.html ✦ P. Budhkar, I. Absalyamov, V. Zois, S. Windh, W. A. Najjar and V. J. Tsotras. Accelerating In- Memory Database Selections Using Latency Masking Hardware Threads, in ACM TACO, 2019. ✦ I. Absalyamov, R. J. Halstead, P. Budhkar, W. A. Najjar, S. Windh, V. J. Tsotras. FPGA- Accelerated Group-by Aggregation Using Synchronizing Caches. 12th Int. Workshop on Data Management on New Hardware (DaMoN 2016), Co-located with ACM SIGMOD/PODS, San Francisco, USA, June 27, 2016. ✦ S. Windh, P. Budhkar and W. A. Najjar. CAMs as Synchronizing Caches for Multithreaded Irregular Applications on FPGAs. ICCAD 2015: 331-336 ✦ R. J. Halstead, I. Absalyamov, W. A. Najjar and V. J. Tsotras. FPGA-based Multithreading for In- Memory Hash Joins, in 7th Biennial Conference on Innovative Data Systems Research (CIDR 15), January 4-7, 2015, Asilomar, California. ✦ E. B. Fernandez, J. Villarreal, S. Lonardi, and W. A. Najjar. FHAST: FPGA-based acceleration of Bowtie in hardware, in IEEE/ACM Trans. on Computational Biology and Bioinformatics, # 99, Feb. 2015. ✦ R. J. Halstead, W. A. Najjar and O. Huseini. SpVM Acceleration with Latency Masking Threads on FPGAs. Technical Report UCR-CSE-2014-04001 FPL 2019 - © W. Najjar 52 conclusion ✦ advantages of filament execution ‣ very small state allows ‣ fast context switching between filaments ‣ small storage for waiting queues ‣ masking the latency of memory and non volatile storage ‣ no reliance on locality, no caches ‣ lower energy consumption ✦ limitations: custom or semi-custom accelerators FPL 2019 - © W. Najjar 53 thank you! questions? FPL 2019 - © W. Najjar 54