FPL-2019 Why-I-Love-Latency
Total Page:16
File Type:pdf, Size:1020Kb
why I love memory latency high throughput through latency masking hardware multithreading WALID NAJJAR UC RIVERSIDE the team the cheerleaders: Walid A. Najjar, Vassilis J. Tsotras, Vagelis Papalexakis, Daniel Wong the players: • Robert J. Halstead (Xilinx) • Ildar Absalyamov (Google) • Prerna Budhkar (Intel) • Skyler Windh (Micron) • Vassileos Zois (UCR) • Bashar Roumanous (UCR) • Mohammadreza Rezvani (UCR) Yale is 80 - © W. Najjar !2 outline ✦ background ✦ filament execution model ✦ application - SpMV ✦ application - Selection ✦ application - Group by aggregation ✦ conclusion FPL 2019 - © W. Najjar !3 current DRAM trends Growing memory capacity and low cost have 3000 created a niche for in-memory solutions. Peak CPU Performance 2000 Low Latency memory hardware is more Peak Memory Performance expensive (39x) and has lower capacity than commonly used DDRx DRAM. Clock (MHz) 1000 Though memory performance is improving, it is not likely to match CPU performance anytime 0 soon 1990 1992 1994 1996 2000 2002 2004 2006 2008 2010 2012 2014 2016 Need alternative techniques to mitigate memory latency! K. K. Chang. Understanding and Improving Latency of DRAM-Based Memory Systems. PhD thesis, Carnegie Mellon University, 2017. Yale is 80 - © W. Najjar !4 latency mitigation - aka caching Yale is 80 - © W. Najjar !5 latency mitigation - aka caching Yale is 80 - © W. Najjar !5 latency mitigation - aka caching Intel Core-i7-5960x Launch Date Q3’14 Clock Freq. 3.0 GHz Cores / Threads 8 / 16 L3 Cache 20 MB Memory 4 Channels Memory 68 GB/s Bandwidth Yale is 80 - © W. Najjar !5 latency mitigation - aka caching Intel Core-i7-5960x ๏ caches > 80% of Launch Date Q3’14 CPU area, Clock Freq. 3.0 GHz ๏ big energy sink! Cores / Threads 8 / 16 ๏ application must have temporal and/ L3 Cache 20 MB Memory 4 or spatial locality Channels ๏ up to 7 levels Memory 68 GB/s storage hierarchy Bandwidth (register file, L1, L2, L3, memory, disk cache, disk) Yale is 80 - © W. Najjar !5 latency mitigation - aka caching Intel Core-i7-5960x ๏ caches > 80% of Launch Date Q3’14 CPU area, Clock Freq. 3.0 GHz ๏ big energy sink! Cores / Threads 8 / 16 ๏ application must have temporal and/ L3 Cache 20 MB Memory 4 or spatial locality Channels ๏ up to 7 levels Memory 68 GB/s storage hierarchy Bandwidth (register file, L1, L2, L3, memory, disk cache, disk) caches everywhere! reduces average memory latency assumes locality Yale is 80 - © W. Najjar !5 where cycles go - cache misses FPL 2019 - © W. Najjar !6 where cycles go - cache misses [DaMoN’13] Tözün, P. et al. OLTP in wonderland: where do cache misses come from in major OLTP components? FPL 2019 - © W. Najjar !6 where cycles go - cache misses [DaMoN’13] Tözün, P. et al. OLTP in [TODS’07] Chen, S. et al. Improving Hash Join wonderland: where do cache misses Performance through Prefetching come from in major OLTP components? FPL 2019 - © W. Najjar !6 big data analytics Yale is 80 - © W. Najjar !7 big data analytics BIG data (x petaB) Yale is 80 - © W. Najjar !7 big data analytics BIG data (x petaB) + Yale is 80 - © W. Najjar !7 big data analytics BIG data hyper (x petaB) + dimensional Yale is 80 - © W. Najjar !7 big data analytics BIG data hyper (x petaB) + dimensional Yale is 80 - © W. Najjar !7 big data analytics BIG data hyper (x petaB) + dimensional no data marshaling Yale is 80 - © W. Najjar !7 big data analytics BIG data hyper (x petaB) + dimensional no data marshaling Yale is 80 - © W. Najjar !7 big data analytics BIG data hyper (x petaB) + dimensional no data sparse data marshaling Yale is 80 - © W. Najjar !7 big data analytics BIG data hyper (x petaB) + dimensional no data sparse data marshaling Yale is 80 - © W. Najjar !7 big data analytics BIG data hyper (x petaB) + dimensional no data sparse data marshaling no locality! Yale is 80 - © W. Najjar !7 You have to know the past to understand the present - Carl Sagan FPL 2019 - © W. Najjar !8 the latency problem & solutions ✦ latency mitigations, aka “caching” ❖ effective only with spatial/temporal localities ✦ streaming: latency incorporated into pipeline ❖ requires spatial locality ✦ latency masking, aka hardware multithreading ❖ fast context switch after memory access ❖ small thread context ❖ concurrency == outstanding memory requests FPL 2019 - © W. Najjar !9 Burton Smith (1941 - 2018) FPL 2019 - © W. Najjar !10 Burton Smith (1941 - 2018) FPL 2019 - © W. Najjar !10 Burton Smith (1941 - 2018) ❖ main architect of ❖ Denelcor HEP (1982) ❖ Horizon (1985-88) @ IDA SRC ❖ Tera MTA (1999) later Cray XMT FPL 2019 - © W. Najjar !10 Burton Smith (1941 - 2018) ❖ main architect of ❖ Denelcor HEP (1982) ❖ Horizon (1985-88) @ IDA SRC ❖ Tera MTA (1999) later Cray XMT FPL 2019 - © W. Najjar !10 Burton Smith (1941 - 2018) ❖ main architect of ❖ Denelcor HEP (1982) ❖ Horizon (1985-88) @ IDA SRC ❖ Tera MTA (1999) later Cray XMT ❖ latency masking ❖ = do useful work while waiting for memory = multithreading ❖ switch to a ready process on long-latency I/O operations ❖ => higher throughput ❖ 50 years ago it was I/O latency => multi-programming FPL 2019 - © W. Najjar !10 Burton Smith (1941 - 2018) ❖ main architect of ❖ Denelcor HEP (1982) ❖ Horizon (1985-88) @ IDA SRC ❖ Tera MTA (1999) later Cray XMT ❖ latency masking ❖ = do useful work while waiting for memory = multithreading ❖ switch to a ready process on long-latency I/O operations ❖ => higher throughput ❖ 50 years ago it was I/O latency => multi-programming FPL 2019 - © W. Najjar !10 Burton Smith (1941 - 2018) ❖ main architect of ❖ Denelcor HEP (1982) ❖ Horizon (1985-88) @ IDA SRC ❖ Tera MTA (1999) later Cray XMT ❖ latency masking ❖ = do useful work while waiting for memory = multithreading ❖ switch to a ready process on long-latency I/O operations ❖ => higher throughput ❖ 50 years ago it was I/O latency => multi-programming ❖ https://www.microsoft.com/en-us/research/people/burtons/ FPL 2019 - © W. Najjar !10 Tera MTA/Cray XMT ✦ supercomputer, barrel processor registers ✦ 128 independent threads/processor … ✦ 2 inst/cycle issued: 1 ALU, 1 memory ✦ 1 full/empty bit per word insures correct ordering of memory accesses datapath ✦ longest memory latency: 128 cycles ✦ randomized memory: reduces bank conflicts ✦ processor & memory nodes on a 3D torus interconnection network ✦ 4096 nodes: sparsely populated FPL 2019 - © W. Najjar !11 barrel processor ✦ CPU that switches between threads of execution on every cycle. ✦ also known as "interleaved" or "fine-grained" temporal multithreading. ✦ each thread is assigned its own program counter and other hardware registers (architectural state). ✦ guarantee each thread will execute one instruction every n cycles. ✦ n-way barrel processor acts like n separate processors, ✦ each running at ~ 1/n the original speed. thread execution thread order threadthreadstate MEMORY threadstatestate threadthreadstate threadstatestate statestate FPL 2019 - © W. Najjar !12 outline ✦ background ✦ filament execution model ✦ application - SpMV ✦ application - Selection ✦ application - Group by aggregation ✦ conclusion FPL 2019 - © W. Najjar !13 what if we take it a step further? ✦ trade bandwidth for latency ✦ custom or semi custom processor/accelerator ✦ no need for large register files and thread state ➡ smaller data path ➡ more processors ➡ more threads per processor (engine or core) ➡ much higher throughput! ✦ suited for large scale irregular applications: ➡ data analytics, databases, sparse linear algebra: SpMV, SpMM, graph algorithms, bioinformatics FPL 2019 - © W. Najjar !14 hardware multithreading FPL 2019 - © W. Najjar !15 hardware multithreading MEMORY core FPL 2019 - © W. Najjar !15 hardware multithreading ready filament MEMORY core FPL 2019 - © W. Najjar !15 hardware multithreading ready filament custom data path MEMORY core FPL 2019 - © W. Najjar !15 hardware multithreading ready filament custom data path request MEMORY waiting filament core FPL 2019 - © W. Najjar !15 hardware multithreading ready filament custom data path request MEMORY response waiting filament core FPL 2019 - © W. Najjar !15 hardware multithreading thread thread moved toqueue ready ready filament custom data path request MEMORY response waiting filament core FPL 2019 - © W. Najjar !15 hardware multithreading Call them filament, to distinguish from “threads” Preempting executing filaments thread thread moved toqueue ready once it access memory ready filament Ready & Waiting Filament Queues fit 1000s of thread custom data path states request Massive parallelism: ≥ 4,000 MEMORY waiting filaments response waiting filament core FPL 2019 - © W. Najjar !15 filament data path DATA PATH read write input output data data MEMORY wait queue filament termination FPL 2019 - © W. Najjar !16 filament data path example Yale is 80 - © W. Najjar !17 filament data path example A[i] += B[i] * C[D[i]]; % typical of sparse linear algebra Yale is 80 - © W. Najjar !17 filament data path example A[i] += B[i] * C[D[i]]; % typical of sparse linear algebra DATA PATH MEMORY wait queue Yale is 80 - © W. Najjar !17 filament data path example A[i] += B[i] * C[D[i]]; % typical of sparse linear algebra generate i DATA PATH MEMORY wait queue Yale is 80 - © W. Najjar !17 filament data path example A[i] += B[i] * C[D[i]]; % typical of sparse linear algebra generate i DATA PATH MEMORY wait queue Yale is 80 - © W. Najjar !17 filament data path example A[i] += B[i] * C[D[i]]; % typical of sparse linear algebra generate i DATA PATH D[i] fetch D[i] MEMORY wait queue Yale is 80 - © W. Najjar !17 filament data path example A[i] += B[i] * C[D[i]]; % typical of sparse linear algebra generate i DATA PATH D[i] C[D[i]] fetch D[i] fetch C[D[i]] MEMORY wait queue Yale is 80 - © W. Najjar !17 filament data path example A[i] += B[i] * C[D[i]]; % typical of sparse linear algebra generate i DATA PATH D[i] C[D[i]] B[i] fetch D[i] fetch C[D[i]] fetch B[i] MEMORY wait queue Yale is 80 - © W. Najjar !17 filament data path example A[i] += B[i] * C[D[i]]; % typical of sparse linear algebra generate i DATA PATH ∑ D[i] C[D[i]] B[i] write A[i] fetch D[i] fetch C[D[i]] fetch B[i] MEMORY wait queue Yale is 80 - © W.