A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores

A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores Marc Tchiboukdjian Vincent Danjean Thierry Gautier Fabien Le Mentec Bruno Raffin Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 1/24 2. Multiple cores with private caches 3. Multiple cores with private caches and one shared cache The sequential has an unfair advantage. Can we still get linear speedup? Shared Cache of Multicore Processors 1. One core with 2 cache levels L2 L1 Core Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 2/24 3. Multiple cores with private caches and one shared cache The sequential has an unfair advantage. Can we still get linear speedup? Shared Cache of Multicore Processors 1. One core with 2 cache levels 2. Multiple cores with private caches L2 L2 L2 L2 L1 L1 L1 L1 Core Core Core Core Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 2/24 Shared Cache of Multicore Processors 1. One core with 2 cache levels 2. Multiple cores with private caches 3. Multiple cores with private caches and one shared cache L3 L2 L2 L2 L2 L1 L1 L1 L1 Core Core Core Core The sequential has an unfair advantage. Can we still get linear speedup? Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 2/24 Scheduling for Efficient Shared Cache Usage Schedule the computation so that shared cache misses do not increase. I with a work stealing scheduler allowing efficient dynamic load balancing I for parallel loops: no dependencies between tasks 8MB 256KB256KB256KB256KB 32KB 32KB 32KB 32KB Core 0 Core 1 Core 2 Core 3 Xeon Nehalem E5530 (2.4Ghz) Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 3/24 Overview Cache Efficient Work Stealing Scheduling for Parallel Loops 1. Standard Schedulers for Parallel Loops 2. New Scheduler Optimized for Shared Cache 3. Efficient Implementation of the Scheduler 4. Experiments Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 4/24 Parallel Loops Examples of parallel loops I OpenMP #pragma omp parallel for I TBB parallel for I Cilk parallel for I Parallel STL for each or transform Problem characteristics I Schedule n iterations on p cores I Iterations can be processed independently I Time to process one iteration can vary Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 5/24 Static Scheduling of Parallel Loops Static Scheduling n I Allocate iterations to each core p I ex: OpenMP static scheduling core 1 core 2 core 3 core 4 Characteristics I low overhead mechanism I bad load balancing if workload is irregular Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 6/24 Dynamic Scheduling of Parallel Loops OpenMP dynamic scheduling I Allocate iterations in chunks of size g I All chunks are stored in a centralized list I Each thread remove a chunk from the list and process it g Characteristics I Good load balancing I Contention on the list I Chunk creation overhead core 1 core 2 core 3 core 4 Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 7/24 Scheduling Parallel Loops with Work Stealing Work Stealing I Each thread has its own list of tasks (= chunks) I If list is empty, steal tasks in a randomly selected list I Binary tree of tasks to minimize number of steals: one steal , half of the iterations steal Characteristics I Good load balancing I Contention is reduced I Task creation overhead core 1 core 2 core 3 core 4 Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 8/24 Scheduling Parallel Loops with Work Stealing [0; 127] [0; 63] [64; 127] [0; 31] [32; 63] [0; 15] [16; 31] I terminated tasks I ready tasks I running tasks Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 9/24 Scheduling Parallel Loops with Work Stealing [0; 127] [64; 127] steal [0; 63] [64; 95] [96; 127] [0; 31] [32; 63] [64; 79] [80; 95] [0; 15] [16; 31] I terminated tasks I ready tasks I running tasks Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 9/24 Parallel Loops with Xkaapi Xkaapi typedef struct f InputIterator ibeg; InputIterator iend; I Work stealing library g Work_t ; // Task I Tasks are created on a steal: void parallel_for (...) f reduce task creation overhead while (iend != ibeg) I Cooperative stealing: do_work ( ibeg++ ) ; g // no more work -> become a thief The victim stops working to answer work requests void splitter ( num_req ) f i = 0 ; I The victim can answer to size = victim.iend - victim.ibeg ; multiple requests at a time bloc = size / ( num_req + 1 ) ; local_end = victim.iend ; while ( num_req > 0 ) f thief->iend = local_end ; Characteristics thief->ibeg = local_end - bloc ; local_end -= bloc ; I Good load balancing --num_req ; g I Low overhead mechanism g // victim + thieves -> parallel_for Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 10/24 Overview Cache Efficient Work Stealing Scheduling for Parallel Loops 1. Standard Schedulers for Parallel Loops steal core 1 core 2 core 3 core 4 2. New Scheduler Optimized for Shared Cache 3. Efficient Implementation of the Scheduler 4. Experiments Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 11/24 Reuse Distances [Beyls and D'Hollander 01] I Number of distinct elements accessed between two accesses to the same element. hd I If first access, reuse distance is infinity. C I On a fully associative LRU cache of size C: reuse distance ≤ C ) hit misses reuse distance > C ) miss d I hd : number of accesses with a reuse distance d 1 X I number of cache misses M(C) = hd d=C+1 element access ABABBCBAC reuse distance 1 1 2 2 1 1 2 3 3 cache content ; A A,B A,B A,B A,B B,C B,C A,B Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 12/24 Shared Cache Misses with the Classic Schedule I Cores work on elements far away shared cache I Good temporal locality of the sequential algorithm core 1 core 2 core 3 core 4 ) Cores work on distinct data hd C I To 1 access by a core corresponds C p − 1 accesses to distinct data by p seq misses the other cores I Reuse distance is multiplied by p: seq par hd = hp·d d I Number of cache misses additional misses 1 1 1 X par X seq X seq Mpar (C) = hd = hd=p = hd = Mseq(C=p) d=C+1 d=C+1 d=C=p+1 Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 13/24 A Shared Cache Aware Schedule I Cores work at distance at most m shared cache I r(m) accesses to distinct elements in a window of size m I The reuse distance is increased by seq win m at most r(m): hd = hd+r(m) I Number of cache misses 1 C X seq X seq Mwin(C) ≤ hd = Mseq(C) + hd d=C+1−r(m) d=C+1−r(m) hd r(m) Small Overhead over Sequential seq misses I good sequential locality ) r(m) small ) hd small for large d C d Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops onadditional Shared Cache Multicores misses 14/24 Overview Cache Efficient Work Stealing Scheduling for Parallel Loops 1. Standard Schedulers for Parallel Loops steal core 1 core 2 core 3 core 4 2. New Scheduler Optimized for Shared Cache shared cache m 3. Efficient Implementation of the Scheduler 4. Experiments Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 15/24 Implementing the Shared Cache Aware Schedule Using a standard parallel for loop: StaticWindow I Divide the iterations in n=m chunks of size m I Each chunk is processed in parallel with a standard parallel for I Two versions: pthread and Xkaapi Pthread version I each thread processes 1=p of a chunk shared cache I wait on a barrier after each chunk Xkaapi version m I each chunk is processed in parallel with a parallel for Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 16/24 Implementing the Shared Cache Aware Schedule Optimized implementation using Xkaapi: SlidingWindow I Processing iteration i enables iteration i + m I Master thread is at the beginning of the sequence I On a steal, the master can give work I In the interval [ibeg; iend[ like the other workers I In the interval [ilast; ibeg + m[ enabled since the last steal ibeg iend ilast Processed Elements Master Work Stolen Work Remaining Elements m-size window typedef struct f typedef struct f InputIterator ibeg; InputIterator ibeg; InputIterator iend; InputIterator iend; InputIterator ilast; g Work_t ; // Task g Master_Work_t ; // Master Task Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 17/24 Overview Cache Efficient Work Stealing Scheduling for Parallel Loops 1. Standard Schedulers for Parallel Loops steal core 1 core 2 core 3 core 4 2. New Scheduler Optimized for Shared Cache shared cache m 3. Efficient Implementation of the Scheduler ibeg iend ilast Processed Elements Master Work Stolen Work Remaining Elements m-size window 4. Experiments Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 18/24 Application: Isosurface Extraction ·108 1:2 C Mseq Isosurface extraction 1 p I Common scientific 0:8 visualization filter 0:6 Mseq(C) I Memory bounded 0:4 2MB 8MB Algorithm I Iterate through all cells in the mesh I Interpolate surface inside each cell I Cells close in the mesh share points Marc Tchiboukdjian A Work Stealing Scheduler for Parallel

Load more