Distributed Synchronization Primitives for Legacy and Novel Parallel Applications Rodrigo Cadore Cataldo
Total Page:16
File Type:pdf, Size:1020Kb
Subutai : Distributed synchronization primitives for legacy and novel parallel applications Rodrigo Cadore Cataldo To cite this version: Rodrigo Cadore Cataldo. Subutai : Distributed synchronization primitives for legacy and novel parallel applications. Distributed, Parallel, and Cluster Computing [cs.DC]. Université de Bretagne Sud; Pontifícia universidade católica do Rio Grande do Sul, 2019. English. NNT : 2019LORIS541. tel- 02865408 HAL Id: tel-02865408 https://tel.archives-ouvertes.fr/tel-02865408 Submitted on 11 Jun 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THESE DE DOCTORAT DE L’UNIVERSITÉ BRETAGNE SUD COMUE UNIVERSITE BRETAGNE LOIRE ÉCOLE DOCTORALE N° 601 Mathématiques et Sciences et Technologies de l'Information et de la Communication Spécialité : Électronique Par « Rodrigo Cadore CATALDO » « SUBUTAI: Distributed synchronization primitives for legacy and novel parallel applications » Thèse présentée et soutenue à Porto Alegre, le 16 décembre 2019 Unité de recherche : Lab-STICC Thèse N° : 541 Rapporteurs avant soutenance : Composition du Jury : Eduardo Augusto BEZERRA Professeur, Universidade Federal Kevin MARTIN Maître de conférences, Université de Santa Catarina (UFSC) Bretagne Sud Marcio KREUTZ Directeur de thèse Professeur, Universidade Federal Jean-Philippe DIGUET DR, CNRS do Rio Grande do Norte (UFRN) Avelino Francisco ZORZO Co-directeur de thèse / Président Professeur, Pontifícia César Augusto MISSIO MARCON Universidade Católica do Rio Professeur, Pontifícia Universidade Grande do Sul (PUCRS) Católica do Rio Grande do Sul (PUCRS) Subutai : Distributed synchronization primitives for legacy and novel parallel applications Rodrigo Cadore Cataldo 2019 List of Figures Figure 1.1 – The speedup of a parallel application is severely limited by how much of the application can be parallelized [Wik19]. 18 Figure 1.2 – The essential approaches to parallelizing code. Thick lines depict locks and the flash symbol denote device interrupt [Kå05]. 19 Figure 1.3 – (a) RCU usage over the years [MBWW17]; (b) RCU and Locking usage over the years [MW08]. 20 Figure 1.4 – Snippet of the locking scheme for the TTY subsystem [BC19a][BC19b]. 23 Figure 1.5 – Execution of reader-writer lock and RCU. The green color is used to represent up to date data [McK19a]. 24 Figure 1.6 – Cost of each production step of multiple generations of transistor technology [Spe19]. 25 Figure 1.7 – Subutai components are highlighted in red in the computing stack. Subutai only requires changes in the (i) on-chip NI, (ii) OS NI driver and (iii) PThreads implementation. Additionally, a new scheduling policy (in blue) is explored in this work as an optional optimization. 26 Figure 1.8 – Synchronization acceleration with the Subutai solution for different scenarios. 27 Figure 2.1 – Atomic increment scalability on a Nehalem Intel processor [McK19a]. 31 Figure 2.2 – Throughput of different atomic operations on a single memory posi- tion [DGT13]. 31 Figure 2.3 – Reordering example from the Intel Manual [Int17] [Pre19b]. 32 Figure 2.4 – The net effect of using lock-based procedures: implicit memory barriers (represented as blue fences) (based on [Pre19a]). 36 Figure 2.5 – Scalability of the Reader-Writer Lock mechanism. A parameter is used to simulate the critical section range in instructions, which ranges from a thousand (1K on the graph) to 100 million (100M on the graph) instructions [McK19a]. 39 Figure 3.1 – The fork-join model of OpenMP [Bar19a]. 47 Figure 3.2 – RCU deletion example [McK19a]. 50 Figure 3.3 – Milliseconds to copy a 10MB file in two threads on a 2GHz Pentium 4 Xeon (lower is better) [Boe07]. 52 Figure 3.4 – Delays to complete the barrier release process for a 16-thread appli- cation on a 16-core system (a) without and (b) with optimization [FPMR18]. 53 Subutai : Distributed synchronization primitives for legacy and novel parallel applications Rodrigo Cadore Cataldo 2019 Figure 3.5 – Normalized execution time of eight applications from the STAMP benchmark from 1 up to 8 threads; AVG represents the average execution time of all applications [YHLR13]. 56 Figure 3.6 – Comparison of different synchronization schemes for PhysicsSolver [YHLR13]. 57 Figure 3.7 – Simulation time comparison [PKMS17]. 58 Figure 3.8 – Ideas behind the CASPAR architecture [GMT16]. 60 Figure 3.9 – Experimental results for the CASPAR architecture design [GMT16]. 61 Figure 3.10 – Three barrier synchronization topologies [AFA+12]. 62 Figure 3.11 – Gather phase for (a) CBarrier, (b) GBarrier, and (c) TBarrier [AFA+12]. 63 Figure 3.12 – (a) Optimal frequencies and (b) area overhead running at 600MHz for all analyzed topologies [AFA+12]. 64 Figure 3.13 – The barrier cost of (a) SW and (b) HW implementation in cycles [AFA+12]................................................ 65 Figure 3.14 – Barrier overhead for varied-sized parallel workload [AFA+12]. 65 Figure 3.15 – Target architecture with Notifying Memories [MRSD16]. 67 Figure 3.16 – Classification of packets transported after 10 decoded frames of ice [MRSD16]. 69 Figure 4.1 – The Subutai solution; Subutai components are highlighted in red. 79 Figure 4.2 – Schematic of the target architecture for Subutai; the target architecture is comprised of 64 cores. 81 Figure 4.3 – Schematic representation of Subutai-HW and the NI. The hardware elements required by Subutai-HW implementation are highlighted in orange. 82 Figure 4.4 – Subutai-HW control structure. 83 Figure 4.5 – Subutai-HW queue structure. 83 Figure 4.6 – Subutai’s packet format. 87 Figure 4.7 – Futex kernel implementation [Har19]. 91 Figure 4.8 – 2-bit saturating counter for branch prediction [Dia19]. 94 Figure 4.9 – Communication flow of Subutai. 95 Figure 5.1 – The Subutai extensions are highlighted in blue. 97 Figure 5.2 – Core scheduler procedure switch_to steps for two Linux kernel versions [Mai19]. 99 Figure 5.3 – Overall time spent in the critical sections for all the work-related mutexes of Bodytrack on a RR scheduler. 100 Figure 5.4 – Overall time spent in mutex queues for all the work-related mutexes of Bodytrack on a RR scheduler. 101 Subutai : Distributed synchronization primitives for legacy and novel parallel applications Rodrigo Cadore Cataldo 2019 Figure 5.5 – Total execution time spent in critical sections for all the work-related mutexes of Bodytrack on a RR scheduler. 102 Figure 5.6 – Overall time spent in critical sections for all the work-related mutexes of Bodytrack on a RR and CSA-enabled scheduler. 103 Figure 6.1 – Experimental setup for Subutai evaluation. 120 Figure 6.2 – Maximum speedup measured for a 48-core system for each bench- mark, region, and input set combination. Full and ROI inputs represent the entire application and parallel portion, respectively [SR15]. 124 Figure 6.3 – Bodytrack’s output. 126 Figure 6.4 – Bodytrack’s core workflow. 127 Figure 6.5 – The execution time in seconds (109 ns) of our application set on SW- only and Subutai solutions for 16 threads. Due to the order of magnitude, Subutai-HW and NoC latencies are not visible. 131 Figure 6.6 – The execution time in seconds (109 ns) of our application set on SW- only and Subutai solutions for 32 threads. Due to the order of magnitude, Subutai-HW and NoC latencies are not visible. 132 Figure 6.7 – The execution time in seconds (109 ns) of our application set on SW-only and Subutai solutions for 64 cores. Due to the order of magnitude, Subutai-HW and NoC latencies are not visible. 133 Figure 6.8 – Relative time for every thread to reach the barrier localS; (a) is a plot for all barrier calls during the application execution, while (b) is a snippet of 50 barrier calls for better visualization. 137 Figure 6.9 – Relative time for every thread to reach all barriers calls of workDone- Barrier. ..................................................... 137 Figure 6.10 – Comparison of latency for the procedure pthread_barrier_wait for SW-only and Subutai. The results are for T7 for a subset of barrier calls (400 up to 450). 138 Figure 6.11 – Comparison of latency for the procedure pthread_barrier_wait for SW-only and Subutai. The results are for T7 for all barrier calls of poolReadyBarrier.............................................. 138 Figure 6.12 – Time spent, in 106 ns, on the mutex queue of Bodytrack and x264. (a) and (b) are plots of time spent (Y-axis) by the number of threads on the queue (X-axis) for x264 and Bodytrack, respectively; (c) and (d) are the total time spent for the same application set. 141 Figure 6.13 – Execution in seconds (109 ns) for multiple application sets (lower is better) (Exec = Execution). 143 Subutai : Distributed synchronization primitives for legacy and novel parallel applications Rodrigo Cadore Cataldo 2019 Figure 6.14 – Execution in seconds (109 ns) for multiple application sets (lower is better) (Exec = Execution). 145 Figure 6.15 – Latency reduction in nanoseconds of the total execution time for three PARSEC applications. 148 Figure 7.1 – Polynomial Regression and recorded data for the barrier latencies for a worker thread of Bodytrack. 157 Figure 7.2 – Barrier latencies from Bodytrack for (a) recorded data, (b) one gaus- sian and (c) double gaussian models. 158 Figure 7.3 – A zoom for some of the first 10 barriers of Bodytrack from Figure 7.2. 159 Subutai : Distributed synchronization primitives for legacy and novel parallel applications Rodrigo Cadore Cataldo 2019 List of Tables Table 1.1 – Breakdown of RCU usage by kernel subsystems [MBWW17]. 21 Table 2.1 – Overview of some memory barriers [Int17] [Lea19] [SPA93]. 32 Table 2.2 – Memory barriers used by some PThread implementations for mutexes.