Performance Optimization Strategies for Transactional Memory Applications
Total Page:16
File Type:pdf, Size:1020Kb
Performance Optimization Strategies for Transactional Memory Applications zur Erlangung des akademischen Grades eines Doktors der Ingenieurwissenschaften von der Fakultät für Informatik des Karlsruher Instituts für Technologie (KIT) genehmigte Dissertation von Martin Otto Schindewolf aus Eschwege Tag der mündlichen Prüfung: 19. April 2013 Erster Gutachter: Prof. Dr. Wolfgang Karl Zweiter Gutachter: Prof. Dr. Albert Cohen KIT – Universität des Landes Baden-Württemberg und nationales Forschungszentrum der Helmholtz-Gesellschaft www.kit.edu Ich versichere hiermit wahrheitsgemäß, die Arbeit bis auf die dem Aufgabensteller bereits bekannte Hilfe selbständig angefertigt, alle benutzten Hilfsmittel vollständig und genau angegeben und alles kenntlich gemacht zu haben, was aus Arbeiten anderer unverändert oder mit Abänderung entnommen wurde. Karlsruhe, den 4. März 2013 Martin Schindewolf Abstract Transactional Memory (TM) has been proposed as an architectural extension to enable lock-free data structures. With the ubiquity of multi-core systems, the idea of TM gains new momentum. The motivation for the invention of TM was to simplify the synchronization of parallel threads in a shared memory system. TM features optimistic concurrency as opposed to the pessimistic concurrency with traditional locking. This optimistic approach lets two transactions execute in parallel and assumes that there is no data race. In case of a data race, e.g., both transactions write to the same address, this conflict must be detected and resolved. Therefore a TM run time system monitors shared memory accesses inside transactions. These TM systems can be implemented in software (STM), hardware (HTM) or as a hybrid combination of both. Most of the research in TM focuses on language extensions, compiler support, and the optimization of algorithmic details of TM systems. Each of the resulting TM systems has a different performance characteristic and comes with a number of parameters that can be tuned. This optimization space overwhelms the application developer who has been the target audience for the invention of TM. In contrast to other research activities, this thesis proposes TM tools that are candidates to bridge the gap between the application developer and the designer of the TM system. These tools for TM provide insights in the run time behavior of the TM application and guide the application developer to optimize the TM application. In contrast to related work presenting tools for a specific TM system coupled with a programming language, this thesis presents solutions that cover multiple TM systems (Software, Hardware, and hybrid TM) and account for different parameter settings in each of the TM systems. Moreover, the proposed methods and strategies use information of all different layers of the TM software stack. Therefore, static information, information about the run time behavior, and expert-level knowledge need to be extracted to develop these new methods and strategies for the optimization of TM applications. Extracting and using this information poses the following challenges that need to be addressed. The first challenge is to capture the genuine TM application’s behavior at run time. This is especially important because transactions are sensitive to artificially introduced delays of a thread which may cause an introduced TM conflict. This may lead to a recorded application’s behavior that is biased through the tracing machinery. Therefore, a lightweight trace generation scheme with a small probe-effect needs to be developed and evaluated. This thesis presents similar solutions for STM and hybrid TM that generate event traces. For STM, the logging of frequently occurring events, such as TM loads and stores, quickly saturates the write bandwidth of the hard disk. Thus, a reduction of the amount of data to be written to hard disk needs to be achieved. Therefore, we research how these events can be compressed online without disturbing the application. A multi-threaded trace compression scheme is designed, implemented, and evaluated. We enhance the TinySTM, a word-based open source STM, with tracing facilities. Due to the lightweight and compressing tracing i ii scheme, the resulting TracingTinySTM generates event traces that capture the genuine TM application’s behavior. For an FPGA-based hybrid TM system, called TMbox, a low-overhead tracing solution is shown. Each processing element is enhanced with an event generation unit for monitoring. When an instruction of interest is encountered, a corresponding time-stamped event is generated and passed on to the log unit. This unit uses idle times on the bus to transfer it to the memory of the host machine. Although all methods are tailored to and applied in the context of TM, they are universally applicable in a different context. Example application areas are the monitoring of arbitrary events in an FPGA-based processor prototype or the logging and compressing of events in a math library. The second challenge arises with the release of IBM’s new Blue Gene/Q architecture with HTM support. We establish a set of best practices for the new architecture so that the application developer can exploit the full potential of the architecture including the TM subsystem. For this purpose, we introduce a new benchmark, CLOMP-TM, that has a series of parameters that allow us to explore varying transaction granularities and conflict rates, coupled with typical computational kernels found in scientific codes. This enables new insights through evaluating the TM system and comparing the performance with other synchronization primitives of OpenMP. Further, this new BG/Q architecture also requires tool support to enable optimizations of scientific applications. A trace-based solution to capture the run time behavior on a per event basis is not feasible for this proprietary HTM system. This HTM system requires a complete software stack, including compiler and run time system. Thus, we design, implement and evaluate three different tools for TM that enable programmers to explore the subtleties of TM execution and contribute the following to the state-of-the-art: • the first tool that profiles applications using MPI and OpenMP with TM on BG/Q, • a tracing tool for TM that enables in-depth inspection of thread-level execution and utilization of the architecture through visualization with the state-of-the-art visualization tool Vampir, • a tool that measures overheads associated with TM, designed to dissect these over- heads and direct optimization efforts for the TM stack. These tools uncover the subtle interaction of the TM system and the prefetching on BG/Q and help to study the implications for designing applications and choosing the TM mode of execution. Moreover, these tools enable to obtain a comprehensive understanding of the performance of synchronization mechanisms in LULESH, a Lagrange hydrodynamics proxy application, and find the cause for the missing performance with TM. The third challenge is to correlate the gathered TM characteristics with microarchitecture events for the optimization of TM applications using STMs. Although this correlation is not new by itself, it is extremely helpful and has not been researched extensively in this context. Besides choosing well-suited parameters to monitor the microarchitecture, the correct interpretation of the obtained values is of key importance. To simplify the optimization for humans, the Visualization Of TM Applications (VisOTMA) framework is developed that visualizes these traces and enables the programmer to identify frequently conflicting transactions and the corresponding values of microarchitecture parameters in a post-processing step. The visualization of this aggregated data needs to be achieved in a way that an inexperienced as well as an experienced programmer can identify pathological execution patterns – posing ii iii the fourth challenge. So far optimizations have been carried out by TM experts with excellent knowledge of the underlying TM system. An inexperienced programmer does not have or may not be willing to acquire the knowledge of the TM system. Thus, the inexperienced programmer follows a trial-and-error strategy to optimize the TM application. To speedup this process, we invent EigenOpt – an exploration tool based on EigenBench – as part of the VisOTMA framework. With the help of EigenOpt, any programmer can capture the TM characteristic of the application in terms of parameters for Eigenbench. These parameters combined with Eigenbench are straightforwardly used to explore the space available through optimizations. With this tool, unrewarding optimization directions can be excluded without modifying the application. In this thesis, we will research how to identify and avoid optimization attempts with diminishing returns. This will speedup the optimization process for an inexperienced programmer and also yield new insights for an experienced one. The fifth challenge is to detect and exploit a potential phase behavior of TM applications and integrate this analysis in the VisOTMA framework. In case the behavior of the TM application has periods with high and low conflict probability, this behavior of the application can be detected and exploited. Exploiting these phases is motivated through the different proposed TM designs: optimistic conflict detection schemes detect conflicts at commit-time whereas pessimistic schemes check for conflicts at encounter-time. In the optimistic case, a conflict early in the execution of a transaction is noticed at commit time so