UC Santa Cruz UC Santa Cruz Electronic Theses and Dissertations
Total Page:16
File Type:pdf, Size:1020Kb
UC Santa Cruz UC Santa Cruz Electronic Theses and Dissertations Title Section Based Program Analysis to Reduce Overhead of Detecting Unsynchronized Thread Communication Permalink https://escholarship.org/uc/item/8vx8d8wv Author Das, Madan Mohan Publication Date 2015 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA SANTA CRUZ SECTION BASED PROGRAM ANALYSIS TO REDUCE OVERHEAD OF DETECTING UNSYNCHRONIZED THREAD COMMUNICATION A dissertation submitted in partial satisfaction of the requirements for the degree of DOCTOR OF PHILOSOPHY in COMPUTER ENGINEERING by Madan Mohan Das March 2015 The Dissertation of Madan Mohan Das is approved: Professor Jose Renau, Chair Professor Cormac Flanagan Professor Anujan Varma Tyrus Miller Vice Provost and Dean of Graduate Studies Copyright c by Madan Mohan Das 2015 Table of Contents List of Figures v List of Tables vii Abstract viii Acknowledgments ix 1 Introduction 1 1.1 Contributions . .8 1.2 Thesis Organization . 10 2 Related Work 12 2.1 Static Race Detection . 13 2.2 Dynamic Race Detection . 15 2.3 Deterministic Runtime Systems . 18 2.4 Software Transactional Memory (STM) . 19 2.5 Data Flow Analysis . 21 2.6 Pointer Analysis Background . 22 2.6.1 Flow sensitive vs. insensitive pointer analysis . 23 2.6.2 Context sensitive vs. insensitive pointer analysis . 24 2.6.3 Dynamic vs. Static pointer analysis . 25 2.7 Some important Flow insensitive methods . 26 2.7.1 Andersen’s flow insensitive analysis . 26 2.7.2 Steensgaard’s algorithm . 27 3 Finding Disjoint Thread Sections 29 3.1 Terminologies . 29 3.2 SBPA Pointer Analysis Framework . 32 3.2.1 Perform modref analysis per section . 33 3.2.2 Adaptive non-unification for points-to sets of function arguments . 33 iii 3.2.3 Field sensitivity for array elements . 36 3.3 Constructing the Reduced ICFG . 36 3.4 Single-Threaded Thread Sections (Single-TS) . 41 3.5 Disjoint Thread Sections (Disjoint-TS) . 45 3.6 Overall Instrumentation Flow . 50 4 Programmer Annotations and MTROM 51 4.1 Marking Parallel Code Sections . 52 4.2 Multi Thread Read Only Memory . 52 5 Loop Invariant Log Motion 56 5.1 Scalar Loop Invariant Log Motion (SLILM) . 57 5.2 Vector Loop Invariant Log Motion (VLILM) . 58 5.3 Result of Applying LILM . 61 6 Experimental Results 63 6.1 Experimental Setup . 63 6.2 Results . 65 6.2.1 Overall Results . 66 6.2.2 Analysis of Reduction in Instrumentation . 68 6.2.3 Benchmark Insights . 70 6.2.4 Compilation Overhead . 75 7 A case study with ThreadSanitizer 78 8 Improving Static Race Precision with SBPA 84 8.1 Methodology . 85 8.2 An Example Case . 87 8.3 Results . 91 9 Conclusion 96 10 Future Work 98 10.1 Symbolic Array Partitioning . 98 10.2 Generalized SBPA and MTROM . 99 10.2.1 Extension of SBPA to Tree of Threads . 99 10.2.2 MTROM for Tree of Threads and Dynamic MTROM . 100 10.3 Hardware Implementation of MTROM . 101 Bibliography 104 iv List of Figures 1.1 Two threads T1 and T2 execute four different code sections (CS1, CS2, CS3, CS4) separated by a barrier B. The barrier implicitly creates two thread sections (TS1, TS2) that cannot execute simultaneously. WX and RX represent write and read of some memory location X. Since X is not modified anywhere in TS2, RX does not need to be instrumented. .6 1.2 A typical threaded section of program . .7 3.1 A code snippet showing the call of a function f, that has any of the thread creation, join or synchronization directive. The resulting changes in RICFG are shown on the right. 41 3.2 A program comprised of 4 code sections in 3 thread sections . 42 3.3 A decomposition of threaded section into smaller, disjoint parallel segments . 46 3.4 Two threads executing 5 code sections in 3 disjoint thread sections. 47 6.1 SBPA compiler pass detected 63% of all memory accesses at run-time as non- conflicting. Excluding the improvements from ’Directives’ yields 51% accesses proven as non-conflicting. 66 6.2 SBPA identified 80% of the dynamic memory accesses executed in single threaded mode. 67 6.3 68% of the loads are detected as non-conflicting, with a few applications reach- ing 100%. 68 6.4 61% of the stores do not need to be tracked by tools like data-race detectors which is much better than 38% with Base. For STMs that may have restarts, 52% of the stores can be proven as safe. 69 6.5 Compilation times of Base and SBPA normalized against compile time when no optimization is applied. On average, Base took 75% and SBPA took 46% of unoptimized compile time. 76 7.1 SBPA yields over 2 times speedup compared to Tsan, with some applications achieving over 30 times speedup. 82 v 7.2 ThreadSanitizer speed-ups in the different modes described earlier. SBPA, com- bined with Directives, speeds up ThreadSanitizer execution by a factor of 2.74. 82 8.1 Static race detection flow with SBPA. Static races detected are classified and further validated by dynamic race detection. 88 8.2 Percentages of total read and write lines identified as non-racy by Base and SBPA techniques. 93 8.3 Percentages of read lines identified as non-racy by Base and SBPA techniques. 93 8.4 Percentages of write lines identified as non-racy by Base and SBPA techniques. 93 8.5 Percentages of read and write instructions identified as non-racy by Base and SBPA techniques. 94 8.6 Percentages of read instructions identified as non-racy by Base and SBPA tech- niques. 94 8.7 Percentages of write instructions identified as non-racy by Base and SBPA tech- niques. 94 10.1 A hierarchy of threads in a program, represented as a tree . 100 vi List of Tables 5.1 Total access logging reduction improvement with LILM . 62 5.2 Load logging reduction with LILM. 62 6.1 Benchmarks used, abbreviation shown in parentheses. 64 7.1 Runtimes (in seconds) of executables compiled for ThreadSanitizer dynamic race detection with 2 threads. 80 8.1 Total racy read and write lines, and total program lines for the benchmarks studied. 95 vii Abstract Section Based Program Analysis to Reduce Overhead of Detecting Unsynchronized Thread Communication by Madan Mohan Das Most systems that test and verify parallel programming, such as data race detectors and soft- ware transactional memory systems, require instrumenting loads and stores in an application. This can cause a very significant runtime and memory overhead compared to executing unin- strumented code. Multithreaded programming typically allows any thread to perform loads and stores at any location in the process’s address space independently. Most of these unsynchro- nized memory accesses are non-conflicting in nature; that is, the values read from or written to memory are only used by a single thread. We propose Section-Based Program Analysis (SBPA), a novel way to decompose the program into disjoint sections to identify non-conflicting loads and stores during program compilation. We combine SBPA with improved context sensitive alias analysis, loop specific op- timizations and a few user directives to further increase the effectiveness of SBPA. We imple- mented SBPA for a deterministic execution runtime environment, and were able to eliminate 63% of dynamic memory access instrumentations. We also integrated SBPA with ThreadSan- itizer, a state of the art dynamic race detector, and achieved a speed-up of 2.74 times on a geometric mean basis. Lastly, we show that SBPA is also effective in static race detection. viii Acknowledgments First, I sincerely thank my adviser Professor Jose Renau for his ideas, vision, constant support and feedback over the last many years on my research. Without his help I won’t be able to reach this milestone in my life. I would also like to thank Professor Cormac Flanagan and Professor Anujan Varma for accepting to review this work, and providing valuable feedback. This thesis wouldn’t have been complete without their inputs. I also thank Gabriel Southern, my fellow researcher and co-author in my publications for his great effort and support for the work presented in this thesis; and for bringing positive thoughts during times of difficulty. I also sincerely thank other MASC lab students for their valuable inputs and feedback during the tenure of my studies. Finally, I thank my wife, son and daughter for being supportive of my decision to pursue this degree and understanding my inability at times to carry out family obligations. ix Chapter 1 Introduction Since the beginning of the computer era, researchers have explored several parallel execution models and architectures. While significant progress has been achieved in paralleliz- ing multiple program execution on the same system, and in distributed multi-processing where the outcome of the program depends on the solution of well-segregated sub-problems, parallel programming in the context of a single program with shared memory remains difficult to adopt for programmers. This fact is most conspicuous when we observe that even today, most pro- grammers first think of their programs as single-threaded applications and often times develop as such, and then consider parallelizing the program as an after thought. The reasons for such behavior are multiple. Most prominent among those is the fact that for software programmers, it is difficult to visualize the parallel execution of many dif- ferent program fragments. Secondly, potentially exponential number of possible interleavings of the program fragment executions makes it almost impossible to contemplate and debug the 1 unpredictable bugs, or as many have said, the ”heisen-bugs”.