Downloaded and Freely Modified to Meet Our Additional Requirements Related to Result Logging
Total Page:16
File Type:pdf, Size:1020Kb
INVESTIGATING TOOLS AND TECHNIQUES FOR IMPROVING SOFTWARE PERFORMANCE ON MULTIPROCESSOR COMPUTER SYSTEMS A thesis submitted in fulfilment of the requirements for the degree of MASTER OF SCIENCE of RHODES UNIVERSITY by WAIDE BARRINGTON TRISTRAM Grahamstown, South Africa March 2011 Abstract The availability of modern commodity multicore processors and multiprocessor computer sys- tems has resulted in the widespread adoption of parallel computers in a variety of environments, ranging from the home to workstation and server environments in particular. Unfortunately, par- allel programming is harder and requires more expertise than the traditional sequential program- ming model. The variety of tools and parallel programming models available to the programmer further complicates the issue. The primary goal of this research was to identify and describe a selection of parallel programming tools and techniques to aid novice parallel programmers in the process of developing efficient parallel C/C++ programs for the Linux platform. This was achieved by highlighting and describing the key concepts and hardware factors that affect parallel programming, providing a brief survey of commonly available software development tools and parallel programming models and libraries, and presenting structured approaches to software performance tuning and parallel programming. Finally, the performance of several parallel programming models and libraries was investigated, along with the programming effort required to implement solutions using the respective models. A quantitative research methodology was applied to the investigation of the performance and programming effort associated with the selected parallel programming models and libraries, which included automatic parallelisation by the compiler, Boost Threads, Cilk Plus, OpenMP, POSIX threads (Pthreads), and Threading Building Blocks (TBB). Additionally, the perfor- mance of the GNU C/C++ and Intel C/C++ compilers was examined. The results revealed that the choice of parallel programming model or library is dependent on the type of problem being solved and that there is no overall best choice for all classes of problem. However, the results also indicate that parallel programming models with higher levels of abstraction require less programming effort and provide similar performance compared to explicit threading models. The principle conclusion was that the problem analysis and parallel design are an important factor in the selection of the parallel programming model and tools, but that models with higher levels of abstractions, such as OpenMP and Threading Building Blocks, are favoured. ACM Computing Classification System Classification Thesis classification under the ACM Computing Classification System (1998 version, valid through 2011) [2]: D.1.3 [Concurrent Programming]: Parallel programming D.2.5 [Testing and Debugging]: Debugging aids D.2.8 [Metrics]: Performance measures, Product metrics D.3.3 [Language Constructs and Features]: Concurrent programming structures General-Terms: Design, Experimentation, Measurement, Performance Acknowledgements I acknowledge the financial and technical support of Telkom SA, Business Connexion, Com- verse SA, Verso Technologies, Stortech, Tellabs, Amatole Telecom Services, Mars Technolo- gies, Bright Ideas Projects 39, and THRIP through the Telkom Centre of Excellence in the De- partment of Computer Science at Rhodes University. In particular, I extend thanks to Telkom for the generous Centre of Excellence bursary allocated to me during the course of my research. I am also deeply grateful to my supervisor, Dr Karen Bradshaw, for her exceptional support, advice, and patience. I could not have asked for a better supervisor to guide me through my research. A special thanks goes to my partner, Lisa, for her love, support and patience, particularly during the final stages of my thesis write-up. Had it not been for her exceptional organisational skills and forward planning, I would have had a significantly harder time finalising this thesis. Many thanks to my family and friends for their constant support and encouragement. Contents 1 Introduction 1 1.1 Problem Statement and Research Goals . 2 1.2 Thesis Organisation . 3 2 Background Work 5 2.1 Introduction . 5 2.2 Parallel Programming Terms and Concepts . 6 2.2.1 Serial Computing . 6 2.2.2 Parallel Computing . 6 2.2.3 Tasks . 7 2.2.4 Processes and Threads . 7 2.2.5 Locks . 8 2.2.6 Critical Sections, Barriers and Synchronisation . 8 2.2.7 Race Conditions . 10 2.2.8 Deadlock . 11 2.2.9 Starvation and Livelock . 12 2.3 Parallel Performance . 12 2.3.1 Speedup . 13 2.3.2 Amdahl’s Law . 14 2.3.3 Gustafson’s Law . 15 i CONTENTS ii 2.3.4 Sequential Algorithms versus Parallel Algorithms . 18 2.4 Computer Organisation . 18 2.4.1 Processor Organisation . 19 2.4.2 Computer Memory . 29 2.4.3 Memory Organisation . 37 2.5 Mutexes, Semaphores and Barriers . 40 2.5.1 Properties of Concurrent Operations and Locks . 40 2.5.2 Atomic Operations . 42 2.5.3 Locking Strategies . 45 2.5.4 Types of Locks . 46 2.6 Software Metrics . 49 2.6.1 Code Size and Complexity . 50 2.6.2 Concluding Remarks . 55 2.7 Summary . 55 3 Parallel Programming Tools 57 3.1 Introduction . 57 3.2 C and C++ . 57 3.2.1 C++0x . 59 3.3 Concurrent Programming Environments . 59 3.3.1 POSIX Threads . 60 3.3.2 Boost Threads Library . 61 3.3.3 Threading Building Blocks . 63 3.3.4 OpenMP . 65 3.3.5 Cilk, Cilk++, and Cilk Plus . 69 3.3.6 Parallel Performance Libraries . 71 CONTENTS iii 3.4 Compilers . 73 3.4.1 GNU Compiler Collection . 74 3.4.2 Intel C and C++ Compilers . 75 3.4.3 Low Level Virtual Machine (LLVM) . 76 3.5 Integrated Development Environments . 77 3.5.1 Eclipse . 77 3.5.2 NetBeans IDE . 78 3.6 Debugging and Profiling Tools . 80 3.6.1 GNU Debugger (GDB) . 81 3.6.2 Intel Debugger (IDB) . 82 3.6.3 Intel Inspector XE 2011 . 82 3.6.4 GNU Profiler (gprof) . 83 3.6.5 OProfile . 85 3.6.6 Valgrind . 85 3.6.7 AMD CodeAnalyst for Linux . 87 3.6.8 Intel VTune Amplifier XE 2011 for Linux . 88 4 Parallel Programming Techniques 90 4.1 Introduction . 90 4.2 Performance Analysis and Tuning . 90 4.3 The Tuning Process . 91 4.4 Sequential Optimisations . 93 4.4.1 Loop Optimisations . 94 4.4.2 Aligning, Padding, and Sorting Data Structures . 99 4.4.3 Avoid Branching . 100 4.4.4 Vectorisation . 100 CONTENTS iv 4.5 Parallel Programming Patterns . 101 4.5.1 Identifying Concurrency . 102 4.5.2 Algorithm Structure . 105 4.5.3 Supporting Structures . 108 4.5.4 Implementation Mechanisms . 110 4.6 Parallel Optimisations . 111 4.6.1 Data Sharing . 112 4.6.2 Reduce Lock Contention . 112 4.6.3 Load Balancing . 114 5 Methodology 116 5.1 Research Design . 116 5.2 System Specifications and Software Versions . 117 5.3 Automated Benchmarking Tool . 118 5.3.1 Extensions to the Benchmarking Tool . 119 6 Implementation 120 6.1 Introduction . 120 6.2 Matrix Multiplication . 121 6.2.1 Initial Profiling and Analysis . 121 6.2.2 Sequential Optimisations . 122 6.2.3 Parallel Implementations . 124 6.3 Mandelbrot Set Algorithm . 132 6.3.1 Initial Profiling and Analysis . 132 6.3.2 Sequential Optimisations . 134 6.3.3 Parallel Implementation . 137 6.4 Deduplication Kernel . 154 6.4.1 Initial Profiling and Analysis . 154 6.4.2 Sequential Optimisations . 156 6.4.3 Parallel Implementation . 158 CONTENTS v 7 Results 171 7.1 Matrix Multiplication . 173 7.1.1 Runtime Performance . 173 7.1.2 Code Metrics . ..