Minimizing Startup Costs for Performance-Critical Threading

Minimizing Startup Costs for Performance-Critical Threading Anthony M. Castaldo R. Clint Whaley Department of Computer Science Department of Computer Science University of Texas at San Antonio University of Texas at San Antonio San Antonio, TX 78249 San Antonio, TX 78249 Email : [email protected] Email : [email protected] Abstract—Using the well-known ATLAS and LAPACK dense on several eight core systems running a standard Linux OS, linear algebra libraries, we demonstrate that the parallel manage- ATLAS produced alarmingly poor parallel performance even ment overhead (PMO) can grow with problem size on even stati- on compute bound, highly parallelizable problems such as cally scheduled parallel programs with minimal task interaction. Therefore, the widely held view that these thread management matrix multiply. issues can be ignored in such computationally intensive libraries is wrong, and leads to substantial slowdown on today’s machines. Dense linear algebra libraries like ATLAS and LAPACK [2] We survey several methods for reducing this overhead, the are almost ideal targets for parallelism: the problems are best of which we have not seen in the literature. Finally, we regular and often easily decomposed into subproblems of equal demonstrate that by applying these techniques at the kernel level, performance in applications such as LU and QR factorizations complexity, minimizing any need for dynamic task scheduling, can be improved by almost 40% for small problems, and as load balancing or coordination. Many have high data reuse much as 15% for large O(N 3) computations. These techniques and therefore require relatively modest data movement. Until are completely general, and should yield significant speedup in recently, ATLAS achieved good parallel speedup using simple almost any performance-critical operation. We then show that the distribution and straightforward threading approaches. This lion’s share of the remaining parallel inefficiency comes from bus contention, and, in the future work section, outline some simple threading approach failed to deliver good performance promising avenues for further improvement. on commodity eight core systems, and thus it became necessary to investigate what had gone wrong. I. INTRODUCTION Long-running architectural trends have signaled the end of In the course of this investigation, we have developed two sustained increases in serial performance due to increasing measurements which we believe help us understand parallel clock rate and instruction level parallelism (ILP), but have not behavior. The first of these is Parallel Management Overhead (PMO). Because our problems are statically partitioned and changed Moore’s law [5]. Therefore, since architects are faced 2 3 with an ever-increasing circuit budget which can no longer be compute intensive (O(N ) or O(N )), PMO should grow only leveraged for meaningful serial performance improvements, with O(t) (the number of threads). For eight threads it should they have increasingly turned to supplying additional cores be a constant. It is not. Not only is PMO a major factor in our within a single physical package. Today it is difficult to buy lack of parallel efficiency, it grows with problem size, even on even a laptop chip that has less than two cores, and 4 cores is very large problems. common on the desktop. This trend is expected to continue, with some architects predicting even desktop chips with huge Outline: Section I-A introduces necessary terminology and numbers of simplified cores, as in the IBM Cell [7] and Intel defines our timing measurements, while §I-B discusses our Larrabee [13] architectures. timing methodology. In §II we survey techniques for managing thread startup and shutdown, show that PMO is a significant As commodity OSes (i.e. OSes not written specifically for cost that can grow with problem size rather than t, and HPC) are used on increasingly parallel machines, previously introduce our technique for reducing PMO to a small constant reasonable assumptions may become untenable. In particular, on t. In §III we provide a quantitative comparison of these we can no longer assume the hardware, OS or compilers techniques, and show they are important even in very large are highly tuned to exploit multiple cores or efficiently share operations that can be perfectly partitioned statically. In §IV common resources. Such assumptions were built into our own we show how these relatively simple changes to a kernel ATLAS [18], [17], [16] (Automatically Tuned Linear Algebra library such as the BLAS [4] (Basic Linear Algebra Subpro- Package) software (eg., we assumed that the OS would sched- grams) can deliver substantial parallel speedup for higher level ule threads to separate cores whenever possible). However, applications such as the QR and LU factorizations found in 1This work was supported in part by National Science Foundation CRI LAPACK. Finally, in §VI we discuss future work, and offer grant CNS-0551504 our summary and conclusions in §VII. A. Terminology and Definitions B. Experimental Methodology We refer to one serial execution engine/CPU as a core, with When performing serial optimization on a kernel with no multiple cores sharing a physical package (or just package). A system calls, we often report the best achieved performance physical package is the component plugged into a motherboard over several trials [15]. This will not work for parallel times: socket, whether comprised of one actual chip (as with recent parallel times are strongly affected by system states during the AMD systems) or multiple chips wired together (as with recent call and vary widely based on unknown starting conditions. Intel systems). If we select the best result out of a large pool of results, When discussing the problem, we may refer to the full we cannot distinguish between an algorithm that achieves the problem, which is the size of the entire problem to be solved. optimal startup time by chance once in a thousand trials from The partitioned problem is the problem size given to each core one that achieves it every time by design. An average of many after decomposition for parallel computation (in this paper we samples will make that distinction. consider only problems that can be simply divided so that each In our timings, the sample count varies to keep experiment runtimes reasonable: for the rank-K experiments, we used 200 core has a static partition of equal size). 3 We directly measure several important times. The Full trials. For the O(N ) factorizations, we used 200 trials for Serial Time (FST) is the elapsed time when solving the full N ≤ 3000, and 50 trials for larger problems. Since the BLAS problem in serial. The Partitioned Serial Time (PST) is elapsed are typically called with cold caches, we flush all processors’ time when solving the partitioned problem serially. caches between timing invocations, as described in [15]. All timings used ATLAS3.9.4. We changed the threaded The Full Parallel Time (FPT) is the elapsed time when routines as discussed and modified our timers to more thor- solving the full problem using the parallel algorithm and oughly flush all core caches. We timed on two commodity plat- multiple cores. forms, both of which have 8 cores in two physical packages. Finally, the per-core time (PCT) is the elapsed time each The OS is critically important, in that it determines scheduling core spends computing on its section of the partitioned prob- and degree of threading support. Linux is a system where the lem. Thus PCT does not include any thread management programmer can manually control the thread affinity, and thus overhead (eg., signalling other threads or waiting on mutex avoid having the system schedule competing threads on the or condition variables). On a problem requiring no shared same processor despite having unloaded processors (without resources, therefore, PCT would always equal PST, but we will affinity, this occurs on both Linux and OS X). OS X possesses see that it does not for our present parallel implementations. no way to restrict the set of cores that a thread can run on. Using these directly measured times we define two quan- Even within a given OS, scheduler differences may change tities we believe illuminate the major causes of slowdown in timing significantly, so we provide kernel version information our parallel algorithms. These indirect measures led us to the here. Our two platforms were: improved algorithms provided here, and so we believe this is (1) 2.1Ghz AMD Opteron 2352 running Fedora 8 Linux a contribution that may benefit other researchers as well. We 2.6.25.14-69 and gcc 4.2.1 (this system is abbreviated as Opt), mentioned PMO earlier, this is the full parallel time minus the (2) 2.5Ghz Intel E5420 Core2 Xeon running Fedora 9 Linux maximum PCT. 2.6.25.11-97 and gcc 4.3.0 (C2). In an ideal system, this time would be zero, meaning the Each physical package on the Opt consists of one chip, parallel algorithm ran only as long as required to solve the which has a built-in memory controller. The physical packages partitioned problem on the slowest processor. A non-zero PMO of the Intel processors contain two chips, and all cores share represents time doing non-computational activities, such as an off-chip memory controller. starting/killing threads, waiting on mutexes, etc. Therefore, the We will survey several operations in order to show the major goal of this paper is to reduce PMO to a small constant generality of these techniques. Our main operation will be on t, which is its expected value for statically distributed the rank-K update, which is the main performance kernel of problems with minimal interactions. the LAPACK library. The rank-K update is a matrix multiply Low PMO by itself is not enough to ensure efficient where the dimension common to both the input matrices (the K parallel performance, since it says nothing about how long the dimension) has been restricted to some small value for cache- computation itself takes.

Minimizing Startup Costs for Performance-Critical Threading

Slicing (Draft)

Assessing Gains from Parallel Computation on a Supercomputer

Instruction Level Parallelism Example

18-447 Lecture 21: Parallelism – ILP to Multicores Parallel

Performance of Physics-Driven Procedural Animation of Character Locomotion for Bipedal and Quadrupedal Gait

Easy Dataflow Programming in Clusters with UPC++ Depspawn

Econstor Wirtschaft Leibniz Information Centre Make Your Publications Visible

Extracting Parallelism from Legacy Sequential Code Using Transactional Memory

Optimization Techniques for Efficient HTA Programs

Introduction to MPI

Performance Loss Between Concept and Keyboard

FDTD) Algorithms on a Selection of High Performance Multiprocessor Computing Systems