Minimizing Startup Costs for Performance-Critical Threading

Anthony M. Castaldo R. Clint Whaley Department of Department of Computer Science University of Texas at San Antonio University of Texas at San Antonio San Antonio, TX 78249 San Antonio, TX 78249 Email : [email protected] Email : [email protected]

Abstract—Using the well-known ATLAS and LAPACK dense on several eight core systems running a standard Linux OS, linear algebra libraries, we demonstrate that the parallel manage- ATLAS produced alarmingly poor parallel performance even ment overhead (PMO) can grow with problem size on even stati- on compute bound, highly parallelizable problems such as cally scheduled parallel programs with minimal task interaction. Therefore, the widely held view that these management matrix multiply. issues can be ignored in such computationally intensive libraries is wrong, and leads to substantial slowdown on today’s machines. Dense linear algebra libraries like ATLAS and LAPACK [2] We survey several methods for reducing this overhead, the are almost ideal targets for parallelism: the problems are best of which we have not seen in the literature. Finally, we regular and often easily decomposed into subproblems of equal demonstrate that by applying these techniques at the kernel level, performance in applications such as LU and QR factorizations complexity, minimizing any need for dynamic task scheduling, can be improved by almost 40% for small problems, and as load balancing or coordination. Many have high data reuse much as 15% for large O(N 3) computations. These techniques and therefore require relatively modest data movement. Until are completely general, and should significant in recently, ATLAS achieved good parallel speedup using simple almost any performance-critical operation. We then show that the distribution and straightforward threading approaches. This lion’s share of the remaining parallel inefficiency comes from bus contention, and, in the future work section, outline some simple threading approach failed to deliver good performance promising avenues for further improvement. on commodity eight core systems, and thus it became neces- sary to investigate what had gone wrong. I. INTRODUCTION Long-running architectural trends have signaled the end of In the course of this investigation, we have developed two sustained increases in serial performance due to increasing measurements which we believe help us understand parallel clock rate and instruction level parallelism (ILP), but have not behavior. The first of these is Parallel Management Overhead (PMO). Because our problems are statically partitioned and changed Moore’s law [5]. Therefore, since architects are faced 2 3 with an ever-increasing circuit budget which can no longer be compute intensive (O(N ) or O(N )), PMO should grow only leveraged for meaningful serial performance improvements, with O(t) (the number of threads). For eight threads it should they have increasingly turned to supplying additional cores be a constant. It is not. Not only is PMO a major factor in our within a single physical package. Today it is difficult to buy lack of parallel efficiency, it grows with problem size, even on even a laptop chip that has less than two cores, and 4 cores is very large problems. common on the desktop. This trend is expected to continue, with some architects predicting even desktop chips with huge Outline: Section I-A introduces necessary terminology and numbers of simplified cores, as in the IBM Cell [7] and Intel defines our timing measurements, while §I-B discusses our Larrabee [13] architectures. timing methodology. In §II we survey techniques for managing thread startup and shutdown, show that PMO is a significant As commodity OSes (i.e. OSes not written specifically for cost that can grow with problem size rather than t, and HPC) are used on increasingly parallel machines, previously introduce our technique for reducing PMO to a small constant reasonable assumptions may become untenable. In particular, on t. In §III we provide a quantitative comparison of these we can no longer assume the hardware, OS or compilers techniques, and show they are important even in very large are highly tuned to exploit multiple cores or efficiently share operations that can be perfectly partitioned statically. In §IV common resources. Such assumptions were built into our own we show how these relatively simple changes to a kernel ATLAS [18], [17], [16] (Automatically Tuned Linear Algebra library such as the BLAS [4] (Basic Linear Algebra Subpro- Package) software (eg., we assumed that the OS would sched- grams) can deliver substantial parallel speedup for higher level ule threads to separate cores whenever possible). However, applications such as the QR and LU factorizations found in 1This work was supported in part by National Science Foundation CRI LAPACK. Finally, in §VI we discuss future work, and offer grant CNS-0551504 our summary and conclusions in §VII. A. Terminology and Definitions B. Experimental Methodology We refer to one serial execution engine/CPU as a core, with When performing serial optimization on a kernel with no multiple cores sharing a physical package (or just package). A system calls, we often report the best achieved performance physical package is the component plugged into a motherboard over several trials [15]. This will not work for parallel times: socket, whether comprised of one actual chip (as with recent parallel times are strongly affected by system states during the AMD systems) or multiple chips wired together (as with recent call and vary widely based on unknown starting conditions. Intel systems). If we select the best result out of a large pool of results, When discussing the problem, we may refer to the full we cannot distinguish between an algorithm that achieves the problem, which is the size of the entire problem to be solved. optimal startup time by chance once in a thousand trials from The partitioned problem is the problem size given to each core one that achieves it every time by design. An average of many after decomposition for parallel computation (in this paper we samples will make that distinction. consider only problems that can be simply divided so that each In our timings, the sample count varies to keep experiment runtimes reasonable: for the rank-K experiments, we used 200 core has a static partition of equal size). 3 We directly measure several important times. The Full trials. For the O(N ) factorizations, we used 200 trials for Serial Time (FST) is the elapsed time when solving the full N ≤ 3000, and 50 trials for larger problems. Since the BLAS problem in serial. The Partitioned Serial Time (PST) is elapsed are typically called with cold caches, we flush all processors’ time when solving the partitioned problem serially. caches between timing invocations, as described in [15]. All timings used ATLAS3.9.4. We changed the threaded The Full Parallel Time (FPT) is the elapsed time when routines as discussed and modified our timers to more thor- solving the full problem using the and oughly flush all core caches. We timed on two commodity plat- multiple cores. forms, both of which have 8 cores in two physical packages. Finally, the per-core time (PCT) is the elapsed time each The OS is critically important, in that it determines scheduling core spends computing on its section of the partitioned prob- and degree of threading support. Linux is a system where the lem. Thus PCT does not include any thread management programmer can manually control the thread affinity, and thus overhead (eg., signalling other threads or waiting on mutex avoid having the system schedule competing threads on the or condition variables). On a problem requiring no shared same processor despite having unloaded processors (without resources, therefore, PCT would always equal PST, but we will affinity, this occurs on both Linux and OS X). OS X possesses see that it does not for our present parallel implementations. no way to restrict the set of cores that a thread can run on. Using these directly measured times we define two quan- Even within a given OS, scheduler differences may change tities we believe illuminate the major causes of slowdown in timing significantly, so we provide kernel version information our parallel algorithms. These indirect measures led us to the here. Our two platforms were: improved algorithms provided here, and so we believe this is (1) 2.1Ghz AMD Opteron 2352 running Fedora 8 Linux a contribution that may benefit other researchers as well. We 2.6.25.14-69 and gcc 4.2.1 (this system is abbreviated as Opt), mentioned PMO earlier, this is the full parallel time minus the (2) 2.5Ghz Intel E5420 Core2 Xeon running Fedora 9 Linux maximum PCT. 2.6.25.11-97 and gcc 4.3.0 (C2). In an ideal system, this time would be zero, meaning the Each physical package on the Opt consists of one chip, parallel algorithm ran only as long as required to solve the which has a built-in memory controller. The physical packages partitioned problem on the slowest processor. A non-zero PMO of the Intel processors contain two chips, and all cores share represents time doing non-computational activities, such as an off-chip memory controller. starting/killing threads, waiting on mutexes, etc. Therefore, the We will survey several operations in order to show the major goal of this paper is to reduce PMO to a small constant generality of these techniques. Our main operation will be on t, which is its expected value for statically distributed the rank-K update, which is the main performance kernel of problems with minimal interactions. the LAPACK library. The rank-K update is a matrix multiply Low PMO by itself is not enough to ensure efficient where the dimension common to both the input matrices (the K parallel performance, since it says nothing about how long the dimension) has been restricted to some small value for cache- computation itself takes. We capture this information with a blocking purposes. On the Core2-based platforms, this value measure we call Partitioned Computational Expansion (PCE), will be 56, and is 40 for the AMD machine (these values come which is the average of the per-core compute time on a given from ATLAS’s tuned GEMM, which is the BLAS routine problem divided by the average partitioned serial time for that providing GEneral Matrix Multiply). same problem. In an ideal parallel machine with no shared resources, this II. THREAD MANAGEMENT TECHNIQUES value would be one. In practice cores share some resources We are aware of two basic approaches for handling thread- (caches, buses, TLB, etc) and interfere with each other. The ing. We call the simplest approach Launch and Join (abbre- most obvious example is when multiple cores attempt to access viated LJ), and in pthreads this is accomplished by having memory on a shared bus, causing additional stalls and making the master create t worker threads to perform the PCE > 1. computation, and then make t calls to pthread_join to wait on all thread completion (a slight variant of this approach is to launch t−1 threads, and have the master process perform one unit of computation). In the simplest LJ approach, the master process creates all t threads, and then joins them, for an O(t) theoretical cost. This simplest algorithm is what ATLAS (and quite a few software packages) use. A more complicated LJ approach is to have the created threads spawn threads as well, which can reduce the theoretical cost to O (log2(t)). We implemented both, but because the cost of initiating thread creation is much lower than the time it takes to actually begin executing a created thread, the simpler O(t) algorithm (a) With & w/o affinity outperforms the O (log2(t)) algorithm for t = 8. Therefore, our LJ times use the simple O(t) algorithm. Note that launch and join is also the paradigm that OpenMP presents to the programmer; we will see, however, that it is not necessarily the paradigm actually used by particular OpenMP implementations. If we assume that creating and destroying a thread is relatively expensive, a second approach suggests itself: create the t threads just one time (for a library, on the first call), and keep the threads around for the entire execution (they can be killed using the ANSI C standard atexit function for a library). We call this approach Persistent Worker threads, (b) Master-last thread assignment (abbreviated as PW). There are many ways to implement Fig. 1. Parallel Management Overhead in rank-K update for launch and join PW; our workers each have an individual interface area that and persistent worker on a 2.5Ghz C2 includes a condition variable which is signalled by the master to have that thread do the unit of work specified by its interface area. for both architectures and for persistent worker threads as well. Both of these approaches can be augmented with processor We examined detailed logs of individual runs, and found a affinity. Many OSes (including Linux and Solaris, but not large variance in overheads, depending on where the master OS X) allow the user to restrict the range of processors a process was in relation to the spawn. thread can be scheduled on using OS-specific calls. This can Essentially, when the master was on the same processor as make a big difference: when left alone, the OS often schedules one of the early-started threads, that thread would compete two or more threads to a single processor, so that some cores with the master for execution, and seriously extend the time go unused while some delay the computation due to competing it took to get all threads working (Linux perplexingly seems threads. to preferentially launch new threads on the master core). The The OS may eventually notice this bad scheduling and move same problem was observed with PW: if we start a worker threads around, but this has a cost beyond the initial lack of thread on the master core, it competes with the master for parallelism: any cache that was warm (and much of linear memory and time slices, delaying the startup. algebra is rich in cache reuse) is now cold, thus causing This led us to a new technique, which we call Master Last increasing bus traffic as threads are migrated across cores. (suffix: ML), which ensures that, whether using LJ or PW, Therefore, the obvious extension is to create t threads, each you call for threads to begin executing on all other cores of which can be scheduled only on a unique core. before starting the computation on the core that the master To indicate that a given approach has affinity, we suffix it is executing on. This is shown in Figure 1(b) for both launch with an ‘A’. and join and persistent workers, and we see that the PMO is These techniques are the ones we found in the literature, now independent of problem size, and orders of magnitude and it was assumed they would suffice to reduce our overhead less. to a constant on t, but this did not prove to be the case, as We can also see that it is much quicker to signal t condition shown in Figure 1(a). This figure shows the PMO for rank-K variables (as in PWAML) than to create and join t threads from update in microseconds as a function of problem size on the scratch (as in LJAML). In §III we will quantify how important C2 platform for launch and join both with and without affinity. these overhead savings are in terms of total runtime, so that Since all of these problem sizes use t = 8, and the cost is O(t), a judgment can be made as to whether PWAML is worth the we expect the overhead to be flat across the graph, but instead extra complexity. it essentially rises linearly (the rank-K computation is O(N 2)) Note that master last is important because of the relatively with N. This linear rise in cost with problem size holds true poor scheduling job being done by the OS: an OS that scheduled threads to non-master cores first would achieve Algorithms without ML decline much less quickly, as the master last implicitly. As a practical matter, OS X appears to O(N 2) cost slowly dominates the startup time. almost never achieve implicit master last, but an older version From this, we would expect something similar on full matrix of Linux which we were running on the Opt before a system multiply, but with the O(N 3) computation dominating more update achieved master last most of the time. Therefore, master quickly. We see that LJAML is almost as good as PWAML last appears to have become important due to OS scheduling towards the end of the curve, but that the PMO advantage of algorithms that are probably more geared to desktop than HPC PWAML is still fairly important in the medium-range problems use. (eg., N ≤ 2000). In the remaining PMO timings, we wish to quantify the So far, we have been focusing completely on overhead, but contribution of each of these techniques, which we do with this can be misleading, in that a given technique might reduce the easy-to-understand launch and join paradigm, giving us PMO and yet make PCE worse, thus making the algorithm three different LJ timings: LJ, LJA (LJ with affinity only) slower despite a lower PMO. Our goal is actually to increase and LJAML (LJ with affinity and master last). Finally, we parallel speedup, and we need to show that these methods do will show the performance for the lowest-overhead technique that. This is shown in Figure 3, which also quantifies how we have yet discovered, persistent worker threads with affinity much each technique contributes to final speedup. and master last (PWAML).

III. IMPACT OF THESE TECHNIQUES ON RUNTIME We have shown that the master last technique can reduce the PMO to a small constant on t, which only matters if PMO is an important component of the total runtime. Obviously this will be strongly affected by the problem size, since work done is O(N 2), while PMO either remains constant, or appears to rise with N. Figure 2 shows the PMO of rank-K update as a percent of the total runtime on both the C2 and Opt platforms.

(a) C2 benefit vs. LJ

(a) On 2.5Ghz C2

(b) Opt benefit vs LJ

Fig. 3. Contribution to rank-K speedup by technique

In this chart, the average speed of LJ is taken as 100%, and we plot the speed of all others relative to it. Specifically, the LJ time Y axis is computed as method time × 100.0. The first thing to notice is the sheer magnitude of the speedups from these techniques: for small problems the best (b) On 2.1Ghz Opt technique is over 220% of the speed of the na¨ıve technique, Fig. 2. PMO of rank-K update as a % of runtime for surveyed techniques and even for very large problems, the advantage is still over 5%! On both platforms, we see that for small problems, the The second lesson from these graphs concerns the relative overhead is actually the dominant cost of the algorithm, but importance of each technique: we see that all techniques have for the master last algorithms PMO cost drops precipitously. the greatest impact for small and midrange problem sizes, where the O(N 2) computation is not so dominant. Affinity is important on both machines, and while its impact initially goes down with problem size, it then levels off – we therefore conclude that affinity is important regardless of problem size (this is confirmed even with full GEMM timings, where we have O(N 3) computations). PW is particularly helpful on the C2, but on both machines its advantage looks likely to go away asymptotically. Finally, master last is extremely important on both systems, particu- larly for midrange problems. It is worth noting that what we are calling “midrange” problems are actually quite large; for problems roughly in the range of 2000 ≤ N ≤ 4000, master last accounts for over half of the available advantage over LJ! (a) 2.5Ghz C2 We did test the performance of PWA, i.e. persistent worker threads with affinity but not master last. For clarity we left it off the charts. PWA performance is about halfway between LJAML and PWAML. PWA is about 25% slower than PWAML when N ≤ 2000. On larger problems the deficit narrows to about 4% asymptotically. It is comparable to the difference between LJA and LJAML. Thus in both PW and LJ imple- mentations, master last delivers significant improvement.

IV. LAPACK PERFORMANCE It is possible to speed up a compute kernel, and yet have little effect on application performance. To show these techniques make a meaningful contribution (b) 2.1Ghz Opt to the performance of higher level codes, Figures 4 and 5 show the usual % speedups over LJ using various threaded GEMM Fig. 4. Statically blocked QR factorization % speedup vs. LJ; the number in parenthesis on the X axis is the parallel speedup achieved by our basic LJ. implementations in two commonly-used LAPACK routines, QR and LU factorization. In parenthesis on the X-axis, we show the parallel speedup FST LJ Figure 4 shows the performance achieved by these tech- (i.e. FPT ) achieved by our basis case of . The timed factorizations use several different BLAS routines, but we have niques on both platforms for the QR factorization. QR is changed only the way GEMM operates. Therefore, these charts statically blocked, which means that for a given problem size, have all non-GEMM threading done using ATLAS’s original it will select a particular K to call GEMM with in the rank-K implementation, which is essentially simple LJ with a more matrix shape. complicated data distribution. Figure 4(a) shows the performance of these methods on Since our techniques supply substantial speedups even when the C2, and we see that they are ranked in performance as applied to only one BLAS routine, we can expect even predicted by our earlier rank-K timings: PWAML, LJAML, LJA, greater application performance once we have applied these followed by LJ. OpenMP does better than straight LJ for large techniques to all the BLAS. problems, and worse for small problems (this is why it only Note that when using a threaded kernel in an application, appears on the graph around N = 1800). there are two general concerns that need be addressed: at what QR performance is improved substantially across the entire stage do you use additional cores (i.e. where is the crossover range of problems, with almost 40% improvement on the point at which it is more efficient to add another core rather small end, and a roughly 12% asymptotic performance . than doing the operation serially), and what data distribution We note that the most critical technique is affinity, closely is used in doling out work to the threads. These issues are followed by master last. operation-specific, and strongly affect PCE, so optimizing The performance curves are quite a bit more messy for them is part of our future work. QR on the Opt as we see in Figure 4(b). These curves In these timings, we chose a simple crossover point (each do not follow the performance rankings suggested by our thread needed to have at least a full block of GEMM to do), earlier rank-40 Opt GEMM timings. We will see a similar and we distributed GEMM’s output matrix over a 4x2 process discrepancy on LU timings, and discuss possible reasons for grid. In addition to our normal algorithms (LJA, LJAML, this in Section IV-A. PWAML) we also chart the performance achieved when gcc’s Performance gains are even more substantial on the Opt, OpenMP implementation is used instead of pthreads in GEMM with our best case showing a roughly 32% speedup, but (using the same data distribution and crossover point). with even asymptotic speedup well above 24%. We see that code’s performance is roughly the same as our LJ for QR, and decidedly worse for LU. Therefore, the speedups reported in this paper will actually be the minimum of what we will have when these techniques are available in ATLAS. Further, since the only difference is in distribution, the fact that our simple 4x2 distribution is better for LU suggests that even further improvement can be had from PWAML/LJAML by tuning the data distribution, as proposed in §VI. There are some interesting observations to be drawn from this data. The first concerns OpenMP, which we added to our timings in order to see if it is reasonable to skip the complexity of (a) 2.5Ghz C2 applying these techniques in pthreads, and rely on a higher- level system like OpenMP instead. At this time, the answer appears to be a decided no. On no machine or factorization does OpenMP run at the speed of our improved methods, and at times it can be significantly slower than using straight LJ. When examining the various techniques, we see that affinity is the most important property for a thread to have for asymptotically large problems, but for the majority of the range, master last provides the most speedup.

A. Why Does PWAML Lose To LJAML? The most startling result for LU on both machines is that PWAML is decidedly inferior to LJAML for all but the smallest (b) 2.1Ghz Opt of problems. Similarly for QR: On the Opt, PWAML loses to Fig. 5. Recursive LU factorization % speedup vs. LJ on a 2.5 Ghz C2; the LJAML. number in parenthesis on the X axis is the parallel speedup achieved by our This is counter-intuitive. On both machines, GEMM is basic LJ. consistently faster using PWAML. It is true that QR and LU use more of the level 3 BLAS than just GEMM, and for our affinity is absolutely critical on QR/Opt, and that master last experiments we only coded PWAML or LJAML for the GEMM is important across the entire range of problem sizes. implementation. Thus, our non-GEMM level 3 routines were Figure 5 shows the speedup numbers using ATLAS’s recur- parallelized by ATLAS the “old” way, namely using LJ sive LU, and again LJAML beats PWAML on both machines. without affinity. Therefore, we see that these techniques are of modest help Our assumption was that any processing time spent by QR for very large LU problems (roughly 6% improvement on the or LU in the non-GEMM portions would be equal under C2, and 9% on the Opt). PWAML and LJAML, and so the net effect would be a faster Our greatest improvement of around 18% (22%) is seen QR or LU when using PWAML, due to its faster GEMMs. That in midsize (2000 ≤ N ≤ 3500) problems on the C2 (Opt). assumption is false. By comparing with the LJA line, we see that the the lion’s In initial experiments in which all the level 3 BLAS routines share of this performance advantage comes from master last for LU or QR have been rewritten to use persistent worker for small and medium size problems, and large problems on threads, the PWAML performance deficit vanished completely. the Opt. Affinity is important for all problem sizes, and is Thus the problem is in using PWAML for GEMM and LJ critical for large LU problems on the C2. (without affinity) for the rest of the level 3 routines. Apparently We notice that OpenMP is not competitive on either plat- in this mixed case, either the LJ threads are running slower, form for our recursive LU. For the C2, it appears untenable: or the PWAML threads are running slower, or both. OpenMP never achieves more than 82% of the performance of In our implementation, our PWAML worker threads wait on simple LJ, and thus fails to show up on our chart at all! Just pthread condition variables to be notified of work. We assume as with QR, OpenMP does much better on the Opt, where they will only consume cycles when their condition variables for LU it is able run slightly faster than LJ. Note that tuning are signalled, and this never happens during the LJ portions of the crossover and distribution shapes might improve OpenMP the algorithm. Nevertheless, the worker threads and condition performance (as it should for all techniques). variables are still in OS structures, and it is possible that We also timed the factorizations using ATLAS’s original for some unknown reason the OS expends overhead cycles GEMM algorithm (straight LJ with a more complex distri- to maintain them, and this somehow degrades the LJ thread bution), but do not show these on the charts. Our original performances. Alternatively, it is possible that the LJ thread performance VI. FUTURE WORK is not degraded at all but the OS, in the process of launching, running and joining them, for some reason deprioritizes our Operating systems like OS X lack the ability to fix processor idle persistent worker threads, or creates conditions that affect affinity and the scheduler seems to migrate threads frequently; the startup or operational overhead of the worker threads. other means must therefore be developed to capture at least This would mean the launching and joining of LJ threads some of the benefit of affinity and master last. If such has some lasting effect on how the OS handles the PWAML techniques are not found, then OSes without affinity can expect worker threads, and that effect is felt even after the temporary substantial performance loss when compared to OSes that LJ thread has terminated. support affinity. This loss is presently around 40%, but it If true, this is a disconcerting development for any system should rise as we address PCE (several PCE techniques will that uses worker thread pools. Even if the thread pools are need affinity for optimal application). shown to improve performance when tested alone, they may With PMO addressed, the bottleneck becomes PCE on both actually reduce performance when unrelated applications are platforms. threading using launch and join operations, or using their own Experiments on the C2 (results on the Opt are similar) show thread pool. that two rank-K threads can be running on each package (four Verifying or falsifying these speculations is a major objec- total) with PCE remaining extremely low (around 1.04). Three tive of our future work, as described in Section VI. threads running on one package, with the other package idle, We must also point out that in the OpenMP implementation exhibit a PCE around 1.15, and adding a fourth thread to that we used for our experiments, OpenMP presents an LJ inter- package raises the PCE to 1.45. face to programmers but internally implements a thread pool Thus it seems there is contention within a package when approach. Our OpenMP timings were exceptionally slow, but more than two of the four cores are working. In addition we it is possible that OpenMP performance on QR and LU were see that if we load both packages, there is additional PCE, degraded by the same mechanisms responsible for degrading rising to 1.80 if all eight cores are busy (this would suggest our PWAML performance on these operations, and would a maximum rank-K speedup of around 4.4 for an 8 processor improve if all parallel operations were done with OpenMP. system). Persistent worker threads can dramatically reduce overhead To narrow down the cause of PCE we wrote a special timer and improve performance, but the factorization results imply program that can call the GEMM kernel in three ways: (1) that these performance gains, although real, currently remain with array (and thus cache) access changing in the same way it fragile. For libraries it becomes important to measure perfor- would during the innermost loop of a standard rank-K update, mance in the final use environment. To fully exploit persistent (2) with array access changing as it would in the innermost worker threads, we need a greater understanding of both PCE loop of a large square GEMM, and (3) with all array data and the unexpected interaction between PW and LJ threads. preloaded to the L1 cache. Our experiments with this framework confirm that there is V. RELATED WORK no PCE when using data preloaded to the L1, and similar PCE numbers to what we observed in our rank-K timings when used We have found no work in our area that directly addresses in way (1), and in our square GEMM timings when timed in overhead issues due to threading, but [1] discusses using pro- way (2). cess (rather than thread) affinity for performance improvement. There are numerous papers that address these issues in some Thus we have learned PCE is primarily due to cache effects. way outside our field; we cite only a representative handful of We plan to use this framework to investigate how much PCE these here. There are several papers that discuss the advantage is due to shared caches versus main memory bus contention. of process affinity in at least a cursory way, including [3], With the knowledge that PCE comes from memory usage, [10]. The mention of persistent worker threads which is most we can suggest promising avenues of investigation to over- closely allied to our area can be found in papers on optimizing come it. Most are operation-specific, unlike the techniques we OpenMP libraries, as in [11], [8]. discussed here. Threading overhead issues are widely discussed in research We will try to improve our data distribution to encourage on web servers and services [6], [12], [9]. They have stan- both additional private cache reuse as well as cache reuse dardized some terminology: our launch and join paradigm among cores sharing a cache. Also, kernels can be written with roughly corresponds to their thread-per-request, while our varying prefetch strategies, so that bus load is distributed more persistent worker threads is roughly their thread pool. We have evenly in time. We will try more explicit bus management kept our idiosyncratic naming strategy, since the web services through the use of gating mechanisms (semaphores, mutexes), terminology has built-in assumptions that are not true for HPC and try coding explicitly for the bus with strategies like block (eg., that you have multiple requests at once, or that the thread fetch [14]. pool consists of far more threads than the number of cores). We have already begun one modified version of the ATLAS We have found no mention of master last in any publication. library that will use strictly persistent worker threads for all parallel operations, and a second modified version that will use strictly OpenMP. This will allow us to make a more thorough [5] J. Hennessy and D. Patterson. Computer Architecture, A Quantitative comparison of the threading strategies. Approach. Morgan Kaufmann Publishers, Inc., San Francisco, Califor- nia, 4th edition, 2007. We also want to pin down the exact cause of slowdown [6] J. Hu, I. Pyarali, and D. Schmidt. Measuring the impact of event with mixed strategies. Our general approach will be to isolate dispatching and concurrency models on web server performance over aspects of the mixing, to determine precisely what causes the high-speed networks. Global Telecommunications Conference, 1997. GLOBECOM ’97., IEEE, 3:1924–1931 vol.3, Nov 1997. added overhead. For example, we can create several persistent [7] J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, worker threads that remain idle, and see if their mere presence and D. Shippy. Introduction to the Cell multiprocessor. ibmjrd, 49(4/5), impacts LJ performance. Or we might create two sets of 2005. [8] K. Kusano8, S. Satoh, and M. Sato. Performance evaluation of the omni persistent worker threads and see if alternating between them compiler. Lecture Notes in Computer Science, 1940:403–414, (i.e. so only one set is in use at a time) produces significantly 2000. worse performance than using only one set. [9] Y. Ling, T. Mullen, and X. Lin. Analysis of optimal thread pool size. SIGOPS Oper. Syst. Rev., 34(2):42–55, 2000. Whatever the cause, we hope to devise an approach that [10] H. Lof¨ and S. Holmgren. affinity-on-next-touch: increasing the perfor- eliminates this interference, so HPC libraries can gain the mance of an industrial pde solver on a cc-numa system. In ICS ’05: performance advantages of PW threads without the risk of Proceedings of the 19th annual international conference on Supercom- puting, pages 387–392, New York, NY, USA, 2005. ACM. damaging the performance of other threading packages that [11] D. Novillo. Openmp and automatic parallelization in gcc. In 2006 GCC might be sharing the platform. Summit, June 2006. [12] D. C. Schmidt. Evaluating architectures for multithreaded object request brokers. Commun. ACM, 41(10):54–60, 1998. VII. SUMMARY AND CONCLUSIONS [13] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, We have introduced two measures, PMO and PCE, which T. Juan, and P. Hanrahan. Larrabee: a many-core x86 architecture for can help illuminate the major causes of parallel slowdown. visual computing. ACM Trans. Graph., 27(3):1–15, August 2008. We have discussed a variety of techniques for reducing thread [14] M. Wall. Using Block Prefetch for Optimized Memory Performance. Technical report, Advanced Micro Devices, 2002. management overhead which should be usable for almost any [15] R. C. Whaley and A. M. Castaldo. Achieving accurate and context- threading application, the most efficient of which (master last) sensitive timing for code optimization. Software: Practice and Experi- we have not seen in the literature. ence, 38(15):1621–1642, 2008. [16] R. C. Whaley and A. Petitet. Atlas homepage. We have shown that the conventional wisdom that overhead http://math-atlas.sourceforge.net/. is unimportant when doing large compute-intensive operations [17] R. C. Whaley and A. Petitet. Minimizing development and needs to be re-examined. We have quantified the relative maintenance costs in supporting persistently optimized BLAS. Software: Practice and Experience, 35(2):101–121, February 2005. contribution of these techniques to the performance of one http://www.cs.utsa.edu/˜whaley/papers/spercw04.ps. of the most widely used kernels in Linear Algebra. [18] R. C. Whaley, A. Petitet, and J. J. Dongarra. Automated empirical Finally, we have shown that these techniques provide sub- optimization of software and the ATLAS project. , 27(1–2):3–35, 2001. stantial speedups for LAPACK factorizations on the high-end systems of today, and are therefore likely to be critical for even the desktop machines of tomorrow. We conclude that all HPC threading should employ affinity and master last. Choosing between PW and LJ is less clear-cut, and will depend on the problem being examined, the sophistication of PCE control, and perhaps whether PW and LJ threads must share the machine. Our initial results suggest that persistent workers/thread pools (which are widely used) may cause prob- lems when more than one library is independently threading (this is, of course, quite common for large applications). We also note that at the present time, developers should not rely on OpenMP to automatically apply these techniques.

REFERENCES

[1] S. Alarm, R. Barrett, J. Kuehn, P. Roth, and J. Vetter. Characterization of scientific workloads on systems with multi-core processors. iiswc, 0:225–236, 2006. [2] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. D. Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen. LAPACK Users’ Guide. SIAM, Philadelphia, PA, 3rd edition, 1999. [3] Y. Chen, E. Li, J. Li, and Y. Zhang. Accelerating video feature extractions in cbvir on multi-core systems. Intel Technology Journal, 11(04):349–360, November 2007. [4] J. Dongarra, J. D. Croz, I. Duff, and S. Hammarling. A Set of Level 3 Basic Linear Algebra Subprograms. ACM Transactions on Mathematical Software, 16(1):1–17, 1990.