<<

The following paper was originally published in the Proceedings of the 2nd USENIX Windows NT Symposium Seattle, Washington, August 3–4, 1998

A System for Structured High-Performance Multithreaded Programming in Windows NT

John Thornley, K. Mani Chandy, and Hiroshi Ishii California Institute of Technology

For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: [email protected] 4. WWW URL:http://www.usenix.org/ A System for Structured High-Performance Multithreaded Programming in Windows NT

John Thornley, K. Mani Chandy, and Hiroshi Ishii Department California Institute of Technology Pasadena, CA 91125, U.S.A. {john-t, mani, hishii}@cs.caltech.edu

Abstract systems. Another part of the reason is the difficulty of developing multithreaded programs, as compared to With the advent of inexpensive multiprocessor PCs, equivalent sequential programs. multithreading is poised to play an important role in The recent advent of inexpensive multiprocessor computationally intensive business and personal com- PCs and commodity OS support for lightweight threads puting applications, as well as in science and engineer- opens the door to an important role for high- ing. However, the difficulty of multithreaded program- performance multithreading in all areas of computing. ming remains a major obstacle. Windows NT support Dual- and quad-processor PCs are now avail- for threads is well suited to systems programming, but able for a few thousand dollars. Mass-produced multi- is too unstructured when multithreading is used for the processors with 8, 16, and more processors will soon purpose of speeding up program . In this pa- follow. Windows NT [2] can already support hundreds per, we describe a system for structured multithreaded of fine-grained threads with low overhead, even on sin- programming. Thread creation operations are multi- gle-processor machines, and future releases will be even threaded variants of blocks and loops, and synchroniza- more efficient. Examples of applications that could use tion objects are based on Boolean flags and integer multithreading to improve performance include spread- counters. With this system, most multithreaded program sheets, CAD/CAM, three-dimensional rendering, development can be performed using traditional se- photo/video editing, voice recognition, games, simula- quential methods and tools. The system is integrated tion, and resource management. with Windows NT and Microsoft Developer Studio The biggest obstacle that remains is the difficulty Visual ++. We are developing a variety of applications of developing efficient multithreaded programs. Gen- in collaboration with other researchers, to demonstrate eral-purpose thread libraries, including the Win32 API the power of structured multithreaded programming on [1][8] supported by Windows NT, are well suited to commodity multiprocessors running Windows NT. In systems programming applications of threads, e.g., one benchmark application (aircraft route optimization), control systems, database systems, and distributed sys- we achieved better performance on a quad-processor tems. However, the interface provided by general- Pentium Pro system than the best results reported on purpose thread libraries is less well suited to applica- expensive . tions where threads are used for the purpose of speeding up program execution. General-purpose thread man- 1. Introduction agement is unstructured and operations are complex and error-prone. The unpredictable inter- In the past, high-performance multithreading has been actions of multiple threads introduce many problems synonymous with either scientific supercomputing, real- (e.g., race conditions and ) that do not occur in time control, or achieving high throughput on multi-user sequential programming. In many regards, general- servers. The idea of dividing a computationally inten- purpose thread libraries are the assembly language of sive program into multiple concurrent threads to speed high-performance multithreaded programming. up execution on multiprocessor computers is well es- In this paper, we describe our ongoing research to tablished. However, this kind of high-performance mul- develop a system for structured high-performance mul- tithreading has made very little impact in mainstream tithreaded programming on top of the general-purpose business and personal computing, or even in most areas thread support provided by operating systems such as of science and engineering. Part of the reason has been Windows NT. The key attributes of our system are as the rarity and high cost of multiprocessor computer follows: • Structured thread creation constructs are based on 2. Windows NT Multithreading sequential blocks and for loops. • Structured synchronization constructs are based on In this section, we describe the interface and perform- Boolean flags and integer counters. ance of standard Windows NT thread support. Since our • Subject to a few simple rules, multithreaded execu- emphasis is on commodity systems and applications, tion is deterministic and produces the same results Windows NT is the ideal platform on top of which to as sequential execution. build our system for structured multithreaded program- • synchronization is provided for nondeter- ming. The following are particularly important to us: (i) ministic algorithms. the Windows NT thread interface provides the function- • Barrier synchronization is provided for efficiency. ality that we require for our system, and (ii) the Win- Our system is supported at two levels: (i) Sthreads, a dows NT thread implementation efficiently supports structured thread library, and (ii) Multithreaded C, a set large numbers of lightweight threads. of pragmas transformed into Sthreads calls by a pre- processor. Sthreads is easily implemented as a thin layer 2.1. Windows NT Thread Interface on top of general-purpose thread libraries. The Multi- threaded C preprocessor is a simple source-to-source Windows NT implements the Win32 thread API. This transformation tool that involves no complex program interface provides all the functionality that is needed for analysis. Therefore, the entire system is highly portable. high-performance multithreaded programming. How- One of the major strengths of our system is that ever, because the scope of the Win32 thread API is gen- much of the development of a multithreaded application eral-purpose, the interface is more complicated and less can be performed using ordinary sequential methods structured than is desirable when threads are used for and tools. The Multithreaded C preprocessor is inte- the purpose of speeding up program execution. For this grated with Microsoft Developer Studio Visual C++ [7]. reason, we have built a less general, less complicated, Applications can either be built either as multithreaded and more structured layer on top of the functionality applications (as indicated by the pragmas) or as sequen- provided by the Win32 thread API. tial applications (by ignoring the pragmas). For deter- The Win32 thread API is typical of other general- ministic applications, most development, testing, and purpose thread libraries, e.g., Pthreads [6] and Solaris debugging can be performed using the sequential ver- threads [4]. It provides a set of function calls for creat- sion of the application. Absence of race conditions and ing and terminating threads, suspending and resuming deadlock can be verified in the of sequential threads, synchronizing threads, and controlling the execution. Nondeterministic applications can usually be of threads using priorities. The interface developed with large deterministic components. contains a large number of constants and types, a large The focus of our work is on commodity multiproc- number of functions, and many optional arguments. essors and operating systems, in particular multiproces- Although all these operations have important uses in sor PCs and Windows NT. However, our programming general-purpose multithreading, only a structured subset system is portable across platforms ranging from low- of this functionality is required for our purpose. end PCs to high-end workstations and supercomputers. As with other general-purpose thread libraries, For this reason, the value of our work is not restricted to Win32 thread creation is unstructured. A thread is cre- developers of applications for commodity systems. Our ated by passing a function pointer and an argument portable system for high-performance multithreaded pointer to a CreateThread call. The new thread exe- programming also allows for high-end applications to cutes the given function with the given argument. The be developed, tested, and debugged on accessible low- thread can be created either runnable or suspended. Af- end platforms. ter creation, there is no special relationship or synchro- The remainder of this paper is organized as fol- nization between the created thread and the creating lows: in Section 2, we discuss the interface and per- thread. For example, the created thread may outlive the formance of Windows NT support for multithreading; in creating thread, causing problems if the created thread Section 3, we describe our structured multithreaded references variables in the creating thread. Many un- programming system; in Section 4, we report in some structured operations are permitted on threads. For ex- detail on one particular application (aircraft route opti- ample, one thread can arbitrarily suspend, resume, or mization) that we have developed using our system; in terminate the execution of another thread. Section 5, we give a brief outline of several other appli- The Win32 thread API provides a large range of cations that we are developing; in Section 6, we com- synchronization operations. A thread can synchronize pare our system with related work; and in Section 7, we on the termination of another thread, or on the termina- summarize and conclude. tion of one or all of a group of other threads. , mutex, , and event objects, and in- are maintained with much larger numbers of threads terlocked operations allow many other forms of syn- than processors. chronization between threads. There are many options associated with these synchronization operations, par- of Multithreaded Matrix Multiplication ticularly with operations on event objects. All of these 4.0 synchronization operations almost inevitably introduce nondeterminacy to a program. Nondeterminacy is an 3.0 implicit part of most systems programming applications 1 processor 2 processors of threads, but is best avoided if possible in other appli- 2.0 3 processors cations, because of the difficulty it adds to testing, de- 4 processors bugging, and performance prediction. 1.0 Speedup Over Sequential Speedup To summarize, Windows NT supports a rich but complex set of general-purpose thread management and 0.0 synchronization operations. We implement a simpler 0 102030405060708090100 Number of Threads layer for structured multithreaded programming on top of the Windows NT thread interface. Figure 1: Speedup of multithreaded matrix multiplica- tion over sequential matrix multiplication on one to 2.2. Windows NT Thread Performance four processors of a quad-processor Pentium Pro sys- tem. Lightweight multithreading is an integral part of our As a simple example, Figure 1 shows the speedup of programming model. If multithreaded applications are multithreaded matrix multiplication over sequential to make an impact in commodity software, they must be matrix multiplication for 1000-by-1000 element matri- able to execute efficiently on systems with differing ces on a 200 MHz quad-processor Pentium Pro system numbers of processors, and dynamically adapt to vary- running Windows NT Server 4.0. A straightforward ing background conditions. The best way to three nested-loops algorithm is used. Sequential matrix achieve this is to build applications with large numbers multiplication takes 50 seconds. In the multithreaded of dynamically created lightweight threads that can take version, the iterations in the outer loop are evenly parti- advantage of whatever processing resources become tioned among the threads. (The intention of this exam- available during execution. The traditional scientific ple is to demonstrate Windows NT multithreaded per- supercomputing model of one static thread per proces- formance characteristics, not optimal matrix multiplica- sor on an otherwise unloaded system is not sufficient in tion algorithms.) Near-perfect speedups are achieved the commodity software domain. and are maintained with large numbers of threads. We have performed experiments that demonstrate To summarize, we have found that Windows NT that Windows NT supports large numbers of lightweight efficiently supports large numbers of lightweight threads efficiently. Specifically, on single-processor, threads. Our structured multithreaded programming dual-processor, and quad-processor Pentium Pro sys- system is implemented as a very thin layer on top of tems running Windows NT, we have demonstrated the Windows NT support for threads. following: • Thread creation and termination overheads are low. 3. A System for Structured Multithreaded • Synchronization overheads are low. Programming • Hundreds of threads can be supported without sig- nificant performance degradation. In this section, we describe the features of our struc- • On single-processor systems, multithreaded pro- tured multithreaded programming system. Since it is grams with many threads can run as fast as equiva- beyond the scope of this short paper to describe all the lent sequential programs. details of our system, we aim to give an overview of the • On multiprocessor systems, multithreaded pro- fundamental constructs and development methods. grams with many threads can achieve good speed- ups over equivalent sequential programs. 3.1. Overview We have developed many small and large programs which demonstrate that multithreaded Windows NT We are developing a two-level approach to structured applications can execute efficiently on both single- multithreaded programming in ANSI C [3]. The higher processor and multiprocessor systems, without any kind level is a pragma-based notation (Multithreaded C), and of reconfiguration. We have found that good speedups the lower level is a structured thread library (Sthreads). In both Multithreaded C and Sthreads, thread creation constructs are multithreaded variants of sequential block of options that control the mapping of iterations to and for loop constructs. With Multithreaded C, the threads. constructs are supported as pragma annotations to a sequential program. With Sthreads, exactly the same 3.3. Synchronization Using Flags and constructs are supported as library calls. At both levels, Counters synchronization objects and operations are supported as Sthreads library calls. Flags and counters are provided to express deterministic The Sthreads library is implemented as a very thin synchronization within multithreadable blocks and for layer on top of the Windows NT thread interface. Mul- loops. Flags support Set and Check operations. Ini- tithreaded C is implemented as a portable source-to- tially, a flag is not set. A Set operation on a flag source preprocessor that directly transforms annotated atomically sets the flag. A Check operation on a flag blocks and for loops into equivalent calls to the suspends until the flag is set. Once a flag is set, it re- Sthreads library. The has the option of ei- mains set. Counters support Increment and Check ther using the pragmas and preprocessor or making operations. A counter has an initial value of zero. An Sthreads calls directly. The Sthreads library and Multi- Increment operation on a counter atomically incre- threaded C preprocessor are integrated with Microsoft ments the value of the counter. A Check operation on a Developer Studio Visual C++. Building a project auto- counter suspends until the value of the counter reaches a matically invokes the preprocessor where necessary and given value. The value of a counter cannot be decre- links with the Sthreads library. mented. The only way to test the value of a flag or counter is with a Check operation. As a simple exam- 3.2. Multithreadable Blocks and for ple, consider the following program to sum the elements Loops of a two-dimensional array:

void SumElements( Multithreadable blocks and for loops are ordinary float A[N][N], float *sum, int numThreads) blocks and for loops for which multithreaded execu- { tion is equivalent to sequential execution. In Multi- int i; SthreadCounter counter; threaded C, multithreadable blocks and for loops are expressed using the multithreadable pragma. It is SthreadCounterInitialize(&counter); #pragma multithreadable \ obvious that the pragma can be applied blocks and for mapping(blocked(numThreads)) loops in which the statements or iterations are inde- for (i = 0; i < N; i++) { pendent of each other. As a simple example, consider int j; float rowSum; the following program to sum the elements of a two- rowSum = 0.0; dimensional array: for (j = 0; j < N; j++) rowSum = rowSum + A[i][j]; SthreadCounterCheck(&counter, i); void SumElements( *sum = *sum + rowSum; float A[N][N], float *sum, int numThreads) SthreadCounterIncrement(&counter, 1); { } int i; SthreadCounterFinalize(&counter); float rowSum[N]; } #pragma multithreadable \ mapping(blocked(numThreads)) Without the counter operations, multithreaded execution for (i = 0; i < N; i++) { int j; of the for loop would not be equivalent to sequential rowSum[i] = 0.0; execution, because the iterations all modify the same for (j = 0; j < N; j++) *sum variable. However, the counter operations ensure rowSum[i] = rowSum[i] + A[i][j]; } that multithreaded execution is equivalent to sequential *sum = 0.0; execution. In sequential execution, the iterations are for (i = 0; i < N; i++) *sum = *sum + rowSum[i]; executed in increasing order and the SthreadCoun- } terCheck operations succeed without suspending. In multithreaded execution, the counter operations ensure Multithreaded execution of the for loop is equivalent that the operations on *sum occur atomically and in the to sequential execution because the iterations all modify same order as in sequential execution. Iteration i sus- different rowSum[i] and j variables. The arguments pends at the SthreadCounterCheck operation until following the pragma indicate that multithreaded exe- iteration i – 1 has executed the SthreadCounter- cution should assign iterations to numThreads differ- Increment operation. ent threads using a blocked mapping. There is a rich set 3.4. Sequential Development Methods 3.6. Multithreaded Blocks and for Loops The equivalence of multithreaded and sequential execu- tion of multithreadable blocks and for loops allows Multithreaded blocks and for loops are blocks and multithreaded programs to be developed using ordinary for loops that must be executed in a multithreaded sequential methods and tools. The multithread- manner. Multithreaded execution is not necessarily able pragma is an assertion by the programmer that equivalent to sequential execution. In Multithreaded C, the block or for loop can be executed in a multi- multithreaded blocks and for loops are expressed using threaded manner without changing the results of the the multithreaded pragma. Unlike the multi- program. It is not a directive that the block or for loop threadable pragma, the multithreaded pragma must be executed in a multithreaded manner. The Mul- is transformed into Sthreads calls by the Multi- tithreaded C preprocessor has two modes: sequential threaded C preprocessor in both sequential and multi- mode in which the multithreadable pragma is threaded mode. ignored and multithreaded mode in which the multi- threadable pragma is transformed into Sthreads 3.7. Synchronization Using Locks calls. Programs can be developed, tested, and debugged in sequential mode, then executed in multithreaded Locks are provided to express nondeterministic syn- mode for performance. In addition, performance analy- chronization, usually , within multi- sis and tuning can often be performed in sequential threaded blocks and for loops. Our locks support the mode. usual Acquire and Release operations. The order in The advantages of this approach to multithreaded which concurrent Acquire operations succeed is non- programming are clear. However, the programmer is deterministic. Therefore, there is very little use for locks responsible for correct use of the multithreadable within multithreadable blocks and for loops. As a sim- pragma. Fortunately, the rules for correct use of the ple example, consider the following program to sum the pragma are straightforward and can be stated entirely in elements of a two-dimensional array: terms of sequential execution. In sequential execution, void SumElements( accesses to shared variables must be separated by Set float A[N][N], float *sum, int numThreads) and Check operations or Increment and Check { operations, and the Check operations must not sus- int i; SthreadLock lock; pend. In multithreaded execution, the synchronization operations will ensure that accesses to shared variables SthreadLockInitialize(&lock); #pragma multithreaded \ occur atomically and in the same order as in sequential mapping(blocked(numThreads)) execution. Therefore, multithreaded execution will be for (i = 0; i < N; i++) { equivalent to sequential execution. int j; float rowSum; rowSum = 0.0; 3.5. Determinacy for (j = 0; j < N; j++) rowSum = rowSum + A[i][j]; SthreadLockAcquire(&lock); Determinacy of results is an important consequence of *sum = *sum + rowSum; the equivalence of multithreaded and sequential execu- SthreadLockRelease(&lock); } tion. If sequential execution is deterministic (which is SthreadLockFinalize(&lock); usually the case), multithreaded execution will also be } deterministic. Determinacy is usually desirable, since program development and debugging can be difficult Like the flag operations in the program in Section 3.3, when different runs produce different results. In many the lock operations in this program ensure that the op- other multithreaded programming systems, determinacy erations on *sum occur atomically. However, unlike is difficult to ensure. For example, locks, semaphores, the flag operations, the lock operations do not ensure and many-to-one almost always intro- that the operations on *sum occur in the same order as duce race conditions and hence nondeterminacy. How- in sequential execution, or even in the same order each ever, nondeterminacy is important for efficiency in time the program is executed. Therefore, since floating- some algorithms, e.g., branch-and-bound algorithms. point addition is not associative, the program may pro- duce different results each time it is executed. However, because execution order is less restricted, this program allows more concurrency than the program in Sec- tion 3.3. This is an example of the commonly occurring (void (*)(int, int, int, void *)) LoopBody, tradeoff between determinacy and efficiency. (void *) &LoopArgs, 0, STHREAD_CONDITION_LT, N, 1, 1, STHREAD_MAPPING_BLOCKED, 3.8. Synchronization Using Barriers numThreads, STHREAD_PRIORITY_PARENT, Barriers are provided to express collective synchroniza- STHREAD_STACK_SIZE_DEFAULT); SthreadCounterFinalize(&counter); tion of a group of threads in cases when thread termina- } tion and recreation is too expensive. Our barriers sup- port the usual Pass operation. All the threads in a Although this program is syntactically more compli- group must enter the Pass operation before all the cated than the Multithreaded C version, it is considera- threads in the group are allowed to leave the Pass op- bly less complicated than the same program expressed eration. In current systems, the cost of N threads exe- using Windows NT threads. The mechanics of creating cuting a Pass operation is less than the cost of creating threads, assigning iterations to threads, and waiting for and terminating N threads. Therefore, a typical use of thread termination is handled within the Sthreads library barriers is to replace a sequence of multithreadable call. loops with a single multithreaded loop containing a se- quence of barrier Pass operations. However, with 3.10. Performance Issues modern lightweight thread systems such as Win- dows NT, we are discovering that barriers are required In our multithreaded programming system, obtaining for efficiency in very few circumstances. good performance is under the control of the program- mer. The following issues must be taken into account: 3.9. The Sthreads Interface • Multithreading overheads: Threading and synchro- nization operations are time consuming. The examples that we have given so far are expressed • Load balancing: Enough concurrency must be ex- using the Multithreaded C pragma notation. As we de- pressed to keep the processors busy. scribed previously, there is a direct correspondence • Memory contention: Locality in memory access between the pragma notation for thread creation and the patterns prevents memory contention. Sthreads library functions that support thread creation. The key tradeoff is granularity. If multithreading is too As a simple example, the following is program from fine-grained, the overheads will swamp the useful com- Section 3.3 implemented using Sthreads: putation. If multithreading is too coarse grained, the processors will spend too much time idle. Fortunately, typedef struct { float (*A)[N]; Windows NT supports lightweight threads with low float *sum; overheads. SthreadCounter *counter; } LoopArgs; 3.11. Implementation and Performance on void LoopBody( Windows NT int i, int notused1, int notused2, LoopArgs *args) { Sthreads for Windows NT is implemented as a very thin int j; layer on top of the Win32 thread API. As a conse- float rowSum; rowSum = 0.0; quence, there is essentially no performance overhead for (j = 0; j < N; j++) associated with using Sthreads or Multithreadable C, as rowSum = rowSum + (args->A)[i][j]; SthreadCounterCheck(args->counter, i); compared to using Win32 threads directly. *(args->sum) = *(args->sum) + rowSum; Multithreadable blocks and for loops are imple- SthreadCounterIncrement(args->counter, 1); mented as a sequence of CreateThread calls fol- } lowed by a WaitForSingleObject call on an void SumElements( event. Terminating threads perform an Inter- float A[N][N], float *sum, int numThreads) { lockedDecrement call on a counter, and the last int i; thread to terminate performs a SetEvent call on the SthreadCounter counter; LoopArgs args; event. Flags are implemented directly as Win32 events. Counters are implemented as linked lists of Win32 SthreadCounterInitialize(&counter); events, with one event for every value on which some args.A = A; args.sum = sum; thread is waiting. Locks are implemented directly as args.counter = &counter; SthreadRegularForLoop( Win32 critical sections. Barriers are implemented as a 4.1. The C3I Parallel Benchmark Suite pair of Win32 events and a Win32 critical section. On a single-processor 200 MHz Pentium Pro The U.S. Air Force Rome Laboratory C3I Parallel system running Windows NT Server 4.0, the time to Benchmark Suite consists of eight problems chosen to create and terminate each thread in a multithreadable represent the essential elements of real C3I (Command, block or for loop is approximately 500 microseconds. Control, Communication, and Intelligence) applications. The time to initialize and finalize a flag, counter, lock, Each problem consists of the following: or barrier is between 5 and 20 microseconds. The time • A problem description giving the inputs and re- to perform a synchronization operation on a flag, coun- quired outputs. ter, lock, or barrier is less than 10 microseconds. • An efficient sequential program (written in C) to solve the problem. 3.12. Current Status • The benchmark input data. • A correctness test for the benchmark output data. The Sthreads implementation for Windows NT is com- For some of the problems, a parallel message-passing plete and robust. Over the last year, we have developed program is also provided. Rome Laboratory maintains a many different multithreaded applications using publicly accessible database of reported performance Sthreads on quad-processor Pentium Pro systems run- results. ning Windows NT. This academic year, we taught an The C3I Parallel Benchmark Suite provides a good undergraduate class at Caltech on multithreaded pro- framework for evaluating our structured multithreaded gramming, with Sthreads used for all the homework programming system. The problems are computation- assignments. We are in the of implementing ally intensive and involve a variety of complex algo- Sthreads on top of Pthreads for a variety of plat- rithms and data structures. The sequential program pro- forms, including Sun UltraSPARC, SGI Origin, and HP vides us with a good starting point and a fair basis for Exemplar. We have developed a comprehensive, plat- performance comparison. The performance database form-independent test suite for Sthreads and a timing allows us to compare our results with those of other suite to compare the performance of different Sthreads researchers. For these reasons, we are developing mul- implementations. tithreaded solutions to several of the C3I Parallel At the time of writing, the Multithreaded C pre- Benchmark Suite problems. processor is still under development. We hope to make a prototype preprocessor available by the fourth quarter 4.2. The Aircraft Route Optimization of 1998. In the meanwhile, we are developing multi- Problem threaded applications by making Sthreads calls directly. Because of the direct correspondence between the The in the Aircraft Route Optimization Problem is pragma annotations and Sthreads calls, the design and to find the lowest-risk path for an aircraft from an origin development of algorithms and programs is the same in point to a set of destination points in the airspace over Sthreads and Multithreaded C. The only difference is in an uneven terrain. The risk associated with each transi- the syntax. tion in the airspace is determined by its proximity to a set of threats. The problem involves realistic constraints 4. An Example Application: Aircraft on aircraft speed and maneuverability. The aircraft is Route Optimization also constrained to fly above the underlying terrain and beneath a given ceiling altitude. In this section, we report on one example application The problem is essentially the single-source, multi- that we have developed. The Aircraft Route Optimiza- ple-destination shortest path problem with a large, tion Problem is part of the U.S. Air Force Rome Labo- sparsely connected graph. The airspace for the bench- ratory C3I Parallel Benchmark Suite [5]. For this appli- mark is 100 km by 100 km in area and 10 km in alti- cation, we achieved better performance using Sthreads tude, discretized at 1 km intervals. The 100,000 posi- on a quad-processor Pentium Pro system running Win- tions in space correspond to 2,600,000 nodes in the dows NT than the best reported results for message- graph, since each position can be reached from 26 dif- passing programs running on expensive Cray and SGI ferent directions. Because of aircraft speed and maneu- supercomputers with up to 64 processors. The flexibility verability constraints, each node is connected to only of shared-memory, lightweight multithreading, and se- nine or ten geographically adjacent nodes. Therefore, quential development methods allowed us to develop a the graph consists of approximately 2.6 million nodes much more sophisticated and efficient algorithm than and 26 million edges. would be possible on a message-passing . 4.3. Sequential Algorithm number of threads. A measure of the average path length is maintained with each local queue. At each The sequential algorithm to solve the Aircraft Route step, the blocks with local queues containing the short- Optimization Problem is based on a queue of nodes. est paths are assigned to the threads. Therefore, the sub- Initially the queue is empty except for the origin node. set of blocks that are active and the assignment of At each step, one node is removed from the queue. blocks to threads change dynamically throughout pro- Valid transitions from this source node to all adjacent gram execution. This algorithm takes advantage of the destination nodes are considered. For each destination symmetric model, in which all threads node, if the path to the node via the source node is can access the entire memory space with uniform cost. shorter than the current shortest path to the node, the It also takes advantage of the lightweight multithreading path to the node is updated and the node added to the model to achieve good load balance, since the workload queue. The algorithm continues until the queue is within each thread at each step is highly variable. empty, at which stage the shortest paths to all reachable The ability to develop, test, and debug using se- nodes have been computed. quential methods was crucial in the development of this The queue is ordered on path length so that shorter sophisticated multithreaded algorithm. The entire pro- paths are expanded before longer paths. This has a sig- gram was tested and debugged in sequential mode be- nificant effect on performance. Without ordering, longer fore multithreaded execution was attempted. In par- paths are expanded, then discarded when shorter paths ticular, development of the complex boundary exchange to the same points are expanded later in the computa- and queue update algorithms would have been consid- tion. However, whether the queue is ordered, partially erably more difficult in multithreaded mode. ordered, or unordered does not affect the results of the The ability to analyze and tune performance using algorithm. sequential methods was also very important. Good per- formance depended on exposing enough parallelism 4.4. Multithreaded Algorithm without significantly increasing the total amount of computation. We determined efficient values for the The most straightforward approach to obtaining paral- number of blocks, the number of threads, and the num- lelism in the Aircraft Route Optimization Problem is to ber of iterations between boundary exchanges by meas- geographically partition the airspace into blocks, with uring computation times and operation counts of the one thread (or process) responsible for each block. Each multithreaded program in running in sequential mode. thread runs the sequential algorithm on its own block This detailed analysis would have been very difficult to using its own local queue and periodically exchanges perform in multithreaded mode. We avoided memory boundary values with neighboring blocks. This ap- contention in multithreaded mode by avoiding cache proach is particularly appealing on distributed-memory, misses in sequential mode. The analysis of memory message-passing platforms, because memory can be access patterns in sequential mode is much simpler than permanently distributed according to the pat- in multithreaded mode. tern. If the threads execute a reasonably large number of iterations between boundary exchanges, good load bal- 4.5. Performance ance can be achieved. The problem with this algorithm is that, as the Other researchers have developed and reported results number of blocks/threads is increased the total amount for message-passing solutions to the Aircraft Route Op- of computation also increases. Therefore, any speedup timization Problem running on Cray T3D and SGI is based on an increasingly inefficient underlying algo- Power Challenge supercomputers with 16 and 64 proc- rithm. At any time, the local queues in most blocks essors. We developed our program using Sthreads on a contain paths that are too long and are irrelevant to the quad-processor 200 MHz Pentium Pro system running actual shortest paths. The processors are kept busy per- Windows NT Server 4.0. (One Pentium Pro processor is forming computation that is later discarded. At any approximately the same speed as one SGI Power Chal- given time, it is only productive to work on an irregular lenge processor and twice the speed of one Cray T3D and unpredictable subset of the graph. However, ir- processor.) As shown in Figure 2, our results are better regular and adaptive blocking schemes do not solve the than the results reported on the expensive problem, since there is usually equal work available in supercomputers. The reason is the combination of the all blocks. The issue is the distinction between produc- shared-memory model supported by the Pentium Pro, tive and unproductive work. the lightweight multithreading model supported by Our solution is to statically partition the airspace Windows NT, and the structured multithreaded pro- into a large number of blocks and to use a much smaller gramming system with sequential development methods fold speedups on quad-processor Pentium Pro sys- supported by Sthreads. tems. • Terrain Masking: Terrain masking computes the Aircraft Route Optimization Execution Times altitudes at which an aircraft will be visible to 35 ground-based radar systems in an uneven terrain. 30 This problem has obvious applications in military

25 and civil aviation. Terrain masking is part of the

20 C3I Parallel Benchmark Suite. We obtain 3-fold

15 speedups on quad-processor Pentium Pro systems. • Threat Analysis: Threat analysis is a military ap- 10

Execution time (seconds) time Execution plication that performs a time-stepped simulation of 5 the trajectories of incoming threats and analyses 0 Cray T3D SGI Power Challenge Cray T3D Pentium Pro options for intercepting the threats. Threat analysis 64 processors 16 processors 64 processors 4 processors is part of the C3I Parallel Benchmark Suite. We MPI MPI PVM Sthreads obtain almost 4-fold speedups on quad-processor Figure 2: Sthreads on a quad-processor Pentium Pro Pentium Pro systems. outperforms message-passing on supercomputers for the • Protein Folding: Protein folding takes a known Aircraft Route Optimization Problem. protein chain and computes the most likely three- These results should not be misinterpreted as dimensional structures for the molecule. This meaning that commodity multiprocessors are now as problem is of vital interest to biochemists involved powerful as supercomputers, or that structured multi- in drug design. We obtain 3-fold speedups on quad- threading can obtain super-linear speedups from a small processor Pentium Pro systems. number of processors. For the Aircraft Route Optimi- • Molecular Dynamics Simulation: We are devel- zation Problem, we obtain approximately three-fold oping a multithreaded implementation of an exist- speedup over sequential execution on four processors. ing molecular dynamics simulation program However, unlike the message-passing approach, the (MPSim) that uses the multipole method. speedups are obtained from a multithreaded algorithm MPSim consists of more than 100 source files and that adds no significant overheads to the sequential al- over 100,000 lines of code. We obtain better than gorithm. 3.5-fold speedups on quad-processor Pentium Pro This example is intended as an interesting case systems for simulations involving up to half a mil- study that highlights the strengths of shared-memory, lion atoms. lightweight threads, and structured multithreaded pro- All of these applications are extremely computationally gramming. Clearly, there are many other applications intensive. Depending on the input data, the problems for which message-passing supercomputers would out- may require many hours or days to solve on fast single- perform a quad-processor Pentium Pro system. None- processor machines. For this reason, much of the previ- theless, it is significant that inexpensive commodity ous work on these problems has been with expensive multiprocessors are now a practical consideration in supercomputers. The computational challenge associ- many areas that were previously the sole domain of ated with these problems is not about to disappear with supercomputers. the next generation of fast processors.

5. Additional Applications under Devel- 6. Comparison with Related Work opment Over the years, hundreds of different systems for mul- We are in the process of developing a number of other tithreaded programming have been proposed and im- multithreaded applications to demonstrate the power of plemented. These systems include thread libraries, con- structured multithreaded programming on commodity current languages, sequential languages with pragmas, multiprocessors running Windows NT. These applica- data parallel languages, automatic parallelizing compil- tions include the following: ers, and parallel dataflow languages. We have attempted • Volume Rendering: Volume rendering computes to combine the best attributes of these other approaches an image of a three-dimensional data volume. The into one simple, timely, and highly accessible multi- image may or may not be realistic, depending on threaded programming system. the nature of the data. We obtain better than 3.5- The contribution of our system is not in the indi- vidual constructs, but in their combination and context. Multithreaded block and for loop constructs date back to the 1960s and have been incorporated in many forms ter Schröder in the Computer Science Department, Dr. in concurrent languages, pragma-based systems, and Jerry Solomon and David Liney in the Computational data parallel languages. Synchronization flags and Biology Department, and Prof. William Goddard III, counters are derived from concepts originally explored Dr. Tahir Cagin, and Hao Li in the Materials and Proc- in the context of parallel dataflow languages. Locks and ess Simulation Center. barriers are standard synchronization constructs in many Thanks to Intel Corporation for donating multi- thread libraries and concurrent languages. processor Pentium Pro computer systems. Thanks also An important difference between our system and to Microsoft Corporation for donating software and others is the combination of the following: (i) the em- documentation. Special thanks to David Ladd of Micro- phasis on practical development of sophisticated multi- soft University Research Programs for being so respon- threaded programs using sequential methods, (ii) the sive to our requests. The design of our multithreaded ability to ensure deterministic execution, but express programming system greatly benefited from discussions nondeterminacy where necessary, and (iii) the mini- with Tim Mattson of Intel Corporation. This research malist integration of these ideas with popular languages, was funded in part by the NSF Center for Research on operating systems, and tools. We believe that this is the Parallel Computation under Cooperative Agreement No. right approach to making multithreading practical for a CCR-9120008 and by the Army Research Laboratory wide range of applications running on modern multi- under Contract No. DAHC94-96-C-0010. processor platforms. References 7. Conclusion [1] Jim Beveridge and Robert Wiener. Multithreading We are developing a system for structured high- Applications in Win 32. Addison-Wesley, Reading, performance multithreaded programming on top of the Massachusetts, 1997. support that Windows NT provides for lightweight [2] Helen Custer. Inside Windows NT. Microsoft Press, multithreading. Our system consists of a structured Redmond, Washington, 1993. thread library and a preprocessor integrated with Micro- [3] Brian W. Kernighan and Dennis M. Ritchie. The C soft Developer Studio Visual C++. An important attrib- . Prentice Hall, Englewood ute of our system is the ability to develop multithreaded Cliffs, New Jersey, second edition, 1988. programs using traditional sequential methods and tools. [4] Steve Kleiman, Devang Shah, and Bart Smaalders. We have developed multithreaded applications in a Programming with Threads. Prentice Hall, Upper wide range of areas such as optimization, graphics, Saddle River, New Jersey, 1996. simulation, and applied science. The performance of [5] Richard C. Metzger, Brian VanVoorst, Luiz S. these applications on multiprocessor Pentium Pro sys- Pires, Rakesh Jha, Wing Au, Minesh Amin, David tems has been excellent. Our experience developing A. Castanon, and Vipin Kumar. The C3I Parallel these applications has convinced us that high- Benchmark Suite - Introduction and Preliminary performance multithreading is ready to enter the main- Results. Supercomputing ’96, Pittsburgh, Pennsyl- stream of computing. vania, November 17-22 1996. [6] Bradford Nichols, Dick Buttlar, and Jacqueline Availability Proulx Farrel. Pthreads Programming. O’Reilly, Sebastopol, California, 1996. Information on obtaining the multithreaded program- [7] David J. Kruglinski. Inside Visual C++. Microsoft ming system described in this paper can be found at Press, Redmond, Washington, fourth edition, 1997. http://threads.cs.caltech.edu/ or can be [8] Thuan Q. Pham and Pankaj K. Garg. Multithreaded requested from [email protected]. Programming with Windows NT. Prentice Hall, Upper Saddle River, New Jersey, 1996. Acknowledgments

The applications described in this paper are being de- veloped by the following Caltech students: Eric Bogs, Marrq Ellenbecker, Yuanshan Guo, Maria Hui, Lei Jin, Autumn Looijen, Scott Mandelsohn, Bradley Marker, Jason Roth, Sean Suchter, and Louis Thomas. We are developing the applications in collaboration with the following Caltech faculty, staff, and students: Prof. Pe-