Qthreads: an API for Programming with Millions of Lightweight Threads

Qthreads: An API for Programming with Millions of Lightweight Threads Kyle B. Wheeler Richard C. Murphy Douglas Thain University of Notre Dame Sandia National Laboratories∗ University of Notre Dame South Bend, Indiana, USA Albuquerque, New Mexico, USA South Bend, Indiana, USA [email protected] [email protected] [email protected] Abstract world of the very small, where processors extract parallelism from serial instruction streams. Recent hardware ar- Large scale hardware-supported multithreading, an at- chitectural research has investigated lightweight threading tractive means of increasing computational power, benefits and programmer-defined large scale shared-memory paral- significantly from low per-thread costs. Hardware support lelism. The lightweight threading concept allows exposure for lightweight threads is a developing area of research. of greater potential parallelism, increasing performance via Each architecture with such support provides a unique in- greater hardware parallelism. The Cray XMT [7], with the terface, hindering development for them and comparisons Threadstorm CPU architecture, avoids memory dependency between them. A portable abstraction that provides basic stalls by switching among 128 concurrent threads. XMT lightweight thread control and synchronization primitives systems support between over 8000 processors. To maxi- is needed. Such an abstraction would assist in exploring mize throughput, the programmer must provide at least 128 both the architectural needs of large scale threading and threads per processor, or over 1,024,000 threads. the semantic power of existing languages. Managing thread Taking advantage of large-scale parallel systems with resources is a problem that must be addressed if massive current parallel programming APIs requires significant parallelism is to be popularized. The qthread abstraction computational and memory overhead. For example, stan- enables development of large-scale multithreading applica- dard POSIX threads must be able to receive signals, which tions on commodity architectures. This paper introduces either requires an OS representation of every thread or re- the qthread API and its Unix implementation, discusses requires user-level signal multiplexering [14]. Threads in a source management, and presents performance results from large-scale multithreading context often need only a few the HPCCG benchmark. bytes of stack (if any) and do not require the ability to receive signals. Some architectures, such as the Processor-in- Memory (PIM) designs [5, 18, 23], suggest threads that are 1. Introduction merely a few instructions included in the thread’s context. While hardware-based lightweight threading constructs Lightweight threading primitives, crucial to large scale are important developments, the methods for exposing such multithreading, are typically either platform dependent or parallelism to the programmer are platform-specific and compiler-dependent. Generic programmer-visible multi- typically rely either on custom compilers [3, 4, 7, 10, 11, threading interfaces, such as pthreads, were designed for 26], entirely new languages [1, 6, 8, 13], or have archi- “reasonable” numbers of threads—less than one hundred or tectural limitations that cannot scale to millions of threads so. In large-scale multithreading situations, the features and [14, 22]. This makes useful comparisons between archi- guarantees provided by these interfaces prevent them from tectures difficult. With a standard way of expressing par- scaling to “unreasonable” numbers of threads (a million or allelism that can be used with existing compilers, compar- more), necessary for multithreaded teraflop-scale problems. ing cross-platform algorithms becomes convenient. For ex- Parallel execution has largely existed two different ample, the MPI standard allows a programmer to create a worlds: the world of the very large, where program- parallel application that is portable to any system provid- mers explicitly create parallel threads of execution, and the ing an MPI library, and different systems can be compared with the same code on each system. Development and study ∗ Sandia is a multiprogram laboratory operated by Sandia Corporation, of large-scale multithreaded applications is limited because a Lockheed Martin Company, for the United States Department of En- ergy’s National Nuclear Security Administration under contract DE-AC04- of the platform-specific nature of the available interfaces. 94AL85000. Having a portable large-scale multithreading interface allows application development on commodity hardware that The equal distribution solution is also simple: divide can exploit the resources available on large-scale systems. the array equally among all of the available processors and Lightweight threading requires a lightweight synchro- spawn a thread for each. Each thread must sum its segment nization model [9]. The model used by the Cray XMT and serially and return the result. The parent thread sums the PIM designs, pioneered by the Denelcor HEP [15], uses ful- return values. This technique is efficient because it matches l/empty bits (FEBs). This technique marks each word in the needed parallelism to the available parallelism, and the memory with a “full” or “empty” state, allows programs processors do minimal communication. However, equal to wait for either state, and makes the state change atomi- distribution is not particularly tolerant of other load imbal- cally with the word’s contents. This technique can be imple- ances: execution is as slow as the slowest thread. mented directly in hardware, as it is in the XMT. Alterna- The lagging-loop model relies upon arbitrary workload tives include ADA-like protected records [24] and fork-sync divisions. It breaks the array into small chunks and spawns [4], which lack a clear hardware analog. a thread for each chunk. Each thread sums its chunk and This paper discusses programming models’ impact then waits for the preceding thread (if any) to return an an- on efficient multithreading and the resource management swer before combining the sum and returning its own total. necessary for those models. It introduces the qthread Eventually the parent thread will do the same with the last lightweight threading API and its Unix implementation. chunk. This model is more efficient than the tree model, The API is designed to be a lightweight threading standard and the number of threads depends on the chunk size. The for current and future architectures. The Unix implementa- increased number of threads makes it more tolerant of load tion is a proof of concept that provides a basis for develop- imbalances, but has more overhead. ing applications for large scale multithreaded architectures. 2.2. Handling resource exhaustion 2. Recursive threaded programming models and resource management These methods differ in the way resources must be man- aged to guarantee forward progress. Whenever new thread Managing large numbers of threads requires managing is requested, one of four things can be done: per-thread resources, even if those requirements are low. 1. Execute the function inline. This management can affect whether multithreaded appli- 2. Create a new thread. cations run to completion and whether they execute faster 3. Create an record that will become a thread later. than an equivalent serial implementation. The worst conse- 4. Block until sufficient resources are available. quence of poor management is deadlock: if more threads are needed than resources are available, and reclaiming In a large enough parallel program, eventually the resources thread resources depends on spawning more threads, the will run out. Requests for new threads must either block un- system cannot make forward progress. til resources become available or must fail and let the program handle the problem. 2.1. Parallel programming models Blocking to wait for resources to become available af- fects each parallel model differently. The lagging loop An illustrative example of the effect the programming method works well with blocking requests, because the model on resource management is the trivial problem of spawned threads don’t rely on spawning more threads. summing integers in an array. A serial solution is trivial: When these threads complete, their resources may be start at the beginning, and tally each sequential number un- reused, and deadlock is easily avoided. The equal distri- til the end of the array. There are at least three parallel ex- bution method has a similar advantage. However, because ecution models that could compute the sum. They can be it avoids using more than the minimum number of threads, referred to as the recursive tree model, the equal distribu- it does not cope as well with load imbalances. tion model, and the lagging-loop model. The recursive tree method gathers a lot of state quickly A recursive tree solution to summing the numbers in an and slowly releases it, making the method particularly sus- array is simple to program: divide the array in half, and ceptible to resource exhaustion deadlock, where all run- spawn two threads to sum up both halves. Each thread does ning threads are blocked spawning more threads. In order the same until its array has only one value, whereupon the to guarantee forward progress, resources must be reserved thread returns that value. Thread that spawned threads wait when threads spawn and threads must execute serially when for their children to return and return the sum of their values. reservation fails. The minimum state that

Qthreads: an API for Programming with Millions of Lightweight Threads

The Essentials of Stackless Python Tuesday, 10 July 2007 10:00 (30 Minutes)

Method, 224 Vs. Typing Module Types

Python Concurrency

Workload Management and Application Placement for the Cray Linux Environment™

Thread-Level Parallelism II

Introducing the Cray XMT

Uwsgi Documentation Release 1.9

Goless Documentation Release 0.6.0

Continuations and Stackless Python

The Gemini Network

Threading and GUI Issues for R

Expert Python Programming Third Edition