Unsynchronized Techniques for Approximate Parallel Computing

Unsynchronized Techniques for Approximate Parallel Computing Martin C. Rinard MIT EECS and CSAIL [email protected] Abstract Now, it may not be immediately clear that it is possible We present techniques for obtaining acceptable unsynchro- to eliminate potentially undesirable interactions in parallel nized parallel computations. Even though these techniques computations without incurring one or more of these dis- may generate interactions (such as data races) that the cur- advantages. We don’t even try — we instead develop ap- rent rigid value system condemns as incorrect, they are en- proaches in which such interactions may indeed take place as gineered to 1) preserve key data structure consistency con- long as they occur infrequently enough for the computation straints while 2) producing a result that is accurate enough, to produce an acceptably accurate result with high enough often enough. And because they contain no synchronization, likelihood. In other words, instead of attempting to deliver they eliminate the synchronization overhead, parallelism re- a perfect/correct computation, we only aspire to deliver a duction, and failure propagation typically associated with computation that is accurate enough, often enough. synchronization mechanisms. This change in perspective calls into question the legiti- We also identify a class of computations that interact well macy of the current rigid value system, which classifies all with our techniques. These computations combine multiple data races as undesirable [4]. It also opens up new regions of contributions to obtain a composite result. Redundancy in the engineering space and enables us to develop new mech- the contributions enables the unsynchronized computation anisms and techniques that deliver advantages unobtainable to produce an acceptably accurate result even though it may, with previous approaches. Armed with this new perspective, in effect, drop contributions. Thus our unsynchronized tech- we develop the following concepts, mechanisms, and techniques (either directly or indirectly) are often acceptably ac- niques: curate for the same reason as other successful approximate computing techniques such as task skipping and loop perfo- • Integrity and Accuracy Instead of Correctness: The ration. field of program analysis and verification has traditionally focused on obtaining correct programs that always satisfy traditional correctness properties. With this ap- 1. Introduction proach, each program is characterized either as correct To eliminate potentially undesirable interactions (such as or incorrect. data races) that may occur when multiple parallel threads We see this classification as counterproductive — there access shared data, developers often augment parallel pro- are many incorrect programs that provide worthwhile re- grams with synchronization. The disadvantages of standard sults for their users. In fact, it is common knowledge that synchronization mechanisms include: essentially every large software system has errors and is therefore incorrect. One may very reasonably ques- • Synchronization Overhead: The time overhead of exe- tion the utility of a conceptual framework that condemns cuting the synchronization primitives and the space over- essentially every successfully deployed artifact as incor- head of the state required to implement the primitives. rect. • Parallelism Reduction: Synchronization can reduce par- We therefore focus instead on acceptability, in particu- allelism by forcing one or more threads to wait for other lar two kinds of acceptability: integrity and accuracy. threads. Integrity constraints capture properties that the program • Failure Propagation: With standard synchronization must satisfy to execute successfully to produce a well- mechanisms, threads often wait for another thread to formed result. Accuracy constraints characterize how ac- perform a synchronization operation (such as releasing curate the result must be. To date we have largely fo- a lock or hitting a barrier) before they can proceed. If cused on obtaining programs that always satisfy their the thread does not perform the operation (because of integrity constraints but only satisfy their accuracy con- an infinite loop, a crash, or a coding error that omits the straints with high likelihood. But we anticipate that other operation), the waiting threads can hang indefinitely. combinations (for example, programs that satisfy the in- 1 tegrity constraints only with high likelihood) will be ap- can often use the data structure successfully even if propriate for many usage contexts. some of these consistency properties do not hold. • Semantic Data Races: Data races have traditionally Even though the unsynchronized updates may vio- been defined in terms of unsynchronized reads and writes late some of these properties, they preserve the key to shared data [4]. This approach misses the point in at properties that the clients require to execute success- least two ways: 1) the absence of data races does not en- fully (i.e., without crashing to produce a result that is sure that the computation satisfies any particular useful accurate enough, often enough). property (note that inserting a lock aquire and release The updates we consider in this paper have the prop- around every access to shared data eliminates data races erty that even if an update may perform multiple but does not change the semantics) and 2) the presence writes to shared data, each individual write preserves of data races does not ensure that the computation has the key consistency properties. This feature ensures any undesirable properties. The problem is that the cur- that these properties hold in all possible executions. rent concept of data races focuses on a narrow aspect of In general, however, we believe it will also be produc- the program’s internal execution that may or may not be tive to consider updates that use data structure repair relevant to its semantics. to repair temporary violations of the key consistency We instead propose to characterize interactions in terms properties or updates that ensure the key properties of the acceptability properties of the shared data that they hold only with high likelihood. may or may not preserve. If an interaction violates a Acceptably Infrequent Interactions: In theory, it specified property, we characterize the interaction as a may be possible for non-atomic interactions to occur semantic data race with respect to the property that it may frequently enough to produce unacceptably accurate violate. With this perspective, synchronization becomes results. In practice, however, such potentially undesir- necessary only to the extent that it is required to preserve able interactions occur infrequently enough so that the essential acceptability properties. result is acceptably accurate. Prioritizing different acceptability properties makes it Internal Integrity: The unsynchronized updates never possible to classify potential interactions between par- crash or perform an illegal operation such as an out of allel threads according to the properties that they may or bounds access. may not violate. This characterization, in turn, induces a • Barrier Elimination: We also discuss a mechanism that trade-off space for synchronization mechanisms. An ex- eliminates barrier idling — no thread ever waits at a pensive synchronization mechanism, for example, may barrier [23]. The mechanism is simple. The computation preserve most or all of the properties. A less expensive contains a set of threads that are cooperating to execute synchronization mechanism, on the other hand, may pre- a group of tasks that comprise a parallel computation serve fewer properties. The trade off between the desir- phase. The standard barrier synchronization mechanism ability of the properties and the cost of the synchroniza- would force all threads to wait until all tasks have been tion mechanisms required to preserve these properties completed. determines the best synchronization mechanism for any given context. Our alternate mechanism instead has all threads proceed on to the next phase (potentially dropping the task they • Synchronization-Free Updates to Shared Data: Devel- are working on) as soon as there are too few tasks to keep opers have traditionally used mutual exclusion synchro- all threads busy. The net effect is 1) to eliminate barrier nization to make updates to shared data execute atomi- idling (when threads wait at a barrier for other threads to cally. We have identified a class of data structures and finish executing their tasks) by 2) dropping or delaying a updates that can tolerate unsynchronized updates (even subset of the tasks in the computational phase to produce when the lack of synchronization may cause updates 3) an approximate but accurate enough result. from different parallel threads to interfere). These data structures and updates are characterized by: No Synchronization, No Atomicity: The updates op- We note that both our synchronization-free data struc- erate with no synchronization and provide no atom- tures and barrier elimination mechanisms exploit the fact icity guarantees. There is therefore no synchroniza- that many programs can produce acceptably accurate re- tion overhead, no parallelism reduction, and parallel sults even though they execute only a subset of the origi- threads continue to execute regardless of what hap- nally specified computation. This property is the key to the pens in other threads. success of such fundamental approximate computing tech- Preservation

Unsynchronized Techniques for Approximate Parallel Computing

Data Structure

Data Structure Invariants

Using Machine Learning to Improve Dense and Sparse Matrix Multiplication Kernels

Verification-Aware Opencl Based Read Mapper for Heterogeneous

Applying Front End Compiler Process to Parse Polynomials in Parallel

Exploratory Large Scale Graph Analytics in Arkouda 59 2 60 3 61 4 Zhihui Du,Oliver Alvarado Rodriguez and Michael Merrill and William Reus 62 5 David A

Array Data Structure

View of This Work

When Prefetching Works, When It Doesn&Rsquo

Array Data Structure

Netlist Security Algorithm Acceleration Using Opencl on Fpgas

The Affix Array Data Structure and Its Applications to RNA Secondary