Speculative Decoupled Software Pipelining

Speculative Decoupled Software Pipelining Neil Vachharajani Ram Rangan† Easwaran Raman Matthew J. Bridges Guilherme Ottoni David I. August Department of Computer Science Princeton University Princeton, NJ 08540 {nvachhar,ram,eraman,mbridges,ottoni,august}@princeton.edu Abstract manufacturers to add value by producing chips that incor- porate multiple processors. Multi-core processors can im- In recent years, microprocessor manufacturers have prove system throughput and speed up multithreaded appli- shifted their focus from single-core to multi-core pro- cations, but single-threaded applications receive no benefits. cessors. To avoid burdening programmers with the re- While the task of producing multithreaded code could sponsibility of parallelizing their applications, some re- be left to the programmer, there are several disadvantages searchers have advocated automatic thread extraction. A to this approach. First, writing multithreaded codes is in- recently proposed technique, Decoupled Software Pipelin- herently more difficult than writing single-threaded codes. ing (DSWP), has demonstrated promise by partitioning Multithreaded programming requires programmers to rea- loops into long-running, fine-grained threads organized son about concurrent accesses to shared data and to insert into a pipeline. Using a pipeline organization and execu- sufficient synchronization to ensure proper behavior while tion decoupled by inter-core communication queues, DSWP still permitting enough parallelism to improve performance. offers increased execution efficiency that is largely indepen- Active research in automatic tools to identify deadlock, live- dent of inter-core communication latency. lock, and race conditions [3, 4, 5, 14, 21] in multithreaded This paper proposes adding speculation to DSWP and programs is a testament to the difficulty of this task. Second, evaluates an automatic approach for its implementation. there are many legacy applications that are single-threaded. By speculating past infrequent dependences, the benefit of Even if the source code for these applications is available, it DSWP is increased by making it applicable to more loops, would take enormous programming effort to translate these facilitating better balanced threads, and enabling paral- programs into well-performing parallel versions. lelized loops to be run on more cores. Unlike prior speculative threading proposals, speculative DSWP focuses on A promising alternative approach for producing multi- breaking dependence recurrences. By speculatively break- threaded codes is to let the compiler automatically convert ing these recurrences, instructions that were formerly re- single-threaded applications into multithreaded ones. This stricted to a single thread to ensure decoupling are now approach is attractive as it takes the burden of writing mul- free to span multiple threads. Using an initial automatic tithreaded code off the programmer. Additionally, it allows compiler implementation and a validated processor model, the compiler to automatically adjust the amount and type this paper demonstrates significant gains using speculation of parallelism extracted depending on the underlying archi- for 4-core chip multiprocessor models running a variety of tecture, just as instruction-level parallelism (ILP) optimiza- codes. tions relieved the programmer of the burden of targeting complex single-threaded architectures. Unfortunately, compilers have been unable to extract 1 Introduction thread-level parallelism (TLP) despite the pressing need. While success of this type has not been achieved yet, For years, steadily increasing clock speeds and unipro- progress has been made. Techniques dedicated to paralleliz- cessor microarchitectural improvements reliably enhanced ing scientific and numerical applications, such as DOALL performance for a wide range of applications. Although and DOACROSS, are used routinely in such domains with this approach has recently faltered, the exponential growth good results [10, 25]. Such techniques perform well on in transistor count remains strong, leading microprocessor counted loops manipulating very regular and analyzable †Currently with the Performance and Tools Group, IBM Austin Re- structures consisting mostly of predictable array accesses. search Laboratory, [email protected]. Since these techniques were originally proposed for scientific applications, they generally do not handle loops with four additional key application loops using only control and arbitrary control flow or unpredictable data access patterns silent store speculation. that are the norm for general-purpose applications. In summary, the contributions of this paper are: Because dependences tend to be the limiting factor in • A new technique, Speculative Decoupled Software extracting parallelism, speculative techniques, loosely clas- Pipelining, that can extract parallel threads from pre- sified as thread-level speculation (TLS), have dominated the viously unparallelizable loops. literature [1, 2, 8, 9, 11, 12, 15, 22, 23, 26, 28]. Speculating • A compiler implementation of SpecDSWP. dependences that prohibit DOALL or DOACROSS paral- • An evaluation of several general-purpose applications lelization increases the amount of parallelism that can be on a cycle-accurate hardware simulator. extracted. Unfortunately, speculating enough dependences The rest of the paper is organized as follows. Section 2 to create DOALL parallelization often leads to excessive examines existing speculative and non-speculative paral- misspeculation. Additionally, as will be discussed in Sec- lelization techniques to put this work in context. Section 3 tion 2, core-to-core communication latency combined with provides an overview of the original non-speculative DSWP the communication pattern exhibited by DOACROSS par- technique, which is followed by a discussion of the specu- allelization often negates the parallelism benefits offered lative DSWP technique in Section 4. Sections 5 and 6 de- by both speculative and non-speculative DOACROSS par- tail how dependences are chosen for speculation and how allelization. misspeculation is handled. Section 7 provides experimental Decoupled Software Pipelining (DSWP) [16, 20] ap- results, and Section 8 concludes the paper. proaches the problem differently. Rather than partitioning a loop by placing distinct iterations in different threads, 2 Motivation DSWP partitions the loop body into a pipeline of threads, Three primary non-speculative loop parallelization tech- ensuring that critical path dependences are kept thread- niques exist: DOALL [10], DOACROSS [10], and local. The parallelization is tolerant of both variable latency DSWP [16, 20]. Of these techniques, only DOALL par- within each thread and long communication latencies be- allelization yields speedup proportional to the number of tween threads. Since the existing DSWP algorithm is non- cores available to run the code. Unfortunately, in general- speculative, it must respect all dependences in the loop. Un- purpose codes, DOALL is often inapplicable. For example, fortunately, this means many loops cannot be parallelized consider the code shown in Figure 1. In the figure, pro- with DSWP. gram dependence graph (PDG) edges that participate in de- In this paper, we present Speculative Decoupled Soft- pendence recurrences are shown as dashed lines. Since the ware Pipelining (SpecDSWP) and an initial automatic statements on lines 3, 5, and 6 are each part of a dependence compiler implementation of it. SpecDSWP leverages the recurrence, consecutive loop iterations cannot be indepen- latency-tolerant pipeline of threads characteristic of DSWP dently executed in separate threads. This makes DOALL and combines it with the power of speculation to break inapplicable. dependence recurrences that inhibit DSWP parallelization. DOACROSS and DSWP, however, are both able to par- Like DSWP, SpecDSWP exploits the fine-grained pipeline allelize the loop. Figure 2(a) and 2(b) show the parallel ex- parallelism hidden in many applications to extract long- ecution schedules for DOACROSS and DSWP respectively. running, concurrently executing threads, and can do so These figures use the same notation as Figure 1, except the on more loops than DSWP. The speculative, decoupled nodes are numbered with both static instruction numbers threads produced by SpecDSWP increase execution effi- and loop iteration numbers. After an initial pipeline fill ciency and may also mitigate design complexity by reduc- time, both DOACROSS and DSWP complete one iteration ing the need for low-latency inter-core communication. Ad- every other cycle. This provides a speedup of 2 over single- ditionally, since effectively extracting fine-grained pipeline threaded execution. DOACROSS and DSWP differ, how- parallelism often requires in-depth knowledge of many mi- ever, in how this parallelism is achieved. DOACROSS al- croarchitectural details, the compiler’s automatic applica- ternately schedules entire loop iterations on processor cores. tion of SpecDSWP frees the programmer from the difficult DSWP, on the other hand, partitions the loop code, and each and even counter-productive involvement at this level. core is responsible for a particular piece of the loop across Using an initial automatic compiler implementation and all the iterations. a validated processor model, this paper demonstrates that The organization used by DOACROSS forces it to SpecDSWP provides significant performance gains for a communicate dependences that participate in recurrences multi-core

Speculative Decoupled Software Pipelining

Optimal Software Pipelining: Integer Linear Programming Approach

Decoupled Software Pipelining with the Synchronization Array

VLIW Architectures Lisa Wu, Krste Asanovic

Static Instruction Scheduling for High Performance on Limited Hardware

Introduction to Software Pipelining in the IA-64 Architecture

Register Assignment for Software Pipelining with Partitioned Register Banks

The Bene T of Predicated Execution for Software Pipelining

Software Pipelining of Nested Loops

Software Pipelining for Transport-Triggered Architectures

A Simple, Verified Validator for Software Pipelining

Build Linux Kernel with Open64

For More Details, See Downloadable Description