Cache Aware Optimization of Stream Programs

Cache Aware Optimization of Stream Programs Janis Sermulins William Thies Rodric Rabbah Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology fjaniss, thies, rabbah, [email protected] Abstract invariably expensive and complex. For example, superscalar pro- Effective use of the memory hierarchy is critical for achieving cessors resort to out-of-order execution to hide the latency of cache high performance on embedded systems. We focus on the class of misses. This results in large power expenditures (unfit for embed- streaming applications, which is increasingly prevalent in the em- ded systems) and also increases the cost of the system. Compilers bedded domain. We exploit the widespread parallelism and regular have also employed computation and data reordering to improve communication patterns in stream programs to formulate a set of locality, but this requires a heroic analysis due to the obscured par- cache aware optimizations that automatically improve instruction allelism and communication patterns in traditional languages such and data locality. Our work is in the context of the Synchronous as C. Dataflow model, in which a program is described as a graph of For performance-critical programs, the complexity inevitably independent actors that communicate over channels. The commu- propagates all the way to the application developer. Programs are nication rates between actors are known at compile time, allowing written to explicitly manage parallelism and to reorder the compu- the compiler to statically model the caching behavior. tation so that the instruction and data working sets fit within the We present three cache aware optimizations: 1) execution scal- cache. For example, the inputs and outputs of a procedure might be ing, which judiciously repeats actor executions to improve instruc- arrays that are specifically designed to fit within the data cache on tion locality, 2) cache aware fusion, which combines adjacent ac- a given architecture; loop bodies are written at a level of granular- tors while respecting instruction cache constraints, and 3) scalar ity that matches the instruction cache. While manual tuning can be replacement, which converts certain data buffers into a sequence effective, the end solutions are not portable. They are also exceed- of scalar variables that can be register allocated. The optimizations ingly difficult to understand, modify, and debug. are founded upon a simple and intuitive model that quantifies the The recent emergence of streaming applications represents an temporal locality for a sequence of actor executions. Our imple- opportunity to mitigate these problems using simple transforma- mentation of cache aware optimizations in the StreamIt compiler tions in the compiler. Stream programs are rich with parallelism and yields a 249% average speedup (over unoptimized code) for our regular communication patterns that can be exploited by the com- streaming benchmark suite on a StrongARM 1110 processor. The piler to automatically tune memory performance. Streaming codes optimizations also yield a 154% speedup on a Pentium 3 and a encompass a broad spectrum of applications, including embedded 152% speedup on an Itanium 2. communications processing, multimedia encoding and playback, compression, and encryption. They also range to server applica- Categories and Subject Descriptors D.3.4 [Programming Lan- tions, such as HDTV editing and hyper-spectral imaging. It is nat- guages]: Processors—Optimization; code generation; compilers; ural to express a stream program as a high-level graph of inde- D.3.2 [Programming Languages]: Language Classifications— pendent components, or actors. Actors communicate using explicit Concurrent, distributed, and parallel languages; Data-flow lan- FIFO channels and can execute whenever a sufficient number of guages items are available on their input channels. In a stream graph, actors General Terms Languages, Design, Performance can be freely combined and reordered to improve caching behavior as long as there are sufficient inputs to complete each execution. Keywords Stream Programing, StreamIt, Synchronous Dataflow, Such transformations can serve to automate tedious approaches that Cache, Cache Optimizations, Fusion, Embedded are performed manually using today’s languages; they are too complex to perform automatically in hardware or in the most aggressive 1. Introduction of C compilers. This paper presents three simple cache aware optimizations for Efficiency and high performance are of central importance within stream programs: (i) execution scaling, (ii) cache aware fusion, and the embedded domain. As processor speeds continue to increase, (iii) scalar replacement. These optimizations represent a unified ap- the memory bottleneck remains a primary impediment to attain- proach that simultaneously considers the instruction and data working performance. Current practices for hiding memory latency are ing sets. We also develop a simple quantitative model of caching behavior for streaming workloads, providing a foundation to rea- son about the transformations. Our work is done in the context of the Synchronous Dataflow [13] model of computation, in which each actor in the stream graph has a known input and output rate. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed This is a popular model for a broad range of signal processing and for profit or commercial advantage and that copies bear this notice and the full citation embedded applications. on the first page. To copy otherwise, to republish, to post on servers or to redistribute Execution scaling is a transformation that improves instruction to lists, requires prior specific permission and/or a fee. locality by executing each actor in the stream graph multiple times LCTES'05, June 15–17, 2005, Chicago, Illinois, USA. before moving on to the next actor. As a given actor usually fits Copyright c 2005 ACM 1-59593-018-3/05/0006. $5.00. within the cache, the repeated executions serve to amortize the cost of loading the actor from off-chip memory. However, as our float->float filter FIR_Filter (int N, float[] weights) { work push 1 pop 1 peek N { cache model will show, actors should not be scaled excessively, as float sum = 0; their outputs will eventually overflow the data cache. We present a for (int i = 0; i < N; i++) { simple and effective algorithm for calculating a scaling factor that sum += peek(i) * weights[i]; respects both instruction and data constraints. } pop(); Prior to execution scaling, cache aware fusion combines adja- push(sum); cent actors into a single function. This allows the compiler to opti- } mize across actor boundaries. Our algorithm is cache aware in that } it never fuses a pair of actors that will result in an overflow of the instruction cache. Figure 1. StreamIt code for an FIR filter As actors are fused together, new buffer management strategies become possible. The most aggressive of these, termed scalar replacement, serves to replace an array with a series of local scalar splitter variables. Unlike array references, scalar variables can be regis- stream ter allocated, leading to large performance gains. We also develop joiner a new buffer management strategy (called “copy-shift”) that ex- stream st am st am tends scalar replacement to sliding-window computations, a do- stream stream re re stream main where complex indexing expressions typically hinder com- splitter piler analysis. stream Our cache aware optimizations are implemented as part of joiner StreamIt, a language and compiler infrastructure for stream programming [21]. We evaluate the optimizations on three architec- tures. The StrongARM 1110 represents our primary target; it is an (a) pipeline (b) splitjoin (c) feedback loop embedded processor without a secondary cache. Our other targets are the Pentium 3 (a superscalar) and the Itanium 2 (a VLIW pro- Figure 2. Hierarchical streams in StreamIt. cessor). We find that execution scaling, cache aware fusion, and scalar replacement each offer significant performance gains, and the most consistent speedups result when all are applied together. float -> float pipeline Main() { Source Compared to unoptimized StreamIt code, our cache optimizations add Source(); // code for Source not shown yield a 249% speedup on the StrongARM, a 154% speedup on add FIR(); FIR the Pentium 3, and a 152% speedup on Itanium 2. These numbers add Output(); // code for Output not shown u u represent averages over our streaming benchmark suite. } O tp t This paper is organized as follows. Section 2 gives background information on the StreamIt language. Section 3 lays the foundation Figure 3. Example pipeline with FIR filter. for our approach by developing a quantitative model of caching behavior for any sequence of actor executions. Section 4 describes execution scaling and cache aware scheduling. Section 5 evaluates N weights buffer management strategies, including scalar replacement. Sec- (e.g., and ) are equivalent to parameters passed to a tion 6 contains our experimental evaluation of these techniques in class constructor. the StreamIt compiler. Finally, Section 7 describes related work and In StreamIt, the application developer focuses on the hierarchi- Section 8 concludes the paper. cal assembly of the stream graph and its communication topology, rather than on the explicit management of the data buffers between filters. StreamIt provides

Load more