Clairvoyance: Look-Ahead Compile-Time Scheduling

http://www.diva-portal.org Postprint This is the accepted version of a paper presented at CGO 2017, February 4–8, Austin, TX. Citation for the original published paper: Tran, K-A., Carlson, T E., Koukos, K., Själander, M., Spiliopoulos, V. et al. (2017) Clairvoyance: Look-ahead compile-time scheduling. In: Proc. 15th International Symposium on Code Generation and Optimization (pp. 171-184). Piscataway, NJ: IEEE Press N.B. When citing this work, cite the original published paper. Permanent link to this version: http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-316480 Clairvoyance: Look-Ahead Compile-Time Scheduling Kim-Anh Tran∗ Trevor E. Carlson∗ Konstantinos Koukos∗ Magnus Själander∗,† Vasileios Spiliopoulos∗ Stefanos Kaxiras∗ Alexandra Jimborean∗ ∗Uppsala University, Sweden †Norwegian University of tifact r Comp * * let A nt e Science and Technology, Norway e * A t s W i E * s e n l C l o D C O o * * c u e fi[email protected] fi[email protected] G m s E u e C e n R t e v o d t * y * s E a a l d u e a t Abstract Highly efficient designs are needed to provide a good To enhance the performance of memory-bound applications, balance between performance and power utilization and the hardware designs have been developed to hide memory answer lies in simple, limited out-of-order (OoO) execution latency, such as the out-of-order (OoO) execution engine, cores like those found in the HPE Moonshot m400 [5] and at the price of increased energy consumption. Contemporary the AMD A1100 Series processors [6]. Yet, the effectiveness processor cores span a wide range of performance and of moderately-aggressive OoO processors is limited when energy efficiency options: from fast and power-hungry OoO executing memory-bound applications, as they are unable to processors to efficient, but slower in-order processors. The match the performance of the high-end devices, which use more memory-bound an application is, the more aggressive additional hardware to hide memory latency. the OoO execution engine has to be to hide memory latency. This work aims to improve the performance of highly This proposal targets the middle ground, as seen in a sim- energy-efficient, limited OoO processors, with the help of ple OoO core, which strikes a good balance between per- advanced compilation techniques. The static code transforma- formance and energy efficiency and currently dominates the tions are specially designed to hide the penalty of last-level market for mobile, hand-held devices and high-end embedded cache misses and to better utilize the hardware resources. systems. We show that these simple, more energy-efficient One primary cause for slowdown is last-level cache (LLC) OoO cores, equipped with the appropriate compile-time sup- misses, which, with conventional compilation techniques, port, considerably boost the performance of single-threaded result in a sub-optimal utilization of the limited OoO engine execution and reach new levels of performance for memory- that may stall the core for an extended period of time. Our bound applications. method identifies potentially critical memory instructions Clairvoyance generates code that is able to hide memory la- through advanced static analysis and hoists them earlier in the tency and better utilize the OoO engine, thus delivering higher program’s execution, even across loop iteration boundaries, performance at lower energy. To this end, Clairvoyance over- to increase memory-level parallelism (MLP). We overlap comes restrictions which yielded conventional compile-time the outstanding misses with useful computation to hide their techniques impractical: (i) statically unknown dependencies, latency and thus increase instruction-level parallelism (ILP). (ii) insufficient independent instructions, and (iii) register There are a number of challenges that need to be met to pressure. Thus, Clairvoyance achieves a geomean execution accomplish this goal. time improvement of 7% for memory-bound applications 1. Finding enough independent instructions: A last level with a conservative approach and 13% with a speculative but cache miss can cost hundreds of cycles [7]. Conventional safe approach, on top of standard O3 optimizations, while instruction schedulers operate on the basic-block level, maintaining compute-bound applications’ high-performance. limiting their reach, and, therefore, the number of independent instructions that can be scheduled in order to hide long latencies. More sophisticated techniques (such as software pipelining [8, 9]) schedule across basic-block 1. Introduction boundaries, but instruction reordering is severely restricted Computer architects of the past have steadily improved per- in general-purpose applications when pointer aliasing and formance at the cost of radically increased design complexity loop-carried dependencies cannot be resolved at compile- and wasteful energy consumption [1–3]. Today, power is not time. Solutions are needed that can cope with statically only a limiting factor for performance; given the prevalence unknown dependencies in order to effectively increase the of mobile devices, embedded systems, and the Internet of reach of the compiler while ensuring correctness. Things, energy efficiency becomes increasingly important for 2. Chains of dependent long latency instructions are se- battery lifetime [4]. rialized: Dependence chains of long latency instructions 978-1-5090-4931-8/17 c 2017 IEEE 171 CGO 2017, Austin, USA Accepted for publication by IEEE. c 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. would normally serialize, as the evaluation of one long Loop Unrolling To expose more instructions for reordering, n latency instruction is required to execute another (de- we unroll the loop by a loop unroll factor countunroll = 2 pendent) long latency instruction. This prevents parallel with n = {0, 1, 2, 3, 4}. Higher unroll counts significantly accesses to memory and may stall a limited OoO core. increase code size and register pressure. In our examples, we Novel methods are required to increase memory level set n = 1 for the sake of simplicity. parallelism and to hide latency, which is particularly chal- lenging in tight loops and codes with numerous (known Access-Execute Phase Creation Clairvoyance hoists all and unknown) dependencies. load instructions along with their requirements (control flow 3. Increased register pressure: Separating loads and their and address computation instructions) to the beginning of the uses in order to overlap outstanding loads with useful com- loop. The group of hoisted instructions is referred to as the putation increases register pressure. This causes additional Access phase. The respective uses of the hoisted loads and the register spilling and increases the dynamic instruction remaining instructions are sunk in a so-called Execute phase. count. Controlling register pressure, especially in tight loops, is crucial. Access phases represent the program slice of the critical Contributions: Clairvoyance looks ahead, reschedules long loads, whereas Execute phases contain the remaining instruc- latency loads, and thus improves MLP and ILP. It goes beyond tions (and guarding conditionals). When we unroll the loop, static instruction scheduling and software pipelining tech- we keep non-statically analyzable exit blocks. All exit blocks niques, and optimizes general-purpose applications, which (including goto blocks) in Access are redirected to Execute, contain large numbers of indirect memory accesses, pointers, from where they will exit the loop after completing all com- and complex control-flow. While previous compile-time tech- putation. The algorithm is listed in Algorithm 1 and proceeds niques are inefficient or simply inapplicable to such applica- by unrolling the original loop and creating a copy of that tions, we provide solutions to well-known problems, such as: loop (the Access phase, Line 3). Critical loads are identified (FindLoads, Line 4) together with their program slices (in- 1. Identifying potentially delinquent loads at compile-time; structions required to compute the target address of the load 2. Overcoming scheduling limitations of statically unknown and control instructions required to reach the load, Lines 5 - memory dependencies; 9). Instructions which do not belong to the program slice of 3. Reordering chains of dependent memory operations; the critical loads are filtered out of Access (Line 10), and in- 4. Reordering across multiple branches and loop iterations, structions hoisted to Access are removed from Execute (Line without speculation or hardware support; 11). The uses of the removed instructions are replaced with 5. Controlling register pressure. their corresponding clone from Access. Finally, Access and Clairvoyance code runs on real hardware prevalent in mo- Execute are combined into one loop (Line 12). bile, hand-held devices and in high-end embedded systems and delivers high-performance, thus alleviating the need for power-hungry hardware complexity. In short, Clairvoyance L increases the performance of single-threaded execution by up Input: Loop , Unroll Count countunroll L to 43% for memory bound applications (13% geomean im- Output: Clairvoyance Loop Clairvoyance provement)

Load more