Approximate Loop Unrolling Marcelino Rodriguez-Cancio, Benoit Combemale, Benoit Baudry

Approximate Loop Unrolling Marcelino Rodriguez-Cancio, Benoit Combemale, Benoit Baudry To cite this version: Marcelino Rodriguez-Cancio, Benoit Combemale, Benoit Baudry. Approximate Loop Unrolling. CF 2019 - ACM International Conference on Computing Frontiers, Apr 2019, Alghero, Sardinia, Italy. pp.94-105, 10.1145/3310273.3323841. hal-02407868 HAL Id: hal-02407868 https://hal.inria.fr/hal-02407868 Submitted on 12 Dec 2019 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Approximate Loop Unrolling Marcelino Rodriguez-Cancio Benoit Combemale Benoit Baudry Vanderbilt University, USA University of Toulouse / Inria, France KTH, Sweden [email protected] [email protected] [email protected] Abstract series, sound, video, and images are frequently represented We introduce Approximate Unrolling, a compiler loop opti- as arrays where contiguous slots contain similar values. As mization that reduces execution time and energy consump- a consequence of this neighboring similarity, computations tion, exploiting code regions that can endure some approxi- producing such data are usually locally smooth functions. mation and still produce acceptable results. Specifically, this Our technique exploits this observation: it searches for loops work focuses on counted loops that map a function over where functions are mapped to contiguous array slots in the elements of an array. Approximate Unrolling transforms each iteration. If the function’s computation is expensive, loops similarly to Loop Unrolling. However, unlike its ex- we substitute it by less costly interpolations of the values act counterpart, our optimization does not unroll loops by assigned to nearby array values. In exchange for losses in adding exact copies of the loop’s body. Instead, it adds code accuracy, we obtain a faster and less energy consuming loop. that interpolates the results of previous iterations. Approximate Unrolling combines static analysis and code transformations in order to improve performance in ex- CCS Concepts • Software and its engineering → Just- change for small accuracy losses. Static analysis has two in-time compilers. objectives: determine if the loop’s structure fits our trans- Keywords approximate computing; compiler optimizations formations (counted loops mapping computations to con- secutive arrays slots); estimate if the transformation could ACM Reference Format: actually provide some performance improvements (compu- Marcelino Rodriguez-Cancio, Benoit Combemale, and Benoit Baudry. tation mapped to the array elements is expensive). The code 2019. Approximate Loop Unrolling. In Proceedings of the 16th confer- is transformed using two possible strategies. One that we ence on Computing Frontiers (CF ’19), April 30-May 2, 2019, Alghero, Italy. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/ call Nearest Neighbour (NN), which approximates the value 3310273.3323841 of one array slot by copying the value of the preceding slot. The other strategy, called Linear Interpolation(LI) transforms 1 Introduction the code so the value of a slot is the average of the previous and next slot. A key novel aspect of these static transfor- Approximate computing exploits the fact that some computa- mations is that they ensure accuracy for non-approximated tions can be made less precise and still produce good results iteration of the loop, avoiding error to accumulate. This is a [25]. Previous works in the field have proven that when small unique feature that is not found on other approximate loop reductions in accuracy are acceptable, significant improve- optimization techniques. ments in execution times and energy consumption can be We have implemented a prototype of Approximate Un- achieved [40]. Opportunities for approximation exist across rolling, inside the OpenJDK C2 compiler. This design choice the whole stack and previous works have explored this con- prevents phase ordering problems and allows us to reuse cept in both hardware [6, 33, 36] and software [3, 28, 30, 34]. internal representations that we need for our analysis. An In this work, we describe Approximate Unrolling, a novel implementation in a production-ready compiler is also key compiler loop optimization that uses the ideas of approxi- to assess the relevance of our optimization on already highly mate computing to reduce execution times and energy con- optimized code, such as the one produced by the JVM. sumption of loops. The key motivation of Approximate Un- We experimented with this implementation using a four rolling relies on the following observation: data such as time real-world Java libraries. The objectives of our experiments Permission to make digital or hard copies of all or part of this work for are to determine whether Approximate Unrolling is able to personal or classroom use is granted without fee provided that copies are not provide a good performance vs. accuracy trade-off and to made or distributed for profit or commercial advantage and that copies bear compare Approximate Unrolling with Loop Perforation [34], this notice and the full citation on the first page. Copyrights for components the state of the art approximate computing technique for of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to loop optimization. redistribute to lists, requires prior specific permission and/or a fee. Request Our results show that Approximate Unrolling is able to permissions from [email protected]. reduce the execution time and CPU energy consumption of CF ’19, April 30-May 2, 2019, Alghero, Italy the x86 code generated around 50% to 110% while keeping © 2019 Association for Computing Machinery. the accuracy to acceptable levels. Compared with Loop Per- ACM ISBN 978-1-4503-6685-4/19/05...$15.00 foration, Approximate Unrolling preserved accuracy better https://doi.org/10.1145/3310273.3323841 CF ’19, April 30-May 2, 2019, Alghero, Italy Marcelino Rodriguez-Cancio, Benoit Combemale, and Benoit Baudry in 76% of the cases and raised less fatal exceptions than Loop body, and (iii) the indexing expression of the array assign- Perforation (2 vs. 7). ment is value-dependent on the loop’s induction variable. The contributions of this work are: (i) Approximate Un- The example at Listing1 is indeed a ‘for’ loop where rolling, an approximate loop optimization. (ii) An efficient (i) the induction variable ‘i’ is incremented by a constant implementation of Approximate Unrolling inside the Open- stride, i.e. ‘i++’; (ii) ‘A[i]’ is an array assignment inside JDK Hotspot VM (iii) An empirical assessment of the effec- the loop’s body and (iii) the indexing expression of ‘A[i]’ tiveness of Approximate Unrolling to trade accuracy for time is value-dependent on the induction variable ‘i’. and energy savings. (iv) An empirical demonstration of Ap- proximate Unrolling’s novel contribution to the state of the for (i=0;i<N-1;i+= 2){ art, as an effective complement to Loop Perforation. phase = phase + phaseInc; double v = Math.sin(phase) * ampl; 2 Approximate Unrolling A[i] = v; In this section we describe the main contribution of our phase = phase + phaseInc; paper. We use the example of Listing1 to illustrate the kind //v1=Math.sin(phase)*ampl;REMOVED of loops targeted by the optimization as well as the process A[i + 1] = A[i]; } for the approximate transformation. // Guard Loopg for (j = i; j < N; j ++) { for ( int i= 0; i <N;i++) { phase = phase + phaseInc; phase = phase+phaseInc; double v = Math.sin(phase) * ampl; double v = Math.sin(phase)*ampl; A[i] = v;} A[i] = v; } Listing 3. After unrolling, our optimization approximates Listing 1. Illustrative Example: A loop mapping a sine wave the unrolled iteration. The loop of Listing2, transformed into an array. using nearest neighborg interpolation. for ( int i=0;i<N;i+= 2){ //Initial iteration peeled: phase = phase + phaseInc; phase = phase + phaseInc; double v = Math.sin(phase) * ampl; double v0 = Math.sin(phase) * ampl; A[i] = v; A[0] = v0; int i = 1; //UNROLLEDITERATION: i f ( N > 2 ) phase = phase + phaseInc; for (i=2;i<N-1;i+=2){ double v1 = Math.sin(phase)*ampl; phase = phase + phaseInc; A[i + 1] = v1; } //v=Math.sin(phase)*ampl;REMOVED Listing 2. Approximate Unrolling first unroll the loop as phase=phase+phaseInc;//Exact iteration: Loop Unrolling would normally do. In this listing, the double v = Math.sin(phase)*ampl; Illustrative Example is unrolled using Loop Unrolling. A[i] = v; //Approximate iteration: A[i - 1] = (A[i]+A[i-2])*0.5f; } Approximate Unrolling replaces the unrolled iteration’s for (j=i; j<N; j++) {//Guard Loop original instructions by others that interpolate results. We phase = phase + phaseInc; perform this transformation in such a way that errors do not double v = Math.sin(phase) * ampl; accumulate as the loop runs. We currently implement two A[i] = v;} interpolation strategies: Linear Interpolation (LI) and Nearest Listing 4. The loop of Listing2 transformed by Approximate Neighbor (NN). A policy determines which strategy, if any, Unrolling using Linear Interpolation. is likely to optimize each loop. In the following paragraphs, we discuss each step. Like normal unrolling, Approximate Unrolling can unroll Step 2. Approximation Policy: In compiler optimiza- more than one loop iteration. For simplicity, we only show tion is often the case that is possible to predict whether the the transformation where 1 iteration out of 2 is unrolled/ap- optimization will have a beneficial impact on speed. This es- proximated. timate is called policy in C2’s terms. Approximate Unrolling Step 1.

Load more