Manual Micro-Optimizations in C++ an Investigation of Four Micro-Optimizations and Their Usefulness

Technology and Society Computer Science and Media Technology Bachelor thesis 15 credits, first cycle Manual micro-optimizations in C++ An investigation of four micro-optimizations and their usefulness Manuella mikrooptimeringar i C++ En undersökning av fyra mikrooptimeringar och deras nytta Viktor Ekström Degree: Bachelor 180hp Supervisor: Olle Lindeberg Main field: Computer Science Examiner: Farid Naisan Programme: Application Developer Final seminar: 4/6 2019 Abstract Optimization is essential for utilizing the full potential of the computer. There are several different approaches to optimization, including so-called micro-optimizations. Micro-optimizations are local adjustments that do not change an algorithm. This study investigates four micro-optimizations: loop interchange, loop unrolling, cache loop end value, and iterator incrementation, to see when they provide performance benefit in C++. This is investigated through an experiment, where the running time of test cases with and without micro-optimization is measured and then compared between cases. Measurements are made on two compilers. Results show several circumstances where micro-optimizations provide benefit. However, value can vary greatly depending on the compiler even when using the same code. A micro-optimization providing benefit with one compiler may be detrimental to performance with another compiler. This shows that understanding the compiler, and measuring performance, remains important. 1 2 Sammanfattning Optimering är oumbärligt för att utnyttja datorns fulla potential. Det finns flera olika infallsvinklar till optimering, däribland så kallade mikrooptimeringar. Mikrooptimeringar är lokala ändringar som inte ändrar på någon algoritm. Denna studie undersöker fyra mikrooptimeringar: loop interchange, loop unrolling, cache loop end value, och iterator-inkrementering, för att se när de bidrar med prestandaförbättring i C++. Detta undersöks genom experiment, där körtiden för testfall med och utan mikrooptimeringar mäts och sedan jämförs. Mätningar görs på två kompilatorer. Resultatet visar flera situationer där mikrooptimeringar bidrar med prestandaförbättringar. Värdet kan däremot variera beroende på kompilator även när samma kod används. En mikrooptimering som bidrar med prestandaförbättring med en kompilatorn kan ge prestandaförsämring med en annan kompilator. Detta visar att kompilatorkännedom, och att mäta, är fortsatt viktigt. 3 4 Table of Contents 1 Introduction 7 1.1 Background 7 1.1.1 Interchange 9 1.1.2 Unrolling 9 1.1.3 Cache loop end value 10 1.1.4 Iterator incrementation 10 1.2 Research question 11 2 Method 12 2.1 Method description 12 2.2. Method discussion 13 3 Results 14 3.1 Interchange 15 3.2 Unrolling 16 3.3 Cache loop end value 17 3.4 Iterator incrementation 19 4 Analysis 20 4.1 Interchange 20 4.1.1 int[][] 21 4.1.2 int** 21 4.1.3 vector<vector<int>> 21 4.1.4 int[col * row] 22 4.2 Unrolling 23 4.2.1 int[] 23 4.2.2 vector<int> 23 4.2.3 vector<float> 24 4.2.4 vector<float> * 0.5 24 4.3 Cache loop end value 24 4.3.1 Short string, subscript 24 4.3.2 Long string, subscript 25 4.3.3 Long string, iterator 25 4.3.4 vector<int>, subscript 25 4.3.5 vector<int>, iterator 26 4.4 Iterator incrementation 26 4.4.1 std::vector<int>, Vector<int>, Vector<float> 26 5 4.4.2 Vector<float> DLL 26 5 Discussion 27 5.1 Interchange 27 5.2 Unrolling 27 5.3 Cache loop end value 28 5.4 Iterator incrementation 29 6 Conclusion 30 References 31 6 1 Introduction Modern computers are fast, and tend to become faster each year. Modern computers are however not fast enough for programmers to completely disregard how efficient their programs are. Spending some effort to maximise efficiency can provide numerous benefits. The program may run faster or be smaller, making time and space for other processes, or save time for the user. The possibility of using less powerful, but cheaper, hardware is a monetary gain, and with lower electricity consumption, it is also good for the environment. The key to efficiency is optimization. There are many approaches to optimization, including (at least in a C++ context, which this study is interested in): using a better compiler, a better algorithm, a better library or a better data structure [1]. These optimization approaches imply large changes, but they can also provide large benefits. Of course, the optimization approach of least effort is to simply let the compiler handle it, which is possibly also the most important optimization of them all. Another approach is micro-optimizations, defined in this study as local adjustments that does not alter any algorithm. These are small changes, that tend to be easy to perform, that provide some benefit. This study investigates when micro-optimizations are beneficial to perform on compiler optimized code. 1.1 Background For all the potential benefits involved, optimization is a fairly opinionated field. Donald Knuth implored in 1974 that optimization has its time and place, and that premature optimization is the root of all evil, creating needlessly complicated code [2]. Knuth is widely quoted even today [3], and commonly so to dismiss curiosity regarding optimization. Michael A. Jackson presented two rules in 1975: Do not optimize, and do not optimize yet, because in the process of making it small or fast, or both, it is liable to also make it bad in various ways [4]. These opinions are from an era when assembly code reigned (meaning everything was more complicated from the start, with or without optimization) and the validity of applying these quotes to the sphere of modern high-level programming languages can be questioned. Kernighan and Plauger appealed in 1974 to not “sacrifice clarity for small gains in ‘efficiency’”, and to let your compiler do the simple optimizations [5]. Wulf et al. argued that efficiency is important, but also saw compiler optimization as a way of having both convenient programming and efficient programs [6]. The importance of compiler optimization has been known and studied for a long time by now. McKeeman claims that the barrier to good compiler optimization is cost in time and space [7]. Stewart and White show that optimization improves performance in applications, stemming from improvements like path length reduction, efficient instruction selection, pipeline scheduling, and memory penalty minimization [8]. A hardware aware compiler, with the proper support for the features of the hardware, is important for performance [8]. 7 Scientific investigations of the benefits of manual micro-optimizations seem uncommon based on the literature search performed for this study. Unpublished opinions are still to be found on the web, where writers may also do some investigation. Scientific studies of micro-optimizations tend to focus on one in particular and some specific aspect of their application, most common as a compiler optimization. Dependence analysis, the analysis of which statements are dependent on which statements, is an important part of understanding and performing loop transformations, such as interchange [9]. Song and Kavi show that detecting and removing anti-dependencies, making loops appear cleaner, can expose further opportunities for analysis and optimization [10]. Sarkar show that performance improvements can be gained by treating the loop unrolling factor selection for perfectly nested loops as an optimization problem [11]. Carr et al. show that unroll-and-jam, or outer-loop unrolling, can improve pipeline scheduling, a desired outcome for regular unrolling as well [12]. Huang and Leng have created a generalized method for unrolling not just for loops but all loop constructs, with modest speed increase but easy to perform [13]. Vectorization, operations on several array elements at once, is possible only under certain circumstances, but will increase performance thanks to the use of SIMD (Single Instruction Multiple Data) instructions [14]. It can be achieved by letting the compiler handle it, which is the easiest, but Pohl et al. show that the best performance comes from manual implementation, for example through the use of intrinsics [15]. There are many studies comparing performance between different compilers. Jayaseelan et al. show that the performance of integer programs is highly sensitive to the choice of compiler [16]. Gurumani and Milenkovic compare Intel C++ and Microsoft Visual C++ compilers in a benchmark test and find that Intel C++ perform better in every case [17]. A similar study by Prakash and Peng, with the same compilers but with a later version of the benchmark, show that Intel C++ mostly outperforms Microsoft Visual C++ [18]. Calder et al. compare the performance between GNU and DEC C++ compilers, but without much discovered difference [19]. Calder et al. do however show that object-oriented C++ programs force the compiler to do more optimization compared to that of a C program [19]. Some of this is due to a larger number of indirect function calls [19], which is the subject of several studies, aiming to improve performance, for example by Mccandless and Gregg, and Joao et al. [20], [21]. Karna and Zou show that GCC is in general the most efficient C compiler on Fedora OS [22], which is a very specific thing to investigate, but likely of great interest to C programmers using Fedora. While compiler comparisons are always interesting, any single study risks becoming irrelevant with each compiler update. This is a good argument to always keep doing them. Spång and Hakuni Persson investigate four types of C++ micro-optimizations: loop interchange, loop unrolling, caching of the loop end value, and proper iterator incrementation, on two compilers, GNU G++ and Clang, to see what benefits can be gained from their use [23]. Their findings show both benefits and drawbacks. Interchange and caching the loop end value are found to be beneficial, and unrolling and iterator incrementation are found to be either detrimental or causing no change.

Manual Micro-Optimizations in C++ an Investigation of Four Micro-Optimizations and Their Usefulness

Loop Pipelining with Resource and Timing Constraints

Redundancy Elimination Common Subexpression Elimination

Polyhedral Compilation As a Design Pattern for Compiler Construction

Research of Register Pressure Aware Loop Unrolling Optimizations for Compiler

CS153: Compilers Lecture 19: Optimization

Fast-Path Loop Unrolling of Non-Counted Loops to Enable Subsequent Compiler Optimizations∗

A Hierarchical Region-Based Static Single Assignment Form

The Effect of C Ode Expanding Optimizations on Instruction Cache Design

A Tiling Perspective for Register Optimization Fabrice Rastello, Sadayappan Ponnuswany, Duco Van Amstel

Graduate Computer Architecture Chapter

The Effect of Code Expanding Optimizations on Instruction Cache Design

Loop Transformations and Parallelization