<<

Deconstructed

David Ryan Koes Seth Copen Goldstein Carnegie Mellon University Carnegie Mellon University Pittsburgh, PA Pittsburgh, PA [email protected] [email protected]

Abstract assignment. The coalescing component of register allocation Register allocation is a fundamental part of any optimiz- attempts to eliminate existing move instructions by allocat- ing . Effectively managing the limited register re- ing the operands of the move instruction to identical loca- sources of the constrained architectures commonly found in tions. The spilling component attempts to minimize the im- embedded systems is essential in order to maximize code pact of accessing variables in memory. The move insertion quality. In this paper we deconstruct the register allocation component attempts to insert move instructions, splitting the problem into distinct components: coalescing, spilling, move live ranges of variables, with the goal of achieving a net im- insertion, and assignment. Using an optimal register alloca- provement in code quality by improving the results of the tion framework, we empirically evaluate the importance of other components. The assignment component assigns those each of the components, the impact of component integra- program variables that aren’ in memory to specific hard- tion, and the effectiveness of existing heuristics. We evalu- ware registers. Existing register allocators take different ap- ate code quality both in terms of code performance and code proaches in solving these various components. For example, size and consider four distinct instruction set architectures: the central focus of a graph-coloring based register allocator ARM, Thumb, , and x86-64. The results of our investiga- is to solve the assignment problem, while the spilling and tion reveal general principles for register allocation design. coalescing components are integrated as extensions to this assignment-focused model. Categories and Subject Descriptors D.3.4 [Programming In this paper we use an optimal register allocation frame- Languages]: Processors—Code generation, , Op- work to study each component of register allocation. We timization evaluate both the individual impact of each component and General Terms Algorithm, Design, Languages, Perfor- the synergistic impact of fully integrating components upon mance code quality. We consider both code performance and code size metrics of code quality and target four distinct instruc- Keywords Register Allocation tion set architectures: ARM, Thumb, x86, and x86-64. 1. Introduction What is the greatest benefit that can be realized by de- veloping a fully integrated allocation algorithm? Conversely, Register allocation is a critical part of any optimizing com- what is the maximum penalty incurred if the register alloca- piler. The register allocator is responsible for finding a de- tion problem is decomposed into (hopefully) easier to solve sirable assignment of program variables to hardware regis- subproblems? What is the individual impact of each com- ters and memory locations. The quality of register allocation ponent of register allocation on code quality? How far from has a substantial impact upon final code quality, both when optimal are existing heuristics and where are the most fruit- optimizing for performance and when optimizing for code ful areas for researchers to focus their attention? The goal size. When optimizing for performance, an effective register of this study is to answer these questions and derive general allocation minimizes memory traffic and decreases the sen- principles for register allocation design. sitivity of a program to the -memory gap. When The main contributions of this paper are: optimizing for code size, an effective register allocation min- imizes the amount of introduced by effectively us- ing addressing modes and managing data movement instruc- tions. Despite decades of study, register allocation remains a • The first comprehensive empirical investigation of the complex and challenging problem for which no completely importance, impact, and interaction of the various com- satisfactory solutions exist. ponents of register allocation. In this paper we deconstruct the register allocation prob- • A set of design principles extrapolated from this investi- lem. We decompose the register allocation problem into dis- gation useful in guiding the construction of future register tinct components: coalescing, spilling, move insertion, and allocators. In the next section we further describe the register allo- rithms, the general problem they are solving is the register cation problem and describe previous work. In Section 3 we assignment problem. In this paper we distinguish between describe the optimal register allocation framework we use. the register assignment problem, which may generate move Section 4 contains the methodology we use to perform our instructions, and the coalescing problem, which removes ex- study. In Section 5 we present and analyze the results of our isting move instructions. We view coalescing based register investigation, and in Section 6 we extrapolate several gen- assignment techniques as an integration of these two com- eral guidelines of register allocation design from the results ponents. of our study. Several register allocators that solve all or part of the reg- ister allocation problem optimally have been implemented. 2. Background Optimal techniques, primarily based on ILP representations, have been presented that solve the register allocation prob- Register allocation is an extensively studied problem. Graph- lem in its entirety [11, 20], the spill code optimization prob- coloring based algorithms are the dominant technique for lem [1], and the integrated assignment-coalescing problem performing global register allocation. These approaches con- [1, 13]. However, no study has used an optimal register allo- struct an interference graph and attempt to find a coloring of cation framework to thoroughly analyze the individual and the graph where the colors correspond to registers [6, 9]. collective impact of the various components of register allo- The basic graph-coloring algorithm has been extended in cation. attempts to integrate additional components of register allo- cation, such as spill code generation [2, 3, 7], move insertion [10], and coalescing [7, 9, 12, 24]. However, at its core the 3. Optimal Register Allocation Framework graph-coloring approach is focused on solving the register The heart of our investigation of register allocation is an assignment problem and treats the spilling problem as a sec- optimal register allocation framework derived from [19]. ondary concern. This framework models the register allocation problem using Linear scan allocators [25, 26, 29, 30] are an alternative to multi-commodity network flow (MCNF) and exactly repre- graph-coloring approaches. Linear scan allocators were orig- sents the spill code optimization, move insertion, and regis- inally developed for dynamic and just-in-time compilation. ter assignment components of register allocation. We extend Initial linear scan allocators were extremely fast and scal- the model to exactly represent the coalescing component by able, but produced low quality code. Recently, the extended adding side constraints. These side constraints, which do not linear scan algorithm [26] has been shown to produce higher fit directly into the network flow formulation, model the ben- quality code than a graph-coloring allocator while maintain- efit of allocating the two operands of a move instruction to ing the scalability advantages of linear scan algorithms. the same location. We formulate the resulting model of reg- Most register allocation algorithms attempt to integrate ister allocation using an integer (ILP) the various components (move insertion, coalescing, spill model. The ILP is then solved using ILOG CPLEX 10.0 code generation, and register assignment) of register allo- [17]. It is important to note that this optimal register allo- cation into a single algorithm. However, several approaches cation framework is not practical for production use; finding separate spill code generation and register assignment [1, 22, an optimal solution for a single allocation problem can take 26]. In particular, Appel and George [1] use an integer lin- as long as several days. However, because the results are op- ear programming (ILP) formulation to solve the spill code timal, this framework is an ideal tool for investigating the optimization problem and then use heuristics to find a valid essential features of register allocation. assignment. Their framework views assignment as a coalesc- We consider a register allocation to be optimal if, for a ing problem. Parallel move instructions are inserted at every given code quality metric, the allocation has the best code program point making the assignment problem trivial; the quality that is achievable by assigning registers and inserting challenge becomes eliminating as many move instructions as loads, stores, and moves into a given instruction stream. Our possible by assigning the operands of the move to the same model is essentially optimal, but there are a few minor limi- register. tations. One such limitation is an implicit assumption that it It has been shown that inserting moves at every program is never beneficial for the same value to be in two different point is unnecessary; it is sufficient that the program be in registers at the same program point. For architectures with SSA form and the assignment problem becomes polynomial uniform register sets, this assumption holds as long as copy [5, 8, 15, 23]. Additional work has embraced the coalescing propagation has been applied to the code. In addition, we do view of register assignment [4, 13, 16, 31, 32]. These ap- not consider the impact of increasing the size of the stack proaches perform spill code generation as a separate phase, (to hold spilled values) on code quality. Finally, note that we insert sufficient moves to render the assignment problem consider to be beyond the scope of the trivial (or expect the code to be in SSA form), and then register allocator and do not reorder instructions. attempt to remove as many move instructions as possible. Our model correctly models the persistence of a value in Although these approaches primarily use coalescing algo- memory. That is, a value need only be stored to memory a nlz ahcmoetsprtl swl sinvestigate as we well that as so separately component model our each the analyze perform modify can to to different necessary order is the In it of investigation, allocation. importance register the of of components investigation base- our a in as serves line allocation optimal This allocation. optimal have weight. frequently higher more proportionally execute a to where judged instructions are move a that and instructions using stores, performance loads, performance of code for sum weighted model metric we mod- exact Instead of an unattainable. nature makes complex architectures The exactly ern time. be compile can at transformation any determined of as impact straightforward immediate is the size code for Optimizing performance. spilled being of instead Con- memory. rematerialized to result. be the may of values optimality sim- stant the This effect variable. not does corresponding plification the instruc- before/after use/define be and that to boundaries tions instructions store basic and at load simplify inserted permit to only order we In model, point. the program every at be- registers moves allow “spill- tween we a model not full model is our this In Our approach. spilling; evictions. everywhere” of view future fine-grained by very a incurred takes cost no with once Configurations eueorfl oe frgse loaint n an find to allocation register of model full our use We code and size code metrics: quality code two consider We tdb h allocator the by ated gener- are instructions move none: blocks basic the of exit at and only entry inserted be may limited: program any point at inserted be full: oeisrcin may instructions move al 1. Table oeInsertion Move oregister-to-register no oeinstructions move h opnnso eitralcto n h aiu ofiuain vlae nti study. this in evaluated configurations various the and allocation register of components The formed none: allocation ter regis- to prior moves coalescable eliminates aggressively heuristic aggressive: separate coalescable allocation register to max- of prior moves the number eliminate imum to used is optimal: separate model ILP into the incorporated register are to allocation prior coalescable identified as be can in- that move structions those only coalescing ignoring uncoalescable: optimal integrated an allocation register of of model part ILP as ex- represented are coalescing actly move of fits optimal: integrated ocaecn sper- is coalescing no Coalescing eitrAlcto Components Allocation Register h eet of benefits the nIPmodel ILP an h bene- the greedy a orgse oe a eisre taypormpit and point) program any at inserted be can register moves (where register insertion to As move memory. full to between evicted ground is middle variable vari- a the the or until live register longer that no defined in is stays able is it variable register, a a into If loaded or moves. register never to in- allocator move register optimal With the generates moves. model, full insert the to in disabled model limiting sertion full and our disabling of both ability move by the quality impact code the on consider has we insertion Instead separabil- insertion. the move assignment analyze of register to ity attempt a not least do at we with Therefore phase. integrated be necessi- decrease it insertion and move that tates of size nature code the increase Instead, performance. only since can pointless is moves insertion inserting move phase, As to easier. separate problem entirely order assignment an regis- in the some make variable to of or a advantage preference, take of ter to range variable, live the spilling the avoid split to typ- done is allocation ically register during instructions move Inserting Insertion Move 3.1 components four these of 1). (Table configurations various the tigate assignment. inser- and move spilling, allocation: coalescing, tion, register of components consider distinct We four components. different integrating of impact the et edsrb o emdf h oe oinves- to model the modify we how describe we Next, e sn ersi algo- heuristic rithm a using prob- lem standalone a as is solved problem generation code heuristic: separate standalone a problem as optimally solved is availability) meet register to liveness max ducing (re- problem generation code of optimal: model separate ILP allocation an register as of represented part exactly are generation code spill of costs optimal: integrated Spilling h spill the h spill the the eisre oipoecol- improve orability to inserted be may instructions move genera- tion; code spill of the to results is register assign heuristic to used based scan ear heuristic: scan linear col- improve orability to inserted be may instructions move genera- tion; code spill of results the to is registers assign to heuristic used based coloring of heuristic: model graph ILP allocation an register as of represented part exactly assignment are register of costs optimal: integrated Assignment graph- a lin- a the no move insertion we also evaluate limiting move insertion generator reduces the number of allocable variables at a pro- to basic block boundaries. That is, the register allocator will gram point by spilling (moving to memory) a variable over only generate register to register moves at the entry and exits some range of program points. The result of spill code gen- of basic blocks. eration is provably assignable [16] as long as the assigner is capable of inserting move and swap instructions. 3.2 Optimal Coalescing In order to represent the standalone optimal spilling prob- Move coalescing occurs when the register allocator can allo- lem we use a simplified version of our full model. In the cate the source and destination operand of an existing move MCNF based model of register allocation used in the full instruction to the same location, resulting in the elimination model, each variable is a flow through a network of nodes of the move instruction. As a separate phase, the coalesc- where each node represents memory or a specific register. ing problem is to eliminate as many coalescable moves as In order to represent the spilling problem, we simplify this possible. A move is coalescable if the source operand does network so that instead of having a unit capacity node for not interfere with the destination operand. For instance, the each register, we have a single node for each register class source operand cannot be live after the instruction. When with capacity equal to the class size. For example, if there performed independently from register assignment, move are eight integer registers, these will constitute a single reg- coalescing merges the live ranges of the two operands of the ister class that is represented by a node with capacity eight. move instruction into a single new variable. The resulting model is substantially smaller and less com- There are three apparent disadvantages to performing co- plex than the full model. It does not model register to reg- alescing as a separate pass prior to assignment. The first ister moves or coalescing constraints and has considerably is that overly aggressive coalescing may make the register fewer nodes and edges. We optimistically model register us- assignment problem harder. However, if the assigner effec- age preferences. That is, if there is some benefit to allocating tively utilizes move insertion, this drawback can be over- a variable to a specific register at a program point, then in the come. The second disadvantage of decoupling coalescing simplified model, that variable achieves the same benefit by from the rest of the allocator is the inability to remove un- being allocated to the corresponding register class. coalescable moves. For example, in the context of the full register allocation problem, it may be possible to remove a 3.4 Optimal Assignment move instruction where the source is live after the instruc- Given the results of spill code generation, optimal register tion if the source value is available in memory. In this case, assignment finds an assignment of registers to variables that future users of the source operand value will have to first maximizes code quality. The assigner may have to insert load the value from memory. Finally, standalone coalescing move or swap instructions in order to generate a legal assign- cannot eliminate move instructions where one operand is a ment. Since in our optimal framework we allow for only the preallocated machine register; these move instructions must generation of move, load, and store instructions, we imple- be removed by the register assignment phase. ment swaps using an additional memory location. The reg- In order to investigate the impact of coalescing on regis- ister assigner maximizes code quality not only by minimiz- ter allocation, we formulate the standalone coalescing prob- ing the impact of inserted move and swap instructions, but lem as an ILP using a graph-coloring representation similar also by exploiting any register preferences. For example, if to [12] where we do not the number of colors. This a variable is moved into a hardware register, allocating that pass identifies the maximum number of coalescable move variable to that hardware register will eliminate the corre- instructions and then removes these instructions by assign- sponding move instruction. ing the operands to identical virtual registers. The resulting We implement optimal assignment as a standalone prob- instruction stream is passed to a version of our full model lem by constraining the full model to observe the results of that is missing the side constraints that represent the benefits spill code generation. For example, if spill code generation of coalescing. In addition, in order to evaluate the benefit of creates a load of a variable at a particular program point, an integrated allocator being able to remove uncoalescable then in the constrained full model we force that variable to moves, we consider a version of the full model where coa- be loaded by constraining the corresponding variables in the lescing side constraints are only added for those moves that ILP appropriately. are coalescable. 3.5 Heuristics 3.3 Optimal Spilling In addition to developing optimal algorithms for the com- Optimal spill code generation minimizes the impact of mem- ponents of register allocation we have implemented several ory accesses generated by the register allocator on code qual- heuristic algorithms within the parameters of our framework. ity. The goal of spill code generation when run as a sepa- We have implemented an aggressive coalescer that simply rate standalone pass is to legally modify the input instruction considers all coalescable move instructions in program order stream so that at every program point sufficient registers are and greedily coalesces. Our heuristic spiller is an implemen- available to hold all allocable live variables. The spill code tation of the heuristic solver of [19] applied to the simplified, spill code optimization, version of the MCNF based model. sentative of most existing architectures. Our observations on We also use this heuristic solver to solve the full model (ig- the impact of register allocation on code size and perfor- noring side constraints). mance can therefore be readily extrapolated to similar ar- We implement two assignment heuristics to represent the chitectures. two most prevalent approaches: a linear scan based algo- 4.1 Code Size Evaluation rithm similar to [26] and an algorithm that uses graph sim- plification [9] to find an initial partial assignment that is fi- The immediate impact of register allocation on code size is nalized with a move insertion pass. Note that the assignment straightforward to measure since the compiler can exactly heuristics assume that spill code generation has been per- calculate the size of a function. Since we are concerned formed (the maximum number of live variables does not ex- with the direct impact register allocation has on code size ceed the number of available registers at any point). There- and wish to avoid any potential noise from downstream fore, the assignment heuristics do not insert any spill code optimizations, all reported code size results reference the to reduce register pressure. However, move or swap instruc- code size immediately after performing register allocation. tions may have to be inserted at basic block boundaries in or- We only report the size of code; constants and der to find a valid assignment. In our implementation, swap other data are ignored. instructions are implemented using a temporary memory lo- Since code size is principally of interest to the embed- cation. Both heuristics attempt to exploit register preferences ded community, we use MiBench [14], a commercially by assigning a variable its preferred register if it is available. representative embedded benchmark suite in our evalua- tions. We omit five of the 20 benchmarks (mad, tiff, sphinx, 4. Methodology ghostscript, and ispell) due to dependencies on libraries not available on our non-native ARM and Thumb targets. We have implemented our optimal register allocation frame- In order to generate results in a reasonable timeframe, we work within the LLVM 2.4 [21] compiler infrastructure. We impose a time limit of 10 minutes per a function upon the target four distinct instruction set architectures: CPLEX solver. In comparing allocators, we consider only • x86 The venerable x86 32- instruction set [18] those functions for which an optimal solution is found for has variable length instructions, supports direct access to all allocator configurations and architectures under consid- memory in most instructions, and has a limited register eration. The imposition of a time limit introduces a self- set (8 integer, 8 floating point). We consider this ISA to be selecting bias to our results. However, we find that this time representative of CISC architectures with limited register limit is sufficient to select more than 70% of the functions resources. in the considered benchmarks. Furthermore, we observed no significant qualitative change in results as we increased the • x86-64 The Intel x86 64-bit instruction set [18] also time limit. has variable length instructions and support for memory In reporting code size results we compare the total code operands, but has an extended register set (16 integer, 16 size over all functions relative to the fully optimal result. floating point). We consider this ISA to be representative That is, a result ratio of one is the best possible result and of CISC architectures with sufficient register resources. larger ratios represent a code size increase. • ARM The ARM instruction set [27] has four-byte RISC- 4.2 Performance Evaluation like instructions and supports 16 integer registers. We tar- get an ARMv6 core with floating point. We con- Code performance is the predominant code quality metric sider this ISA to be representative of RISC architectures in the desktop and server spaces. We evaluate code per- with sufficient register resources. formance on the highly relevant x86 and x86-64 targets. We run our performance experiments on an Intel Core 2 • Thumb The Thumb instruction set [27] is an alternative Quad (Q6600) processor running at 2.4GHz with 4GB of instruction set for ARM processors optimized for code main memory. We evaluate performance using a subset of size. It has two-byte RISC-like instructions and can only the SPEC2006 [28] benchmark suite. Unfortunately, it isn’t efficiently access 8 integer registers. For our investiga- feasible to solve every function of every benchmark to opti- tion we restrict the allocator to only allocate to these 8 mality. In order to make the evaluation practical we consider efficiently accessed registers to make the Thumb target only those /C++ benchmarks where, based on profile data, representative of architectures with limited register sets. 85% of the time is spent in 10 or fewer functions. We target an ARMv6 core and do not generate Thumb-2 When compiling these benchmarks we find an optimal reg- instructions. We consider this ISA to be representative of ister allocation only for these key functions. We further limit RISC architectures with limited register resources. the number of benchmarks by only considering those bench- We believe that these four architectures, in addition to marks where an optimal solution could be obtained for every constituting a substantial portion of market share in the desk- key function for all configurations of the allocator within a top, server, and embedded spaces, are also generally repre- time limit of 1000 minutes. These selection criteria result in a subset of six benchmarks for x86 (429.mcf, 433.milc, boundaries, performs even better demonstrating miniscule or 462.libquantum, 470.lbm, 473.astar, and 482.sphinx) and nonexistent increases in code size. four benchmarks for x86-64 (433.milc, 462.libquantum, Both strategies result in essentially equal performance 473.astar, and 482.sphinx). Note that since different bench- compared to the optimal allocator. Disabling move inser- mark sets are used for the two architectures, they are not tion does result in more memory operations, but these ex- directly comparable. tra memory operations do not have any meaningful effect on When optimizing for performance, our register allocation performance. In contrast, the limited move insertion strategy model requires some estimate of the execution frequencies is virtually indistinguishable from the optimal allocator. of every basic block. We extract exact execution frequencies Several register allocators utilize move insertion in order from a profile run of the train data set and use these exact to generate a better allocation [1, 19, 20] or to simplify frequencies within the model. Since we are interested in the assignment problem [13, 16, 31, 32]. The results of finding an optimal allocation, we eliminate any noise from our empirical study strongly suggest that supporting move differences in data sets by also using the train data set to insertion has a surprisingly minimal impact on final code compute our performance numbers. quality and that supporting full move insertion, where move In reporting code performance results, unless stated oth- instructions can be inserted at every program point, is not erwise, we compare a geometric mean across the selected necessary to achieve high quality code. benchmarks of the execution time increase relative to the 5.2 Coalescing fully optimal model. Since the fully optimal result is only optimal with respect to a weighted sum of executed loads, Code size and performance results for four different coalesc- stores, and moves, the result is not necessarily optimal with ing strategies are shown in Figure 2. After performing coa- respect to all aspects of processor performance. Further- lescing using one of the shown strategies, both spill code op- more, post-allocation optimizations may add noise. As a re- timization and register assignment are solved optimally as a sult, it is possible to observe a performance ratio less than single integrated pass. As expected, not performing any coa- one, indicating a performance improvement relative to the lescing results in significant increases in code size as well as optimal allocation. In addition to reporting execution time, significantly increasing the number of instructions executed we use performance counters to collect dynamic load, store, and degrading performance. and instruction counts. These counts are representative of the The three remaining coalescing strategies have remark- simple code performance metric used by our optimal model. ably little impact on code quality. Code size grows by less Indeed, there are cases where the optimal solution according than 0.15% on all targets and the difference in performance to our performance metric does not yield the fastest code, is negligible. There is a small code size advantage of opti- implying there are important aspects of processor perfor- mal coalescing over aggressive coalescing, but in practice mance that are not captured by our metric. the greedy heuristic aggressive coalescer usually eliminates the same number of move instructions as the optimal coa- 5. Analysis lescer. When the fully integrated optimal allocator is con- figured to ignore any potential benefit from eliminating un- In this section we use our optimal allocation framework to coalescable move instructions (that may become coalesca- empirically evaluate the importance and impact of the vari- ble with the introduction of spill code), a small code size ous components of register allocation. We investigate the im- increase is observed. In all cases, the integrated coalescer pact of solving these components separately, but optimally, slightly improves upon the optimal allocator implying that examine the individual contribution of each component to the decision of which move instructions are coalesced, not the overall code quality, and evaluate the performance of just the number of coalesced move instructions, can influ- heuristics relative to an optimal allocator. ence the final code quality. These results suggest that, when viewed as a move elim- 5.1 Move Insertion ination problem distinct from register assignment, coalesc- Code size and performance results for two different move ing can be effectively implemented as a separate pass using insertion strategies are shown in Figure 1. In both cases, any a simple greedy heuristic. However, when code size is of pre-existing move instructions are aggressively coalesced paramount importance and the target is register limited, im- prior to register allocation. That is, the allocator cannot sim- proving coalescing and integrating coalescing into the spill ply split a live range by choosing not to coalesce; a move code optimizer may yield a small improvement. must be inserted. Surprisingly, disabling move insertion al- together, which should necessitate the introduction of addi- 5.3 Spilling tional spill code, has only a marginal impact on code qual- The impact on code size and performance of both optimal ity. Code size grows by less than 0.25% on all targets. The and heuristic spill code optimization performed as a separate limited move insertion strategy, which only allows move pass is shown in Figure 3. After executing the spill code op- instructions to be inserted by the allocator at basic block timization pass, coalescing and register assignment are per- !#$$&%" &"&%

&"!$% !#$$&" &"!)%

&"!(% !#$$!%" &"!'%

!#$$!" &% !"#$%"&'(&)*%+$#&,##(-$%(.& !"#$%

!"#$%&'($%)$*+,-$%."%/0,1+*% !#$$$%" *+,-% *+,-% ./012% ./012% 34/5-2% 34/5-2%

!" 6724589:/72% 6724589:/72% '()" '()*)+" ,-." /0123" ;$)% ;$)<)(% 45".567"89:7;<59" =>2>?7@".567"89:7;<59" =/%>/?-%672-5:/7% .+,+4-1%>/?-%672-5:/7% (a) (b)

Figure 1. The impact of move insertion on code quality when targeting code size (a) and code performance (b). Taller bars indicate larger/slower code. Error bars are shown for execution time measurements.

&"(% !#$&&" !#$HI" !#!!" !#!$" !#$!" &"'$% &"'% !#$$(" &"&$% &"&% !#$$'" &"!$% &% !#$$&" !"#$% )*+,% )*+,% !"#$%&"'()'*+%,$#'-##).$%)/' -./01% -./01% !#$$%" 23.4,1% 23.4,1% !"#$%&'($%)$*+,-$%."%/0,1+*% 56134789.61% 56134789.61% !" )('" )('*'&" +,-" ./012" :;<% :;<=<>% 34"546789:;<=" ?.%@./A,18*6B% >8?6@6A8"B?C167"546789:;<=" 2,C/4/3,%DC9+/A%@./A,18*6B% >8?6@6A8"+==@899;D8"546789:;<=" 2,C/4/3,%EBB4,11*F,%@./A,18*6B% E

Figure 2. The impact of coalescing on code quality when targeting code size (a) and code performance (b). Taller bars indicate larger/slower code. Error bars are shown for execution time measurements.

!#$'" &")% &"($% !#$&%" &"(% &"'$% &"'% !#$&" &"&$% &"&% !#$!%" &"!$% &% !#$!" !"#$%&"'()'*+%,$#'-##).$%)/' !"#$% !"#$%&'($%)$*+,-$%."%/0,1+*% *+,-% *+,-%

!#$$%" ./012% ./012% 34/5-2% 34/5-2% 6724589:/72% 6724589:/72% !" ()*" ()*+*," -./" 01234" ;<=% ;<=>=)% 567898:6";7<38="57>==>?@" 567898:6"A629>B==>?@" 3-?0504-%@?:,0A%3?+AA+7B% 3-?0504-%C-85+2:9%3?+AA+7B% (a) (b)

Figure 3. The impact of spill code optimization on code quality when targeting code size (a) and code performance (b). Taller bars indicate larger/slower code. Error bars are shown for execution time measurements. !#!&" '"!&% '"!$% !#!%" '"!(% !#!" '"!'% !#$(" !"##% !"#&% !#$'" !"#$% !#$&" !"#$%&"'()'*+%,$#'-##).$()/' )*+,% )*+,% -./01% -./01% 23.4,1% 23.4,1% !"#$%&'($%)$*+,-$%."%/0,1+*% !#$%" 56134789.61% 56134789.61% !" )('" )('*'&" +,-" ./012" :;<% :;<=<>% 34567684"95:16;"+<<=>?14?8" 2,?/4/3,%@?9+/A%B11*C6+,63% 34567684"@407=<:A"+<<=>?14?8"*"3A6?"B6<4C" 2,?/4/3,%D,74*198%B11*C6+,63%=%28/6%E/1,0% 34567684"@407=<:A"+<<=>?14?8"*"D765/"B6<4C" 2,?/4/3,%D,74*198%B11*C6+,63%=%F4/?G%E/1,0% (a) (b)

Figure 4. The impact of register assignment on code quality when targeting code size (a) and code performance (b). Taller bars indicate larger/slower code. Error bars are shown for execution time measurements.

!#!&" &")$% &")% !#!%" &"($% &"(% !#!" &"'$% &"'% &"&$% !#$(" &"&% &"!$% !#$'" &% !"#$%

!#$&" !"#$%&"'()'*+%,$#'-##).$%)/' *+,-% *+,-% ./012% ./012% 34/5-2% 34/5-2% !"#$%&'($%)$*+,-$%."%/0,1+*% !#$%"

!" 6724589:/72% 6724589:/72% )('" )('*'&" +,-" ./012" ;<=% ;<=>=)% 34567684"9407:;<="35:>>:?@"6?A"+;;:@?14?8"B3=6?C" 3-?0504-%@-85+2:9%3?+AA+7B%071%C22+B7,-74%D3907E% 34567684"9407:;<="35:>>:?@"6?A"+;;:@?14?8"BD765/C" 3-?0504-%@-85+2:9%3?+AA+7B%071%C22+B7,-74%DF50?GE% E?84@7684A"9407:;<="35:>>:?@"6?A"+;;:@?14?8" 674-B504-1%@-85+2:9%3?+AA+7B%071%C22+B7,-74% (a) (b)

Figure 5. The effectiveness of heuristic allocators when targeting code size (a) and code performance (b). Taller bars indicate larger/slower code. Error bars are shown for execution time measurements. formed as a single optimal pass. Particularly for the register mizing for code size, spill code optimization is not so readily limited architectures, there is a definite degradation of code decoupled from the rest of register allocation. When target- quality when a heuristic spiller is used. The register limited ing architectures with sufficient register sets, a heuristic spill x86 architecture exhibits a 20% increase in dynamic memory code optimizer can approach the quality of an optimal opti- operations that results in a 7.5% increase in execution time. mizer, but there remains room for improvement, particularly Interestingly, although the number of dynamic memory op- when targeting register-limited architectures. erations increases substantially on the x86-64 architecture, this increase does not correspond to a significant increase in 5.4 Assignment execution time. Code size and performance results for three different assign- Performing spill code optimization separately, but opti- ment algorithms are shown in Figure 4. Prior to assignment, mally, has no discernible impact on performance; however, aggressive coalescing and optimal spill code generation are there are notable increases in code size, ranging from an in- performed. The optimal assigner does not explicitly opti- crease of 0.22% on x86 to 1.1% on Thumb. mize for coalescing, but does explicitly optimize for reg- These results suggest that, when optimizing for perfor- ister preferences. Unsurprisingly, given the spilling results, mance, spill code optimization can be implemented as a sep- there does not seem to be any penalty for performing assign- arate pass without loss of code quality. However, when opti- ment as a separate pass when optimizing for performance. However, an increase in code size is observed. Both heuristic heuristic. However, there remains a significant gap between assigners generate poorer quality code than the optimal as- the optimal allocation and the best heuristic solution. signer, but the graph-based assigner generates substantially In all cases the heuristics increase the number of executed better code than the scan-based assigner. memory operations and instructions. On the x86 architec- The graph-based assigner was especially effective when ture, increases of execution time are also observed. Inter- optimizing for performance. On the x86 architecture, the estingly, these increases do not translate into performance graph assigner resulted in a small increase in the number of degradations on the x86-64 architecture. memory operations and instructions executed, while on the x86-64 architecture only a slight increase in the total num- 6. Conclusions ber of instructions executed was observed. Memory opera- In this paper we have presented the first of its kind, compre- tions are introduced by the assigner when a swap operation hensive empirical investigation of the importance, impact, is needed. The greater number of registers on the x86-64 ar- and interaction of the various components of register alloca- chitecture apparently made such operations less necessary tion. Extrapolating from the results of our empirical study, and only simple move instructions were generated. we conclude with a set of design principles useful in guiding The graph-based assigner first uses graph simplification the construction of future register allocators and empirically to find a partial coloring of the interference graph. If any justifying the design of current allocators: variables remain uncolored after this pass then these vari- • ables are split into colorable live ranges. If the first pass The ability to insert moves (that is, split live ranges or is successful at finding a color, no move or swap instruc- undo coalescing) has surprisingly little impact on code tions are generated, otherwise move and swap instructions quality. What benefit there is can primarily be obtained are only generated for the uncolored variables. In practice, by allowing move insertions only at basic block bound- the first pass often successfully colors the interference graph. aries. Thus, register allocators need not be designed to ex- For example, on x86, more than 75% of the compiled func- plicitly optimize move insertions, although allowing in- tions have their interference graph fully colored while more sertions may simplify the design of algorithms for other than 90% of the functions end up with one or fewer uncol- components. ored nodes. As a result, few move and swap instructions are • Coalescing, when viewed as a move elimination problem generated. Since swap instructions are implemented using separate from register assignment, is highly separable: it a temporary memory location in our framework, the graph- can be performed as a standalone pass without materially based assigner is especially effective when optimizing for degrading code quality. Furthermore, a simple greedy performance. heuristic is nearly as effective as an optimal algorithm. The results of our empirical study suggest that there re- • When optimizing for performance on a modern proces- mains ample room for improving assignment algorithms and sor, spill code optimization is of paramount importance. a poor register assignment can negatively impact perfor- Furthermore, the various components of register alloca- mance (especially if a register swap requires a memory ac- tion can be treated separately without significantly im- cess). More advanced techniques such as graph recoloring pacting performance. Therefore, when targeting proces- [16] may narrow the gap between heuristic and optimal so- sor performance, new register allocator designs should lutions. However, in order to achieve maximum code quality, focus on solving the spill code optimization problem as particularly when optimizing for code size, some integration the coalescing, move insertion, and register assignment between the spill code optimizer and the assigner is neces- problems are adequately solved using existing heuristics. sary. • Both spill code optimization and register assignment are 5.5 Heuristic Comparison important contributors when optimizing for code size. Code size and performance results for three different purely If code size is of paramount importance, then integrat- heuristic register allocators are shown in Figure 5. All three ing these two phases can result in an additional small heuristics first perform aggressive coalescing since coalesc- improvement. Therefore, when targeting code size, new ing has been shown to by highly separable and aggressive co- register allocator designs should focus on solving both alescing is very effective while being compile-time efficient. the spill code optimization and register assignment prob- We consider two heuristics that treat spill code optimization lems, possibly in an integrated framework. and assignment separately. They both use the same heuris- tic spill code optimizer, but use our two different assignment Acknowledgments algorithms. The third heuristic is an integrated heuristic that We would like to thank Mike Ryan and Pillai Padmanabhan attempts to solve the spill code optimization and assignment for their help and support using the Open Cirrus testbed at problems simultaneously. Unsurprisingly, when optimizing Intel Research Pittsburgh. This research was sponsored in for code size the graph-based assigner outperforms the scan- part by the National Science Foundation under grant CCF- based assigner and both are outperformed by the integrated 0702640. References [16] S. Hack and G. Goos. Register Coalescing by Graph Recol- [1] A. W. Appel and L. George. Optimal spilling for CISC oring. In Proceedings of the 2008 ACM SIGPLAN Confer- machines with few registers. In Proceedings of the ACM ence on Design and Implementation, SIGPLAN Conference on Programming Language Design and pages 227–237, New York, NY, USA, 2008. ACM. Implementation, pages 243–253. ACM Press, 2001. [17] ILOG CPLEX. http://www.ilog.com/products/cplex. [2] P. Bergner, P. Dahl, D. Engebretsen, and M. O’Keefe. Spill [18] Intel 64 and ia-32 architectures software developer’s man- code minimization via interference region spilling. In Pro- ual. http://www.intel.com/products/processor/ ceedings of the ACM SIGPLAN Conference on Programming manuals/. Language Design and Implementation, pages 287–295, New [19] D. R. Koes and S. C. Goldstein. A global progressive register York, NY, USA, 1997. ACM Press. allocator. In Proceedings of the 2006 ACM SIGPLAN confer- [3] D. Bernstein, M. Golumbic, Y. Mansour, R. Pinter, D. Goldin, ence on Programming Language Design and Implementation, H. Krawczyk, and I. Nahshon. Spill code minimization tech- pages 204–215, New York, NY, USA, 2006. ACM Press. niques for optimizing compliers. In Proceedings of the ACM [20] T. Kong and K. D. Wilken. Precise register allocation for SIGPLAN Conference on Programming Language Design and irregular architectures. In Proceedings of the 31st annual Implementation, pages 258–263. ACM Press, 1989. ACM/IEEE international symposium on , [4] F. Bouchez. A Study of Spilling and Coalescing in Register Al- pages 297–307. IEEE Computer Society Press, 1998. ´ location as Two Separate Phases. PhD thesis, Ecole Normale [21] The LLVM compiler infrastructure. http://llvm.org. Superieure´ de Lyon, France, December 2008. [22] C. R. Morgan. Building an . Butterworth, [5] F. Bouchez, A. Darte, and F. Rastello. Register allocation: 1998. What does the NP-completeness proof of Chaitin et al. really prove. In Workshop on Duplicating, Deconstructing, and [23] J. Palsberg. Register allocation via coloring of chordal graphs. Debunking, 2006. In Proceedings of the thirteenth Australasian symposium on Theory of Computing, pages 3–3, Darlinghurst, Australia, [6] P. Briggs. Register allocation via . PhD thesis, Australia, 2007. Australian Computer Society, Inc. Rice University, Houston, TX, USA, 1992. [24] J. Park and S.-M. Moon. Optimistic register coalescing. ACM [7] P. Briggs, K. D. Cooper, and L. Torczon. Improvements Trans. Program. Lang. Syst., 26(4):735–765, 2004. to graph coloring register allocation. ACM Trans. Program. Lang. Syst., 16(3):428–455, 1994. [25] M. Poletto and V. Sarkar. Linear scan register allocation. ACM Trans. Program. Lang. Syst., 21(5):895–913, 1999. [8] P. Brisk, F. Dabiri, J. Macbeth, and M. Sarrafzadeh. Polyno- mial time graph coloring register allocation. In 14th Internat. [26] V. Sarkar and R. Barik. Extended linear scan: an alternate Workshop on Logic and Synthesis. ACM Press, 2005. foundation for global register allocation. In Proceedings of the 2007 International Conference on Compiler Construction, [9] G. J. Chaitin. Register allocation & spilling via graph color- March 2007. ing. In Proceedings of the SIGPLAN Symposium on Compiler Construction, pages 98–101. ACM Press, 1982. [27] D. Seal. ARM Architecture Reference Manual. Addison- Wesley Longman Publishing Co., Inc., Boston, MA, USA, [10] K. D. Cooper and L. T. Simpson. Live range splitting in a 2000. graph coloring register allocator. In Proceedings of the 7th International Conference on Compiler Construction, pages [28] SPEC CPU2006 benchmark suite. http://www.spec.org. 174–187, London, UK, 1998. Springer-Verlag. [29] O. Traub, G. Holloway, and M. D. Smith. Quality and speed [11] C. Fu, K. Wilken, and D. Goodwin. A faster optimal register in linear-scan register allocation. In Proceedings of the ACM allocator. The Journal of Instruction-Level Parallelism, 7:1– SIGPLAN Conference on Programming Language Design and 31, January 2005. Implementation, pages 142–151, New York, NY, USA, 1998. ACM Press. [12] L. George and A. W. Appel. Iterated register coalescing. ACM Trans. Program. Lang. Syst., 18(3):300–324, 1996. [30] C. Wimmer and H. Mossenb¨ ock.¨ Optimized interval splitting in a linear scan register allocator. In Proceedings of the [13] D. Grund and S. Hack. A Fast Cutting-Plane Algorithm for ACM/USENIX International Conference on Virtual Execution Optimal Coalescing. In S. Krishnamurthi and M. Odersky, Environments, pages 132–141, New York, NY, USA, 2005. editors, Compiler Construction, Lecture Notes In Computer ACM Press. Science. Springer, March 2007. Braga, Portugal. [31] T. Zeitlhofer and B. Wess. Optimum register assignment for [14] M. Guthaus, J. Ringenberg, D. Ernst, T. Austin, T. Mudge, heterogeneous register-set architectures. In Proceedings of and R. Brown. Mibench: A free, commercially representative the 2003 International Symposium on Circuits and Systems, embedded benchmark suite. IEEE International Workshop on volume 3, pages III–252–III–244, May 2003. Workload Characterization, pages 3–14, December 2001. [32] T. Zeitlhofer and B. Wess. A comparison of graph coloring [15] S. Hack and G. Goos. Optimal register allocation for ssa-form heuristics for register allocation based on coalescing in inter- programs in polynomial time. Information Processing Letters, val graphs. Proceedings of the 2004 International Symposium 98(4):150–155, May 2006. on Circuits and Systems, 4:IV–529–32 Vol.4, May 2004.