<<

Reprinted from the Proceedings of the GCC Developers’ Summit

June 17th–19th, 2008 Ottawa, Ontario Canada Conference Organizers

Andrew J. Hutton, Steamballoon, Inc., Linux Symposium, Thin Lines Mountaineering . Craig Ross, Linux Symposium

Review Committee

Andrew J. Hutton, Steamballoon, Inc., Linux Symposium, Thin Lines Mountaineering Ben Elliston, IBM Janis Johnson, IBM Mark Mitchell, CodeSourcery Toshi Morita Diego Novillo, Google Gerald Pfeifer, Novell Ian Lance Taylor, Google C. Craig Ross, Linux Symposium

Proceedings Formatting Team John W. Lockhart, Red Hat, Inc.

Authors retain copyright to all submitted papers, but have granted unlimited redistribution rights to all as a condition of submission. A Superoptimizer Analysis of Multiway Branch Code Generation

Roger Anthony Sayle OpenEye Scientific [email protected]

Abstract Over the last 25 years, microprocessor architecture has advanced considerably, the cycle penalties of memory The high level of abstraction of the multi-way branch, latency and branch misprediction have changed, and exemplified by C/C++’s switch statement, Pascal/Ada’s several novel implementations have been proposed. case, and ’s computed GOTO and SELECT, allows an a large degree of free- 2 Definitions dom in its implementation. Typically compilers gener- ate code selected from a few common patterns, such as A switch statement (or multiway branch) is a con- jump tables and/or trees of conditional branches. How- struct found in most high-level languages for selecting ever, for most switch statements there exists a vast num- from one of several possible blocks of code or branch ber of possible implementations. destinations depending on the value of an index expres- To assess the relative merits of these alternatives, and sion. It can be considered a generalization of the if evaluate the selection strategies used by different com- statement that conditionally selects between two pos- pilers, a switch-statement “superoptimizer” has been de- sible blocks of code depending upon the value of a veloped. This exhaustively enumerates code sequences Boolean expression. In this work, we restrict the index implementing a specified multiway branch. Each se- expression to be of an integral type. quence is evaluated against a simple cost model for Mathematically, a switch statement behaves like a func- code-size, average and worst-case performance (option- tion or mapping, F : Z → M, that for each element of ally using profile information). Backend timings for the set Z, the set of all K-bit values where K is a positive several architectures are parameterized by microbench- integer, selects a single outcome from the set M of des- marking. The results and insights from applying this su- tination labels. The function F is surjective, such that peroptimizer to a corpus of “real-world” examples will |M| ≤ |Z| = 2K. be discussed. The semantics of Pascal’s and Algol’s multiway branch, the case statement, permit that any index value not han- 1 Introduction dled by the case statement may result in undefined be- havior. We do not consider such partial functions, and Although numerous optimizing compilers were imple- instead restrict F to be a total function, even though this mented prior to its publication, the first paper to describe may restrict some the potential optimizations allowed in the issues of code generation for multiway branches was those languages. Instead the function F is made total published by Arthur Sale in 1981 [Sale81]. This now- by mapping any value z ∈ Z that is not in the original classic paper compared several implementation strate- domain to a unique default , md ∈ M. gies for the University of Tasmania’s Pascal compiler for the Burroughs B6700/7700 series. Ultimately this By this definition any pure const function of a single in- evaluation recommended the use of either a table jump teger argument, such as factorial or Fibonacci, or a balanced search tree of comparisons. It is inter- may be represented or canonicalized using a single esting, and perhaps slightly disappointing, that over a switch statement. This ability can be generalized using quarter of a century later the switch code generation in a currying-like approach such that any pure const func- the GNU GCC compiler is little changed from Sale’s tion can be represented by nested switch statements, or original exposition. a single switch statement with a sufficiently large K.

• 103 • 104 • A Superoptimizer Analysis of Multiway Branch Code Generation

When |M| = 1, the switch statement F is said to be un- L0; conditional. When |M| = 2, then is said to be binary. F jmp L0 In practice, we are mostly interested in switch state- Unconditional jump instructions typically have low ments where the size of the codomain M is small, with overhead on modern processors, where branch predic- only a few labels, |M|  |Z|. In such cases, a significant tion is unnecessary and the instruction prefetch machin- fraction of the elements of Z map to a default label m ∈ d ery can avoid stalling the execution pipeline. Addition- M. In such cases, it makes sense to talk of the set of case ally, block reordering can frequently eliminate the values V ⊆ Z that are defined as the values that don’t jump altogether, by placing the destination basic block map to m , i.e. v ∈ V iff (v) 6= m . We can thus rep- d F d at the site of the jump (if it has only a single incoming resent as a set of ordered pairs ⊆ V × (M − m ) F FV d edge) or by duplicating the basic block otherwise (if it such that for all (v ,m ) ∈ , every v is associated with i i FV i has more than one incoming edge) [Mueller02]. a single label mi, such that F (vi) = mi.

Additionally, when interpreted as either signed or un- 3.2 Sequential Tests signed integers, sequentially consecutive values in Z of- ten map to the same destination label. Thus the switch The simplest universal implementation strategy for statement F can also be efficiently represented by the switch statements is to check for equality against each set of ordered triples FR ⊆ V × V × (M − md) such case value sequentially. (l ,h ,m ) ∈ that every i i i FR represents a contiguous case if (x == 1) range, where li ≤ hi and for all vi such that li ≤ vi ≤ hi, goto L1; we have F (vi) = mi and the case ranges are maximal, if (x == 0) i.e. F (li − 1) 6= mi and F (hi + 1) 6= mi. goto L2;

We shall refer to the number of case values, |FV |, and cmpl $1, %eax je L1 the number of case ranges, |FR|. testl %eax, %eax je L2 3 Switch Lowering On many architectures, the performance of integer com- This section lists several possible compilation strategies parisons may be dependent on the value being com- for implementing a multiway branch. The first few are pared. As shown in the assembly example above, well known and commonly employed by existing com- the provides special instructions for comparison pilers, whilst the later strategies are perhaps more novel. against zero. On MIPS, comparisons against zero and For each implementation strategy, the high-level C (pos- one are cheaper than comparisons against signed 16-bit sibly using GCC extensions) and the equivalent constants, which are cheaper than general 32-bit con- assembler are given. We use the conventions that stants. the index variable is x, the default label is called L0, and the case target labels are L1 ... Ln. In this work, we 3.3 Jump Tables assume the target labels are opaque, and have no useful mathematical relationships (though some early compil- A common method for implementing a switch statement ers could rearrange code to simplify the task of switch is via an indirect jump (also known as a table jump). implementation). This uses the switch variable to index an array of desti- nation addresses, and then branch to that location.

3.1 Unconditional Branch In the general case, to keep the size of the jump table reasonable, the minimum case value needs to be sub- The simplest of switch statement implementations and tracted, and the result compared against the table range one of the fundamental building blocks is the uncondi- (as an array bounds check) before indexing the jump ta- tional jump. This is used for switch statements that con- ble. Fortunately, the minimum case value is often one or tain only a default outcome, or during switch lowering two allowing the subtraction to be omitted for a minor after all other cases have been handled. increase in the size of the jump table [Sayle01]. 2008 GCC Developers’ Summit • 105

unsigned int t = x - 10; as (x>=lo) && (x<=hi) where lo and hi are the if (t > 5) lower and upper bounds of the range respectively. An goto L0; often more efficient way of doing this is to subtract the static const void *T1[6] = lower bound, and then perform an unsigned comparison { &&L1, &&L2, &&L3, hi-lo &&L4, &&L5, &&L6 }; against the integer constant which achieves the lo goto *T1[t]; same thing with a single comparison. Obviously, if is zero the subtraction can be omitted. Warren [Warren02] also mentions the efficient use of shifts to perform range subl $10, %eax tests for suitable high and low bounds. cmpl $5, %eax ja L0 if (x < 8) jmp ∗T1(,%eax,4) goto L1; T1: .long L1, L2, L3 .long L4, L5, L6 unsigned int t = x >> 3; if (t == 0) An obvious free parameter and difference between com- goto L1; pilers is the density threshold used to decide whether to use a table jump or not. When the case values are sequentially consecutive values that each jump to dis- 3.5 Balanced Binary Trees tinct labels the decision to use a table jump is obvious. However, as the density (approximately the number of A more efficient use of compare and branch compar- case values divided by the difference between their up- isons than the sequential testing described above is per- per and lower bounds) decreases, the memory to perfor- form a binary search on the target case values. By com- mance trade-off eventually reaches a critical limit where paring against a (median) pivot value, the case values the additional cost in bytes provides insufficient benefit can be partitioned into those values less than or greater in clock cycles. than the pivot. This reduces the time to perform a multi- way branch from O(n) with the sequential method to A less obvious difference is the lack of uniformity in the O(logn), where n is the number of case values. way that switch table density is calculated. One obvi- ous approach is to use the number of case values, |FV |, Each comparison in such a binary search refines the up- as the numerator. However, as mentioned in the next per or lower bounds of the remaining partitions. Whilst section, ranges of consecutive case values that branch many compilers perform only signed or only unsigned to the same label can be efficiently as range tests, that comparisons depending upon the type of the switch ex- are independent of the number of case values they con- pression, it is potentially beneficial to use both forms in sider. Hence, density may correlate better with the num- a single search. Another more commonly implemented ber of case ranges than with the number of case values. improvement is to use the established upper and lower The approach currently used by GCC is to count non- bounds to avoid comparisons. For example, if a bi- singleton case ranges as twice as expensive as singleton nary search has already confirmed that the index value is case ranges. This provides a better measure of density greater than three, but less than five, there is no point in for comparison against other implementation strategies. explicitly comparing against the value four. A naïve im- In practice, the real non-singleton to singleton cost ra- plementation of binary tree lowering using these bounds tio is less than two, even approaching one for switches may inadvertently emit conditional branches to uncon- consisting of consecutive adjoining ranges. ditional branches and similar inefficient idioms. These may be avoiding by either using the usual CFG jump 3.4 Range Tests optimizations as a clean-up step, or by improvements in the binary tree lowering code.

When two or more consecutive integer values branch to Depending upon the architecture, it may be beneficial the same target label, it is possible to test for a range (or not) to perform three-way branches at each parti- values rather than each value independently. One obvi- tion step, reusing the condition codes (or register) from ous possibility is via a pair of conditional branches such a single comparison against an integer constant for both 106 • A Superoptimizer Analysis of Multiway Branch Code Generation an equality test and an ordering test. Currently, GCC An often overlooked aspect of performing binary com- always performs three-way branches, whilst LLVM al- parisons both in sequential comparisons and in binary ways performs two-way (binary) branching. Even with search trees is that it can sometimes be advantageous to three-way branching there is the decision of whether to compare against integer values (and ranges) not in the perform the equality/inequality comparison before or af- case set, V, i.e. those that map to the default label md. ter the ordering comparison. Unless the pivot value oc- For example, when ranges on both adjoining sides of curs frequently, it is theoretically better to perform the a ‘gap’ branch to the same destination, testing for the partitioning first and incur the cost of the equality com- gap allows the adjoining ranges to be merged. Likewise parison on only one of the following paths. for binary switch tables, it can occasionally be easier to identify the (complementary) default values than the On many processors there is a significant asymmetry in (positive) case values. the cycle counts for taken vs. not-taken cycle counts (occasionally even when the outcome is correctly pre- 3.6 Bit Tests dicted). This difference means that a perfectly balanced binary tree will not have the best worst-case behavior. Instead appropriately skewed binary trees can lead to A relatively recent innovation in switch statement im- better performance. This is one reason why the sequen- plementation is the use of bit tests [Sayle03a]. This tial testing strategy described above is sometimes com- technique allows several case values that share the same petitive for switches with four or more values. label to be tested simultaneously by representing each case by its own bit. switch (x) { This technique requires that the range of case values not case 100: goto L1; be larger than the number of bits in a machine word. case 200: goto L2; Like jump tables, it is often necessary to perform a sub- case 300: goto L3; traction followed by a bounds check to eliminate values case 400: goto L4; outside of the allowed range. } if (x > 8) if (x == 200) goto L0; goto L2; unsigned int t = 1 << x; if (x > 200) { // cases 1, 4, and 8 if (x == 300) if ((t & 274) != 0) goto L3; goto L1; if (x == 400) // cases 0, 2, and 5 if ((t & 37) != 0) goto L4; goto L2; } goto L0; else { if (x == 100) cmpl $8, %eax goto L1; ja L0 } movl %eax, %ecx movl $1, %eax cmpl $200, %eax sall %cl, %eax je L2 testl $274, %eax jbe LT1 je L1 testb $31, %al cmpl $300, %eax je L2 je L3 cmpl $400, %eax The performance of bit testing can be improved by test- je L4 ing the most frequent bit patterns (masks) first. If all jmp L0 cases are equally probable, this equates to testing the LT1: cmpl $100, %eax masks that have more bits set before those that have a je L1 few bits (or just a single one) set. 2008 GCC Developers’ Summit • 107

3.7 Tabular Methods decl %eax je L3

For completeness, we briefly describe the applicability On MIPS, comparisons against one and zero are cheaper of table-driven methods for implementing switch state- than other values that require loading of integer con- ments. In these techniques, the task of implement- stants into registers. Hence on this target, it is conve- ing a multiway branch is reduced to a static search nient for dense sequential cases to place the constant two problem. A good review of tabular methods, includ- into a register, then repeatedly compare against zero and ing linear search, binary search, and multiplicative bi- one between subtractions of two. nary search, is given by Spuler [Spuler94]. The ma- jor utility of tabular approaches is in reducing code This strategy may potentially benefit from the use of size, especially when the table search implementation decrement and branch instructions on those architec- can be shared between multiple switches and placed in tures that support them. the compiler’s run-time library. In the Java virtual ma- chine, the lookupswitch bytecode provides a tabular movl %eax, %ecx search implementation. loop LT1 jmp L1 static const int T1[4] = LT1: loop LT2 { 4, 6, 9, 11 }; jmp L2 static const void ∗T2[4] = LT2: loop L0 { &&L1, &&L2, &&L3, &&L4 }; jmp L3 for (int i = 0; i < 4; i++) if (x == T1[i]) 3.9 Index Mapping goto ∗T2[i]; Index mapping is a table-based technique similar to 3.8 Decrement Chains jump tables but with a wider range of applicability. In- stead of indexing directly into a sparse table of labels An interesting implementation technique implemented (typically each four bytes long), this method uses a nar- by some compilers, but not previously described in rower lockup table of bytes to first transform the index the literature, is the use of decrement (or subtraction) variable into dense zero-based range. chains. On some architectures, a decrement instruction Switch statements with up to 256 unique labels can be followed by a test for zero is more convenient than a handled by a byte-wide look-up table. This technique comparison against an arbitrary constant. On i386 this can also be extended to switch statements with even leads to dense code, and on some RISC architectures more unique labels by using two-byte shorts instead. this provides a useful instruction for the branch delay slot. unsigned int t = x − 10; if (t > 7) x−−; goto L0; if (x == 0) static const char T1[8] = goto L1; { 0, 1, 1, 2, 0, 2, 2, 1 }; x ; −− t = (unsigned char)T1[t]; if (x == 0) if (t == 0) goto L2; goto L1; x−−; if (t == 1) if (x == 0) goto L2; goto L3; movzbl T1(%eax), %eax decl %eax je L1 Often index mapping is immediately followed by a reg- decl %eax ular table jump. This idiom, called double dispatch, has je L2 the advantage that the range of values in the first table 108 • A Superoptimizer Analysis of Multiway Branch Code Generation is known and therefore the usual bounds test on the fol- 3.11 Imperfect Hashing lowing table jump can be omitted. On the SPARC archi- tecture, which usually needs to multiply the table jump Hashing functions provide a method of handling sparse index by four to access the correct destination word, switch statements. By quickly calculating an auxiliary this multiplication can be precomputed and stored in the value and branching or multiway branching on its value, byte map (provided that the switch statement has less we can ‘divide and conquer’ the original switch. A re- than 64 unique destinations). view of suitable (i.e., cheap) hash functions is described by Dietz [Dietz92]. More recently, Erlingsson et al. Mathematically, index mapping provides a minimal per- have described a practical method for quickly selecting fect hash function when each target label is unique and a suitable bitwise-and hash function called “Multiway there are no ranges. Radix Search Trees (MRST)” [Erlingsson96].

On many CISC processors, such as the x86 family, bi- Modern processor architectures contain a number of in- nary index mapping (i.e. when there is only a single structions that perform useful hashing functions. Ex- non-default label and the byte array contains only ones amples include popcount, ffs (find first set), clz and zeros) can take advantage of instructions that com- (count leading zeros), and ctz (count trailing zeros). pare directly against memory. On the i386, for exam- One simple use of imperfect hashing is handling switch cmpb $0, T1(%eax) ple, the instruction avoids the statements where the index expression has a multi-word usual load followed by a compare or test. type (such as long long). In such cases, the im- plementation could be considered a switch of the high- word, dispatching to switches on the low-word. This is 3.10 Condition Codes especially advantageous on narrow processors such as AVR.

Depending upon the architecture, and upon the fre- The MRST scheme that extracts consecutive bits from quency of case values, it may be beneficial to manip- the index value for the hash (i.e. where hash functions ulate explicit Boolean expressions rather than the more of the form t = (x >> C1) & C2) is particularly common ‘short-circuit evaluation’ conditional branches. well suited to architectures that have bit-field extraction instructions, such as the ARM and the PowerPC. For the char t1 = (x == 6); special case where the field is a single bit, i.e. C2 is one, the comparison can be performed using a bit-wise ‘and’ char t2 = (x == 11); instruction, or special-purpose bit-test instruction where if (t1 | t2) available. goto L1; 3.12 Perfect Hashing cmpl $6, %eax perfect sete %dl In the field of switch statement code generation, hashing describes any transformation of the original in- cmpl $11, %eax teger index expression that is a bijection. A bijective in- sete %al teger operation has the property that every output value orb %al, %dl is generated by a unique input value, enabling the ex- istence of an inverse. Such an operation just permutes je L1 the integer values, but as mentioned above, these per- muted case values may allow more efficient implemen- In the example above, the i386 architecture manipulates tation than its original isomorphism. the condition codes as bytes, but on many RISC archi- tectures with multiple condition code or predicate regis- Examples of suitable bijective operations (permuta- ters, they can be manipulated directly. As an example, tions) on K-bit words include logical inversion (∼x), the PowerPC architecture provides a suitable cror in- integer negation (-x), addition and subtraction of an in- struction. teger constant (modulo 2K), bitwise-exclusive-or (xor) 2008 GCC Developers’ Summit∼ • 109 of an integer constant, byte-swap, bitwise rotation, and 3.13 Safe Hashing even multiplication by any odd number (proof left to the reader). Somewhere between perfect (bijective) hashing and im- switch (x) { perfect hashing is safe hashing. These are many-to-one case 0: index transformations with the fortunate property that case 4192257: all index values that map to the same output value have case 8384514: the same outcome (target label). case 12576771: case 16769028: case 20961285: Examples of suitable safe hashing functions are divi- goto L1; sion, bitwise logical and arithmetic shifts, bitwise-and, } and bitwise-or.

unsigned int t = x ∗ 2049; switch (x) { if (t < 6) case 0: goto L1; case 1: goto L1; movl %eax, %edx case 2: sall $11, %eax addl %edx, %eax case 3: goto L2; cmpl $5, %eax case 4: ja L1 case 5: goto L3; A far more frequent application of perfect hashing is } the use of rotation (instead of subtraction) to reduce the unsigned int t = x 1; range of target case values. >> switch (t) { switch (x) { case 0: goto L1; case 16: goto L1; case 32: goto L2; case 1: goto L2; case 48: goto L3; case 2: goto L3; case 64: goto L4; } }

int t = (x >> 4) | (x << 28); shrl %eax switch (t) { case 1: goto L1; Another example of the potential benefit of safe hashing case 2: goto L2; is the classic Has30Days function. case 3: goto L3; case 4: goto L4; switch (x) { } case 4: // April rorl $4, %eax case 6: // June case 9: // September Instruction set support for multimedia functions can case 11: // be a rich source of suitable bijective instructions. November For example, the IA-64’s mux1.rev, mux1.mix, return true; mux1.shuf, and mux1.alt can all potentially be } used to improve the performance of a switch statement implementation. unsigned int t = x | 2; switch (t) { K It should be clear that the modulo-2 subtractions, used case 6: in range tests as bounds checks for table jumps and bit case 11: tests and functionally in decrement chains, are just a return true; special case of the use of subtraction as a perfect hash function. } 110 • A Superoptimizer Analysis of Multiway Branch Code Generation

3.14 Conditional Moves one for a taken branch and one for a not-taken branch. In this work, we restrict the representation on control Modern processors can often avoid the penalty of condi- flow graphs to binary trees. Looping constructs (such tional branching by the use of conditional move instruc- as those in inlined tabular implementation methods) are tions. represented by a single “macro instruction” hiding the internal backward edges from the superoptimizer. Al- void ∗t = &&L0; though there may be some benefit in extending the cur- if (x == 1) rent enumeration strategy to general directed acyclic t = &&L1; graphs (DAGs), the restriction to binary trees helps limit if (x == 2) the combinatorial explosion associated with exhaustive t = &&L2; search. if (x == 3) t = &&L3; The multiway branch superoptimizer works by a divide- and-conquer approach, repeatedly simplifying or trans- goto ∗t; forming the remaining switch cases that need to be im- movl %eax, %ecx plemented. At each node in the search/tree, the problem movl $L0, %edx state consists of the set of case nodes that remain to be movl $L1, %eax considered, and the set of potential values the index ex- cmpl $1, %ecx pression may have. At the start of the recursive search, cmove %eax, %edx the root of the tree, the set of case values are those of movl $L2, %eax the original switch, and the set of potential values are all cmpl $2, %ecx 2K values representable in the index type. The terminal cmove %eax, %edx nodes at the leaves of the tree are unconditional jump movl $L3, %eax nodes or table jump nodes. In theory, the value range cmpl $3, %ecx propagation pass of a compiler may be able to refine the cmove %eax, %edx initial set of potential index values. jmp ∗%edx Some example solution trees enumerated by the super- optimizer are given in Figure 1. These depict some of 4 Superoptimization the possible balanced binary tree implementations for a switch statement with four unique case values. The term superoptimization was first coined by Henry Massalin in [Massalin87] and describes the task of find- 5 Cost Model ing the optimal code sequence for a single, loop-free sequence of instructions. Whilst most compiler opti- An important aspect of superoptimization is the objec- mizations attempt to improve code by applying heuris- tive function that is used to assess the merit of each solu- tic transformations, superoptimization performs an ex- tion. When optimizing for code size, a size in bytes can haustive search in the space of valid instruction se- be associated with each node in a solution tree, and the quences. An outstanding example of superoptimization total code size is simply the sum of each node size in the is the GNU superoptimizer [Granlund92]. A less rigor- tree. When optimizing for performance, a cycle count ous but practical application of superoptimization is the can be associated with each edge in the tree. For con- task of generating instruction sequences for multiplica- ditional branches, different timings can be associated tion by integer constants, known in GCC as the routine with the taken and the not-taken edges. The worst-case synth_mult [Bernstein86, Lefevre01]. performance is calculated by determining the maximum edge cost from the root to any leaf node. By associating Unlike most traditional superoptimizers, the superopti- a frequency with each original case value, we can deter- mizer implementation described in this work needs to mine the average or expected cost of a multiway branch. handle explicit control flow instructions. In addition to The expected average cost of a switch statement is the the strict linear instruction sequences enumerated in pre- sum over all leaves of the probability of the leaf multi- vious efforts, a representation of conditional branches plied by the cost of the leaf. In practice, the use of float- introduces instructions with two successor instructions; ing point probabilities can be avoided by using integer 2008 GCC Developers’ Summit • 111

2 2 3 2 ? ? ? ? = < < ≤ © @@R © @@R © @@R + QQs < 1 = 1 = = 3 © @@R ? © @@R ? © @@R © @@R ? 1 3 = 3 = 4 1 = ? ? © @@R ? © @@R ? ? © @@R = = = 2 = = 4 © @@R © @@R © @@R ? © @@R © @@R ? 4 4 = = ? ? © @@R © @@R = = © @@R © @@R

Figure 1: Four alternate balanced binary tree solutions for implementing a unique switch statement with case values one, two, three, and four. Digits represent comparison instructions; relational operators represent conditional branch instructions with the taken edge below to the left and the not-taken edge below to the right.

frequencies. These frequencies can be taken from pro- 6 Microbenchmarking filing information or assumed to be equal when profile information is not available. In order to both parameterize the cost model described previously and confirm the predictions made by the su- peroptimizer, it is possible to measure the performance In practice, rather than purely minimize average perfor- of possible code sequences on real hardware. mance, worst-case performance, or code size, it is of- ten more useful to use Pareto frontier methods to per- The use of microbenchmarking is even applicable to tar- form multiobjective optimization. If two implementa- gets where real hardware isn’t available. The methodol- tions have the same performance, then the shorter so- ogy can be used with virtual machines, such as a JVM, lution is preferable. Often when optimizing for size, or with simulators, such as GNU sim or SIMH. One we’re not interested in the absolute shortest code if the application of this technique was to evaluate the bene- casel performance is terrible. Likewise, when optimizing for fits of the VAX’s instruction by running VAX performance, solutions that use an unrealistic amount of Net/BSD on SIMH. memory may be inappropriate. The restriction of jump The advanced branch prediction and memory caching tables based on density can be seen as an instance of strategies performed by modern processors can often this. Likewise, if the training data used to generate pro- significantly skew and bias the results of microbench- file edge data doesn’t cover all of the cases, the per- marks, where memory (for jump tables) is often held in formance of cases with a zero frequency are not used cache, and where hot branches can often by predicted in estimating the average case performance. Whenever from history tables. To avoid (or more accurately to ac- two implementations have the same average case behav- count for) these effects, explicit cache flushing can be ior, we can avoid potential issues with pathological input used [Whaley08]. values by preferring the one with the better worst case performance. 7 Benchmark Corpus

An interesting additional cost metric not considered in To evaluate whether the advanced switch statement im- the present study is the optimization of code to reduce plementations described above are applicable on real- power consumption [Tiwari94]. Much like microbench- world code, a survey of existing switch statement usage marking, the cost of each node could be parameterized was undertaken. By using a specially instrumented ver- by chip-level simulation. sion of GCC, the switch statements were extracted from 112 • A Superoptimizer Analysis of Multiway Branch Code Generation the source code of 5,953 packages of the Debian Linux Range Count Fraction distribution. 1 51804 9.91% 2 62757 12.01% These 5,953 Debian packages contained, or more ac- 3 70489 13.49% curately compiled, 522,681 switch statements. These 4 49311 9.43% switch statements contained 3,024,896 case ranges, giv- 5–8 80043 15.31% ing a mean of 5.79 case ranges per switch statement. A 9–16 66019 12.63% case range is defined a consecutive sequence of case val- 17–32 47947 9.17% ues that map to the same non-default target label. The 33–64 31998 6.12% largest switch statement contained 4,541 case ranges. > 64 62313 11.92% Table 1 summarizes the distribution of switch sizes, as measured by the number of case ranges. Table 2: Distribution of Switch Ranges

# Cases Count Fraction 1 61193 11.71% of density, calculated as n/r where r is the range of val- 2 115980 22.19% ues and n the number of values, we instead report its 3 103623 19.83% integral reciprocal sparsity, which is defined as dr/ne. 4 65671 12.56% Using this measure, all switch statements with a density 5–8 102898 19.69% greater than 33.3%˙ have a sparsity less than or equal to 9–16 48575 9.29% 3. 17–32 17462 3.34% > 32 7279 1.39% Sparsity Count Fraction Table 1: Distribution of Switch Sizes 1 286427 54.80% 2 85380 16.34% 3 24429 4.67% As shown by these figures, the vast majority of switch 4 16701 3.20% statements contain relatively few case ranges. Over a 5 16081 3.08% third have only one or two cases, and less than two per- 6 10374 1.98% cent have more than 32 cases. However, the distribution 7 7509 1.44% has a long tail with 19 switch statements having more 8 5301 1.01% than a thousand case ranges. 9 5459 1.04% 10 3271 0.63% Of the 522,681 switch statements, 424,600 (81.24%) > 10 61749 11.81% are unique, where every case range branches to a dis- tinct destination label. These ‘unique’ switch statements Table 3: Distribution of Switch Densities don’t benefit from bit tests or safe hashing methods. There were 78,925 (15.10%) binary switch statements that had a single destination label other than the default. These figures show that extending the range of applica- There were 94,523 (18.08%) switch statements that had bility of jump tables by decreasing the density threshold non-singleton ranges, i.e. consecutive case values that has rapidly diminishing returns. GCC currently uses a branched to the same non-default destination. density threshold of 10% when optimizing for speed, and 33% when optimizing for size. LLVM currently Table 2 is a summary of the ranges or spans of the ob- uses a more conservative density threshold of 40%. served switch statements. These figures show that over eighty percent of all switch statements span a range of less than 32 values. This reveals that techniques such 8 Results as index mapping and bit testing should frequently be applicable. An example x86 solution found by the described super- optimizer is given below. Table 3 provides an overview of switch statement den- sity. Rather than the more usual (floating point) measure switch (x) { 2008 GCC Developers’ Summit • 113

case 20: Method Count Fraction case 21: binary trees 202633 38.77% case 23: (without ranges) 163548 31.29% case 24: goto L1; (with ranges) 39085 7.48% } table jumps 151893 29.06% cmpl $22, %eax (without delta) 99932 19.12% je L0 (with delta) 51961 9.94% subl $20, %eax sequential tests 153211 29.31% cmpl $4, %eax bit tests 14944 2.86% ja L0 (without delta) 11575 2.21% jmp L1 (with delta) 3369 0.64%

A more interesting example is taken from compiling the Table 4: GCC Implementation Statistics Linux kernel on IA-64. A frequent idiom is to switch on powers of two. that perform table jumps or bit tests to implement the re- switch (x) { maining comparisons. This allows more than one jump case 1: goto L1; table to be used to implement a single switch statement. case 2: goto L2; The threshold for switching from comparisons to jump case 4: goto L3; tables is slightly lower for LLVM, at a maximum of case 8: goto L4; ... three comparisons vs. four for GCC. LLVM also has a case 16384: goto L15; stricter density requirement for jump tables than GCC. case 32768: goto L16; Of the 522,681 switch statements in the survey corpus, } 5,135 have two singleton case values that branch to the The fastest solution found uses count-trailing-zeros as a same label, and of these, 1,564 (0.30%) have case values imperfect hash function, which in turn makes use of the that differ by a single bit and are therefore suitable for ’s popcount instruction. ior safe hashing.

int t = __builtin_ctz(x); Of the GCC table jumps, 1,056 (0.20%) are binary static const void ∗T1[32] = { and have a single non-default label and could therefore &<1, &<2, &&<3, ... be more efficiently implemented with a Boolean index &&L0, &&L0, &&L0, &&L0 }; map. goto ∗T1[t]; LT1: When bit testing, when the number of bits set in the im- if (x == 1) mediate constant is more than half of the available bits, goto L1; on some architectures it may be cheaper to test the com- goto L0; plement bit pattern and reverse the sense of the condi- LT2: tional branch. Analysis of the Debian data set reveals if (x == 2) that this occurs extremely infrequently (62 times for bi- goto L2; nary bit tests, and 516 times for multiway bit tests). goto L0; ... Of the switch statement implement by GCC using bi- nary trees, 29,155 (5.58% of the total) have more than The switch statement lowering code in the current ver- four case ranges, but failed to be implemented by a table sion of LLVM differs from that used by the current jump for density reasons. We shall refer to this set as the version of GCC in several minor respects. Instead of difficult subset. The largest difficult switch statement, making the decision whether to use a table jump or se- found in the conversions.c source file of Debian’s quence of bit tests once for the whole switch statement, msort-8.42 package, had 1,080 case ranges, cover- LLVM uses a divide-and-conquer approach, allowing a ing 1,202 case values branching to 320 unique destina- sequence of binary comparisons to lead to leaf nodes tion labels. 114 • A Superoptimizer Analysis of Multiway Branch Code Generation

Of this difficult subset, 1,806 (0.35%) were “powers of Clusters Count Fraction two” where each case value had only a single bit set. 0 14024 48.10% There were also 3,083 (0.59%) switch statements where 1 12599 43.21% every case value had trailing zero bits, and in 2,066 2 2005 6.88% (0.40%) of these, rotating by the number of common 3 356 1.22% trailing zero bits reduced the range (and therefore the 4 67 0.23% density) enough to be implemented by a table jump. 5–8 85 0.29% 9–16 10 0.03% Index mapping turns out to be a particularly effective > 16 9 0.03% implementation strategy for dealing with these 29,155 difficult cases. Using GCC’s current 10% density crite- Table 5: Dense Clusters in Difficult Subset rion, which permits using up to 40 bytes of memory per case value (on a 32-bit machine), index mapping can ex- tend the applicability of table jumps to another 15,346 reasonable pivot element in LLVM. A common failing (2.94%) switch statements (or over half of the difficult with these two “greedy” approaches is that they inadver- cases). Additionally, of the 155,893 switch statements tently skew binary search trees even when there are no currently implemented by table jumps, 77,958 (14.92%) dense clusters to be found. These improve the perfor- would have reduced memory consumption using “dou- mance of the rare (15,131 or 2.89%) binary trees that ble dispatch” index mapping. Likewise, 19,790 (3.79%) contain dense clusters at the expense of the far more of the current table jumps have fewer than 5 unique des- common (187,502 or 35.87%) binary trees that should tinations, allowing the table jump to be eliminated com- attempt to be suitably balanced independently of their pletely by index mapping followed by sequential com- case value’s integer values. A better global approach parisons or a binary search tree. has been proposed by Delorie [Delorie04] that initially identifies the dense clusters and treats these much like One approach to improving GCC’s implementations of case ranges, as leaves (or potentially internal nodes) the difficult subset is to use a binary tree of com- when building a balanced binary search tree. These re- parisons to split/subset the original switch statement main heuristics, as optimal switch statement lowering into partitions that are dense enough to be imple- may occasionally require intentional splitting of dense mented by table jumps. Analysis of the difficult set re- clusters, much like testing frequent cases prior to table veals that 15,131 (2.89%) contain one (or more) sub- jumps can improve average case performance. ranges that contain enough case ranges and are dense enough to satisfy GCC’s table jump criteria. Ta- ble 5 gives the histogram of the number of table- 9 Future Work jump clusters in the difficult subset. The maximum number of dense clusters observed in a switch state- The C# extends the switch state- ment is 28, found in both in the inventor_2.1.5 ment to allow switching of strings, where each case package’s SoFieldConvertors.c++ and in the contains a constant string literal. The implementation netrik_1.15.3 package’s parse-syntax.c. of string switches is analogous to the classic computer science problem of keyword recognition. Additional An area of active research in switch statement lower- strategies include perfect and imperfect hashing (as used ing is how best to select the pivot element to best parti- by the GNU utility gperf), Patricia tries, or finite tion switch statements to take advantage of these dense state machines (as used by the UNIX utilities flex and clusters. An easy-to-implement approach is to use the lex). existing mechanism of selecting the median case value (or frequency-weighted median) as is traditionally done In this work, the implementation of a switch statement for constructing binary search trees. This has the un- has been equated with that of a multiway branch. How- wanted effect of occasionally splitting dense clusters, ever, for many uses of the switch statement in real code, reducing the number of potential jump tables. Bern- it is possible avoid branching altogether and replace the stein [Bernstein85] suggested splitting at the largest gap switch with one or more table look-ups. For example, between case values. Korobeynikov [Korobeynikov07] the Has30Days example presented earlier can be im- implemented a density-balancing metric used to select a plemented as the following: 2008 GCC Developers’ Summit • 115

if ((unsigned)x > 11) 10 Acknowledgments return 0; static const int T[12] = The author would like to thank OpenEye Scientific Soft- { 0, 0, 0, 0, 1, 0, ware for the freedom to investigate code generation is- 1, 0, 0, 1, 0, 1 }; sues as a hobby; IBM, HP, SGI, , Apple, SUN, return T[x]; and Intel for providing hardware; and the GCC com- munity for inspiring this research. Particularly, Jan Hu- This mechanism can be extended to more diverse targets bicka,ˇ Andi Kleen, DJ Delorie, Martin Jambor, and through the use of unification. Kazu Hirata have discussed the issue of GCC’s switch statement generation. The author is especially grate- switch (x) { ful to Martin Michlmayr for actually running the in- case 0: foo(2,4); break; strumented GCC over the Debian Linux distribution’s case 1: foo(8,3); break; source code repository. default: foo(5,9); break; } References . . . can be implemented as. . . [Bernstein85] Robert L. Bernstein, “Producing Good int t1, t2; Code for the Case Statement,” Software – Practice if (x < 2) { and Experience, Vol. 15, No. 10, pp. 1021–1024, static const int T1[2] = { 2, 8 }; October 1985. static const int T2[2] = { 4, 3 }; t1 = T1[x]; [Bernstein86] Robert L. Bernstein, “Multiplication by t2 = T2[x]; Integer Constants,” Software – Practice and } else { Experience, Vol. 16, No. 7, pp. 641–652, July 1986. t1 = 5; [Delorie04] DJ Delorie, “Case Clustering” t2 = 9; gcc-patches:2004-07/msg01234, July 2004. } foo(t1,t2); [Dietz92] H.G. Dietz, “Coding Multiway Branches Using Customized Hash Functions,” ECE Technical Another possible area of research is to combine the tab- Report, School of Electrical Engineering, Purdue ular implementation methods described earlier with dy- University, 1992. namic data structures such as splay-trees, self-balancing binary search trees, or move-to-front lists. The advan- [Erlingsson96] Ulfar Erlingsson, Mukkai tage of such a hybrid technique is that the performance Krishnamoorthy and T.V. Raman, “Efficient of the implementation could dynamically adapt to the Multiway Radix Search Trees,” Information frequency distribution of the index expression. Processing Letters, Vol. 60, No. 3, pp. 115–120, November 1996. Finally, to borrow from Newton’s third law, “For ev- ery optimization, there is an equal and opposite pes- [Granlund92] Torbjön Granlund and Richard Kenner, simization.” For each of the transformations described “Eliminating Branches using a Superoptimizer and in this article, it would be nice to perceive the original the GNU C Compiler,” Proceedings of the higher-level switch statement from its optimized (low- Conference on Programming Language Design and ered) equivalent. Programmers by necessity often at- Implementation (PLDI), ACM SIGPLAN Notices, tempt to hand-optimize code to the current generation of Vol. 27, No. 7, pp. 341–352, July 1992. computer hardware. The ability for a compiler to auto- matically undo these (premature) optimizations and then [Hennessy82] J.L. Hennessy and N. Mendelsohn, re-lower them using the target’s optimizer could poten- “Compilation of the Pascal Case Statement,” tially avoid the performance problems frequently asso- Software – Practice and Experience, Vol. 12, pp. ciated with legacy code. 879–992, September 1982. 116 • A Superoptimizer Analysis of Multiway Branch Code Generation

[Lefevre01] Vincent Lefèvre, “Multiplication by an IEEE Transactions on Very Large Scale Integration Integer Constant,” citeseer:491491.html, (VLSI) Systems, Vol. 2, No. 4, pp. 437–445, 1994. 2001. [Warren02] Henry S. Warren, Jr., “Hacker’s Delight,” [Kannan94] Sampath Kannan and Todd A. Proebsting, Addison-Wesley Publishers, 2002. “Correction to ‘Producing Good Code for the Case Statement’,” Software – Practice and Experience, [Whaley08] R. Clint Whaley and Anthony M. Short Communication, Vol. 24, No. 2, pp. 233, Castaldo, “Achieving Accurate and February 1994. Context-Sensitive Timing for Code Optimization,” Software – Practice and Experience, Accepted for [Korobeynikov07] Anton Korobeynikov, “Improved publication, 2008. Switch Lowering for the LLVM Compiler Compiler [Wienskoski06] Edmar Wienskoski, “Switch System,” Proceedings of the 2007 Spring Young Researchers Colloquium on Software Engineering Statement Case Reordering FDO,” GCC Summit 2006, Ottawa, Canada, pp. 235–241, June 2006. (SYRCoSE’2007), Moscow, Russia, May 2007.

[Massalin87] Henry Massalin, “Superoptimizer: A Look at the Smallest Program,” ACM SIGPLAN Notices, Vol. 22, No. 10, pp. 122–126, October 1987.

[Mueller02] Frank Mueller and David B. Whalley, “Avoiding Unconditional Jumps by Code Replication,” Proceedings of the Conference on Programming Language Design and Implementation (PLDI), ACM SIGPLAN Notices, Vol. 27, No. 7, pp. 322–330, July 1992.

[Sale81] Arthur Sale, “The Implementation of Case Statements in Pascal,” Software – Practice and Experience, Vol. 11, pp. 929–942, 1981.

[Sayle01] Roger A. Sayle, “Optimize Tablejumps for Switch Statements,” gcc-patches:2001-10/ msg01234, October 2001.

[Sayle03a] Roger A. Sayle, “Implement Switch Statements with Bit Tests,” gcc-patches: 2003-01/msg01733, January 2003.

[Sayle03b] Roger A. Sayle, “Tune Jump Tables for -Os,” gcc-patches:2003-08/msg02054, August 2003.

[Spuler94] David A. Spuler, “Compiler Code Generation for Multiway Branch Statements as a Static Search Problem,” Technical Report, Department of Computer Science, James Cook University, Australia, January 1994.

[Tiwari94] Vivek Tiwari, Sharad Malik and Andrew Wolfe, “Power Analysis of Embedded Software: A First Step Towards Software Power Minimization,”