Threaded Code Variations and Optimizations

Threaded Co de Variations and Optimizations M. Anton Ertl TU Wien Abstract header of x code field dovar code orth has been traditionally implemented as in- F body direct threaded co de, where the co de for non- es is the co de- eld address of the word. To primitiv header of foo docol code um b ene t from combining sequences get the maxim code field primitives into sup erinstructions, the co de pro- of code field e should b e a primitivefol- duced for a non-primitiv body @ code wed by a parameter (e.g., lit addr for variables). lo code field pap er takes a lo ok at the steps from a tra- This ;s code ditional threaded-co de implementation to sup erinstructions, and at the size and sp eed e ects of the Figure 1: Traditional indirect threaded co de various steps. The use of sup erinstructions gives sp eedups of up to a factor of 2 on large b enchmarks on pro cessors with branch target bu ers, but re- pap er describ es the various steps from an indirect quires more space for the primitives and the opti- threaded implementation to an implementation us- mization tables, and also a little more space for the ing sup erinstructions (Section 2), and evaluates the threaded co de. e ect of these steps on run-time and co de size (Sec- tion 4); these steps also a ect cache consistency is- sues on the 386 architecture, whichcanhave a large 1 Intro duction in uence on p erformance (Section 3). Traditionally,Forth has b een implemented using an interpreter for indirect threaded co de. However, over time programs have tended to dep end less on 2 Threaded co de variations sp eci c features of this implementation technique, and an increasing number of Forth systems have Wewillusethefollowing co de as a running example: used other implementation techniques, in particu- lar native co de compilation. One of the goals of the Gforth pro ject is to variable x provide comp etetive p erformance, another goal is p ortability to a wide range of machines. To meet : foo x @ ; the p ortability goal, we decided to stay with a threaded-co de engine compiled with GCC [Ert93]; to regain ground lost on the eciency front, wede- cided to combine sequences of primitives into su- 2.1 Traditional indirect threaded p erinstructions. This technique has b een prop osed co de bySchutz [Sch92] and implemented by Wil Baden In indirect threaded co de (Fig. 1) the co de of a colon in this4th [Bad95] and by Marcel Hendrix in a ver- sion of eforth. It is related to the concepts of sup er- de nition consists of a sequence of the co de eld ad- combinators [Hug82] and sup erop erators [Pro95]. dresses (CFAs) of the words contained in the colon de nition. SuchaCFA points to a co de eld that Non-primitives in traditional indirect threaded contains the address of the machine co de that p er- co de cannot be combined into sup erinstructions, forms the function of the word. but it is p ossible to compile them into using primitives that can be combined, and then into using The reason for the indirection through the co de direct threading instead of indirect threading. This eld is to supp ort non-primitives, likethevariable x in our example: The dovar routine can compute Corresp ondence Address: Institut fur Computer- the body address by adding the co de eld size to sprachen, Technische Universitat Wien, Argentinierstrae 8, A-1040 Wien, Austria; [email protected] en.a c.a t the CFA. header of x header of x jmp dovar dovar code jmp dovar dovar code body body header of foo header of foo docol code jmp docol docol code jmp docol lit lit code body @ code @ code ;s code @ ;s ;s code Figure 2: Direct threaded co de, traditional variant Figure 3: Primitive-centric direct threaded co de 2.2 Traditional-style direct threaded co de class co de colon de nition call body For primitives the indirection is only needed be- constant lit value cause we do not knowinadvance whether the next value lit body @ word is a primitive or not. So, in order to avoid variable lit body the overhead of the indirection for primitives, some user variable useraddr offset Forth implementors have implemented a variantof defered lit body @ execute this scheme using direct threaded co de (see Fig. 2). eld lit offset + For primitives, the threaded co de for a word p oints do es-de ned lit body call does-code directly to the machine co de, and the execution to- ;co de-de ned lit cfa execute ken p oints there, to o. For non-primitives, the threaded co de cannot Figure 4: Compiling non-primitives for the point directly to the machine-co de routine (the primitive-centric scheme (with the least number of doer ), b ecause the do er would not know how to additional primitives) nd the body of the word. So the threaded co de points to a (variant of the) co de eld, and that eld contains a machine-co de jump to the do er. This change replaces a load in every word by a sumes a one-cell execution token (represented bya jump in every non-primitive. Because only 20%{ CFA), not a primitive with an argument. 25% of the dynamically executed words are non- Figure 4 shows how the various classes of non- primitives, direct threading is faster on most pro ces- primitives can b e compiled. Instead of using a se- sors, but not always on some p opular pro cessors quence of several primitives for some classes, new (see Section 3). primitives could be intro duced that p erform the The main disadvantage of direct threading in whole action in one primitive. However, if we com- an implementation like Gforth is that it requires bine frequent sequences of primitives into sup erin- architecture-sp eci c co de for creating the contents structions, this will happ en automatically; explic- of the co de elds (Gforth currently supp orts direct itly intro ducing these primitives may actually de- threading on 7 architectures). grade the ecacy of the sup erinstruction optimiza- tion, b ecause the optimizer would not know without 2.3 Primitive-centric threaded co de additional e ort that, e.g., our value-primitive is the same as the lit @ combination it found elsewhere. An alternative metho d to provide the b o dy address is to simply layitdown into the threaded co de as This scheme takes more space for the co de of non- immediate parameter. Then the threaded co de for primitives; this is the original reason for preferring a non-primitivedoes not p oint to the \co de eld" the traditional style when memory is scarce. There of the nonprimitive, but to a primitive that gets probably is little di erence from traditional-style the inline parameter and p erforms the function of direct threading in run-time p erformance on most the non-primitive. For the variable x in our exam- pro cessors: Only non-primitives are a ected; in the ple this primitiveis lit (the run-time primitivefor traditional-style scheme there is a jump and an ad- literal). dition to compute the b o dy address, whereas in the primitive-centric scheme there is a load with often This scheme is shown in Figure 3. The non- fully-exp osed latency. For some p opular pro cessors primitives still have a co de eld, whichisnotused the p erformance di erence can b e large, though (see in ordinary threaded co de execution. But it is used Section 3). for execute (and do defer), b ecause execute con- header of x : myconst create , does> @ ; code field dovar code 5 myconst five body docol code indirect header of foo call code field code field (does>) lit lit code @ code field ;s @ @ code ;s header of five code field dodoes code code field ;s code 5 Figure 5: Hybrid direct/indirect threaded co de double-indirect call Hybrid direct/indirect threaded 2.4 (does>) co de dodoes code @ e-centric scheme the Forth engine With the primitiv ;s (inner interpreter, primitives, and do ers) is divided to two parts that are relatively indep endent of in header of five h other: eac code field 5 Ordinary threaded co de. Figure 6: A does>-de ned word with indirect and Execute (and do defer), co de elds and execu- double-indirect threaded co de. tion tokens. Another interesting consequence of the division Only execute (and do defer) deals with execution 1 between ordinary threaded co de and execute etc. tokens and co de elds ,sowe can mo dify these com- in primitive-centric co de is that do ers (e.g., do col) p onents in co ordinated ways. The mo di cation we can only be invoked by execute; therefore only are interested in is to use indirect threaded co de execute has to rememb er the CFA for use by the elds; of course this requires an additional indirec- do ers. Unfortunately,theobvious way of expressing tion in the execute co de, it requires adding co de this in GNU C leads to the conservative assump- elds to primitives (see Fig. 5), and the execution tion that this value is alive (needs to b e preserved) tokens are the addresses of the co de elds. How- across all primitives, and this results in sub optimal ever, the ordinary threaded co de p oints directly to register allo cation.

Threaded Code Variations and Optimizations

An Architectural Trail to Threaded-Code Systems

Basic Threads Programming: Standards and Strategy

Effective Inline-Threaded Interpretation of Java Bytecode

Threaded Code Is Not in Fact Implemented in the Computer's Hardware

Csc 453 Interpreters & Interpretation

Static Analysis of Lockless Microcontroller C Programs

A Case Study of Multi-Threading in the Embedded Space

The Flow Programming Language: an Implicitly-Parallelizing, Programmer-Safe Language

Virtual Machine Showdown: Stack Versus Registers

Multiple Cores + SIMD + Threads) (And Their Performance Characteristics

Generating Multithreaded Code from Parallel Haskell for Symmetric Multiprocessors Alejandro Caro

The Vector-Thread Architecture