Threaded Co de Variations and Optimizations
M. Anton Ertl
TU Wien
Abstract header of x
code field dovar code
orth has been traditionally implemented as in-
F body
direct threaded co de, where the co de for non-
es is the co de- eld address of the word. To
primitiv header of foo docol code
um b ene t from combining sequences
get the maxim code field
primitives into sup erinstructions, the co de pro-
of code field
e should b e a primitivefol-
duced for a non-primitiv body @ code
wed by a parameter (e.g., lit addr for variables).
lo code field
pap er takes a lo ok at the steps from a tra-
This ;s code
ditional threaded-co de implementation to sup erin-
structions, and at the size and sp eed e ects of the
Figure 1: Traditional indirect threaded co de
various steps. The use of sup erinstructions gives
sp eedups of up to a factor of 2 on large b enchmarks
on pro cessors with branch target bu ers, but re-
pap er describ es the various steps from an indirect
quires more space for the primitives and the opti-
threaded implementation to an implementation us-
mization tables, and also a little more space for the
ing sup erinstructions (Section 2), and evaluates the
threaded co de.
e ect of these steps on run-time and co de size (Sec-
tion 4); these steps also a ect cache consistency is-
sues on the 386 architecture, whichcanhave a large
1 Intro duction
in uence on p erformance (Section 3).
Traditionally,Forth has b een implemented using an
interpreter for indirect threaded co de. However,
over time programs have tended to dep end less on
2 Threaded co de variations
sp eci c features of this implementation technique,
and an increasing number of Forth systems have
Wewillusethefollowing co de as a running example:
used other implementation techniques, in particu-
lar native co de compilation.
One of the goals of the Gforth pro ject is to
variable x
provide comp etetive p erformance, another goal is
p ortability to a wide range of machines. To meet
: foo x @ ;
the p ortability goal, we decided to stay with a
threaded-co de engine compiled with GCC [Ert93];
to regain ground lost on the eciency front, wede-
cided to combine sequences of primitives into su-
2.1 Traditional indirect threaded
p erinstructions. This technique has b een prop osed
co de
bySchutz [Sch92] and implemented by Wil Baden
In indirect threaded co de (Fig. 1) the co de of a colon in this4th [Bad95] and by Marcel Hendrix in a ver-
sion of eforth. It is related to the concepts of sup er-
de nition consists of a sequence of the co de eld ad-
combinators [Hug82] and sup erop erators [Pro95].
dresses (CFAs) of the words contained in the colon
de nition. SuchaCFA points to a co de eld that
Non-primitives in traditional indirect threaded
contains the address of the machine co de that p er-
co de cannot be combined into sup erinstructions,
forms the function of the word.
but it is p ossible to compile them into using prim-
itives that can be combined, and then into using
The reason for the indirection through the co de
direct threading instead of indirect threading. This
eld is to supp ort non-primitives, likethevariable
x in our example: The dovar routine can compute
Corresp ondence Address: Institut fur Computer-
the body address by adding the co de eld size to
sprachen, Technische Universitat Wien, Argentinierstrae 8,
A-1040 Wien, Austria; [email protected] en.a c.a t the CFA. header of x header of x jmp dovar dovar code jmp dovar dovar code body body
header of foo header of foo docol code jmp docol docol code jmp docol lit lit code body @ code @ code ;s code @
;s ;s code
Figure 2: Direct threaded co de, traditional variant
Figure 3: Primitive-centric direct threaded co de
2.2 Traditional-style direct threaded
co de
class co de
colon de nition call body
For primitives the indirection is only needed be-
constant lit value
cause we do not knowinadvance whether the next
value lit body @
word is a primitive or not. So, in order to avoid
variable lit body
the overhead of the indirection for primitives, some
user variable useraddr offset
Forth implementors have implemented a variantof
defered lit body @ execute
this scheme using direct threaded co de (see Fig. 2).
eld lit offset +
For primitives, the threaded co de for a word p oints
do es-de ned lit body call does-code
directly to the machine co de, and the execution to-
;co de-de ned lit cfa execute
ken p oints there, to o.
For non-primitives, the threaded co de cannot
Figure 4: Compiling non-primitives for the
point directly to the machine-co de routine (the
primitive-centric scheme (with the least number of
doer ), b ecause the do er would not know how to
additional primitives)
nd the body of the word. So the threaded co de
points to a (variant of the) co de eld, and that eld
contains a machine-co de jump to the do er.
This change replaces a load in every word by a
sumes a one-cell execution token (represented bya
jump in every non-primitive. Because only 20%{
CFA), not a primitive with an argument.
25% of the dynamically executed words are non-
Figure 4 shows how the various classes of non-
primitives, direct threading is faster on most pro ces-
primitives can b e compiled. Instead of using a se-
sors, but not always on some p opular pro cessors
quence of several primitives for some classes, new
(see Section 3).
primitives could be intro duced that p erform the
The main disadvantage of direct threading in
whole action in one primitive. However, if we com-
an implementation like Gforth is that it requires
bine frequent sequences of primitives into sup erin-
architecture-sp eci c co de for creating the contents
structions, this will happ en automatically; explic-
of the co de elds (Gforth currently supp orts direct
itly intro ducing these primitives may actually de-
threading on 7 architectures).
grade the ecacy of the sup erinstruction optimiza-
tion, b ecause the optimizer would not know without
2.3 Primitive-centric threaded co de
additional e ort that, e.g., our value-primitive is the
same as the lit @ combination it found elsewhere.
An alternative metho d to provide the b o dy address
is to simply layitdown into the threaded co de as
This scheme takes more space for the co de of non-
immediate parameter. Then the threaded co de for
primitives; this is the original reason for preferring
a non-primitivedoes not p oint to the \co de eld"
the traditional style when memory is scarce. There
of the nonprimitive, but to a primitive that gets
probably is little di erence from traditional-style
the inline parameter and p erforms the function of
direct threading in run-time p erformance on most
the non-primitive. For the variable x in our exam-
pro cessors: Only non-primitives are a ected; in the
ple this primitiveis lit (the run-time primitivefor
traditional-style scheme there is a jump and an ad-
literal).
dition to compute the b o dy address, whereas in the
primitive-centric scheme there is a load with often This scheme is shown in Figure 3. The non-
fully-exp osed latency. For some p opular pro cessors primitives still have a co de eld, whichisnotused
the p erformance di erence can b e large, though (see in ordinary threaded co de execution. But it is used
Section 3). for execute (and do defer), b ecause execute con- header of x : myconst create , does> @ ; code field dovar code 5 myconst five body docol code indirect header of foo call code field code field (does>) lit lit code @ code field ;s @ @ code ;s header of five code field dodoes code code field ;s code
5
Figure 5: Hybrid direct/indirect threaded co de double-indirect
call
Hybrid direct/indirect threaded 2.4 (does>)
co de dodoes code
@
e-centric scheme the Forth engine
With the primitiv ;s
(inner interpreter, primitives, and do ers) is divided
to two parts that are relatively indep endent of
in header of five h other: eac code field
5
Ordinary threaded co de.
Figure 6: A does>-de ned word with indirect and
Execute (and do defer), co de elds and execu-
double-indirect threaded co de.
tion tokens.
Another interesting consequence of the division
Only execute (and do defer) deals with execution
1
between ordinary threaded co de and execute etc.
tokens and co de elds ,sowe can mo dify these com-
in primitive-centric co de is that do ers (e.g., do col)
p onents in co ordinated ways. The mo di cation we
can only be invoked by execute; therefore only
are interested in is to use indirect threaded co de
execute has to rememb er the CFA for use by the
elds; of course this requires an additional indirec-
do ers. Unfortunately,theobvious way of expressing
tion in the execute co de, it requires adding co de
this in GNU C leads to the conservative assump-
elds to primitives (see Fig. 5), and the execution
tion that this value is alive (needs to b e preserved)
tokens are the addresses of the co de elds. How-
across all primitives, and this results in sub optimal
ever, the ordinary threaded co de p oints directly to
register allo cation.
the co de of the primitives, and uses direct threaded
NEXTs.
The main advantage of this hybrid scheme is that
2.5 Double-indirect threaded co de
we do not need to create machine-co de jumps in
thecode eld, which helps the p ortability goals of
For does>-de ned words, the inner interpreter gets
Gforth and reduces the maintenance e ort. An-
the CFA and has to nd the body, the co de ad-
other advantage is that this allows us to separate
dress of do do es, and the do es-co de (the Forth co de
co de and data completely, without incurring the
b ehind the does>). With indirect threaded co de,
run-time cost of indirect threaded co de for most of
Gforth uses a two-cell co de eld that contains the
the co de.
co de address and the do es-co de. Therefore all co de
The overall p erformance impact should b e small,
elds in Gforth are two cells wide.
b ecause execute and do defer account for only 1%{
An alternativewould b e to use a double indirec-
1.6% of the executed primitives and do ers; and it
tion (@ @) to get from the co de eld to the co de ad-
should b e a slight sp eedup on most pro cessors, b e-
dress (double-indirect threaded co de, see Fig. 6). In
cause most (70%{97%) of the executed or deferred
this scheme, the co de eld for does>-de ned words
words are colon de nitions, and indirect threaded
points near the do es-co de, and the cell there p oints
co de is often faster for non-primitives (b ecause a
to do do es.
jump is often more exp ensivethana@).
This allows to reduce the co de eld size to one
1
cell, but requires an additional indirection on every
Expanding ;co de-de ned words into lit cfa execute
ensures that this holds for uses of ;co de-de ned words, to o. use of the co de eld; using double-indirect thread-
hniques on the 386 architecture.
header of x tec
CPUs usually split the rst-level cache
code field dovar code Mo dern
to an instruction cacheandadatacache. Instruc-
body in
hes access the instruction cache, loads and
docol code tion fetc he.
header of foo stores access the data cac w do es the in-
code field code field If a store writes an instruction, ho
cache learn ab out that? On most archi-
lit_@_;s lit_@_;s code struction
tectures the program is required to announce this
to the CPU in some sp ecial way after writing the in-
struction, and b efore executing it. However, the 386
Figure 7: Combining primitives into sup erinstruc-
architecture do es not require anysuchsoftware sup-
tions in hybrid direct/indirect threaded co de
2
port , so the hardware has to deal with this problem
by itself. The various implementations of this ar-
ing for all co de would incur a signi cant p erfor-
chitecture deal with this problem in the following
mance p enalty, but the p enalty is small for a hybrid
ways:
with a primitive-centric direct threading scheme.
Double-indirect threaded co de also requires one ad-
The Pentium, Pentium MMX, K5, K6, K6-2,
ditional cell for the additional indirection for every
and K6-3 don't allow a cache line (32 bytes
primitive.
on these pro cessors) to b e in b oth the instruc-
Forth systems using indirect threaded co de and
tion and the data cache. If the instruction
one-cell co de elds usually use a scheme for im-
cache loads the line, the data cache has to evict
plementing does>-de ned words that is similar to
it rst, writing back the mo di ed instruction.
our double-indirect threaded scheme, with one dif-
Similarly, when the line is loaded into the data
ference: instead of a p ointer to do do es there is a
cache,itisinvalidated in the instruction cache.
machine-co de jump to do do es b etween (does>) and
The e ect of this is that having alternating
the do es-co de. Gforth uses an appropriately mo d-
data accesses and instruction executions in the
i ed variant of this approach for direct threaded
same cache line is relatively exp ensive (dozens
co de on some platforms [Ert93].
of cycles on each switch).
The Pentium Pro, Pentium II, Pentium III,
2.6 Sup erinstructions
Celeron, Athlon and Duron allow the same
cache line to b e in b oth caches, in shared state
We can add new primitives (sup erinstructions) that
(i.e., read-only). As so on as there is a write to
perform the action of a sequence of primitives.
the line, the line is evicted from the instruction
These sup erinstructions are then used in place of
cache. The e ect is that it is relatively exp en-
the original sequence; the immediate parameters to
sive to alternate writes to and instruction exe-
the original primitives are just app ended to the su-
cutions from the same cache line (32 bytes on
p erinstruction (see Fig. 7). In Gforth we select a
the Intel pro cessors, 64 bytes on the AMDs).
few hundred frequently o ccuring sequences as su-
p erinstructions.
According to the Pentium 4 optimization man-
This optimization can be used with any of the
ual [Int01], it is exp ensive on the Pentium 4
threading schemes explained earlier, but it gives
to alternate writes and instruction executions
b etter results with the primitive-centric schemes,
from the same 1KB-region; actually the man-
b ecause a traditional-style non-primitive cannot b e
ual recommends keeping co de and data on sep-
part of a sup erinstruction. However, it is p ossible
arate pages (4KB).
to use the traditional scheme and convert only the
invo cations of those non-primitives into primitive-
Howarevarious Forth implementation techniques
centric style that are to b e included in sup erinstruc-
a ected by that?
tions.
Sup erinstructions reduce the threaded co de size,
Indirect-threaded co de. If primitives are im-
but require more nativecode. The main advantage
plementedasshown in Fig. 1, i.e., with co de elds
(and the reason for their implementation in Gforth)
adjacenttothe co de, the data read from the co de
is the reduction in run-time, mainly by reducing the
eld is so on followed by an instruction read nearby
numb er of mispredicted indirect branches.
(usually in the same cache line). This leads to low
p erformance on Pentium{K6-3; e.g., Win32Forth
su ered heavily from this.
3 Cache consistency
2
Actually, the 486 requires a jump in order to ush the
This section discusses an issue that has a signi cant
pip eline, but that do es not help with the cache consistency
p erformance impact on most Forth implementation problem.
ortunately, this problem is easy to avoid by
F execution time
putting the co de into an area separate from the co de
Gforth do es this, and p erforms well on these
elds. 1
pro cessors with indirect threaded co de.
Direct-threaded co de. Non-primitives have a
jump in the co de eld close to the data that usually indirect threaded
accessed so on after executing the jump. So for is direct threaded
0.5
traditional-style direct-threaded co de the Pentium{
K6-3 will slowdown when executing non-primitives,
and the Pentium Pro{Pentium 4 will slow down
when writing to variables and values. There may
additional slowdowns due to co de elds for
be 0
words b eing in the same cache line as the other trad 0 100 400 1600
doprims 50 200 800
data, esp ecially with the longer cache lines of the
Athlon/Duron, and the 1KB consistency checking
Figure 8: Execution time of bench-gc relative
region of the Pentium 4. Note that these problems
to traditional-style indirect threaded co de on an
do not show up when running some of the p opular
800MHz Athlon
small benchmarks, b ecause their inner lo ops often
contain only primitives.
voided by using indirect
These problems can b e a execution time
threaded co de or by using primitive-centric direct-
in the latter case execute still ex-
threaded co de; 1
the jumps in the co de elds and will cause
ecutes indirect threaded
slowdowns. This can b e avoided byusingahybrid
direct/indirect-threaded co de scheme.
direct threaded
ecode. ManyForth-to-native-co de compil-
Nativ 0.5
ers store the native co de here, interleaving that
co de with data for variables, headers, etc. This
can lead to p erformance loss due to cache consis-
tency maintenance in a somewhat erratic fashion
h data shares a cache line with (dep ending on whic 0
trad 0 100 400 1600
h co de in a particular run). It also leads to bad
whic doprims 50 200 800
utilization of b oth caches.
This problem can b e avoided byhaving separate
Figure 9: Execution time of bench-gc relative
co de and data memory areas.
to traditional-style indirect threaded co de on a
600MHz 21164a
4 Evaluation
We built and ran all of these variants with b oth
This section compares the execution sp eed and
indirect and direct threaded co de, making it p ossi-
memory requirements of a number of variants of
ble to identify howvarious comp onents of a scheme
Gforth:
contribute to p erformance. We did not run hybrid
schemes (they are not yet implemented in Gforth),
trad Traditional-style threaded co de.
but we exp ect the p erformance to b e very close to
doprims Primitive-centric threaded co de using
direct threaded co de. For technical reasons, the ker-
sp ecial primitives for values, elds, etc., such
nel of Gforth (a part that contains, e.g., the com-
that only one primitive is executed per non-
piler and text interpreter, but not the full wordlist
primitive.
and search order supp ort) is always compiled in
mostly Scheme 0 in these exp eriments.
0 Primitive-centric co de, expanding some words
into multiple primitives, as shown in Fig. 4.
Equivalent to using 0 sup erinstructions.
4.1 Sp eed
50, 100, 200, 400, 800, 1600 Using 50, 100, ... Figure 8 and 9 showhowthe bench-gc benchmark
sup erinstructions representing the n most fre- b ehaves on an Athlon and a 21164a (Alpha archi-
quently executed sequences in brainless (a tecture), resp ectively. Other b enchmarks b ehavein
chess program written byDavid Kuhling). similar ways, although the magnitude of the e ects
aries with the b enchmark and with the pro cessor.
v bytes
For the traditional-style scheme, indirect thread-
threaded code
ing is faster than direct threading on current 386
100000 native code
architecture pro cessors (by up to a factor of 3), b e-
cause of cache consistency issues; on other architec-
tures direct threading b eats indirect threading also
for the traditional-style scheme, but not byasmuch
50000
as in the other schemes (b ecause of the additional
jumps). engine data
The primitive-centric scheme gives ab out the
same p erformance as the traditional-style scheme
0
For direct threaded co de
for indirect threaded co de. trad 0 100 400 1600 hitecture,
it pro duces a go o d sp eedup on the 386 arc doprims 50 200 800
mainly due to the elimination of cache invalidations.
On the Alpha it gets a little faster, b ecause of the
Figure 10: The sizes of Gforth's primitives (native
elimination of the jumps through the co de elds.
co de), engine data (mainly p eephole optimization
Overall, the combination of the primitive-centric
tables), and threaded co de a ected by the di erent
scheme and direct threading runs faster than tra-
schemes
ditional indirect threaded co de on all pro cessors.
Expanding non-primitives into multiple primi-
everything that app ears in the dictionary, not
tives (Scheme 0) costs some p erformance, but intro-
just the threaded co de. The co de grows quite
ducing sup erinstructions recovers that right away.
a bit by going to the primitive-centric scheme,
On the Athlon, the Pentium I I I, and the 21264 the
and the savings incurred by sup erinstructions
sp eedup from sup erinstructions is typically up to a
cannot fully recover that cost.
factor of 2 for large b enchmarks, and up to a fac-
tor of 5.6 on small b enchmarks (matrix ). A large
Native co de is the text size of the Gforth binary,
part of this sp eedup is caused by the improved in-
which consists mainly of the co de for the prim-
direct branch prediction accuracy of the BTB on
itives. For large numb ers of sup erinstructions,
these CPUs. The sp eedup from sup erinstructions
this grows signi cantly.
is quite a bit less on pro cessors without BTB like
the 21164a and the K6-2 (e.g., for b ench-gc a factor
Engine data is the data size of the Gforth binary,
of 1.38 on the K6-2 vs. 1.86 for the Athlon).
which contains sup erinstruction optimization
Having a large numb er of sup erinstructions gives
tables, among other things. These, of course
diminishing returns. On the 21164a, there are also
grow with the numb er of sup erinstructions.
slowdowns from more sup erinstructions in some
con gurations; this is probably due to con ict
Overall, sup erinstructions as currently imple-
misses in the small (8KB) direct-mapp ed instruc-
mented in Gforth cannot reduce the co de size;
tion cache.
other techniques, like converting only those non-
Using a large number of sup erinstructions also
primitives into a primitive-centric form that are
poses some practical problems: Compiling Gforth
combined into a sup erinstruction, or eliminating
with 800 sup erinstructions requires ab out 100MB
co de elds, and switching to byte co de mightchange
virtual memory on the 386 architecture, compil-
the picture for suciently large programs. On the
ing it with 1600 sup erinstructions requires ab out
other hand, the increase in memory consumption
300MB and 1.5 hours on a 800MHz Athlon with
from primitive-centric schemes and sup erinstruc-
192MB RAM. So, in the release version Gforth will
tions should not b e a problem in desktop and server
probably use only a few hundred sup erinstructions.
systems. But for manyemb edded systems, the tra-
ditional scheme will continue to be the threaded-
co de metho d of choice.
4.2 Size
Figure 10 shows data ab out the sizes of various com-
5 Conclusion
p onents of Gforth:
Switching from traditional-style threaded co de to a
Threaded co de is the non-kernel part of the primitive-centric scheme op ens up a numberofop-
threaded co de of Gforth for the 386 architec- p ortunities, most notably a wider applicability of
ture; it contains, e.g., the full wordlist and sup erinstructions, but it also makes direct threaded
search order supp ort, see, and the assembler co de viable on 386 architecture pro cessors, and en-
and disassembler. The shown size includes ables a hybrid direct/indirect (or direct/double-
indirect) threading scheme that o ers the p erfor-
mance of direct-threaded co de without requiring
any (non-p ortable) machine-co de generation. The
price paid is a mo derate increase in threaded-co de
size.
Sup erinstructions o er go o d sp eedups on pro ces-
sors with BTBs, mo derate sp eedups on pro ces-
sors without BTBs, and a mo derate reduction in
threaded co de size (but not enough to fully recover
the increase through primitive-centric co de), but re-
quire more space for the primitives.
References
[Bad95] Wil Baden. Pinhole optimization. Forth
Dimensions, 17(2):29{35, 1995.
[Ert93] M. Anton Ertl. A p ortable Forth engine. In
EuroFORTH '93 conference proceedings,
MarianskeLazne (Marienbad), 1993.
[Hug82] R. J. M. Hughes. Sup er-combinators. In
ConferenceRecord of the 1980 LISP Con-
ference, Stanford, CA, pages 1{11, New
York, 1982. ACM.
[Int01] Intel. Intel Pentium 4 Processor Optimiza-
tion, 2001.
[Pro95] Todd A. Pro ebsting. Optimizing an
ANSI C interpreter with sup erop erators.
In Principles of Programming Languages
(POPL '95), pages 322{332, 1995.
[Sch92] Udo Schutz. Optimierung von Fadenco de.
In FORTH-Tagung, Rosto ck, 1992. Forth
Gesellschaft e.V.