Threaded Co de Variations and Optimizations



M. Anton Ertl

TU Wien

Abstract header of x

code field dovar code

orth has been traditionally implemented as in-

F body

direct threaded co de, where the co de for non-

es is the co de- eld address of the word. To

primitiv header of foo docol code

um ene t from combining sequences

get the maxim code field

primitives into sup erinstructions, the co de pro-

of code field

e should b e a primitivefol-

duced for a non-primitiv body @ code

wed by a parameter (e.g., lit addr for variables).

lo code field

pap er takes a lo ok at the steps from a tra-

This ;s code

ditional threaded-co de implementation to sup erin-

structions, and at the size and sp eed e ects of the

Figure 1: Traditional indirect threaded co de

various steps. The use of sup erinstructions gives

sp eedups of up to a factor of 2 on large b enchmarks

on pro cessors with branch target bu ers, but re-

pap er describ es the various steps from an indirect

quires more space for the primitives and the opti-

threaded implementation to an implementation us-

mization tables, and also a little more space for the

ing sup erinstructions (Section 2), and evaluates the

threaded co de.

e ect of these steps on run-time and co de size (Sec-

tion 4); these steps also a ect consistency is-

sues on the 386 architecture, whichcanhave a large

1 Intro duction

in uence on p erformance (Section 3).

Traditionally,Forth has b een implemented using an

for indirect threaded co de. However,

over time programs have tended to dep end less on

2 Threaded co de variations

sp eci features of this implementation technique,

and an increasing number of Forth systems have

Wewillusethefollowing co de as a running example:

used other implementation techniques, in particu-

lar native co de compilation.

One of the goals of the Gforth pro ject is to

variable x

provide comp etetive p erformance, another goal is

p ortability to a wide range of machines. To meet

: foo x @ ;

the p ortability goal, we decided to stay with a

threaded-co de engine compiled with GCC [Ert93];

to regain ground lost on the eciency front, wede-

cided to combine sequences of primitives into su-

2.1 Traditional indirect threaded

p erinstructions. This technique has b een prop osed

co de

bySchutz [Sch92] and implemented by Wil Baden

In indirect threaded co de (Fig. 1) the co de of a colon in this4th [Bad95] and by Marcel Hendrix in a ver-

sion of eforth. It is related to the concepts of sup er-

de nition consists of a sequence of the co de eld ad-

combinators [Hug82] and sup erop erators [Pro95].

dresses (CFAs) of the words contained in the colon

de nition. SuchaCFA points to a co de eld that

Non-primitives in traditional indirect threaded

contains the address of the machine co de that p er-

co de cannot be combined into sup erinstructions,

forms the function of the word.

but it is p ossible to compile them into using prim-

itives that can be combined, and then into using

The reason for the indirection through the co de

direct threading instead of indirect threading. This

eld is to supp ort non-primitives, likethevariable

x in our example: The dovar routine can compute



Corresp ondence Address: Institut fur Computer-

the body address by adding the co de eld size to

sprachen, Technische Universitat Wien, Argentinierstrae 8,

A-1040 Wien, Austria; [email protected] en.a c.a t the CFA. header of x header of x jmp dovar dovar code jmp dovar dovar code body body

header of foo header of foo docol code jmp docol docol code jmp docol lit lit code body @ code @ code ;s code @

;s ;s code

Figure 2: Direct threaded co de, traditional variant

Figure 3: Primitive-centric direct threaded co de

2.2 Traditional-style direct threaded

co de

class co de

colon de nition call body

For primitives the indirection is only needed be-

constant lit value

cause we do not knowinadvance whether the next

value lit body @

word is a primitive or not. So, in order to avoid

variable lit body

the overhead of the indirection for primitives, some

user variable useraddr offset

Forth implementors have implemented a variantof

defered lit body @ execute

this scheme using direct threaded co de (see Fig. 2).

eld lit offset +

For primitives, the threaded co de for a word p oints

do es-de ned lit body call does-code

directly to the machine co de, and the execution to-

;co de-de ned lit cfa execute

ken p oints there, to o.

For non-primitives, the threaded co de cannot

Figure 4: Compiling non-primitives for the

point directly to the machine-co de routine (the

primitive-centric scheme (with the least number of

doer ), b ecause the do er would not know how to

additional primitives)

nd the body of the word. So the threaded co de

points to a (variant of the) co de eld, and that eld

contains a machine-co de jump to the do er.

This change replaces a load in every word by a

sumes a one-cell execution token (represented bya

jump in every non-primitive. Because only 20%{

CFA), not a primitive with an argument.

25% of the dynamically executed words are non-

Figure 4 shows how the various classes of non-

primitives, direct threading is faster on most pro ces-

primitives can b e compiled. Instead of using a se-

sors, but not always on some p opular pro cessors

quence of several primitives for some classes, new

(see Section 3).

primitives could be intro duced that p erform the

The main disadvantage of direct threading in

whole action in one primitive. However, if we com-

an implementation like Gforth is that it requires

bine frequent sequences of primitives into sup erin-

architecture-sp eci c co de for creating the contents

structions, this will happ en automatically; explic-

of the co de elds (Gforth currently supp orts direct

itly intro ducing these primitives may actually de-

threading on 7 architectures).

grade the ecacy of the sup erinstruction optimiza-

tion, b ecause the optimizer would not know without

2.3 Primitive-centric threaded co de

additional e ort that, e.g., our value-primitive is the

same as the lit @ combination it found elsewhere.

An alternative metho d to provide the b o dy address

is to simply layitdown into the threaded co de as

This scheme takes more space for the co de of non-

immediate parameter. Then the threaded co de for

primitives; this is the original reason for preferring

a non-primitivedoes not p oint to the \co de eld"

the traditional style when memory is scarce. There

of the nonprimitive, but to a primitive that gets

probably is little di erence from traditional-style

the inline parameter and p erforms the function of

direct threading in run-time p erformance on most

the non-primitive. For the variable x in our exam-

pro cessors: Only non-primitives are a ected; in the

ple this primitiveis lit (the run-time primitivefor

traditional-style scheme there is a jump and an ad-

literal).

dition to compute the b o dy address, whereas in the

primitive-centric scheme there is a load with often This scheme is shown in Figure 3. The non-

fully-exp osed latency. For some p opular pro cessors primitives still have a co de eld, whichisnotused

the p erformance di erence can b e large, though (see in ordinary threaded co de execution. But it is used

Section 3). for execute (and do defer), b ecause execute con- header of x : myconst create , does> @ ; code field dovar code 5 myconst five body docol code indirect header of foo call code field code field (does>) lit lit code @ code field ;s @ @ code ;s header of five code field dodoes code code field ;s code

5

Figure 5: Hybrid direct/indirect threaded co de double-indirect

call

Hybrid direct/indirect threaded 2.4 (does>)

co de dodoes code

@

e-centric scheme the Forth engine

With the primitiv ;s

(inner interpreter, primitives, and do ers) is divided

to two parts that are relatively indep endent of

in header of five h other: eac code field

5

 Ordinary threaded co de.

Figure 6: A does>-de ned word with indirect and

 Execute (and do defer), co de elds and execu-

double-indirect threaded co de.

tion tokens.

Another interesting consequence of the division

Only execute (and do defer) deals with execution

1

between ordinary threaded co de and execute etc.

tokens and co de elds ,sowe can mo dify these com-

in primitive-centric co de is that do ers (e.g., do col)

p onents in co ordinated ways. The mo di cation we

can only be invoked by execute; therefore only

are interested in is to use indirect threaded co de

execute has to rememb er the CFA for use by the

elds; of course this requires an additional indirec-

do ers. Unfortunately,theobvious way of expressing

tion in the execute co de, it requires adding co de

this in GNU C leads to the conservative assump-

elds to primitives (see Fig. 5), and the execution

tion that this value is alive (needs to b e preserved)

tokens are the addresses of the co de elds. How-

across all primitives, and this results in sub optimal

ever, the ordinary threaded co de p oints directly to

register allo cation.

the co de of the primitives, and uses direct threaded

NEXTs.

The main advantage of this hybrid scheme is that

2.5 Double-indirect threaded co de

we do not need to create machine-co de jumps in

thecode eld, which helps the p ortability goals of

For does>-de ned words, the inner interpreter gets

Gforth and reduces the maintenance e ort. An-

the CFA and has to nd the body, the co de ad-

other advantage is that this allows us to separate

dress of do do es, and the do es-co de (the Forth co de

co de and data completely, without incurring the

b ehind the does>). With indirect threaded co de,

run-time cost of indirect threaded co de for most of

Gforth uses a two-cell co de eld that contains the

the co de.

co de address and the do es-co de. Therefore all co de

The overall p erformance impact should b e small,

elds in Gforth are two cells wide.

b ecause execute and do defer account for only 1%{

An alternativewould b e to use a double indirec-

1.6% of the executed primitives and do ers; and it

tion (@ @) to get from the co de eld to the co de ad-

should b e a slight sp eedup on most pro cessors, b e-

dress (double-indirect threaded co de, see Fig. 6). In

cause most (70%{97%) of the executed or deferred

this scheme, the co de eld for does>-de ned words

words are colon de nitions, and indirect threaded

points near the do es-co de, and the cell there p oints

co de is often faster for non-primitives (b ecause a

to do do es.

jump is often more exp ensivethana@).

This allows to reduce the co de eld size to one

1

cell, but requires an additional indirection on every

Expanding ;co de-de ned words into lit cfa execute

ensures that this holds for uses of ;co de-de ned words, to o. use of the co de eld; using double-indirect -

hniques on the 386 architecture.

header of x tec

CPUs usually split the rst-level cache

code field dovar code Mo dern

to an instruction cacheandadatacache. Instruc-

body in

hes access the instruction cache, loads and

docol code tion fetc he.

header of foo stores access the data cac w do es the in-

code field code field If a store writes an instruction, ho

cache learn ab out that? On most archi-

lit_@_;s lit_@_;s code struction

tectures the program is required to announce this

to the CPU in some sp ecial way after writing the in-

struction, and b efore executing it. However, the 386

Figure 7: Combining primitives into sup erinstruc-

architecture do es not require anysuchsoftware sup-

tions in hybrid direct/indirect threaded co de

2

port , so the hardware has to deal with this problem

by itself. The various implementations of this ar-

ing for all co de would incur a signi cant p erfor-

chitecture deal with this problem in the following

mance p enalty, but the p enalty is small for a hybrid

ways:

with a primitive-centric direct threading scheme.

Double-indirect threaded co de also requires one ad-

 The Pentium, Pentium MMX, K5, K6, K6-2,

ditional cell for the additional indirection for every

and K6-3 don't allow a cache line (32 bytes

primitive.

on these pro cessors) to b e in b oth the instruc-

Forth systems using indirect threaded co de and

tion and the data cache. If the instruction

one-cell co de elds usually use a scheme for im-

cache loads the line, the data cache has to evict

plementing does>-de ned words that is similar to

it rst, writing back the mo di ed instruction.

our double-indirect threaded scheme, with one dif-

Similarly, when the line is loaded into the data

ference: instead of a p ointer to do do es there is a

cache,itisinvalidated in the instruction cache.

machine-co de jump to do do es b etween (does>) and

The e ect of this is that having alternating

the do es-co de. Gforth uses an appropriately mo d-

data accesses and instruction executions in the

i ed variant of this approach for direct threaded

same cache line is relatively exp ensive (dozens

co de on some platforms [Ert93].

of cycles on each switch).

 The Pentium Pro, Pentium II, Pentium III,

2.6 Sup erinstructions

Celeron, Athlon and Duron allow the same

cache line to b e in b oth caches, in shared state

We can add new primitives (sup erinstructions) that

(i.e., read-only). As so on as there is a write to

perform the action of a sequence of primitives.

the line, the line is evicted from the instruction

These sup erinstructions are then used in place of

cache. The e ect is that it is relatively exp en-

the original sequence; the immediate parameters to

sive to alternate writes to and instruction exe-

the original primitives are just app ended to the su-

cutions from the same cache line (32 bytes on

p erinstruction (see Fig. 7). In Gforth we select a

the Intel pro cessors, 64 bytes on the AMDs).

few hundred frequently o ccuring sequences as su-

p erinstructions.

 According to the Pentium 4 optimization man-

This optimization can be used with any of the

ual [Int01], it is exp ensive on the Pentium 4

threading schemes explained earlier, but it gives

to alternate writes and instruction executions

b etter results with the primitive-centric schemes,

from the same 1KB-region; actually the man-

b ecause a traditional-style non-primitive cannot b e

ual recommends keeping co de and data on sep-

part of a sup erinstruction. However, it is p ossible

arate pages (4KB).

to use the traditional scheme and convert only the

invo cations of those non-primitives into primitive-

Howarevarious Forth implementation techniques

centric style that are to b e included in sup erinstruc-

a ected by that?

tions.

Sup erinstructions reduce the threaded co de size,

Indirect-threaded co de. If primitives are im-

but require more nativecode. The main advantage

plementedasshown in Fig. 1, i.e., with co de elds

(and the reason for their implementation in Gforth)

adjacenttothe co de, the data read from the co de

is the reduction in run-time, mainly by reducing the

eld is so on followed by an instruction read nearby

numb er of mispredicted indirect branches.

(usually in the same cache line). This leads to low

p erformance on Pentium{K6-3; e.g., Win32Forth

su ered heavily from this.

3 Cache consistency

2

Actually, the 486 requires a jump in order to ush the

This section discusses an issue that has a signi cant

pip eline, but that do es not help with the cache consistency

p erformance impact on most Forth implementation problem.

ortunately, this problem is easy to avoid by

F execution time

putting the co de into an area separate from the co de

Gforth do es this, and p erforms well on these

elds. 1

pro cessors with indirect threaded co de.

Direct-threaded co de. Non-primitives have a

jump in the co de eld close to the data that usually indirect threaded

accessed so on after executing the jump. So for is direct threaded

0.5

traditional-style direct-threaded co de the Pentium{

K6-3 will slowdown when executing non-primitives,

and the Pentium Pro{Pentium 4 will slow down

when writing to variables and values. There may

additional slowdowns due to co de elds for

be 0

words b eing in the same cache line as the other trad 0 100 400 1600

doprims 50 200 800

data, esp ecially with the longer cache lines of the

Athlon/Duron, and the 1KB consistency checking

Figure 8: Execution time of bench-gc relative

region of the Pentium 4. Note that these problems

to traditional-style indirect threaded co de on an

do not show up when running some of the p opular

800MHz Athlon

small benchmarks, b ecause their inner lo ops often

contain only primitives.

voided by using indirect

These problems can b e a execution time

threaded co de or by using primitive-centric direct-

in the latter case execute still ex-

threaded co de; 1

the jumps in the co de elds and will cause

ecutes indirect threaded

slowdowns. This can b e avoided byusingahybrid

direct/indirect-threaded co de scheme.

direct threaded

ecode. ManyForth-to-native-co de compil-

Nativ 0.5

ers store the native co de here, interleaving that

co de with data for variables, headers, etc. This

can lead to p erformance loss due to cache consis-

tency maintenance in a somewhat erratic fashion

h data shares a cache line with (dep ending on whic 0

trad 0 100 400 1600

h co de in a particular run). It also leads to bad

whic doprims 50 200 800

utilization of b oth caches.

This problem can b e avoided byhaving separate

Figure 9: Execution time of bench-gc relative

co de and data memory areas.

to traditional-style indirect threaded co de on a

600MHz 21164a

4 Evaluation

We built and ran all of these variants with b oth

This section compares the execution sp eed and

indirect and direct threaded co de, making it p ossi-

memory requirements of a number of variants of

ble to identify howvarious comp onents of a scheme

Gforth:

contribute to p erformance. We did not run hybrid

schemes (they are not yet implemented in Gforth),

trad Traditional-style threaded co de.

but we exp ect the p erformance to b e very close to

doprims Primitive-centric threaded co de using

direct threaded co de. For technical reasons, the ker-

sp ecial primitives for values, elds, etc., such

nel of Gforth (a part that contains, e.g., the com-

that only one primitive is executed per non-

piler and text interpreter, but not the full wordlist

primitive.

and search order supp ort) is always compiled in

mostly Scheme 0 in these exp eriments.

0 Primitive-centric co de, expanding some words

into multiple primitives, as shown in Fig. 4.

Equivalent to using 0 sup erinstructions.

4.1 Sp eed

50, 100, 200, 400, 800, 1600 Using 50, 100, ... Figure 8 and 9 showhowthe bench-gc benchmark

sup erinstructions representing the n most fre- b ehaves on an Athlon and a 21164a (Alpha archi-

quently executed sequences in brainless (a tecture), resp ectively. Other b enchmarks b ehavein

chess program written byDavid Kuhling).  similar ways, although the magnitude of the e ects

aries with the b enchmark and with the pro cessor.

v bytes

For the traditional-style scheme, indirect thread-

threaded code

ing is faster than direct threading on current 386

100000 native code

architecture pro cessors (by up to a factor of 3), b e-

cause of cache consistency issues; on other architec-

tures direct threading b eats indirect threading also

for the traditional-style scheme, but not byasmuch

50000

as in the other schemes (b ecause of the additional

jumps). engine data

The primitive-centric scheme gives ab out the

same p erformance as the traditional-style scheme

0

For direct threaded co de

for indirect threaded co de. trad 0 100 400 1600 hitecture,

it pro duces a go o d sp eedup on the 386 arc doprims 50 200 800

mainly due to the elimination of cache invalidations.

On the Alpha it gets a little faster, b ecause of the

Figure 10: The sizes of Gforth's primitives (native

elimination of the jumps through the co de elds.

co de), engine data (mainly p eephole optimization

Overall, the combination of the primitive-centric

tables), and threaded co de a ected by the di erent

scheme and direct threading runs faster than tra-

schemes

ditional indirect threaded co de on all pro cessors.

Expanding non-primitives into multiple primi-

everything that app ears in the dictionary, not

tives (Scheme 0) costs some p erformance, but intro-

just the threaded co de. The co de grows quite

ducing sup erinstructions recovers that right away.

a bit by going to the primitive-centric scheme,

On the Athlon, the Pentium I I I, and the 21264 the

and the savings incurred by sup erinstructions

sp eedup from sup erinstructions is typically up to a

cannot fully recover that cost.

factor of 2 for large b enchmarks, and up to a fac-

tor of 5.6 on small b enchmarks (matrix ). A large

Native co de is the text size of the Gforth binary,

part of this sp eedup is caused by the improved in-

which consists mainly of the co de for the prim-

direct branch prediction accuracy of the BTB on

itives. For large numb ers of sup erinstructions,

these CPUs. The sp eedup from sup erinstructions

this grows signi cantly.

is quite a bit less on pro cessors without BTB like

the 21164a and the K6-2 (e.g., for b ench-gc a factor

Engine data is the data size of the Gforth binary,

of 1.38 on the K6-2 vs. 1.86 for the Athlon).

which contains sup erinstruction optimization

Having a large numb er of sup erinstructions gives

tables, among other things. These, of course

diminishing returns. On the 21164a, there are also

grow with the numb er of sup erinstructions.

slowdowns from more sup erinstructions in some

con gurations; this is probably due to con ict

Overall, sup erinstructions as currently imple-

misses in the small (8KB) direct-mapp ed instruc-

mented in Gforth cannot reduce the co de size;

tion cache.

other techniques, like converting only those non-

Using a large number of sup erinstructions also

primitives into a primitive-centric form that are

poses some practical problems: Compiling Gforth

combined into a sup erinstruction, or eliminating

with 800 sup erinstructions requires ab out 100MB

co de elds, and switching to byte co de mightchange

virtual memory on the 386 architecture, compil-

the picture for suciently large programs. On the

ing it with 1600 sup erinstructions requires ab out

other hand, the increase in memory consumption

300MB and 1.5 hours on a 800MHz Athlon with

from primitive-centric schemes and sup erinstruc-

192MB RAM. So, in the release version Gforth will

tions should not b e a problem in desktop and server

probably use only a few hundred sup erinstructions.

systems. But for manyemb edded systems, the tra-

ditional scheme will continue to be the threaded-

co de metho d of choice.

4.2 Size

Figure 10 shows data ab out the sizes of various com-

5 Conclusion

p onents of Gforth:

Switching from traditional-style threaded co de to a

Threaded co de is the non-kernel part of the primitive-centric scheme op ens up a numberofop-

threaded co de of Gforth for the 386 architec- p ortunities, most notably a wider applicability of

ture; it contains, e.g., the full wordlist and sup erinstructions, but it also makes direct threaded

search order supp ort, see, and the assembler co de viable on 386 architecture pro cessors, and en-

and disassembler. The shown size includes ables a hybrid direct/indirect (or direct/double-

indirect) threading scheme that o ers the p erfor-

mance of direct-threaded co de without requiring

any (non-p ortable) machine-co de generation. The

price paid is a mo derate increase in threaded-co de

size.

Sup erinstructions o er go o d sp eedups on pro ces-

sors with BTBs, mo derate sp eedups on pro ces-

sors without BTBs, and a mo derate reduction in

threaded co de size (but not enough to fully recover

the increase through primitive-centric co de), but re-

quire more space for the primitives.

References

[Bad95] Wil Baden. Pinhole optimization. Forth

Dimensions, 17(2):29{35, 1995.

[Ert93] M. Anton Ertl. A p ortable Forth engine. In

EuroFORTH '93 conference proceedings,

MarianskeLazne (Marienbad), 1993.

[Hug82] R. J. M. Hughes. Sup er-combinators. In

ConferenceRecord of the 1980 LISP Con-

ference, Stanford, CA, pages 1{11, New

York, 1982. ACM.

[Int01] Intel. Intel Pentium 4 Processor Optimiza-

tion, 2001.

[Pro95] Todd A. Pro ebsting. Optimizing an

ANSI C interpreter with sup erop erators.

In Principles of Programming Languages

(POPL '95), pages 322{332, 1995.

[Sch92] Udo Schutz. Optimierung von Fadenco de.

In FORTH-Tagung, Rosto ck, 1992. Forth

Gesellschaft e.V.