The Cog writing a JIT in a high-level dynamic language

Eliot Miranda Independent Consultant San Francisco [email protected]

Introduction 1. Overview Cog is an extension of the VMMaker [1] vir- Cog is implemented within the Squeak VMMaker tual machine framework for implementing Smalltalk package and translated to using the package’s virtual machines. The framework implements virtual Slang translator. Being a Smalltalk program, Cog is machines in a subset of Smalltalk and allows their evaluable within a Smalltalk environment and hence incremental development and execution within the allows incremental development and debugging of Smalltalk IDE, and then translates the code to C to most of the system. But to evaluate generated x86 produce a production VM. machine code the framework is extended with an Cog adds a just-in-time compiler for x86 to the exist- interface to the core processor implementation in the ing interpreter. At this stage in its evolution Cog Bochs [3] IBM PC emulator, which is married to the produces a VM that runs language-intensive (rather byte array object holding the heap and generated than library intensive) benchmarks from the com- code, and to variables in the Smalltalk interpreter puter language shootout [2] at around 5 times faster and JIT objects. than the interpreter. While Cog’s code generation Cog was implemented in a commercial context, the and message send optimization techniques are not Qwaq/Teleplace1 [4] application of the Croquet [5] novel, its plumbing of an x86 simulator into its virtual worlds system to business communication. In framework, which allows full simulation of gener- this context, incremental delivery of increased per- ated machine code, is probably of some interest to formance over the base interpreter and minimisation the VMIL community. The VM is also well- of risk was a major requirement. Hence Cog modularised and extensible, as exemplified by re- evolved in several stages, and forethought was given cently being applied to Newspeak [6]. to making major components pluggable to support Cog is fully open-source, released under the MIT that evolution. Squeak license. The author maintains a blog with This paper follows the above sketch. We first cover detailed articles on various components of the sys- the Slang Smalltalk-to-C translator and its exten- tem [7]. sions, going on to describe the processor simulation General Terms Virtual Machine Implementa- system with Bochs at its core. The next two sections tion, High-level language, dynamic language, just- then look at different aspects of modularity and ex- in-time compiler. tensibility in Cog. The last section presents a couple

1 The company was founded as Qwaq, Inc (see the left of a qwerty keyboard) and changed its name to Teleplace to avoid offense; try selling Qwaq Forums to a medical establishment. of cool hacks made possible by using Smalltalk and essor implementation in the Bochs [3] IBM PC emu- then we conclude. lator. This accesses data in the memory ByteArray, and executes instructions in the generated code por- 2. Slang tion of the memory ByteArray. Using Bochs is both As described in [1] the VMMaker framework uses a less work and doubtless provides faster execution subset of Smalltalk that is easily translated to C. The than implementing an x86 simulator in Smalltalk. translator, called Slang, is a one-pass traversal over The Bochs x86 simulator is derived by taking the the parse trees of methods in the framework, core processor definition from Bochs, changing the straightforwardly writing out C code. This implies memory interface so that it fetches/stores from/to the no rich data structures, no Dictionaries (hash maps), memory ByteArray and wrapping this up in a Squeak Sets or OrderedCollections (dequeues, stacks), only VM plugin2 with a primitive interface whose re- Arrays, simple to:by:do: loops, and no polymor- ceiver is an Alien3 for the Bochs C++ processor ob- phism; each message in the framework must have ject and whose arguments are commonly the memory only one implementation. ByteArray and some read/write/execute limits. The primitive interface comprises 6 methods for execu- The base Interpreter is monolithic, inheriting from tion, single stepping, disassembly, initialization and the object representation/heap/garbage collector Ob- error message retrieval. The remaining methods ac- jectMemory. All messages are therefore to self. Elements of this architecture were far too restrictive cess the register state of the x86 represented by the for a JIT and the architecture was extended, by split- C++ object. ting the Interpreter from the ObjectMemory, and Machine code is generated into the “method zone”, keeping the JIT separate from these two. In located low in the memory ByteArray. Correspond- to this support for multiple classes, there is support ingly there are structures in the memory object, such for record and pointer-to-record types which are used as machine-code methods, that must be accessed extensively to implement data structures such as in from the Smalltalk side of things, and hence proxy the Interpreter, a stack page of Smalltalk method ac- objects are used to wrap addresses in the memory tivations, or in the JIT, an abstract instruction, or an object (see Figure 1 on page 6). These wrapper entry on the simulation stack used to map stack byte- classes are auto-generated from the classes that de- code to register-based code. fine the relevant ADTs, e.g. CogMethod defines the ADT for a machine-code method header, generates The base Interpreter uses a ByteArray called memory to model the object heap, using integer addresses in the C typedef in the Slang translation and generates CogMethodSurrogate32/64 to access CogMethod the memory ByteArray as object references. In Cog instances in the simulation memory (see appendix). this memory ByteArray is extended with space for stack pages holding method activations in the form Accessing methods and variables in the interpreter, of stack frames, and with space for generated code. such as the bulk of primitives that are not imple- mented in machine code (window, file and operating 3. Bochs system interfaces, FFI, save/restore and complex To evaluate generated x86 machine code the frame- execution primitives such as perform: (Smalltalk’s work is extended with an interface to the core proc- apply)), requires a way of mapping machine code

2 A Squeak VM plugin is a class implementing a set of primitives that either implement or provide an interface to some useful functionality. It can be either statically or dynamically linked into the VM and is hence a modular extension to the VM providing encapsulation and platform independence over an FFI based implementation. See [12].

3 An Alien is a wrapper for some external address along with a set of primitives such as longAt:put: for accessing external memory relative to that address. accesses to Smalltalk message sends. This is done cacheing techniques, which are focussed on call in- using illegal addresses. Each method and/or variable structions. But the mapping must also allow modifi- that needs to be accessed from machine code is given cation of the activation records, causing modification a unique illegal address and entered in a read and a or destruction of the associated stack frame as ap- write dictionary mapping illegal address to evaluable propriate. Such code is complex and Qwaq/ (either a Symbol or a Block) that can be used to ei- Teleplace management rightly wanted to speed de- ther fetch/store a variable or call a method. Bochs is livery of improved performance by requiring that the invoked to execute machine code until it hits an ille- context-to-stack mapping machinery was realised gal address, at which point the invocation primitive first in the context of a modified interpreter, without fails and maps the Bochs exception into a Smalltalk having to wait for the development of the JIT. This ProcessorSimulationTrap Exception subclass that context-to-stack mapping StackInterpreter is typi- records the offending instruction, next instruction cally about twice as fast as the original context-based address, the illegal address, and register source or Interpreter for Smalltalk-intensive code. Further, destination. The system simulator then handles Qwaq/Teleplace negotiated a naive initial JIT code ProcessorSimulationTrap exceptions by looking up generator for early delivery. This code generator has the illegal address in the relevant Dictionary, evaluat- a one-to-one correspondence between bytecodes and ing it and continuing execution. generated code, hence every bytecode that pushes an Execution of the simulator (see Figure 1 on page 6) operand results in machine code that also pushes that starts in Smalltalk, the interpreter being pure Small- operand on the real stack. talk. But once the CoInterpreter has asked the JIT to Given the StackInterpreter it was natural to use it via compile a method execution may continue in ma- the CoInterpreter subclass4 which married the chine code, surfacing back up to Smalltalk on en- context-to-stack mapping machinery, bytecode inter- countering an illegal address, or the VM being inter- preter and primitive set to Cog’s JIT (called the Co- rupted. Use of single-stepping allows capturing ma- git), providing a mixed execution model where code chine instructions so that one may conveniently ex- may either be interpreted or compiled to machine amine the previous N instructions on hitting a bug. code and freely inter-called. Hence Cog evolved in five stages, an extended bytecode set and new Small- 4. Modularity, Evolution and Mixed- talk bytecode compiler that implements full closures Mode Execution and hence enables context-to-stack mapping, a faster Cog was implemented in a commercial context, the context-to-stack mapping interpreter, a naive code Qwaq/Teleplace [4] application of the Croquet [5] generator, a threading model supporting a non- virtual worlds system to business communication. In blocking FFI and a sophisticated code generator, and this context, incremental delivery of increased per- is currently being extended with a more efficient ob- formance over the base interpreter and minimisation ject representation and a garbage collector that sup- of risk was a major requirement. One unique chal- ports pinning to serve the threaded FFI. This evolu- lenge in implementing a high-performance Smalltalk tion has been enabled by keeping the VM modular virtual machine is efficient access to contexts, Small- with forethought given to making things like the ob- talk’s first-class activation records [8][9]. Such a ject representation and bytecode set pluggable. machine must map activation records to stack There are distinct advantages to the mixed execution frames, to be able to apply conventional inline model. The Cogit can refuse to compile certain

4 Slang was extended to allow methods to be redefined in subclasses, with super sends being expanded during translation. methods, especially extremely large ones that con- preterMT subclass of the CoInterpreter, itself a sub- sume too much code space. Such methods are typi- class of StackInterpreter. 5 cally class initializers and rarely run ; interpreting The second major extension was the implementation them once typically consumes much less space and of a stack-to-register mapping code generator that is much faster than compiling and running them avoids reifying operands on the actual stack by once. The Cogit has a fixed size machine code zone, maintaining a simulation stack during compilation which has good performance characteristics due to and only generating code to access an operand when its small working set size, and is also quick to scan compiling a bytecode that consumes operands such and update when, for example, redefinition of a as a send or instance variable assignment. This sim- Smalltalk method in the IDE necessitates flushing of ple technique may have been pioneered by Peter associated machine code and inline caches. The Co- Deutsch in the ObjectWorks 2.4 Smalltalk VM, but Interpreter implements an extremely simple policy, was not described in the literature. It is similar to the only compiling a method if it has less than a certain technique used in the Intel VTune Java JIT [11]. number of literals and has been found in the first- Also developed is a code generator that provides level method lookup cache. Hence Cog typically support for adaptive optimization and speculative compiles a bytecoded method to machine code on its inlining, by adding performance counters to code 6 second invocation , avoiding compiling any method generated for conditional branch bytecodes (which used only once (e.g. for initialization). provides basic block frequency information), and When the code zone fills up a reclamation is sched- code to reify inline cache state as Smalltalk objects. uled which throws away least recently used methods Hence the Cogit hierarchy is up to a quarter of the space. In a pure JIT in order to continue execution the VM may have to reclaim Cogit code while attempting to edit that code (e.g. to add a SimpleStackBasedCogit method to a polymorphic inline cache) which can be StackToRegisterMappingCogit extremely tricky, given that reclamation involves StackToRegisterMappingCogitChecker compacting the space, relocating all methods in it. SistaStackToRegisterMappingCogit But with the mixed mode model reclamation is per- formed only at suspension points at the next avail- where StackToRegisterMappingCogitChecker as a able opportunity, providing significant simplifica- test class that checks stack depths are correct at sus- tion. pension points, and Sista stands for Speculative In- lining Small-talk Architecture. 5. Modularity and Extensibility The bytecode decode and generation scheme is plug- The first major extension of the VM was the imple- gable; a bytecode set is defined by an Array of Cog- mentation of a threading model, similar to that in the BytecodeDescriptor instances that contain a message Python VM, but using an extremely efficient locking selector to generate the bytecode (mapped to a func- mechanism [10], where any number of native threads tion pointer in C), the number of bytes in the byte- can share the VM with only one owning the VM at code, etc. Hence there is no monolithic central any one time. This is implemented in the CoInter- switch statement in the compiler, and bytecode sets should be easily changed, although to date this has

5 of course during development class initializers may be run often, but they may be evaluated only after redefinition, and being evaluated on demand by the programmer their performance is typically not an issue.

6 The CoInterpreter will also compile to machine code if a backward branch bytecode (which occurs at the end of all loops) is evaluated a consecutive number of times in the same method. only been exercised in adding bytecodes for the at start-up, using reflection, arranges that all methods Newspeak implementation. implemented in CoInterpreter and overridden in Co- The object representation is also pluggable, all code InterpreterMT have an override in CogVMSimulator for generating object access being abstracted into a that invokes either the CoInterpreter version or the separate hierarchy, currently with only two classes, CoInterpreterMT version depending on CoInterpre- an abstract CogObjectRepresentation and its sole terMTʼs cogThreadManager instance variable being subclass CogObjectRepresentationForSqueakV3 nil or non-nil. For example: that implements the current Squeak object format. I CogVMSimulator methods for simulation switch am just starting work on a new object representation initializeInterpreter: bytesToShift "Auto-generated by CogVMSimulator>> based on a scheme the author used in the 64-bit ensureMultiThreadingOverridesAreUpToDate" VisualWorks implementation where object headers contain indices into a class table rather than a full- ^self perform: #initializeInterpreter: width pointer to a class object. This has space ad- withArguments: {bytesToShift} vantages but also advantages in inline caches, which inSuperclass: (cogThreadManager now contain constant class indices instead of class ifNil: [CoInterpreter] pointers subject to movement by the garbage collec- ifNotNil: [CoInterpreterMT]) In adding threading the system needed to simulate tor. multiple processors. This was implemented by 6. Clever Tricks wrapping the Bochs Alien with a proxy that essen- tially implements only doesNotUnderstand:. The The utility of being able to use Smalltalk, a doesNotUnderstand: method then checks the current scriptable dynamic language, to run queries during process and if it has changed, saves the register set simulation, for example to gather statistics from in- and switches it to that of the current process. line caches, should not be underestimated. 7. Conclusion Preceding sections have been glib about the details of simulation. One issue is integer overflow in C Cog is a new implementation of the Smalltalk-80 and its absence in Smalltalk (which has bignum sup- virtual machine for Squeak. It is a Smalltalk pro- port). Hence the framework uses simulator sub- gram, providing incremental development and high- classes to implement certain methods that reconcile level debugging, while it is a JIT, depending on ex- Smalltalk and C’s peculiarities and provide ternal processor simulators to efficiently simulate simulation-only debugging code. For example, machine code execution. It is well modularised and ObjectMemory methods for interpreter access extensible and as such provides a sound basis for the isIntegerValue: intValue evolution of Smalltalk execution engines. It is in ^ (intValue bitXor: (intValue << 1)) >= 0 active development and is the standard execution engine for the Squeak and Smalltalk dialects ObjectMemorySimulator methods for simulation and for the Newspeak . isIntegerValue: valueWord ^ valueWord >= 16r-40000000 and: [valueWord Acknowledgements <= 16r3FFFFFFF] I wish to thank Andreas Raab and Greg Nuyens ex of Qwaq/Teleplace, where Cog started and matured, So CoInterpreter started life with a substantial sub- and Yaron Kashai at Cadence Design Systems, Inc, class CogVMSimulator. When CoInterpreterMT was added, I wanted to avoid maintaining two simulator where I have the opportunity to continue Cog’s evo- subclasses with essentially the same code. So lution and apply it to Newspeak. CogVMSimulator inherits from CoInterpreterMT and major simulation objects set-up on initialization CogObject Representation ForSqueakV3

Cogit NewCoObjectMemory StackToRegister- MappingCogit

CogMethodZone

CoInterpreter free space

Smalltalk object heap e.g. several Mb

CogStackPageSurrogate32

stack pages e.g. 32kb proxy instances CogMethodSurrogate32 created as necessary machine-code wrap addresses/indices methods within memory ByteArray e.g. 1Mb CogMethodSurrogate32 run-time linkage < 3kb

Figure 1. Cog simulation architecture memory ByteArray References Appendix - CogMethod code [1] , , John Maloney, Scott VMStructType Wallace, and , “Back to the future: the subclass: #CogBlockMethod instanceVariableNames: story of Squeak, a practical Smalltalk written in it- 'objectHeader homeOffset startpc padToWord self”, Proceedings of the 1997 ACM SIGPLAN con- cmNumArgs cmType cmRefersToYoung ference on Object-oriented programming systems, cmIsUnlinked cmUsageCount stackCheckOffset' languages and applications, pp 318–326, ACM, 1997 CogBlockMethod [2] computer language benchmarks game subclass: #CogMethod http://shootout.alioth.debian.org/ instanceVariableNames: 'blockSize blockEntryOffset methodObject [3] Bochs, http://en.wikipedia.org/wiki/Bochs, methodHeader selector' http://bochs.sourceforge.net/ CogMethod auto-generated typedef: [4] Qwaq/Teleplace, http://www.teleplace.com/ typedef struct { [5] Croquet, sqInt objectHeader; unsigned cmNumArgs : 8; http://en.wikipedia.org/wiki/Croquet_Project unsigned cmType : 3; [6] Newspeak, http://newspeaklanguage.org/ unsigned cmRefersToYoung : 1; unsigned cmIsUnlinked : 1; [7] Cog blog, http://www.mirandabanda.org/cogblog unsigned cmUsageCount : 3; unsigned stackCheckOffset : 16; [8] http://www.mirandabanda.org/cogblog/2009/ ushort blockSize; 01/14/under-cover-contexts-and-the-big-frame-up ushort blockEntryOffset; sqInt methodObject; [9] Eliot Miranda, “Context management in Visual- sqInt methodHeader; Works 5i”, OOPSLA '99 Workshop on Simplicity, sqInt selector; } CogMethod; Performance and Portability in Virtual Machine De- sign, November 1999, Denver, CO. Sample auto-generated CogMethodSurrogate32 meth- http://www.esug.org/data/Articles/misc/oopsla99-con ods: texts.pdf cmNumArgs ^memory unsignedByteAt: address + 5 [10] David Simmons, private communication. cmType [11] Ali-Reza Adl-Tabatabai, Michael Cierniak, ^(memory unsignedByteAt: address + 6) bitAnd: 16r7 Guei-Yuan Lueh, Vishesh M. Parikh, James M. Stichnoth, "Fast, Effective Code Generation in a cmType: aValue self assert: (aValue between: 0 and: 16r7). Just-In-Time Java Compiler", In Proceedings of the memory SIGPLAN `98 Conference on Programming Lan- unsignedByteAt: address + 6 guage Design and Implementation, pp 280-290. put: ((memory unsignedByteAt: address + 6) Published as SIGPLAN Notices 33(5), May 1998. bitAnd: 16rF8) + aValue. ^aValue [12] Tim Rowledge, “A Tour of the Squeak Object cmRefersToYoung Engine”, ^(((memory unsignedByteAt: address + 6) http://www.google.com/search?q=%22A+Tour+of+th bitShift: -3) bitAnd: 16r1) ~= 0 e+Squeak+Object+Engine%22+filetype%3Apdf methodObject: aValue ^memory unsignedLongAt: address + 13 put: aValue