<<

CONS Should not CONS its Arguments, or, a Lazy Alloc is a Smart Alloc Henry G. Baker Nimble Computer Corporation, 16231 Meadow Ridge Way, Encino, CA 91436, (818) 501-4956 (818) 986-1360 (FAX) November, 1988; revised April and December, 1990, November, 1991. This work was supported in part by the U.S. Department of Energy Contract No. DE-AC03-88ER80663

Abstract semantics, and could profitably utilize stack allocation. One would Lazy allocation is a model for allocating objects on the execution therefore like to provide stack allocation for these objects, while stack of a high-level language which does not create dangling reserving the more expensive heap allocation for those objects that references. Our model provides safe transportation into the heap for require it. In other words, one would like an implementation which objects that may survive the deallocation of the surrounding stack retains the efficiency of stack allocation for most objects, and frame. Space for objects that do not survive the deallocation of the therefore does not saddle every program with the cost for the surrounding stack frame is reclaimed without additional effort when increased flexibility--whether the program uses that flexibility or the stack is popped. Lazy allocation thus performs a first-level not. garbage collection, and if the language supports garbage collection The problem in providing such an implementation is in determining of the heap, then our model can reduce the amortized cost of which objects obey LIFO allocation semantics and which do not. allocation in such a heap by filtering out the short-lived objects that Much research has been done on determining these objects at can be more efficiently managed in LIFO order. A run-time compile time using various static methods which are specific to the mechanism called result expectation further filters out unneeded particular type of object whose allocation is being optimized. results from functions called only for their effects. In a shared- Unfortunately, these techniques are limited, complex and expensive. memory multi-processor environment, this filtering reduces We present a technique which acts more like a cache or a virtual contention for the allocation and management of global memory. memory, in the sense that no attempt is made to predict usage at Our model performs simple local operations, and is therefore suitable compile time, but the usage is determined at run time. In other for an interpreter or a hardware implementation. Its overheads for words, the system learns about the usage of objects "on-the-fly". functional data are associated only with assignments, making lazy The major contribution of this paper is the recognition that a wide allocation attractive for mostly styles. variety of stack-allocation optimizations are all instances of the Many existing stack allocation optimizations can be seen as same underlying mechanism--lazy allocation. Using this insight, instances of this generic model, in which some portion of these we can simplify hardware architecture and language implementations local operations have been optimized away through static analysis by factoring the problem into two abstraction layerswlanguage techniques. implementation using generic storage allocation, and the Important applications of our model include the efficient allocation implementation of generic storage allocation using lazy allocation. of temporary data structures that are passed as arguments to Safety is also enhanced by the elimination of "dangling references" anonymous procedures which may or may not use these data due to objects escaping a stack-allocated . While lazy structures in a stack-like fashion. The most important of these allocation already provides for stack-allocation of arguments and temporaries, a programmer can extract even better performance by objects are functional arguments (funargs), which require some run- time allocation to preserve the local environment. Since a funarg is utilizing "continuation-passing style" to stack-allocate function sometimes returned as a first-class value, its lifetime can survive the results. stack frame in which it was created. Arguments which are evaluated 2. Stack Allocation is an Implementation Issue, not in a lazy fashion (Scheme delays or "suspensions") are similarly a Language Issue handled. Variable-length argument "lists" themselves can be The stack allocation of variables and contexts in higher-level allocated in this fashion, allowing these objects to become "first- compiled languages such as , Pascal and Ada has been a fact of life class". Finally, lazy allocation correctly handles the allocation of a for so many years since its introduction in Algol-60 that most Scheme control stack, allowing Scheme continuations to become programmers today assume that stack allocation is a language, rather first-class values. than an implementation, issue. Stack allocation cannot be a 1. Introduction. language issue, however, since the most direct mapping of the nested Stack allocation of objects in higher level programming languages lexical variable scopes in these languages is a , not a stack. Rather, stack allocation was a conscious decision on the part of is desired because it is elegant, efficient, and can handle the great majority of short-lived object allocations. Traditional higher-level these language designers to provide the implementation efficiency of a stack as a storage allocation mechanism even though this choice languages such as Algol, Pascal and Aria have preferred to perform all automatic storage management by using a stack, while non-stack substantially compromised language elegance. We have also since learned that stack allocation of variables and contexts severely allocation remains the responsibility of the programmer. However, limits the expressiveness of a programming language. Indeed, many the limitations of stack allocation are semantically confining, of the advances in programming languages after Algol-60 can be because a strict last-in, first-out, (LIFO) allocation/deallocation ordering does not allow for important classes of program behavior characterized as attempts to ameliorate the restrictions of stack allocation. For example, most of the functionality of Simula-67 such as that of returning a functional argument as a result. Therefore, the programmer is forced to use complex and error-prone techniques could have been achieved in Algol-60 through the dropping of the stack allocation requirement along with the syntactic restrictions on to simulate this behavior himself, even though the abstract programming language may be capable of expressing the behavior procedure arguments and returned values. more elegantly and directly using, for example, functional arguments The stack allocation of Algol-60 provided a major advance over as results. Modem higher level languages such as Lisp, Smalltalk, Fortran's static allocation. Functions could be arbitrarily nested, and Mesa, Modula-3, ML, and Eiffel, seek to escape these LIFO recursion became possible. So long as the basic values being restrictions to gain in expressive power while retaining elegance and manipulated were numbers and individual characters, stack allocation simplicity in the language. Insofar as they succeed, they can greatly proved remarkably expressive. With the advent of dynamic strings improve engineering productivity and software quality. of characters and larger dynamic objects such as arrays, stack allocation started to break down. PL/I used pointers and explicit A cost must be paid for this flexibility through the increased use of allocation/deallocation to avoid the limits of stack allocation. heap allocation for objects in the language. Yet the vast majority of Substantial arguments in the 1970's raged about the allocation of objects obey a straight-forward last-in, first-out (LIFO) allocation

24 ACM SIGPLAN Notices, Volume 27, No. 3, March 1992 function-calling and variable-allocation contexts--"retention" address by following a forwarding pointer.1 The mechanisms for (non-stack allocation) versus "deletion" (normal stack allocation) dealing with objects which can be unexpectedly moved are well [Berry71] [Fischer72]. Non-stack-allocation has become steadily known in the context of "on-the-fly", or "incremental" compacting more important as the sophistication and complexity of programs garbage collectors [Baker78][Lieberman83]. have increased. For example, in modern "object-oriented" Once we have installed the machinery necessary to deal with objects programming style, most objects are heap-allocated rather than which can suddenly move, we can contemplate the possibility of stack-allocated. allocating everything initially on the stack. If we implement an Unfortunately, heap-allocation is substantially less efficient than "escape alarm system" to detect escaping references, we can use these stack-allocation. As a result, programmers constantly seek to utilize alarms to trap to an eviction routine which will transport evicted stack allocation whenever possible. This change requires much objects into the heap. We call the policy of stack allocation work, because the source code changes required to change from one followed by eviction traps "lazy allocation", because the system is sort of allocation to another are substantial, even when the logic of too lazy to perform the work involved in heap allocation until this the program has not changed. Most importantly, the programmer is work is forced by the object's usage. required to use heap-allocation for entire classes of objects, even In the lazy allocation model, we have a global heap, a stack when the vast majority of these objects can be safely stack- formatted into stack frames, and a small number of machine allocated. This results from the difficulty in determining at design or registers. We posit the existence of an ordering relation on object compile time which of the objects can be safely stack-allocated, addresses which adheres to the following axioms: because it depends on a particular pattern of function calls, which in 1. The location address of an object in the global heap compares turn depends on the input data. lower than the location address of an object in the stack. The use of different constructs and mechanisms for heap allocated 2. The location address of an object in a frame near the stack top objects than for stack-allocated objects is therefore less productive compares greater than the location address of an object in a frame for the programmer and less efficient at run-time. It is also an near the base of the stack. invitation to disaster, because using stack allocation Lazy allocation will enforce the following rule: inappropriately can cause a program to fail in a spectacular manner. Temporary Restraining Order -- No temporary object in the stack We believe that the intertwining of an implementation mechanism can be pointed to by an object whose location address compares (storage allocation) with a programming language mechanism lower; i.e., no object in the heap can point into the stack, and no (variable and function seoping) is confusing to the programmer and object in the stack can point "up" the stack, only "down" the stack. inefficient for the hardware. It is a violation of the design principle The Temporary Restraining Order (TRO) is enforced by checking of providing "levels of abstraction", wherein each level can be every primitive operation which could possibly violate it. The only understood in its own terms, and not require a detailed understanding operation which can violate TRO is the storage of a pointer into a of every lower level. We therefore describe a mechanism called "lazy stack frame (or the heap) which compares lower than the object allocation" which can be used to provide a uniform to the pointed at. In other words, when performing an assignment, we need programmer, while taking advantage of stack-allocation as much as only check that the ordering is preserved: possible. component (p) := q; /* p,q pointers; assert (P~.q)- */ The simplification in language design and implementation permitted Statistically, most component assignments in higher level by lazy allocation through improved abstraction is important even if languages will respect the TRO rule, because component additional run-time efficiency on current hardware is not immediately assignments are usually used to initialize components of newer data forthcoming. The problem factorization by lazy allocation should allow for more efficient hardware architectures as well as language structures to reference older ones. 2 The main exceptions to this implementations, which will eventually translate into improved observation are assigning to (more) global variables and returning price/performance ratios. values. By assigning a pointer to a global variable, we are very likely to violate TRO because our lazy allocation will allocate 3. The Model. everything on the stack. The standard model of a run-time system in a modern high level Since most function returns result in binding a variable in the caller's language consists of a traditional execution stack formatted into frame, they can have the same effect on returned values as an stack frames, each of which represents the execution state of a assignment to a global variable--eviction. In some cases, this particular invocation of a particular procedure. The model may also eviction can be avoided by having the caller allocate the space for includes a heap, which contains objects that cannot be allocated and the returned result in his own frame, and passing a reference to this deallocated in a last-in, first-out (LIFO) manner. Typical languages space as an additional argument. This technique, which we call using this model are Pascal, C and Ada (without tasks). "caller result allocation" has been used since the early 1960's in C and PL/I make most storage allocation and pointer management Fortran, Cobol and PL/I compilers. There are two problems with the responsibility of the programmer. In these languages, he is caller result allocation. First, when the called routine attempts to allowed to take the address of an object on the stack and store this initialize any components of this structure, these assignments may pointer into an arbitrary variable. A common instance of creating cause eviction of the components. Second, caller result allocation references to local variables is when local variables are passed as only works when the caller knows the size of the object being arguments to another procedure by reference. Passing large objects returned in advance. Nevertheless, it can be used to reduce the by reference is usually cheaper than passing them by value, but by- number of evictions of returned values. 3 reference argument passing can lead to a dangling reference if a reference into the stack is then stored into a variable having a larger scope, such as a global vadable. A dangling reference can cause bugs if it is derefereneed outside of the stack object's lifetime. Thus, 1Forwarding pointers are only required for side-effectable objects; the power to take the address of an object on the stack can lead to functional (read-only) objects are copied, but do not require the efficient programs in which temporaries are stack-allocated, but this detection or following of forwarding pointers [Baker90b]. power can also produce bugs which are difficult to find. Our model detects these objects whose references are about to escape 2On architectures without special TRO-checking hardware, common from the lifetime of their enclosing stack frame, and relocates these cases like initialization can be easily recognized and optimized. objects into the permanent heap. This graduation from temporary to 3Functions in expression-oriented languages like Lisp will often permanent status we call eviction. Due to the constant threat of return values which are never used when the function is called for its sudden eviction, any access to such an object must be capable of side-effects rather than for its returned values---e.g., print. An detecting this movement so it can find the relocated object at its new important optimization when using lazy allocation is to inform a called function when results are not expected, so that eviction of

25 When a TRO violation is about to take place, the system executes a This reduction in execution time is only achieved, however, through "transporter trap" [Moon84], which evicts the target object of the a tremendous increase in the size of real memory. Using the model in pointer from the stack into the heap. The process of relocating this either [Baker78a] or [Appe187a], the allocation time becomes object can cause recursive transporter traps, however, when pointers independent of the memory size only when the fraction of space used which are components of the object being relocated are discovered to is negligible--i.e., at least an order of magnitude smaller! Even with also point into the heap. If relocation of the object, installation of the exponential reduction in memory prices over the last 30 years, forwarding addresses, and updating pointer components are all CPU prices have fallen even faster, resulting in an ever-increasing performed in the correct order, this recursion will terminate. When fraction of system costs tied up in memory. These trends indicate the recursion terminates, not only has the object causing the first that the use of garbage collection to simulate stack allocation is a trap been relocated out of the stack, but so have all of the objects waste of resources. Stack allocation, on the other hand, operates accessible from it. When the original trap has completed, the with the same efficiency (for those objects having LIFO behavior) updated address is then used to complete the assignment operation whether memory is nearly empty or nearly full. Therefore, the which caused that trap. "space-time product" for stack allocation is better than that for Since forwarding addresses are left for every non-functional object garbage collection whenever the utilization of memory is greater which is evicted in this manner, references from machine registers than a small fraction. and from objects still in the stack to the evictee can follow these The growing popularity of "generational garbage collectors" forwarding addresses to find the moved object. A functional (read- [Lieberman83] [Unger84] [Appe189] is an indication that the only) object still in the stack does not need a forwarding address efficient use of real memory is relatively important in most applications. Lazy allocation can be viewed as a variation on because the stack original is equivalent to the heap copy,4 and generational garbage collection in which individual objects are continues to function correctly until the frame is exited, at which generations, and tenuring to the heap is immediate. point it is abandoned because no live references to it exist. Thus, the transporter traps caused by attempted violations of the TRO rule Because lazy allocation copies objects, it is an interesting exercise serve to relocate the objects so that the TRO role is preserved, and so to compare the complexity of our model with that of a straight- that the semantics of the objects themselves are also preserved. forward copying garbage collector such as Cheney's [Cheney70]. We have shown in [Baker78a] that the amortized cost of allocation Let us call a stack which does not violate TRO well-ordered. By in a system with a copying garbage collector is linearly preserving the Temporary Restraining Order, we have the following proportional to the amount of space being allocated; the constant of theorem: proportionality is a more complex expression depending upon the Theorem. If a frame from a well-ordered stack is popped, and the occupancy factor of the total amount of storage under management. registers contain no references to the popped frame, then no Since we must usually initialize an object being allocated, and this dangling references are created by the pop. initialization process is usually linearly proportional to the size of proof b~, induction. The only dangling references could come from the object being allocated, the most we can hope for from our lazy the registers, further down in the stack, or the heap. By hypothesis, stack model is a better constant of proportionality than in the the registers have no pointers to the popped frame. There is no uniform heap-allocation model. higher stack frame, hence no references from the stack above. Due to Allocating an object on the stack is virtually the same process as transporter traps, there are no references from down the stack or from allocating an object in a compact heap because it involves only a the heap. Therefore, the only references to the popped frame are movement of the free space pointer. Deallocation of a stack frame is from the popped frame itself, in which case the entire frame is essentially free, since the stack frame need not be scanned. Well- garbage. QED. ordering violations can occur, however, resulting in the relocation Since the stack in our model is restricted to transient objects, it is of objects from the stack. The total size of the relocated objects like a "flophouse" hotel. Since such an address is not respectable, during a recursive transportation trap cannot exceed the current size one does not give it out. In order to put down "roots", one must of the stack, and once an object is relocated to the heap, it will not move to a more respectable address in the permanent heap. be relocated again. 5 4. Complexity Thus, in the worst case, every object is allocated on the stack, and is Stack allocation has recently come under attack as being slower than then relocated to the heap. This is a worst case in storage, because garbage-collected heap allocation under certain circumstances since every object was allocated once on the stack and then relocated [Appe187a]. Since many of the benefits of lazy allocation are lost in to the heap, the storage on the stack is no longer occupied (until the this instance, we must first demonstrate that stack allocation retains frame is popped). It is a worst case in time, because every object is its advantage under most conditions. allocated first on the stack and then relocated to the heap, when it As [Baker78a] and [Appe187a] demonstrate, the cost of heap could have been allocated directly on the heap in the first place. If allocation can be reduced to approximately the same as stack the cost of moving an object exceeds the cost of building it directly allocation. This reduction is achieved by increasing real memory on the heap, then our model will result in poorer performance than size (relative to the program) so that the number of garbage the pure compact heap model. (This analysis does not count the collections (which are really heap copies) is reduced. With enough added costs of constantly checking for objects which may have memory, the amortized cost of garbage collection (heap copying) moved, so that their forwarding pointers may be followed.) approaches zero, so that the cost of allocation (incrementing a The literature indicates, however, that while the worst case may pointer) becomes the dominant cost. Since the cost of stack occur, the most common cases are LIFO allocation of objects. If this allocation/deallocation is also the incrementation/decrementation of is the case, then most objects will get allocated on the stack, and a pointer, the costs of heap and stack allocation become similar. will become inaccessible before the stack frame is popped. 6 As a result, most of these objects will never be evicted from the heap. Those objects which are evicted will require somewhat more effort to

unneeded results is not performed; this result expectation optimization is discussed in a later section. 5We here assume that there is no sharing of functional objects, since eviction unshares them due to the lack of forwarding pointers. This 4If the language offers an address-comparison operator (e.g. Lisp's assumption is generally true, but if functional objects are to be EQ) for functional objects, then forwarding pointers will be required. highly shared, then forwarding pointers should be checked and left as [Baker90b] argues for a more comprehensive and portable treatment if they were non-functional objects. of such an object identity predicate so that functional objects can be relocated without requiring the checking or leaving of forwarding 6These statistics can be improved through the result expectation pointers. optimization, discussed later.

26 get them relocated to the heap, but in these cases we do not begrudge 5. Applications the additional effort because 1) these objects are still accessible, and 2) we have saved so much time as a result of stack allocating the vast A. Malioc/Ncw majority of objects, that we can afford to be more generous in these The most straight-forward application of lazy allocation is a version few instances. of C's ma 11 oc or Pascal/Ada's new which allocates the object within the current stack frame---expanding it if necessary. In other words, The programmer himself usually has a very good idea about which C's malloc becomes equivalent to the old a11oca. With lazy allocated objects are likely to have stack-like extents, and which allocation, one never calls the heap ran 1 lo c/he w directly, but leaves objects are not. If he knows that a particular object is almost certain it to the eviction trap to call these routines. Since ephemeral objects to cause a well-ordering trap, he can allocate it directly on the heap will be deallocated automatically when the enclosing stack frame is himself, and save the system the time and expense of figuring this exited, free (dispose) needn' do anything when handed a pointer out for itself. The programmer can indicate this information either into the stack. However, for a small cost, some error checking can in the form of a declarative "hint", or by calling an allocation be gained by marking the stack object as free, so that any access or routine with a different name. In either case, the percentage of attempted eviction will cause an error trap instead. objects which can be successfully reclaimed by lazy allocation will be greatly increased. B. Functional Arguments By using "continuation-passing style"--discussed later--the The correct implementation of functional/procedural arguments programmer can extend the lifetime on the stack of values returned ("funargs" [Moses70]) in higher level languages is quite from functions, and thereby gain efficiency not easily obtained complicated. The creators of the DoD standard Ada language were so through more traditional optimizations. We conjecture that the fearful of these constructs, that they were summarily banned from the majority of objects reclaimed early by "ephemeral" or "generational" language [STEELMAN78, Requirement 5D]. In transmitting a garbage collectors are extremely temporary objects which could be functional/procedural argument, not only must a pointer to the more efficiently stack-allocated using continuation-passing style executable instructions be passed, but also a pointer to the and lazy allocation. environment of the function. This is because a function defined at a Lazy allocation utilizes forwarding pointers to correctly preserve lexical level other than the topmost level may refer to local "object identity". Since we have assumed that the heap already variables of another function. These variables are not local to the utilizes forwarding pointers in its management, lazy stack allocation function being passed, and they are referred to as "free" variables. will exact no additional penalty. Brooks's forwarding scheme Some languages such as C finesse the free variable problem by not [Brooks84] can be used to advantage, since its overhead on a cached allowing functions to be defined at other than the topmost lexical architecture is rather small. As we discuss later, functional objects level. In this case, all free variables are global, and since global require no forwarding pointers, in which case all overhead is variables can be statically allocated, their addresses can be embedded concentrated into assignment operations. into the instruction so that an additional environment pointer In real architectures, there may be additional savings from our model need not be passed. Much expressive power is lost in the language that will not show up in raw instruction counts. Modern by this restriction, however. architectures have smaller, faster memory banks which cache the More powerful languages such as Algol, PL/I, Pascal and Lisp allow most heavily used memory locations. Between the effects of caching • the definition of functions and procedures within inner lexical and paging, the amortized cost of cached memory references can be contexts, and therefore must package up pointers to both the code up to two orders of magnitude faster than the cost of non-cached and the environment when passing a function/procedure as an memory references [Jouppi90]. Since our model attempts to keep argument. While this involves more work, these everything on the stack as long as possible, it is likely to have more functional/procedural arguments are strictly more powerful than C- local behavior than a model which allocates objects directly on the style functional arguments. True functional arguments are capable of heap. 7 If the hardware memory system knows that this is a stack, simulating arbitrary data structures--at least within their lifetimes-- additional optimizations can be made, such as not writing back and therefore the work necessary to allocate a new structure for their popped stack frames to the backing store. implementation is required. Another optimization possible with our model is the inlining of the Algol, PL/I and Pascal have carefully restricted the use of functional allocation routine itself. In this case, the initializing information is arguments to make sure that they can never escape the lifetime of the copied directly into the appropriate locations. For example, a Lisp- stack frame in which they were created. As a result of these like z:=CONS(x,y) is just "tl:=x;t2:=y;z:=addr(tl)", where tl,t2 are restrictions, functional arguments can be safely allocated on the adjacent locations known to the compiler, and the compiler can stack. Therefore, these objects are not "first-class". Part of the therefore optimize out the incrementation of the allocation pointer. reason for these language restrictions has to do with dangling In a system without a lazy allocation routine, one would need references. If a functional argument refers to objects on the stack, additional register shuffling, lbllowed by storage to an unknown and the stack is popped to the point where these objects are no location of memory, even when the allocation routine itself was longer valid, then a dangling reference will be created which can inlined. Such a routine would be slower than our lazy allocation-- cause nasty bugs. perhaps as slow as an out-of-line procedure. The restrictions on the use of functional arguments in Algol, PL/I A new generation of shared-memory "symmetric multiprocessing" and Pascal are rather arbitrary and constrain the expressive power of (SMP) machines is starting to be utilized for tasks requiring flexibly the programming language, although not so much as the restrictions allocated storage. On such machines a global storage allocation of the C language. Several modem languages like and routine must be protected by a lock to avoid inconsistency and this Scheme offer "first-class" functional arguments, which can be lock can become a performance bottleneck. If the allocation returned from functions and stored in data structures. However, the responsibility is split up so that each processor has its own efficiency impact of this freedom is normally quite severe. Not only allocator, this bottleneck is removed. Lazy allocation elegantly must the functional arguments themselves now be allocated on the achieves such a split, because each processor already has its own heap, but much of the normal variable-binding environment must stack, and lazy allocation may reduce the allocation demand enough also be heap-allocated. Only in this way can dangling references be so that permanent allocations can be made using a global, shared avoided. allocator without undue contention. Lazy allocation can solve the problem of allowing functional arguments to be first-class, while keeping stack allocation for the vast majority of these objects that do not escape the lifetime of their creators. In the case of Lisp languages like Common Lisp and 7[Stanley87] reports a phenomenally large reference locality for Scheme, the environment stack is kept separate from the control stack caches, as opposed to other data references. stack, so that lazy allocation "almost" works without any change. A

27 problem which occurs in any language which offers both functional argument is never used, because the argument expression will never arguments and side-effects, however, is making sure that the object be evaluated. being side-effected is "the" object, rather than a copy of it. In Lazy evaluation appeared in Algol-60 under the guise of "call-by- particular, when one creates multiple funargs which share the same name", and was implemented by means of "thunks" [Randel164]. free variable, and one or both of the funargs side-effect this variable, Thunks are quite similar to functional arguments, in that they have it is important that the side-effect be visible to all of the funargs. some executable code and an environment in which to execute the The usual solution to this problem is to allocate a unique assignable code. In fact, the semantics of Scheme's de lay expressions are "cell" in the heap for each such variable, which is then bound as the given in terms of functional arguments [IEEE-Scheme90]. "value" of the variable; [Kranz88] calls this transformation As a result of the simple implementation of lazy argument evaluation assignment conversion. 8 When multiple funargs are created which using functional arguments, lazy allocation will work for lazy reference this variable, then each will then reference the same cell. evaluation in the same way that it works for functional arguments. The Common Lisp example below exhibits this situation. Most "lazy" arguments will be evaluated before their defining stack (let* ((x 3)) ; Create cell x initialized to 3. frame is exited, and those that have not, will most likely never be ( (lambda (y z) evaluated. However, those that do escape from their defining stack (let* ((ex x) ; Get the current value of x (= 3). frame will be evicted to the heap, where they will evaluate correctly (ey (y)) ; x <- 4 if the need arises. (ez (z))) ; New current value of x (= 4). (list ex ey ez)}) The possibility of escaping Scheme de lays suggests an #' (lambda () (setq x 4) x) ; Funarg with free var. x. optimization that may sometimes be beneficial. If a de 1 ay is about #' (lambda () x))) ; Funarg with free var. x. to escape, one may want to arrange for its immediate ("strict") => (3 4 4) evaluation. If the delay is functional, then the time of its Unfortunately, assignment conversion is very expensive, because it evaluation cannot affect its value, yet this value may occupy less causes every variable to be bound to such a heap-allocated cell, space than the de lay itself (zero if the value is an "immediate" unless it can be statically shown that either 1) the cell is never quantity such as a floating point number), thus improving heap assigned to after initialization, or 2) the cell is never shared among utilization and performance. We call the evaluation order resulting funargs [Kranz86] [Kranz88]. Using lazy allocation, however, we from this optimization "downward lazy evaluation", due to its can cheaply allocate an assignable cell for every variable in the same similarity to "downward funargs"--i.e., those functional arguments stack frame as the variable itself, and this cell will only be relocated that obey stack ordering. to the stack when it tries to escape from that stack frame's lifetime. 9 . Argument "Lists" Thus, lazy allocation can be used for both the funarg environment Programming language designers have long been tantalized by the itself, as well as the assignable cells, so that in the most common similarities between parameter lists and structured values ("records" situations all of the allocations for funargs occur on the stack and are or "structures"). For example, Ada utilizes nearly the same syntax never relocated to the heap. for declarations of both, and nearly the same syntax for function In the case of non-Lisp languages, where the environment stack and calls and record "aggregates" [Ada83]. Lisp semantics presume the the control stack are usually merged, we must be more careful. The existence of a Lisp list of arguments [McCarthy65]. Yet nearly all simplest solution would be to utilize lazy allocation on the stack language designers are forced to back down from unifying these two frames themselves, which would work because any frame relocated concepts due to the unacceptable loss of efficiency that would result. into the heap would leave a forwarding address for any other object Making parameter lists into true first-class objects would force them which pointed to it. Unfortunately, the eviction of one stack frame to be heap-allocated, with the concomitant problems of determining has the undesired effect of immediately evicting all of the stack when and how to deallocate them. frames nearer to the bottom of the stack as well. When using such a Lazy allocation offers the language designer the ability to unify policy, the top stack frame's eviction causes the entire stack to be record structures and parameter lists without the loss of efficiency. relocated into the heap. While there may be occasions where this Every function and procedure can be elegantly defined as accepting massive relocation is desired (see the later section on Scheme just one parameter--a record structure. A function/procedure call continuations), this policy is usually overkill for funargs. constructs an argument object on the stack by creating a new A better solution for implementing first-class funargs in non-Lisp instance of this structure initialized with the individual argument languages would be to more closely follow the Lisp example. A components. A pointer to this argument object is then passed to the funarg environment need only retain the bindings of the variables function or procedure, where a "destructuring pattern match" of the free in the function being passed, so the funarg environment could be argument is made using the parameter record definition as a pattern. a simple vector of these variable values. Of course, the same Elegance and power are obtained in two ways: the capabilities of "assignable cell" indirection must be made for free variables in non- record structures are made available to argument objects, and the Lisp languages that we demonstrated above for Lisp. Since lazy capabilities of argument objects (e.g., pattern-matched allocation allows both the assignable cells and these environment destructuring) are made available to record objects. 10 Even more vectors to be (initially) stack-allocated, we get excellent fascinating is the transfer of other argument-passing ideas to data performance until a funarg is returned or stored into a global variable structures--e.g., lazy evaluation becomes lazy component and becomes first-class. initialization [Friedman76]. C. Lazy Argument Evaluation In the most common case, argument objects are immediately Lazy evaluation of the arguments of a function call is the deferral of destructured, and do not escape the lifetime of the called function or the evaluation of the argument expression until the argument is procedure, and therefore can be deallocated when the actually "used". Lazy evaluation of arguments can be more function/procedure returns. However, should the argument object expressive and efficient than so-called "applicative" evaluation if an become first-class, it is evicted from the stack, and is thereby retained when the function/procedure returns. By giving argument objects a (potentially) first-class existence, 8Erik Sandewall circulated a memo describing the same technique in some efficiencies can be obtained. For example, some recursive 1974 [Sandewal174]; Greenblatt utilized this idea in the MIT Lisp procedures pass on a number of arguments unchanged to successive Machine [Greenblatt74]. 9On a machine with a cache, the variable and its cell should be 10So long as these argument objects are functional [Baker9Ob], one initially located in the same cache line, so that the additional can still pass argument objects through RISC architecture registers indirection through the variable reference to the cell will execute as rather than memory, because functional objects can be transparently quickly as possible. copied--even to a bank of registers.

28 recursions, where they are used only when the recursion terminates. structure would exist either in the stack frame storage of either the By passing a pointer to a portion of the argument object instead of calling or the called subprogram, no restrictions on taking addresses copying these arguments, some effort can be saved. Furthermore, if would be needed. If any pointers survived, the targeted objects would the number and/or size of these arguments are related to the depth of be automatically moved into the heap. Reading such a stream would the recursion, an algorithm that is quadratic in complexity may be produce a truncated stream consisting of the rest of the list. If any reduced to linear complexity [Dybvig88]. stream Cap") pointers survived, then the "rest" of the stream (including objects it referred to) would be relocated to the heap. E. ANSI-C STDARG's The programming language BCPL [Richards74], from which the C F. Common Lisp &REST Arguments and Scheme ". z" language descends, passed all arguments uniformly as a vector on the Arguments stack which could be accessed in exactly the same way as any other The semantics of the Lisp language were originally defined by a vector. This clean model depended upon a uniform size for all "recta-circular" interpreter which created actual Lisp lists of evaluated objects, and was therefore abandoned in C. Until the development of arguments as part of its evaluation of function application UNIX varargs and then ANSI stdargs, there was no portable way [McCarthy65]. While most modern Lisp implementations put to pass an arbitrary number of arguments to functions, most evaluated arguments onto a stack instead of a list, Common Lisp and especially variants of print f. These portable forms offer a of Scheme retain one vestige of the original Lisp evaluator--the &RE S T "stream" access to the argument "list", and the receiver of the argument. Both Lisps allow for the passing of an arbitrary number variable argument "list" sequentially reads this stream, and provides of arguments to a function, but the called function must somehow be the type information necessary to decode it. These streams provide capable of addressing these arguments. Common Lisp uses normal only data, and no pointers to these arguments are allowed. positional matching for the first several arguments, and has a The stdargs facility is currently the only facility that provides keyword-matching capability, but functions with a large number of stack allocation of variable-sized quantities of storage in ANSI-C; relatively homogeneous arguments such as "+" are most elegantly ANSI-C is highly tuned to allow only fixed-size stack frames whose handled using a &REST parameter. The semantics of the &REST parameter are that it is bound to a Lisp list of the arguments size is known by the compiler.1 1 With the demise of a 1 loca, remaining after all required and optional parameters have been which allocated variable-sized objects on the stack, ANSI-C's bound. Scheme does not have keyword arguments, but does allow the restriction against taking the address of such an argument is equivalent of Common Lisp's &REST parameter which is denoted by particularly obnoxious, because some applications of a 13_o ca could putting a symbol in the last "cdr" position of the parameter "list", have been (painfully) simulated using variable-sized stdargs. which is therefore an improper list. and are ugly and non-modular, and introduce Both stdargs varargs Unfortunately, the creation of first-class Lisp lists for handling notions not used elsewhere in C. Because C wants to charge all costs these &REST parameters is quite expensive. Yet to be safe, a true list for variable-length argument lists to those functions which use must be constructed for the &REST parameter, since the called them, these forms must not interfere with the passing of fixed-length function may do anything with this list it likes, including returning argument lists in registers (the norm on RISC architectures). 12 it or side-effecting it. For example, Lisp's LIST function itself has Because the simplest implementation of a stream is the incrementing the trivial definition where it simply returns its &REST argument: of a pointer variable through a memory structure, the first step in 13 most implementations of va start is to store all of the register- (defun list (&rest: arqs) arqs) passed arguments into memory, and then utilize simple pointer- Since there are many reasons for passing variable numbers of stepping for all arguments. Indeed, given the fact that the argument- arguments to a function, and only a few of them involve creating a reading stream may be passed on to additional functions, it is list, it is unfortunate that Lisp forces a list allocation for this difficult to conceive of a compiler smart enough to implement common situation. stdargs/varargs in any other way. The inability to create a The Lisp Machines derived from the MIT Lisp Machine pointer to a s t da rq s argument is therefore unreasonably restrictive, [Greenblatt74] actually do format their argument lists on the stack to since it almost certainly resides in addressible memory. Presumably, look like Lisp lists so that they can be passed as &REST arguments the no-pointers restriction is meant to protect the user from the and traversed using the normal CAR and CDR functions. However, particularities of storage allocation of the storage needed for the these argument lists are not first-class Lisp lists, because the lists so variable-length argument list, which may be allocated by va s tar t created have the same lifetime as the enclosing stack frame. and deallocated by va end. Such protection is out of charac~r for C, Therefore, while these &REST arguments can be passed down the since no such protec~ons exist for other stack and heap-allocated stack, they can never be stored into the heap, returned past their objects. Most implementations allocate the storage needed for creation point, or RPLACD'ed. However, because they are formatted va start on the main C stack (using a version of the now-banned as lists, they can be passed to other functions as lists---e.g., as the al-ioca), however, in which case the va end is extraneous. last argument to APPLY---and thereby avoid a quadratic explosion of Allocating storage for va start in any other-place is almost certain copying [Dybvig88]. to run afoul of longjr~, which must then decode the stack and In our lazy allocation model, however, all CONS'ing is first execute va end for each stack frame involving stdargs for which performed on the stack, so there is no additional penalty for va_st ar tq.vas executed, but va_end was not. CONS'ing &REST arguments. Furthermore, using lazy CONS'ing, A lazy allocation mechanism could dramatically simplify the &RE S'Z lists are truly first-class lists, since they are created using stdargs device of ANSI-C. A new, first-class polymorphic "stream" exactly the same mechanism that is used to create any Lisp list. could be defined which could be opened and read (As an aside, we point out that if Lisp argument lists are to be sequentially from a variable-length argument list. Since this data constructed so that the CDR pointers always point towards the base of the stack, and if each argument is to be inserted when its list cell is l lANSI-C doesn't strictly require stack allocation (on the main C allocated, then this virtually requires that arguments to a Lisp stack) of variable-size argument lists, but any other implementation function be evaluated in reverse order of appearance.) must deal with signals and jmp/longjmp, which require main- G. Tail Recursion stack-allocation semantics. Tail recursion is a Lisp optimization that was elevated in the Scheme 12Curiously, arguments passed in registers are required to be stored dialect into a requirement. By requiring that a tail-recursive routine into memory upon entry to a function, unless the corresponding called to a depth of n is allowed to use only O(1) amount of control parameter is declared with a storage class of register (which is not stack, Scheme can simulate iterative control structures in a storage- the default). An optimization not always performed is to store the argument only if the address-of C&") operator is ever applied to the parameter; this usage can easily be detected by the compiler. 13InScheme, this becomes (define (iis% . z) z).

29 efficient manner without a distinct iteration construct. Typical recursively cause all lower stack frames to also be evicted. 14 An implementations achieve this by reusing the stack frame on a tail- implementation of a language which must deal with the possibility recursive call. The introduction of lazy allocation requires a new of a stack frame suddenly being moved can be quite inefficient, understanding of the meaning of a "tail recursion optimization", because virtually every access to the stack frame must check for the since any allocation performed during such a loop will increase the existence of a forwarding pointer. Functional stack frames--which size of the stack frame which is being reused, potentially allocating may still point to assignable local variables--can relax this an unbounded amount of stack space. One interpretation is that such a program utilizes additional storage during its execution, and restriction by allowing copying without forwarding. 15 We can thus therefore isn't "really" iterative at all. Another interpretation is that obtain a behavior analogous to that of many current the Scheme tail recursion requirement is unreasonable, since a lazy implementations of Scheme which copy the entire control stack to allocation implementation utilizing arbitrary storage space may be the heap when a continuation is captured. Of course, assignable more efficient (within its storage limitations) than a more strict local variables cannot be copied, but must be relocated in order to stack frame reusing strategy, and the Scheme requirement makes the retain their shared semantics. programmer "subvert" the compiler in order to achieve his wish. A We do not contend that lazy allocation solves all problems in the default interpretation is that a tail recursion "optimization" disables high performance implementation of Scheme continuations, and lazy allocation, and forces allocation directly into the heap. experience may prove that Scheme continuations are better managed with more specialized techniques. Nevertheless, it is interesting that H. Scheme Continuations the generic lazy allocation model faithfully captures the behavior of Scheme is a dialect of Lisp which has a very interesting construct several existing Scheme implementations, without requiting any called a continuation. A continuation is a functional argument that special handling. embodies the "rest of the computation". When a continuation function is called with an argument, it does not act like a normal call, I. Function Results and the Result Expectation but instead returns from a previous expression evaluation. Optimization Normally, when a function is called, the arguments are evaluated, an As we have pointed out above, lazy allocation is not lazy when it argument list is constructed, the caller is suspended, and control is comes to result values. This is because results must be returned "up" transferred to the callee. The callee creates a new frame on top of the the stack, usually by assigning them to a temporary value in the stack and starts execution. Eventually, the callee returns with a caller's frame. This is unfortunate, because the efficient handling of returned value. A return is usually implemented by saving the results is as important as the efficient handling of arguments. The returned value in a register, popping the callee's frame from the efficient allocation of results, however, is a difficult problem. stack, and continuing execution at the point in the caller's code The first optimization for reducing result eviction we call result where the callee was originally called. During the execution of the expectation. Since many functions are called "for side-effects" rather callee, the suspended caller can be considered a kind of functional than "for result value" in expression-oriented languages like Lisp, argument, which, if called, would execute the "rest of the the caller should notify the function of this expectation so that computation". This "function" even takes an argument--the value to unneeded function results can be thrown away instead of evicted. 16 be returned from the callee. This "function"'s main fault is that if it Other functions are called for their result, but only 1 of result is called, it will never return. We designate this theoretical information is actually used--whether the result matches a "function" the continuation of the suspended caller program. distinguished value (n i 1 in the case of Lisp, 0 in the case of C). In In a traditional stack implementation, the continuation has more such cases, we need preserve only the single bit of result information than a passing resemblance to a functional argument. It consists of a actually needed, to avoid the eviction of large structures which are pair of values: the point in the program text where execution will already mostly garbage. resume, and an environment--the current stack--in which to Another optimization for avoiding the heap-allocation of function interpret lexical variable occurrences in the program text. If we results is for the caller to allocate space for the result and pass this could somehow package this continuation into a "real" functional space by reference; this technique, which we call "caller result argument, then we could simplify the notion of function calling by allocation", has been used since at least the early 1960% in Fortran, always including a continuation argument, and no longer including Cobol and PL/I compilers. Eviction upon callee return is avoided the return "PC" and the return stack-pointer as integral parts of the since the callee no longer performs allocation. While this technique function calling sequence. In fact, the "jump to " is widely used in programming language implementations, it operation of most modern computer architectures can be viewed as an depends upon the ability of the caller to guess the correct size of the optimization of the sequence "push continuation argument; jump to result, and it forces the called routine to use side-effects to the beginning of the called function". communicate its result information. When the size cannot be In the most primitive Scheme model, then, there are no returns from guessed---e.g., in the case of some Ada unconstrained array results-- , only calls to continuations. The basic difference this method fails, and heap-allocation must be used. Heap between a normal function and a continuation is that calling a allocation, however, runs the risk of "storage leakage" in non- normal function will push onto the stack, while calling a garbage-collected language implementations if an error or other non- continuation will pop from the stack. local transfer of control fails to deallocate this storage when the Traditional stack implementations of traditional languages work stack is contracted. correctly, because under normal conditions all continuations created Another method for avoiding the heap-allocation of function results during the execution of a program have strict LIFO is for the result to be allocated in the stack, but to redefine the allocation/deallocation behavior. In other words, the continuation caller's stack frame to include this result before returning to the (return point, stack pointer) does not escape the lifetime of its creator, and therefore no dangling references are created. In the Scheme language, however, since a function callee can gain access to 14Lower stacks frames will be evicted only if they have not already his continuation through a special construct (ca 1 1/co), this continuation can escape the lifetime of its creator and become a first- been evicted; this solves the multiple stack copy problem mentioned class object. (ANSI-C [ANSI-C88] also defines the operations in [Hieb90]. set imp and longjmp, which allow for the "capture" and application 15Stack frames in Scheme may have assignable slots even when no of a continuation which is not first-class.) side-effects are performed by the programmer; this behavior is a If a system utilizes lazy allocation, then stack frames will remain on result of the ability of Scheme continuations to be resumed many the stack, so long as LIFO allocation/deallocation behavior is different times. observed. If a continuation attempts to escape its creator's lifetime, 16Result expectation is the run-time analogue of a classic compiler however, its stack frame will be evicted from the stack, which will optimization used in expression-oriented languages like Lisp.

30 caller. 17 This scheme is similar to caller result allocation, except Therefore, the CPS form of the program can execute correctly even that the caller conceptually passes the entire "rest-of-the-stack" as when the original form of the program would have failed due to some the result reference, which is then chopped back to its actual size stack-allocated object leaving the scope of its creator. In other before being returned. This scheme also works, and has been used in words, CPS style offers a "retention" rather than a "deletion" some Ada compilers [ShermanS0], but can waste arbitrary amounts of strategy for stack frames [Fischer72]. space on the stack if the process is iterated. Consider, for example, a The "continuation-passing style" of programming thus offers new recursive program which allocates a result at the bottom of the flexibility to the programmer who wishes to utilize stack-allocation recursion. This result, and all the intervening space, will become whenever possible. If he calls a function with a stack-allocated part of the caller's stack frame, even though most of this space is no argument which could then become part of that function's returned longer used. Since one knows the current extent of the stack, one value, he is likely to get a dangling reference without lazy could conceivably copy the result back to "close up" the space, but if allocation, or cause the eviction of a large structure with lazy this process is iterated, a quadratic explosion of copying could result allocation. However, he can postpone the eviction for a while by [Dybvig88]. Therefore, the non-lazy eviction of a result value to the calling that function with an explicit continuation which will accept heap just once can be more efficient than trying too hard to keep the the "returned" value and continue executing without popping the result on the stack. stack or causing any evictions. Unlike previous schemes for redefining the stack frame The most trivial example of all is the Lisp CONS function itself [ShermanS0], we suggest that whether the object is relocated to the which allocates a list cell. If implemented as a tree Lisp function in a heap or kept as part of the caller's stack frame should be the choice of system using lazy allocation, the list cell would be allocated on the the caller as part of his result expectation. In other words, the result stack and initialized with its "car" and "cdr" components. As we expectation code to be included in every function call consists of at have already pointed out, however, returning a value typically causes least the following two of information: 18 its eviction (and the eviction of its components). Therefore, although CONS tries to be lazy, the effect of returning the newly ResultCode Action allocated object causes its eviction to the heap, so our CONS isn't 00- don't return a result (used for non-last position of "progn") lazy after all! If we call this CONS with an explicit continuation, 01- nil/non-nil as result (used for boolean position of "if") however, within which the newly allocated list cell is manipulated in 10- evict result if necessary (normal lazy allocation operation) a normal fashion, then the cell is not immediately evicted, and 11 ~ don't pop frame (redefine caller's frame to include result) remains lazy. The result expectation code inherited by a function from its caller is used only when the function attempts to return a result (in Lisp, when Below is such an implementation of a lazy CONS in C. 20 it executes a function in the "tail-call" position); the rest of the time, void lazy_cons (x,y, cont) it computes its own result expectation code (perhaps dependant upon int x; list y; void eont(list); the caller's expectation code) when calling out to other functions. {struct {Int car; list cdr;} z; /* The cons cell. */ By propagating result expectation codes, the result consumer can z.car=x; z.cdr=y; /* Initialize the lazy cons cell. */ cont(&z);} /* Give cont ptr. to cons cell. */ inform the result producer of its wishes regarding the allocation of this result. A programmer or a compiler can therefore use result The use of continuation-passing-style allows the programmer expectations to avoid evictions when the amount of wasted space is himself to choose whether allocation will be lazy or not. In this way, he can use his greater knowledge of the program behavior to calculated to be within acceptable limits. 19 avoid unnecessary evictions, but also avoid the creation of large J. Extending Result Lifetimes and Multiple Return amounts of garbage in the stack. Values Continuation-passing style has yet another benefit. Unlike normal There is another method for allocating function results on the stack -application notation, continuation-passing style which will not cause immediate eviction. This method depends upon can deal with multiple returned values. For example, a Euclidean a source-to-source conversion of a program called "continuation- division algorithm "function" can return both a quotient and a passing-style conversion" (CPS conversion). In converting to CPS, remainder. In such a ease, the "continuation" function must utilize a program is turned inside out so that returns and returned values, more than one parameter in order to receive all of the results. which can be viewed as implicit calls to continuations, are converted Common Lisp also provides a number of forms to handle "multiple into explicit calls on explicit continuations with the values as values", but does not utilize continuation-passing style for their arguments. While the CPS form of a program will execute and implementation. Common Lisp multiple values are not strictly produce the same answer as the original program, its execution on a necessary, as all of the benefits of multiple values can be achieved traditional stack implementation will use considerably more stack through the composing of multiple values into a Lisp structure which space. Since there are no longer any returns, neither are there any can then be decomposed by the user of the function. In order to save stack pops, at least until the very end of the program, and so the the time and garbage collection required to compose and decompose amount of stack used can be substantial. these structures, however, Common Lisp uses a special mechanism The conversion to CPS form has certain benefits, however. Since to provide multiple values. We show that the benefits (including the stack is retained until the very end of the program, there is never stack allocation) of Common Lisp's multiple-value mechanism can any possibility of dangling references for stack-allocated objects. be simulated through a new first-class "multiple-value" structure type together with lazy allocation. 21 17Although values are preserved on the stack, exited stack frames (defstruet multiple-value are spliced out of the call-chain so that unwirad-protect's in exited (values nil :read-only t)) frames are not inadvertently executed. (defun values (&rest args) 18The MIT Lisp Machine function call instruction uses a similar (make-multiple-value :values args)) coding (without lazy allocation) for its "destination operand" [Greenblatt74]; however, the callee does not utilize this information to avoid allocating useless results! 19The "continuation-passing style" (CPS) of programming, as discussed in the next section, can achieve the same allocation behavior as result expectation, but with greater overhead. 20Due to the lack of full function closures in C, we would have Furthermore, one cannot use CPS on code for which one does not trouble actually using this CONS in any serious way. have the source---e.g., library code--so result expectation is to be 21This multiple-value structure should be functional [Baker90b] to preferred. achieve the maximum benefits of lazy allocation.

31 (defun multiple-value-call (fn &rest args) be more than the costs of copying. At the hardware level, the (apply fn installation of forwarding pointers also causes a cache write-back, (mapcan which adds additional load to the memory system. Thus, for small or #' (lambda (arg) unshared functional objects the cost of copying is less than the cost (if (multiple-value-p arg) of storing, checking and following forwarding pointers. (multiple-value-values arg) (list arg) ) ) Copying semantics can be exponentially less efficient than args) ) ) relocation semantics if substantial substructure sharing occurs, To provide the programmer with the retention benefits of CPS however. A simple linear Lisp list of length n in which the CAR of without the requirement of turning his source code inside-out, we each list cell is assigned to be the same as its CDR has been called a define the CONTCALL special form. The semantics of CONTCALL can "blam list" [McCarthy65] because it explodes into a structure of 2 n be defined as follows: list cells when it is functionally copied. The only exponential (defun contcall (continuation fn &rest args) blowup of this type we have observed occurs in Macsyma's (multiple-value-call continuation (apply fn args) ) ) representation of the determinant of an nxn matrix in O(n 3) cells; In other words, CONTCALL applies the function to the arguments, and this structure expands into an expression with O(n!) terms. On the than applies the "continuation" function to this result. The other hand, the extended size of a functional data structure is constant implementation of CONTCALL is special, however. Whereas the and can be computed incrementally as it is constructed. The stack would normally have been contracted after the execution of information needed to make a copy/no-copy decision can therefore (apply fn args), it is not, so that the result(s) of this application be gathered cheaply at run-time. is (are) left on the stack for the application of cont i nuat i on. There are a number of "functional" structures even in imperative Using CONTCALL, we can then define Common Lisp's languages. Argument lists and functional arguments are usually side- MULT IPLE-VALUE-BIND special form, whose purpose appears to be effect free. In some languages, strings cannot be modified the extraction of multiple values from a function call without causing by side-effects. ANSI C offers the cons: qualifier. Scheme any extraneous heap allocation. continuation structures, being similar to functional arguments, are (defmacro mult iDle-value-bind side-effect free, although they may have pointers to non-functional (vats (fn . args) &body body) objects. The various kinds of numbers in Common Lisp are " (contcall #'(lambda ,vats ,@body) ,fn ,@args)) functional---even large objects like infinite precision integers and Of course, once we have lazy allocation and CON TCALL, we no longer structured objects like complex floating point numbers. The need to clutter up the Lisp language with "multiple values", since the "multiple values" structures returned from Common Lisp function "non-consing" benefits can already be achieved without multiple calls are also functional, even though they are not first-class values. CONTCALL can also be used for a definition of a LET which Common Lisp objects. extends the stack so that the variables are bound to stack-allocated The functionality of these objects--at least their top level values. Such a stack-extending LET is usually the intention of the structures--accounts for many of the other variations on lazy programmer; this is the motivation for proposals for a allocation. Thus, while used lazy allocation for integers "dynami c-let" [Queinnec88], but without causing failure if the and floating point numbers [Steele77], it did not have to leave or values escape the scope of the allocation. check for forwarding addresses because these numeric objects were (defmacro dlet ( (var (fn . args) ) &body body) functional. Similarly, lazy allocation implementations of " (contcall #' (lambda (,var &rest ignore) ,@body) functional arguments and continuations do not bother to leave or • fn , @args) ) check for forwarding addresses because there is very little potential Below, we show how to program a complex division routine which sharing, and evictions happen very rarely, so copying is not a allocates all intermediate results on the stack. problem. (defun cdiv (zl z2) Due to Common Lisp's insistence upon the use of true Lisp lists for (dlet ( (z2bar (conjugate z2) ) ) &REST arguments, however, one cannot legally use copying (dlet ( (z2norm (crimes z2 z2bar) ) ) (dlet ((zlz2bar (crimes 71 z2bar))) semantics for these objects, because Common Lisp list cells can be (dlet ( (rz2norm (realpart z2norm) ) ) side-effected. As a result, the lazy allocation of &REST arguments is (complex (/ (realpart zlz2bar) rz2norm) less efficient than if &REST arguments were based on a functional (/ (imagpart zlz2bar) rz2norm) ) ) ) ) ) ) sequence structure instead of non-functional list cells. Scheme's Using continuation-passing style to extend the life of stack- requirement that &RE S T lists be always copied is just as bad, because allocated objects can be used for more substantial applications. For the majority of such lists are functional and could otherwise be example, the storage needed for the intermediate results in a chain of shared and thereby avoid a quadratic explosion of copying in deeply matrix multiplications can be allocated in this fashion, so that only nested recursions [Dybvig87]. the final product matrix becomes a first-class heap object. 6. Future Work n An Incremental Model K. "Functional" Data Structures The model as described above is not particularly incremental, So far, our lazy allocation model has utilized strict "relocation" because a transporter trap in a deeply nested stack frame could cause semantics in order to preserve the "object identity" of allocated an unbounded number of objects to be copied before returning to the objects. In this semantics, there is only one "true" location for each execution of the program. One can view the copying effort involved object, but this location can sometimes change. For "functional" in eviction as the effort which was deferred by lazy allocation, and data structures---data structures which cannot be side-effected--we has suddenly come due. While this effort might still be less than the can relax the strict "relocation" semantics and utilize "copying" amount of effort saved by using lazy allocation, it is time that is not semantics. This is because the behavior of a functional data structure easily interrupted, and can therefore cause problems in a real-time is determined by the values of its components, and since they cannot system. be changed, a copy of the data structure having the same components A more incremental system would evict objects from a stack frame will have the same behavior. (A more thorough treatment of object just before it is popped, and would evict only the "top level" of those identity for functional objects can be found in a companion paper objects. Unfortunately, this sort of a system leads to great [Baker90b].) complications. In such a system, a stack frame must now be scanned Copying semantics for functional objects can have some benefits in before popping, in order to evict any remaining objects. However, our lazy allocation model. When a functional object is copied, one unlike the non-incremental scheme where we could inductively prove need not necessarily leave a forwarding address, since the original is that no pointers to the stack frame exist at the time of popping, the as good as the copy. Since forwarding addresses must be detected and incremental scheme has no such property. If there are live objects followed during execution, the cost of detection and following may remaining in the stack frame, then ipsofacto there must be live

32 pointers. Unfortunately, we do not know where those pointers are, user-allocated data. In other words, these models do not separate the so we cannot update them when the objects are moved out of the issues of language implementation (frames) from storage allocation stack frame. (stack allocation). The only solution is to follow a technique invented by Bishop The stack allocation of functional arguments has been studied by [Bishop77] and used by Lieberman and Hewitt [Lieberman83]. We [Johnston71], [Steele78], [McDermottS0] and many others. use a separate entry table to keep track of those pointers which Johnston [JohnstonT1] is said to have used the term lazy contour violate a stack's well-ordering. This entry table initially starts out which is a close approximation to our lazily allocated stack frame. empty, but when a stack frame is cleared of live objects, some of The stack allocation of continuations has been studied by Steele these objects may continue to point into other stack frames. These [Steele78], Stallman [Stallman80], Bartley [Bartley86], Clinger pointers are routed indirectly through the entry table. When the next [Clinger88], Danvy [Danvy87], Deutsch and Schiffman [Deutsch84], stack frame is to be cleared, the entry table is searched for objects Dybvig [Dybvig87], Kranz [Kranz86] [Kranz88], [Moss87], entering the stack frame, and those objects are then relocated. In [Hieb90] and many others. The straight-forward application of lazy this way, we can incrementalize the eviction process. stack allocation to Dybvig's heap model [Dybvig87] yields a large Our incremental scheme is nearly equivalent to Lieberman and fraction of the optimizations he performs by hand; the lazy Hewitt's generational garbage collection, in which the global heap allocation of continuations also avoids the multiple stack copies and each stack frame are separate generations. Unlike Lieberman and mentioned in [Hieb90]. Deutsch and Schiffman use the term volatile Hewitt, however, who would move a result object through every for lazy stack frames, and stable for evicted stack frames. intermediate generation, we move such objects directly to the oldest Dybvig [Dybvig88] is apparently the first to have pointed out in generation--the global heap--in order to avoid a quadratic print that the consistent copying of a large argument to successive explosion in copying effort [Dybvig88]. Our policy is similar to levels of recursion can convert a linear algorithm into a quadratic Ungar's tenuring policy [Ungar84], which also avoids the copying one. of long-lived objects through the intermediate generations. The The concept of lazy allocation was almost diseoved by Lieberman address ordering relation, the relocation process and the and Hewitt [Lieberman83], since they had all of the necessary manipulation of the entry tables in our scheme are all identical to machinery. However, the additional concept of "genetic order" that of Lieberman and Hewitt, however. [Terashima78] was missing. McDermott discovered a form of lazy 7. Conclusions and Previous Work allocation [McDermott80] for implementing the variable-binding We have shown how a general model called lazy allocation can environments used in a lexically-scoped Lisp interpreter, and he also simply and elegantly explain many traditional programming indicated its possible use for managing Scheme continuations. language optimizations aimed at increasing the fraction of storage [Morrison82] is simply lazy allocation applied to the consing allocations that can be performed on a stack. Lazy allocation performed in [Baker78b][ [Mellender89] implements Smalltalk with requires the ability to deal with objects which can be suddenly a scheme based on the same concepts as "lazy allocation", but with moved, but once this cost has been paid, lazy allocation can result in substantially greater complexity. Tucker Taft [Kownacki87] great simplifications in other parts of a language implementation. [Taft91] independently developed for the Ada-9X language the idea of Lazy allocation puts most of its overhead burden on assignments, a run-time "scope check", with a user-defined copy-to-heap if which makes it attractive for the "mostly functional" programming required; this excellent proposal was unfortunately later withdrawn. styles of modem expression-oriented languages. Lazy allocation Stallman's phantom stacks [StallmanS0], which were invented to also has benefits in shared-memory multi-processor environments implement Scheme on the MIT Scheme Chip [Steele79], are an where the potential bottleneck of a global allocator is shielded by interesting alternative to solving the same kinds of stack allocation lazy stack allocation from the bulk of the allocation load. problems as lazy allocation. In phantom stacks, objects are stack- We have also described a new run-time technique called result allocated in the same manner as in lazy allocation. The difference expectation, which informs called functions of what results are between the two models comes when LIFO order is violated. In lazy expected and where they should be put, so that unexpected results allocation, we evict objects from the stack to the heap, while in need not be heap-allocated. While interesting in its own right, result phantom stacks, a new stack is initiated, the old stack is abandoned, expectation works with lazy allocation to reduce the number of in place, where it becomes a passive set of objects in the heap. evictions of function results from functions called for their effect Thus, lazy allocation and phantom stacks are duals of one another: rather than for their result. lazy allocation moves objects from the stack, while phantom stacks The concept of lazy allocation is the result of 10 years of pondering moves the stack from the objects. [Hieb90] rediscovered phantom the possibility of "backing up"--under certain circumstances--the stacks and gives an analysis which is more appropriate for the allocation pointer of the author's real-time garbage collection execution of Scheme on a modern RISC processor. algorithm [Baker78a], in order to improve its amortized Both Prolog [Warren83] and Forth [Moore80] make more extensive performance. The single-bit reference count [Wise77] for stack- use of stacks than do traditional Lisp implementations, and gain allocated objects can be subsumed by address ordering, yielding the substantially in elegance and speed as a result. current concept of lazy allocation. 8. Acknowledgements Due to the ubiquity of the problem, the literature on stack-allocating We wish to thank Dan Friedman, Andre van Meulebrouck, Carolyn various kinds of objects is so large that we can reference only a small Talcott and the referees for their helpful suggestions and criticisms. fraction. The stack allocation of variable binding environments encompasses the Algol-60 display [Randel164], Lisp's binding 9. References environments [Greenblatt74], Lisp's shallow binding [Baker78b], Aho, A.V. "Nested stack automata". JACM 16,3 (July 1969),383-406. ANSI-C. Draft Proposed American National Standard Programraing Language C. Lisp's cactus stacks [Bobrow73]. The stack allocation of functional ANSI, New York, NY, 1988. objects like numbers is discussed in [Steele77] and [Brooks82]. Appel, Andrew W. "Garbage Collection Can Be Faster Than Stack Allocation". "Dynamic extent objects" [Queinnec88] have been proposed as part lnfo. Proc. Let. 25 (1987),275-279. of the Eu_Lisp standard. Our lazy allocation completely subsumes Appel, A.; MacQueen, D.B. "A Standard ML Compiler". ACM Conf. Funct. dynamic extent objects, and our trapping for the purpose of eviction Prog. & Comp. Arch., Sept. 1987. is no more expensive than trapping to determine lifetime errors. Appel, Andrew W.; Ellis, John R.; and Li, Kai. "Real-time concurrent garbage collection on stock multiprocessors". ACM Prog. Lang. Des. and Irapl., The deletion (stack-allocation) versus retention (heap-allocation) June 1988,11-20. implementation strategies for Algol-like compiled languages has Appel, Andrew W. "Simple Generational Garbage Collection and Fast Allocation". been studied by [Berry71] [Fischer72] [Berry78a] [Berry78b] SW Prac. & Exper. 19,2 (Feb. 1989),171-183. [Berry78c]. [Blair85] describes an optimistic stack.heap, which is Baker, Henry G. "List Processing in Real Time on a Serial Computer". CACM approximately our lazy allocation applied to stack frames; unlike our 21,4 (April 1978), 280-294. lazy allocation, however, the optimistic stack-heap is not used for

33 Baker, Henry G. "Shallow Binding in Lisp 1.5". CACM 21,7Jouppi, N.P. "Improving Direct-Mapped Cache Performance by the Addition of a (July 1978),565-569. Small Fully-Associative Cache and Prefetch Buffers". Prec. 17th lntT Syrup. on Baker, Henry G. "Unify and Conquer (Garbage, Updating, Aliasing, ...) in Computer Arch., ACM Computer Arch. News 18,2 (June 1990). Functional Languages". Prec. 1990 ACM Conf. on Lisp and FunctionalProgr., Kownacki, R., and Taft, S.T. "Portable and Efficient Dynamic Storage June1990,218-226. Management in Ada". Prec. ACM SigAda Int'l. Conf. "Using Ada", Baker, Henry G. "Equal Rights for Functional Objects". Submitted to ACM Dec. 1987,190-198. TOPLAS, 1990. Kranz, D., et el. "ORBIT: An Optimizing Compiler for Scheme". Prec. Sigplan Berth, J. "Shifting garbage collection overhead to compile time". CACM 20,7 "86 Syrup. on Compiler Constr., Sigplan Notices 21,7 (July 1986),219-233. (July 1977),513-518. Kranz, David A. ORBIT: An Optimizing Compiler for Scheme". Ph.D. Thesis, Bartley, D.H., Jansen, J.C. "The Implementation of PC Scheme". ACM Lisp & YaleUniv., also YALEUlDCS/RR-632, Feb. 1988,138p. Fun~t. Prog., Aug. 1986, 86-93. Lieberman, H., Hewitt, C. "A Real-Time Garbage Collector Based on the Berry, D.M. "Block Structure: Retention vs. Deletion". Prec. 3rd Sigact Syrup. Lifetimes of Objects". CACM 26, 6 (June 1983),419-429. Th. of Comp., Shaker Hgts., OH, 1971. McCarthy, J., et al. LISP 1.5 Programmer's Manual. MIT Press, Carnb., MA, Berry, D.M., et el. "Time Required for Reference Count Management in 1965. Retention Block-Structured Languages, Part 1". IntT J. Computer & Info. Sci. McDermott, D. "An Efficient Environment Allocation Scheme in an Interpreter 7,1 (1978),11-64. for a Lexically-scoped LISP". 1980 Lisp Conf., Stanford, CA, Berry, D.M., et al. "Time Required for Reference Count Management in Aug. 1980,154-162. Retention Block-Structured Languages, Part 2". lnt'l. J. Computer & lnfo. Sci. Mellender, F, et al. "Optimizing Smalltalk Message Performance". In Kim, W., 7,2 (1978),91-119. and Lochovsky, F.H., eda., Object-Oriented Concepts, Databases and Berry, D.M., and Sorkin, A. "Time Required for Garbage Collection in Retention Applications. Addison-Wesley,Reading, MA, 1989,423-450.. Block-Structured Languages". Int'l. J. Computer & Info. Sci. 7,4Moon, D. "GarbageCollectionin aLargeLispSystem". ACM Symp. on Lisp & (1978),361-404. Funct. Prog., 1984, 235-246. Bishop, P.B. Computer Systems with a very large address space and garbage Moore, Charles H. "The Evolution of FORTH, an Unusual Language". collection. Ph.D. Thesis, TR-178, MIT Lab. for Comp. Sci., Camb., MA, Magazine 5,8 (special issue on the Forth Programming Language) May 1977. (Aug. 1980),76-92. Blair, J.R., Keams, P., and Soffa, M.L. "An Optimistic Implementation of the Morrison, Donald F, and Griss, Martin, L. "An Efficient A-list-like Binding Stack-Heap". J. Sys. & Soft. 5 (1985),193-202. Scheme". Unpublished manuscript, Dept. of Computer Science, U. of Utah, Bobrow, D.G., and Weghreit, B. "A Model and Stack Implementation of Multiple Jan. 1982,12p. Environments". CACM 16,10 (Oct. 1973),591-603. Moses, J. "The function of FUNCTION in Lisp". Memo 199, MIT AI Lab., Bobrow, et al. "Common Lisp Object System Specification X3JI3", ACM Camb.,MA, June 1970. SIGPLAN Notices, v.23, Sept. 1988; also X3J13 Document 88-002R, June 1988. Moss, J.E.B. "Managing Stack Frames in Smalltalk". SIGPLAN "87 Syrup. on Boehm, H.-J., Demers, A. "Implementing Russell". Prec. Sigplan "86 Syrup. on Interpreters and Interpretive Techniques, in Sigplan Notices 22,7 (July 1987), Compiler Conatr., Sigplan Not. 21,7 (July 1986),186-195. 229-240. Brooks, R.A., et al. "An Optimizing Compiler for Lexically Seeped LISP". ACM Queinnec, Christian. "Dynamic Extent Objects". Lisp Pointers 2,1 Lisp & Funct. Prog. 19822,261-275. (July-Sept. 1988),11-21. Brooks, R.A. "Trading Data Space for Reduced Time and Code Space in Real- Randell, B., and Russell, L.J. ALGOL 60 lmplemxntation. Acad. Press, London Time Garbage Collection on Stock Hardware". ACM Lisp & Funct. Progr. and NY, 1964. 1984,256-262. Rees, J. and Clinger, W., et al. "Revised Report on the Algorithmic Language Chase, David. "Garbage Collection and Other Optirnizations". Phi) Thesis, Rice Scheme". Sigplan Not. 21,12 (Dec. 1986),37-79. U. Comp. Sci. Dept., Nov. 1987. Richards, M., Evans, A., and Mabee, R.R. "The BCPL Reference Manual". MIT Cioni, G., Kreczmar, A. "Programmed Deallocation without Dangling MAC TR-141, Dec. 1974,53p. Reference". IPL 18, 4 (May 1984),179-187. Ruggieri, C.; Murtagh, T.P. "Lifetime analysis of dynamically allocated objects". Clinger, W.D., Hartheimer, A.H., and Ost, E.M. "Implementation Strategies for ACM POPL "88,285-293. Continuations". ACM Conf. onLisp and Funct. Prog., July 1988,124-131. Samples,A.D., Ungar, D, and tlilfinger, P. "SOAR: Smalltalk without Bytecodes". CLtL: Steele, Guy L., Jr. Common Lisp: The Language. Digital Press, 1984. OOPSLA"86 Prec., Sigplan Not. 21, 11 (Nov. 1986),107-118. Danvy, O. "Memory Allocation and Higher-Order Functions". Prec. Sigplan "87 Sandewall, E. "A Proposed Solution to the FUNARG Problem". 6.894 Class Syrup. on Interpreters and Interpretive Techniques, ACM Sigplan Notices 22,7 Notes,MIT AI Lab., Sept. 1974,42p. (July 1987),241-252. Shaw, Robert A. "Improving Garbage Collector Performance in Virtual Dybvig, R.K. "Three Implementation Models for Scheme". Ph.D. Thesis, Univ. Memory". Stanford CSL-TR-87-323, March 1987. N. Carolina at Chapel Hill Dept. of Comp. Sci., also TR#87-011, Sherman, Mark, et al. "An ADA Code Generator for VAX 11/780 with UNIX". April 1987,180p. ACM SIGPLAN ADA Conf., SIGPLAN Notices 15,11 (Nov. 1980),91-100. Dybvig, R.K. "A Variable-Arity Procedural Interface". ACM Lisp and Stallman, Richard M. "Phantom Stacks: If you look too hard, they aren't there". Functional Prog. (July 1988),106-115. AI Memo 556, MIT AI Lab., 1980. Fischer, M.J. " Schemata". Prec. ACM Conf. on Proving Stanley, T.J., and Wedig, R.G. "A Performance Analysis of Automatically Asserts. re Progs., Sigplan Not. 7,1 (Jan. 1972). Managed Top of Stack Buffers". Prec. 14th Int'l. Syrup. on Computer Arch., Fisher, D.A. "Bounded Workspace Garbage Collection in an Address-Order ACM Computer Arch. News 15,2 (June 1987),272-281. Preserving List Processing Environment". Inf.Proc.Lett. 3,1 (July 1974),29-32. Steele, Guy L., Jr. "Fast Arithmetic in Maclisp". Prec. 1977 Macsyma User's Friedman, D.P., and Wise, D.S. "CONS Should not Evaluate its Arguments". In Conference, NASA Sci. and Tech. Info. Off. (Wash., DC, July 1977),215-224. S. Michaelson and R. Milner (eds.), Automata, Languages and Programming, Also A1 Memo 421, MIT AI Lab., Camb. Mass. Edinburgh U. Press., Edinburgh, Scotland, 1976,257-284. Steele, Guy L., Jr. Rabbit: A Compiler for SCHEME (A Study in Compiler Goldberg, B., and Park, Y.G. "Higher Order Escape Analysis: Optimizing Stack Optimization). AI-TR-474, Artificial Intelligence Laboratory, MIT, May 1978. Allocation in Functional Program Implementations". Prec. ESOP'90, Steele, Guy L., Jr., and Sussman, Gerald J. "Design of LISP-Based Processors or Springer-Verlag, May 1990. SCHEME: A Dielectric LISP or, Finite Memories Considered Harmful or, Greanblatt, R., et al. "The Lisp Machine". AI WP 79, MIT, Camb., MA, Nov. LAMBDA: The Ultimate Opcede". MIT AI Memo 514, Mamh 1979. 1974, rev. vers. in Winston, P.H., and Brown, R.H., eds. Artificial Intelligence, STEELMAN. Dept. of Defense Requirements for High Order Computer an M1T Perspective: Vol. H. MIT Press, Camb., MA, 1979. Programming Languages. June 1978. Harbison, S.P., and Steele, G.L., Jr. C: A Reference Manual, 2nd Ed. Taft, S. Tucker, et al. Ada-gX Draft Mapping Document. Wright Lab., AFSC, Prentice-Hall, Englewood Cliffs, NJ, 1987. Eglin AFB, FL, Feb. 1991. Harrington, Steven J. "Space efficient copying storage recovery". Computer J. Terashima, M., and Goto, E. "Genetic Order and Compactifying Garbage 24,4 (1981),316-319. Collectors". IPL 7,1 (Jan. 1978),27-32. Hederman, Lucy. "Compile Time Garbage Collection". MS Thesis, Rice U. Unger, D. "Generation Scavenging: A non-disr~tptive, high performance storage Computer Science Dept., Sept. 1988. reclamation algorithm". ACM Soft. Eng. Syrup. on Prac. Software Dev. Envs., Hieb, R., Dybvig, R.K., and Bruggeman, Carl. "Representing Control in the Sigplan Notices 19,6 (June 1984),157-167. Presence of First-Class Continuations". ACM Sigplan "90 Conf. Prog. Lang. Warren, David H.D. Applied Logic--Its Use and Implementation as a Design & lmpl., June 1990,66-77. Programming Tool. Ph.D. Thesis, U. of Edinburgh, 1977, also Tech. Note 290, IEEE-Scheme. IEEE Standard for the Scheme Programming Language. SRI IntT, Menlo Park, CA, 1983,230p. IEEE-1178-1990, IEEE, NY, Dec. 1990. Wise, D.S., and Friedman, D.P. "The One-Bit Reference Count". BIT 17 Johnston, J.B. "The Contour Model of Block Structured Processes". Sigplan (1977),351-359. Notices (Feb. 1971).

34