Vol. 47 No. SIG 0(PRO 29) IPSJ Transactions on Programming June 2006

A Transformation-Based Implementation of Lightweight Nested Functions

Tasuku Hiraishi,† Masahiro Yasugi† and Taiichi Yuasa†

The SC language system has been developed to provide a transformation-based language extension scheme for SC languages (extended/plain languages with an S-expression based syntax). Using this system, many flexible extensions to the C language can be implemented by transformation rules over S-expressions at low cost mainly because of the pre-existing Common Lisp capabilities for manipulating S-expressions. This paper presents the LW-SC (LightWeight-SC) language as an important application of this system, which features nested functions (i.e., a function defined inside another function). Without returning from a function, the function can manipulate its caller’s local variables (or local variables of its indirect callers) by indirectly calling a of its (indirect) caller. Thus, many high-level services with “stack walk” can be easily and elegantly implemented by using LW-SC as an intermediate language. Moreover, such services can be efficiently implemented because we design and implement LW-SC to provide “lightweight” nested functions by aggressively reducing the costs of creating and maintaining nested functions. The GNU C also provides nested functions as an extension to C, but our sophisticated translator to standard C is more portable and efficient for occasional “stack walk”.

port nested functions. A nested function is a 1. Introduction function defined inside another function. With- The C language is often indispensable for de- out returning from a function, the function can veloping practical systems. Furthermore, ex- manipulate its caller’s local variables (or local tended C languages are sometimes suitable for variables of its indirect callers) sleeping in the elegant and efficient development. We can execution stack by indirectly calling a nested implement language extension by modifying a function of its (indirect) caller. C compiler, but sometimes we can do it by This paper presents the implementation of translating an extended C program into C. We an extended SC language, named LW-SC have developed the SC language system8),10) to (LightWeight SC), which features nested func- help such transformation-based language exten- tions☆. Many high-level services with “stack sions. SC languages are extended/plain C lan- walk” mentioned above can be easily and el- guages with an S-expression based syntax and egantly implemented by using LW-SC as an the extensions are implemented by transforma- intermediate language. Moreover, such ser- tion rules over S-expressions. Thus we can re- vices can be efficiently implemented because duce implementation costs mainly because we we design and implement LW-SC to provide can easily manipulate S-expressions using Lisp. “lightweight” nested functions by aggressively The fact that C has low-level operations (e.g., reducing the costs of creating and maintaining pointer operations) enables us to implement nested functions. Though the GNU C com- many flexible extensions using the SC language piler15) (GCC) also provides nested functions system. But without taking “memory” ad- as an extension to C, our sophisticated transla- dresses, C lacks an ability to access variables tor to standard C is more portable and efficient sleeping in the execution stack, which is re- for occasional “stack walk”. quired to implement high-level services with Note that, though this paper presents an im- “stack walk” such as capturing a stack state for plementation using the SC language system, check-pointing and scanning roots for copying our technique is not limited to it. GC (Garbage Collection). A possible solution to this problem is to sup-

☆ We have previously reported the implementation of † Department of Communications and Computer En- LW-SC as an example of a language extension us- gineering, Graduate School of Informatics, Kyoto ing the SC language system.10) This paper discusses University further details about LW-SC itself.

1 2 IPSJ Transactions on Programming June 2006

extended SC-B code transformation rule-set B transformation rule-set A

SC preprocessor SC translator SC preprocessor SC translator

extended SC-A code C code extended SC-C code

SC preprocessor SC translator SC preprocessor SC compiler C compiler

transformation rule-set C SC-0 code executable file Fig. 1 Code translation phases in the SC language system.

evaluated as a defmacro form of Com- 2. The SC Language System mon Lisp to define an SC macro. After This section explains the SC language system the definition, every list in the form of for giving the specification and an implementa- (macro-name ···) is replaced with the re- tion of LW-SC. More details are available in our sult of the application of Common Lisp’s past paper.8),10) macroexpand-1 function to the list. The 2.1 Overview algorithm to expand nested macro applica- The SC language system, implemented in tions complies with the standard C specifi- Common Lisp, deals with the following S- cation. expression-based languages: • (%defconstant macro-name S-expression) • SC-0, the base SC language, and defines an SC macro in the same way as • extended SC languages, a %defmacro directive, except that every and consists of the following three kinds of mod- symbol which eqs macro-name is replaced ules: with S-expression after the definition. • The SC preprocessor — includes SC files • (%undef macro-name) and handles macro definitions and expan- undefines the specified macro defined by sions, %defmacros or %defconstants. • The SC translator — interprets transfor- • (%ifdef symbol list 1 list 2) mation rules for translating SC code into (%ifndef symbol list 1 list 2) another SC, and If the macro specified by symbol is defined, • The SC compiler — compiles SC-0 code list 1 is spliced there. Otherwise list 2 is into C. spliced. Fig. 1 shows code translation phases in the • (%if S-expression list 1 list 2) SC language system. Extended SC code is S-expression is macro-expanded, then the translated into SC-0 by the SC translators, then result is evaluated by Common Lisp. If the translated into C by the SC compiler. Be- return value eqls nil or 0, list 2 is spliced fore each translation phase with a transforma- there. Otherwise list 1 is spliced. tion rule-set is applied, preprocessing by the • (%error string) SC preprocessor is performed. Extension im- interrupts the compilation with an error plementers can develop a new translation phase message string. simply by writing new transformation rules. • (%cinclude file-name) 2.2 The SC Preprocessor file-name specifies a C header file. The C The SC preprocessor handles the following code is compiled into SC-0 and the result SC preprocessing directives to transform SC is spliced there. The SC programmers can programs: use library functions and most of macros • (%include file-name) such as printf, NULL declared/#defined corresponds to an #include directive in C. in C header files☆.

The file file-name is included. ☆ • In some cases such a translation is not obvious. (%defmacro macro-name lambda-list In particular, it is sometimes impossible to trans- S-expression1···S-expressionn) late #define macro definitions into %defmacro or Vol. 47 No. SIG 0(PRO 29) A Transformation-Based Implementation of Lightweight Nested Functions 3

2.3 The SC Translator and Transfor- uation result of (every #’function-name mation Rules list) is non-nil. A transformation rule for the SC translator The function function-name can be what is de- is given by the syntax: fined as a list of transformation rules or an usual Common Lisp function (a built-in function or (function-name pattern parm2 ··· parmn) what is defined separately from transformation -> expression rules). In evaluating expression, the special variable where a function function-name is defined as x is bound to the whole matched S-expression an usual Lisp function. When the function is and, in the cases except (1), symbol is bound to called, the first argument is tested whether it the matched part in the S-expression. matches to pattern. If matched, expression is An example of such a function definition can evaluated by the Common Lisp system, then its be given as follows☆: value is returned as the result of the function call. The parameters parm2 ··· parmn, if any, (EX (,a[numberp] ,b[numberp])) are treated as usual arguments. -> ‘(,a ,b ,(+ a b)) A list of transformation rules may include two (EX (,a ,b)) or more rules with the same function name. In -> ‘(,a ,b ,a ,b) these cases, the first argument is tested whether (EX (,a ,b ,@rem)) it matches to each pattern in written order, and -> rem the result of the function call is the value of (EX ,otherwise) expression if matched. -> ’(error) It is permitted to abbreviate The application of the function EX can be ex- ··· (function-name pattern1 parm2 parmn) emplified as follows: -> expression ... (EX ’(3 8)) → (3 8 11) ··· → (function-name patternm parm2 parmn) (EX ’(x 8)) (x 8 x 8) -> expression (EX 8) → (error) (EX ’(3)) → (error) (all the expressions are identical and only pat- (EX ’(x y z)) → (z) terns are different from each other) to Each set of transformation rules defines one or ··· (function-name pattern1 parm2 parmn) more (in most cases) function(s). A piece of ... extended SC code is passed to one of the func- ··· (function-name patternm parm2 parmn) tions, which generates transformed code as the -> expression. result. Internally, transformation rules for a func- The pattern is an S-expression consisted of one tion are compiled into an usual Common Lisp of the following elements: function definition (defun). The output can be (1) symbol loaded by the load function, which enables the matches a symbol that is eq to symbol. programmers to easily test a part of transfor- (2) ,symbol mation rule-sets in an interactive environment. matches any S-expression. 2.4 The SC Compiler and the SC-0 (3) ,@symbol Language matches any list of elements longer than 0. We designed the SC-0 language as the final (4) ,symbol[function-name] target language for translation by transforma- matches an element if the evaluation result of (funcall #’function-name element) is ☆ In consideration of symmetry between expressions non-nil. and patterns, it is more pertinent to describe (5) ,@symbol[function-name] ‘(,a[numberp] ,b[numberp]) with a backquote. matches list (longer than 0) if the eval- However, this notation rule leads inconvenience that programmers have to put backquotes before most of patterns. We preferred shorter descriptions and %defconstant. We discussed this problem before.9) adopt the notation without backquotes. 4 IPSJ Transactions on Programming June 2006

(def (sum a n) (fn int (ptr int) int) (def (h i g) (fn int int (ptr (lightweight int int))) (def s int 0) (return (g (g i)))) (def i int 0) (do-while 1 (def (foo a) (fn int int) (if (>= i n) (break)) (def x int 0) (+= s (aref a (inc i)))) (def y int 0) (def (g1 b) (lightweight int int) (return s)) (inc x) Fig. 2 An SC-0 program. (return (+ a b))) (= y (h 10 g1)) (return (+ x y))) int sum (int* a, int n) { (def (main) (fn int) int s=0; (return (foo 1)) int i=0; do{ Fig. 4 An LW-SC program. if ( i >= n ) break; s += a[i++]; } while(1); block-item-list) return s; (the almost same syntax as ordinary func- } tion definitions’ except for the difference Fig. 3 C program equivalent to Fig. 2. between keywords fn and lightweight.) A nested function can access the lexically- tion rules. It has the following features: scoped variables in the creation-time environ- • an S-expression based, Lisp like syntax, ment and a pointer to it can be used as a • the C semantics; actually most of C code function pointer to indirectly call the closure. can be represented in SC-0 in a straightfor- For example, Fig. 4 shows an LW-SC program. ward manner☆, and When h indirectly calls the nested function g1, • practical for programming. it can access a parameter a and local variables Fig. 2 shows an example of such an SC-0 pro- x, y sleeping in foo’s frame. gram, which is equivalent to the program in 4. GCC’s Implementation of Nested Fig. 3. Functions In practice, the SC compiler is imple- mented as a transformation rule-set described GCC also features nested functions and the above, which specifies transformation from specification of nested functions of LW-SC is S-expressions to a string (instead of S- almost the same as the one of GCC. As well expressions). as GCC (but differently from closure objects in modern languages such as Lisp and ML), 3. Language Specification of LW-SC nested functions of LW-SC are valid only when LW-SC has the following features as exten- the owner blocks are alive. But unlike GCC, sions to SC-0. pointers to nested functions are not compati- • Nested function types: ble with ones to top-level functions. However, (lightweight type-expression-list) such limitations are insignificant for the pur- is added to the syntax for type-expression pose of implementing most high-level services • Calling nested functions: In function- with “stack walk” mentioned in Section 1. call expressions ((expression-list)), The GCC’s implementation of nested functions type of the first expression is permitted to causes high maintenance/creation costs for the be the nested function pointer type other following reasons: than the ordinary function pointer type. • In creating nested functions, there is the • Defining nested functions: In the places cost for initializing. To initialize a nested where variable definitions are allowed ex- function, GCC implements taking the ad- cept at the top-level, definitions of nested dress of it using a technique called tram- functions are permitted in the following polines.2) Trampolines are code fragments form: generated on the stack at runtime to indi- (def (identifier-list) rectly enter the nested function with a nec- (lightweight type-expression-list) essary environment. The cost of runtime code generation is high, and for some pro- ☆ except some features such as -> operators, for con- cessors like SPARC, it is necessary to flush structs, and while constructs. These are imple- mented as language extensions to SC-0 using the some instruction caches for the runtime- SC language system itself. generated trampoline code. Vol. 47 No. SIG 0(PRO 29) A Transformation-Based Implementation of Lightweight Nested Functions 5

• Local variables and parameters of a func- of nested functions (of LW-SC) and return tion generally may be assigned to registers addresses. if the function has no nested function. But (b) Each function call to an ordinary top-level an owner function of GCC’s nested func- function in LW-SC is transformed to the tions must keep the values of these vari- same function call in C, except that a spe- ables in the stack for the nested functions cial argument is added which saves the to access them usually via a static chain. stack pointer to the explicit stack. The Thus, the owner function must perform callee first initializes its frame pointer with memory operations to access these vari- the stack pointer, moves the stack pointer ables, which means that the cost of main- by its frame size, then executes its body. taining nested functions is high. (c) Each nested function definition in LW-SC LW-SC overcomes the former problem by trans- is moved to the top-level in C. Instead, a lating the nested function to a lazily-initialized value of a structure type, which contains pair (on the explicit stack) of the ordinary func- the pointer to the moved nested function tion pointer and the frame pointer, and the lat- and the frame pointer of the owner func- ter by saving the local variables to the “explicit tion, is stored on the explicit stack. Note stack” lazily (only on calls to nested functions), that initialization of the structure is de- as is shown in the following section. layed until nested functions are invoked to reduce costs of creating nested functions. 5. Implementation of LW-SC (d) Each function call to a nested function We implemented LW-SC described above by in LW-SC is translated into the following using the SC language system, that is, by writ- steps. ing transformation rules for translation into SC- 1. Push arguments passed to the nested 0, which is finally translated into C. function and the pointer to the struc- 5.1 Basic Ideas ture mentioned above in (c) to the ex- The basic ideas to implement nested func- plicit stack. tions by translation are summarized as follows: 2. Save the values of the all local variables • After transformation, all definitions of and parameters, and an integer corre- nested functions are moved to be top-level sponding to the current execution point definitions. (return address) into the explicit stack, • To enable the nested functions to access lo- then return form the function. cal variables of their owner functions, an 3. Iterate Step 2 until control is returned explicit stack is employed in C other than to main. The values of local variables the (implicit) execution stack for C. The and parameters of main are also stored explicit stack mirrors values of local vari- to the explicit stack. ables in the execution stack, and is referred 4. Referring to the structure which is to when local variables of the owner func- pointed to by the pointer pushed at tions are accessed. Step 1 (the one in (c)), call the • To reduce costs of creating and maintaining nested function whose definition has nested functions, operations to fix inconsis- been moved to the top-level in C. The tency between two stacks are delayed until callee first obtains its arguments by nested functions are actually invoked. popping the values pushed at Step 1, Function calls/returns and function defini- then executes its body. tions in LW-SC should be appropriately trans- 5. Before returning from the nested func- formed based on these ideas. tion, push the return value to the ex- 5.2 Transformation plicit stack. LW-SC programs are translated in the follow- 6. Reconstruct the execution stack by ing way to realize the ideas described in Section restoring the local variables, the param- 5.1. eters, and the execution points, with (a) Each generated C program employs an the values saved in the explicit stack explicit stack mentioned above on mem- at Step 3 (the values may be changed ory. This shows a logical execution during the call to the nested function), stack, which manages local variables, callee to return to (resume) the caller of the frame pointers, arguments, return values nested function. 6 IPSJ Transactions on Programming June 2006

Executing main Executing foo Just before g1

h h

foo foo foo foo

main main main main main main

C stack explicit stack C stack explicit stack C stack explicit stack

1. 2., 3. 4.(Start g1)

get the argument h h h h g1 g1 foo foo foo g1 foo main main main main main main

C stack explicit stack C stack explicit stack C stack explicit stack

4.(Executing g1) 5., 6. 7. return value g1 h h h h

g1 foo foo foo foo main main main main main main

C stack explicit stack C stack explicit stack C stack explicit stack

: The stack memory that contains correct values of local variables. : The structure that contains a pointer to function g1(moved to the top level) and a pointer to foo’s frame. : The arguments to g1 and the pointer to the above structure. : The return value of g1.

Fig. 5 Details of an indirect call to a nested function g1 in Fig. 4.

7. If necessary, get the return value of the from the beginning of the execution until the nested function pushed at Step 5 by end of the first indirect call to a nested func- popping the explicit stack. tion g1 (Each number in the figure corresponds Note that a callee (a nested function) can to the step of the nested function call described access the local variables of its owner func- in (d)). Notice that the correct values of the tions through the frame pointers contained local variables are saved in the explicit stack in the structure that have been saved at during the execution of the nested function and Step 1. otherwise in the C stack. For example, Fig. 5 shows the state transi- 5.3 Transformation Rules tion of the two stacks☆, in the case of Fig. 4, To implement the transformation described above, we wrote transformation rules. The en- ☆ “The C stack” here just states the set of local vari- tire transformation is divided into the follow- ables and parameters, whose values are stored not ing four phases (rule-sets) for simplicity and only in the stack memory but also in registers. reusability of each phase. Vol. 47 No. SIG 0(PRO 29) A Transformation-Based Implementation of Lightweight Nested Functions 7

(1) The type rule-set: adds type information to all the expressions of an input program. (2) The temp rule-set: transforms an input (Tp0 (,@declaration-list) ) program in such a way that no function -> (progn call appears as a subexpression (except as ... (let (*str-alist* *v-alist* *lastv-alist*) a right hand side of an assignment). (mapcar #’Tp1 declaration-list))) (3) The lightweight rule-set: performs the ;;;;; declaration ;;;;; transformation described in Section 5.2. (Tp1 (,scs[SC-SPEC] ,id[ID] ,texp ,@init)) -> (progn (4) The untype rule-set: removes the type (push (cons id (remove-type-qualifier texp)) information added by the type rule-set *v-alist*) ‘(,scs ,id ,texp ,@(mapcar #’Tpi init))) from expressions to generate correct SC-0 (Tp1 (,scs[SC-SPEC] (,@id-list[ID]) code. (fn ,@texp-list) ,@body)) -> (let* ((texp-list2 The following subsections present the details (mapcar #’rmv-tqualifier texp-list)) of these transformation rule-sets. (*v-alist* (cons (cons (first id-list) 5.3.1 The type rule-set ‘(ptr (fn ,@texp-list2))) *v-alist*)) Transformation by the temp rule-set and the (new-body nil)) lightweight rule-set needs type information of (let ((b-list (cmpd-list (cdr id-list) all expressions. The type rule-set adds such (cdr texp-list2)))) information. More concretely, it transforms (setq new-body each expression into type-expression ex- (let((*v-alist* (append b-list *v-alist*)) (the (*str-alist* *str-alist*)) pression). (mapcar #’Tpb body)))) Fig. 6 shows the (abbreviated) transforma- ‘(,scs (,@id-list) (fn ,@texp-list),@new-body)) tion rule-set. Tp0 is applied to input program ... (e.g., in Fig. 7) to get the transformed program (Tp1 ,otherwise) -> (error "sytax error") (e.g., in Fig. 8). Tp1 receives declarations and ;;;;;;; body ;;;;;; renews the dynamic variables which save the in- (Tpb (do-while ,exp ,@body)) formation about defined variables, structures, -> (switch ,(Tpe exp) ,@(let ((*v-alist* *v-alist*) etc. Tpe actually transforms expressions refer- (*str-alist* *str-alsit*)) ring to the dynamic variables. (mapcar #’Tpb body))) ... 5.3.2 The temp rule-set (Tpb ,otherwise) A function call appearing as a subexpression -> (let ((expression-stat (Tpe otherwise))) (if (eq ’$not-expression expression-stat) such as (g x) in (f (g x)) makes it difficult to (Tp1 otherwise) add some operations just before/after the func- expression-stat)) tion call. The temp rule-set makes such function ;;;;; expression ;;;;; (Tpe ,id[ID]) calls not appear. -> ‘(the ,(assoc-vartype id) ,id) Because some temporary variables are needed ... (Tpe (ptr ,exp)) for the transformation, the definitions of those -> (let ((exp-with-type (Tpe exp))) are inserted at the head of the function body. ‘(the (ptr ,(cadr exp-with-type)) (ptr ,exp-with-type))) For example, a program in Fig. 10 is trans- (Tpe (mref ,exp)) formed to the program in Fig. 11 using this -> (let* ((exp-with-type (Tpe exp)) rule-set. (exp-type (cadr exp-with-type))) ‘(the ,(deref-type exp-type) Fig. 9 shows the (abbreviated) temp rule- (mref ,exp-with-type))) set. The actual transformation is performed by (Tpe (,fexp[EXPRESSION] ,@arg-list)) -> (let* ((fexp-with-type (Tpe fexp)) Tmpe, which returns a 3-tuple of (fexp-type (cadr fexp-with-type)) • a list of the variable definitions to be in- (type-fn (cadr fexp-type))) serted at the head of the current function, ‘(the ,(cadr type-fn) (call (the ,type-fn • a list of the assignments to be inserted just ,(caddr fexp-with-type)) before the expression, and ,@(mapcar #’Tpe arg-list)))) (Tpe ,otherwise) • an expression with which the current ex- -> ’$not-expression pression should be replaced. Fig. 6 The type rule-set (abbreviated). Tmp and Tmp2 combine the tuples appropriately and finally Tmp0 returns the transformed code. 5.3.3 The lightweight rule-set Now the transformation described in Section 8 IPSJ Transactions on Programming June 2006

(def (g x) (fn int int) (return (* x x)))

(def (f x) (fn double double) (Tmp0 (,@decl-list)) (return (+ x x))) ->(progn ... (def (h x) (fn char double) (let ((*used-id* (get-all-id x)) (return (f (g x)))) (*prev-continue* nil)) (mapcar #’Tmp1 x))) Fig. 7 An example for the type rule-set (before ;;;; declaration ;;;;; transformation). (Tmp1 (,scs[SC-SPEC] (,@id-list[ID]) (fn ,@texp-list) ,@body)) -> (let* ((tmpbody (Tmp2 body)) (def (g x) (fn int int) (newdecl (first tmpbody)) (return (the int (newbody (second tmpbody))) (* (the int x) (the int x))))) ‘(,scs (,@id-list) (,fntag ,@texp-list) ,@newdecl ,@newbody)) (def (f x) (fn double double) ... (return (the double ;;;;; body ;;;;; (+ (the double x) (the double x))))) (Tmp2 (,@item-list)) -> (let* ((tmpitemlist (mapcar #’Tmp item-list)) (def (h x) (fn char double) (decl-list (apply #’append (return (the double (mapcar #’first tmpitemlist))) (call (the (fn double double) f) (prev-stat (apply #’append (mapcar (the int (call (the (fn int int) g) #’(lambda (x) ‘(,@(second x) ,(third x))) (the double x))))))) tmpitemlist)))) Fig. 8 An example for the rule-set (after (list decl-list prev-stat)) type (Tmp (do-while ,exp ,@body)) transformation). ->(let* ((tmpexp (Tmpe exp)) (*prev-continue* (second tmpexp)) (tmpbody (Tmp2 body))) 5.2 is realized by the lightweight rule-set. (list (append (first tmpexp) (first tmpbody)) Fig. 12 shows the (abbreviated) lightweight nil (do-while ,(third tmpexp) rule-set which is related to the transformation ,@(second tmpbody) ,@*prev-continue*))) of “ordinary function” calls and “nested func- (Tmp (return ,@exp)) tion” calls. Esp appearing in the code is a spe- -> (if (null exp) ‘(nil nil (return)) cial parameter which is added to each func- (let ((tmpexp (Tmpe (car exp)))) tion and keeps the stack top of the explicit ‘(,(first tmpexp) ,(second tmpexp) (return ,(third tmpexp))))) stack. Efp is a special local variable added to ... each function, which acts as the (explicit) frame (Tmp ,otherwise) ->(let ((tmpe-exp (Tmpe otherwise))) pointer of the function. Lwe-xfp transforms (if (eq ’$not-expression tmpe-exp) references to local variables into references to (list (list (Tmp1 otherwise)) nil) the explicit stack. tmpe-exp)) ;;;;; expression ;;;;; “Ordinary function” calls and “nested func- (Tmpe (the ,texp (call ,fexp ,@arg-list))) tion” calls can be statically distinguished with -> (case texp ((void) the functions’ types because ordinary function ... ) types are incompatible with lightweight nested (otherwise (let* function types. ((tmpexps (comb-list (mapcar #’Tmpe arg-list))) The transformation of each operation is de- (tempid (generate-id "tmp")) tailed as follows (the rules unrelated to function (tmp-decl1 ‘(def ,tempid ,texp)) (tmp-decl calls are omitted in the figure): (append (first tmpexps) ‘(,tmp-decl1))) Calling ordinary functions: The function (tmp-set1 ‘(the ,texp (= (the ,texp ,tempid) (the ,texp (call ,fexp ,@(third tmpexps)))))) call is performed as a part of the condi- (tmp-set tional expression of the while statement, (append (second tmpexps) ‘(,tmp-set1)))) where the stack pointer is passed to the (list tmp-decl tmp-set ‘(the ,texp ,tempid))))) (Tmpe (the ,texp (+ ,exp1 ,exp2))) callee as an additional first argument. If -> (let ((op (caaddr x)) the callee procedure normally finished, the (t-exp1 (Tmpe exp1)) (t-exp2 (Tmpe exp2))) (list ‘(,@(first t-exp1) ,@(first t-exp2)) condition becomes false and the body of ‘(,@(second t-exp1) ,@(second t-exp2)) while loop is not executed. Otherwise, ‘(the ,texp (,op ,(third t-exp1) ,(third t-exp2))))) if the callee returned for a “nested func- ... tion” call, the condition becomes true. In Fig. 9 The temp rule-set (abbriviated). the body of the while loop, the values of local variables are saved to the explicit stack, an integer that corresponds to the Vol. 47 No. SIG 0(PRO 29) A Transformation-Based Implementation of Lightweight Nested Functions 9

;;;; Due to the temp rule-set, a function call expression must be appeared in either of the following form ;;;; as a statement expression: ;;;; * (= variable function-call-expression) ;;;; * (= function-call-expression). ;;; ‘‘Ordinary function’’ call (Lwe (the ,texp0 (= (the ,texp1 ,id) (the ,texp (call (the (fn ,@texp-list) ,exp-f) ,@exp-list))))) (Lwe (the ,texp (call (the (fn ,@texp-list) ,exp-f) ,@exp-list))) -> (let (...) ... (let* (...) (list nil decl-list (cons ’(= new-esp esp) prev-list) ‘(while ‘(and (== (= ,(Lwe-xfp ‘(the ,texp1 ,id)) (call ,fexp new-esp ,@(cdr tmpid-list))) (special ,texp0)) (!= (= (fref efp -> tmp-esp) (mref-t (ptr char) esp)))) ;; Save the values of local variables to the frame. ,@(make-frame-save *current-func*) ... ;; Save the current execution point. (= (fref efp -> call-id) ,(length (finfo-label-list *current-func*))) ;; Return from the current function ;; (In main, call the nested function here instead of the following steps). ,(make-suspend-return *current-func*) ;; Continue the execution from here when reconstructing the execution stack. (label ,(caar (push (cons (generate-id "L_call" *used-id-list*) nil) (finfo-label-list *current-func*))) nil) ;; Restore local variables from the explicit stack. ,@(make-frame-resume *current-func*) ... (= new-esp (+ esp 1)))))) ;;; ‘‘Nested function’’ call (Lwe (the ,texp0 (= (the ,texp1 ,id) (the ,texp (call (the (lightweight ,@texp-list) ,exp-f) ,@exp-list))))) (Lwe (the ,texp (call (the (lightweight ,@texp-list) ,exp-f) ,@exp-list))) -> (let (...) ... (list ’() fp-decl ’() ‘(begin ... (= argp (aligned-add esp (sizeof (ptr char)))) ;; Push the arguments passed to the nested function ,@(mapcar (compose #’(lambda (x) ‘(push-arg ,(second x) ,(third x) argp)) #’Lwe-xfp) (reverse exp-list)) ;; Push the structure objcect that corresponds to the frame of the nested function to the explicit stack. (= (mref-t (ptr closure-t) argp) ,xfp-exp-f) ... ;; Save the values of local variables to the frame. ,@(make-frame-save *current-func*) (= (fref efp -> argp) argp) (= (fref efp -> tmp-esp) argp) ;; Save the current execution point. (= (fref efp -> call-id) ,(length (finfo-label-list *current-func*))) ;; Return from the current function (In main, call the nested function here instead of the following steps). ,(make-suspend-return *current-func*) ;; Continue the execution from here after the function call finishes. (label ,(caar (push (cons (generate-id "L_call" *used-id-list*) nil) (finfo-label-list *current-func*))) nil) ;; Restore local variables from the explicit stack. ,@(make-frame-resume *current-func*) ;; Get the return value (if necessary). ,@(when assign-p ‘( (= ,(Lwe-xfp ‘(the ,texp1 ,id)) (mref-t ,texp1 (fref efp -> argp))) )) ))) Fig. 12 The lightweight rule-set (abbriviated).

current execution point is also saved to the tion call in the conditional expression of the explicit stack ((fref efp -> call-id)), while statement is restarted. The assign- and then the current function temporarily ment (= new-esp (+ esp 1)) at the end exits. This function is re-called for recon- of the while block sets a flag at the LSB structing the execution stack after the ex- of the explicit stack pointer that indicates ecution of the nested function. Then the reconstructing the execution stack. control is transferred to the label that is Calling Nested functions: The arguments put next to the return by a goto statement passed to the nested function and the clo- which is added in the head of the function. sure structure (contains the nested function Then the values of local variables are re- pointer and the frame pointer of its owner stored form the explicit stack and the func- function) are pushed to the explicit stack. 10 IPSJ Transactions on Programming June 2006

(def (g x) (fn int int) (UTp0 ,decl-list) (return -> (UTp decl-list) (the int (UTp (the ,texp ,exp)) (+ (the int -> (Utp exp) (= (the int x) (the int 3))) (UTp (call ,@exp-list)) (the int -> (mapcar #’Utp exp-list) (call (the (fn int int) g) (UTp (,@lst)) (the int x))))))) -> (mapcar #’UTp lst) Fig. 10 An example for the temp rule-set (before (UTp ,otherwise) transformation). -> otherwise Fig. 13 The untype rule-set. (def (g x) (fn int int) (def tmp1 int) (def tmp2 int) definitions, and adding the structure defi- (the int nition that represents the function’s frame (= (the int tmp1) (the int in the explicit stack and is referred to by (= (the int x) (the int 3))))) efp. (the int (= (the int tmp2) 5.3.4 The untype rule-set (the int The output code transformed by the (call (the (fn int int) g) (the int x))))) lightweight rule-set is not valid SC-0 code be- (return cause it contains type information. The untype (the int rule-set removes such information and gener- (+ (the int tmp1) (the int tmp2))))) ate valid SC-0 code. The rule-set is very sim- Fig. 11 An example for the temp rule-set (after transformation). ple; only needs to search (the ...) forms re- cursively and to remove the type information. Fig. 13 shows the entire untype rule-set. Then, like an “ordinary function” call, the As an example of the total translation, Ap- values of local variables and the executing pendix A.1 shows the entire SC-0 code gener- point are saved, the current function ex- ated from the LW-SC program in Fig. 4. its, and the execution point is restored by 6. Evaluation goto after the procedures for calling the nested function. Then the values of local 6.1 Creation and Maintenance Cost variables are restored and the return value To measure costs of creating and maintain- of the nested function is taken from the top ing nested functions, we employed the follow- of the explicit stack, if exists. ing programs with nested functions for several Returning from functions: Returns from high-level services and compared them with the ordinary function need no transformation. corresponding plain C programs: On the other hand, returns from nested BinTree (copying GC) creates a binary functions must be transformed to push the search tree with 200,000 nodes, with a return value to the explicit stack, and just copying-collected heap (Fig. 14). to return 0 to indicate that the execution Bin2List (copying GC) converts a binary of the function is normally finished. tree with 500,000 nodes into a linear list, Function definitions: The following steps with a copying-collected heap (Fig. 15). are added before the functions’ body: fib(34) (check-pointing) calculates the 34th • initializing the frame pointer of the ex- Fibonacci number recursively, with a capa- plicit stack (efp) and the stack pointer bility of capturing a stack state for check- (esp), pointing (Fig. 16). • judging whether reconstruction of the nqueens(13) (load balancing) solves the N- execution stack is required or not queens problem (N=13) on a load- and, if required, executing goto balancing framework based on lazy parti- to the label corresponding to tioning of sequential programs.21),22) (fref efp -> call-id), and Note that nested functions are never invoked, • popping parameters from the explicit that is, garbage collection, check-pointing and stack, in the case of nested functions. task creation do not occur, in these measure- The transformation also involves adding ments because we measured the costs of creat- the parameter esp that receives the explicit ing and maintaining nested functions. stack pointer, adding some local variable We measured the performance on 1.05GHz Vol. 47 No. SIG 0(PRO 29) A Transformation-Based Implementation of Lightweight Nested Functions 11

Table 1 Performance measurements (for the creation and maintenance cost).

Elapsed Time in seconds S:SPARC (relative time to plain C) P:Pentium C GCC LW-SC XCC CL-SC BinTree S 0.180 0.263 0.192 0.181 0.249 copying (1.00) (1.46) (1.07) (1.00) (1.38) GC P 0.152 0.169 0.156 0.150 0.179 (1.00) (1.11) (1.03) (0.988) (1.18) Bin2List S 0.292 0.326 0.303 0.289 0.318 copying (1.00) (1.12) (1.04) (0.99) (1.09) GC P 0.144 0.145 0.151 0.146 0.154 (1.00) (1.01) (1.05) (1.01) (1.07) fib(34) S 0.220 0.795 0.300 0.226 0.361 check- (1.00) (3.61) (1.36) (1.03) (1.64) pointing P 0.0628 0.152 0.138 0.0751 0.162 (1.00) (2.42) (2.20) (1.20) (2.58) nqueens(13) S 0.478 1.04 0.650 0.570 1.05 load (1.00) (2.18) (1.36) (1.19) (2.20) balancing P 0.319 0.428 0.486 0.472 0.544 (1.00) (1.34) (1.52) (1.48) (1.71)

(deftype sht (ptr (lightweight void void))) (def (randinsert scan0 this n) (def (cpfib save0 n) (fn void sht (ptr Bintree) int) (fn int (ptr (lightweight void)) int) (decl i int) (def pc int 0) (decl k int) (def s int 0) (decl seed (array unsigned-short 3)) (def (save1) (lightweight void) (def (scan1) (lightweight void void) (save0) (= this (move this)) (save-pc pc) (scan0)) (save-int s) (= (aref seed 0) 3) (save-int n)) (= (aref seed 1) 4) (if (<= n 2) (= (aref seed 2) 5) (return 1) (for ((= i 0) (< i n) (inc i)) (begin (= k (nrand48 seed)) (= pc 1) (insert scan1 this k k))) (+= s (cpfib save1 (- n 1))) (= pc 2) Fig. 14 The LW-SC program for BinTree. (+= s (cpfib save1 (- n 2))) (return s))) ) (deftype sht (ptr (lightweight void void))) Fig. 16 The LW-SC program for fib(34). (def (bin2list scan0 x rest) (fn (ptr Alist) sht (ptr Bintree) (ptr Alist)) (def a (ptr Alist) 0) UltraSPARC-III and 3GHz Pentium 4 using (def kv (ptr KVpair) 0) GCC with -O2 optimizers. Table 1 summa- (def (scan1) (lightweight void void) (= x (move x)) rizes the results of performance measurements, (= rest (move rest)) where “C” means the plain C program with- (= a (move a)) (= kv (move kv)) out high-level services, and “GCC” means the (scan0)) use of GCC’s nested functions. The “XCC” (if (fref (mref x) right) means the use of XC-Cube, which is an ex- (= rest (bin2list scan1 (fref (mref x) right) rest)) ) tended C language with some primitives added (= kv (getmem scan1 (ptr KVpair_d))) for safe and efficient shared memory program- (= (fref (mref kv) key) (fref (mref x) key)) 23) (= (fref (mref kv) val) (fref (mref x) val)) ming. XC-Cube also features nested func- (= a (getmem scan1 (ptr Alist_d))) tions with lightweight closures,21),22) which are (= (fref (mref a) kv) kv) implemented at the assembly language level by (= (fref (mref a) cdr) rest) (= rest a) modifying GCC directly☆. The “CL-SC” (clo- (if (fref (mref x) left) sure SC) means the use of nested functions with (= rest (bin2list scan1 (fref (mref x) left) rest))) non-lightweight closures, whose implementation (return rest) ) Fig. 15 The LW-SC program for Bin2List. ☆ The detail of its implementation will be reported by a separate paper. 12 IPSJ Transactions on Programming June 2006 (def (mod-sort a n d) comparator, whose owner is the caller of (fn void (ptr int) int int) (def (comp-mod pp pq) the sorting function (Fig. 18). In the plain (lightweight int (ptr void) (ptr void)) C program, the comparison function is de- (return (if-exp (< (% (mref (cast (ptr int) pp)) d) fined as the ordinary function where d is (% (mref (cast (ptr int) pq)) d)) declared as a global variable. 1 (if-exp (== (% (mref (cast (ptr int) pp)) d) Bin2List (copying GC) works as the same (% (mref (cast (ptr int) pq)) d)) as Bin2List in Section 6.1, except that the 0 garbage collector actually runs and nested -1)))) (quicksort a n (sizeof int) comp-mod)) functions are called for scanning the stack Fig. 18 The LW-SC program of QSort (calling the (therefore there is no plain C program). sorting function by passing a nested function The collectors employ a simple breadth- comp-mod as a comparator). first (non-recursive) copying GC algorithm. Table 2 summarizes the results of perfor- is almost the same as LW-SC except that all lo- mance measurements. In LW-SC, the invo- cal variables and parameters are stored into the cation cost is high because saving (restoring) explicit stack. the values in the execution stack are necessary Since nested functions are created frequently upon calling (returning from) nested functions, in fib(34), LW-SC shows good performance on which causes bad performance in QSort. What SPARC, compared to GCC where the cost of is worse is that the cost of invoking a nested flushing instruction caches is significant. On function increases depending on the depth of the other hand, LW-SC shows not so good per- the execution stack at the time of the invoca- formance on Pentium 4 where overhead with tion. To show it clearly, we invoked mod-sort in additional operations in LW-SC is emphasized. Fig. 18 on top of various numbers of intermedi- Since several local variables can get callee- ating function calls (Fig. 17). The result shows save registers in BinTree, LW-SC shows good the elapsed time increases proportionally to the performance on SPARC, even if function calls stack depth only in LW-SC. We think that the (i.e. creations) are infrequent. This effect is not cost of throwing an exception to an exception so significant in fib(34) since there is few local handler may also change with a similar reason. variable accesses in the fib function. CL-SC shows good performance in QSort be- LW-SC does not show good performance in cause the unwinding and the reconstructing the nqueens(13) since unimportant variables are al- execution stack are unnecessary. located to registers. Since Pentium 4 has only Notice that GCC on Pentium shows bad per- a few callee-save registers and performs ex- formance in QSort. We guess that this is be- plicit save/restore of callee-save registers which cause trampoline code placed in a writable data is implicit with SPARC’s register window, the area (not a read-only program area) prevents penalty of wrong allocation is serious. the processor from prefetching instructions. XC-Cube shows better performance than All implementations show almost the same LW-SC mainly because it does not employ some performance in Bin2List even when only GC of additional operations in LW-SC, for exam- times are compared. This is because the invo- ple checking flags after returning from ordinary cation costs are negligible relative to the other functions and at the beginning of function bod- costs for GC (such as scanning heaps). ies (by using assembly-level techniques such as These results show that LW-SC works effec- modifying return addresses). However, the dif- tively if nested functions are not so frequently ference is negligibly small if the body of a func- called, and that CL-SC works better if they are tion is sufficiently large. called very often. Programmers and compiler CL-SC shows worse performance than LW- writers can choose one of these implementations SC since all local variables and parameters are depending on their situation. stored in the explicit stack and they never get 7. Related Work registers. 6.2 Invocation Cost 7.1 Compiler-Based Implementations To measure the cost of invoking nested func- of Nested Functions tions, we employ the following programs: As described above, GCC also features nested QSort sorts 200,000 integers by the quick sort function but it is less portable and takes high algorithm invoking a nested function as a maintenance/creation costs. XC-Cube imple- Vol. 47 No. SIG 0(PRO 29) A Transformation-Based Implementation of Lightweight Nested Functions 13

Table 2 Performance measurements (for the invocation cost).

Elapsed Time in seconds C GCC LW-SC XCC CL-SC QSort SPARC 0.795 0.821 7.04 8.03 0.931 (200,000) (Ratio to C) (1.00) (1.03) (8.86) (10.1) (1.17) Pentium 0.139 3.44 3.77 3.38 0.186 (Ratio to C) (1.00) (24.7) (27.1) (24.3) (1.33) Bin2List SPARC — 0.495 0.522 0.495 0.526 copying (GC time) 0.278 0.296 0.279 0.302 GC Pentium — 0.248 0.257 0.249 0.259 (GC time) 0.0647 0.0685 0.0669 0.0714

70 18 C C 16 60 GCC GCC LW-SC LW-SC XCC 14 XCC 50 CL-SC 12 CL-SC 40 10

30 8 6 20 Elapsed Time [sec] Elapsed Time [sec] 4 10 2 0 0 0 20 40 60 80 100 0 20 40 60 80 100 Depth of Execution Stack (# of calls) Depth of Execution Stack (# of calls) UltraSPARC-III Pentium 4 Fig. 17 Elapsed time in QSort against the number of intermediating function calls. ments nested functions with lightweight clo- ous work. sures by modifying the GCC compiler. It shows 7.4.1 Garbage Collection better performance, but it also lacks portabil- To implement garbage collection, the collec- ity. tor needs to be able to find all roots, each of 7.2 Closure Objects in Modern Lan- which holds a reference to an object in the guages garbage-collected heap. In C, a caller’s pointer Many modern languages such as Lisp and ML variable may hold an object reference, but it implement closures as first class objects. Those may be sleeping in the execution stack until closure objects are valid after exit of their owner the return to the caller. Even when using di- blocks. In most implementations they require rect stack manipulation, it is difficult for the some runtime supports such as garbage collec- collector to distinguished roots from other el- tion, which makes it C too inefficient to be used ements in the stack. For this reason, conser- as an intermediate language to implement high- vative collectors1) are usually used. Conserva- level languages. tive copying collectors can inspect the execution 7.3 Portable Assembly Languages stack but cannot modify it. Accurate copying C--11),14) also has an ability to access the GC can be performed by using translation tech- variables sleeping in the execution stack by us- niques based on “structure and pointer”6),7) ing the C-- runtime system to perform “stack with higher maintenance costs. walk”. We expect that its efficiency is better Fig. 15 partially shows how scanning roots than LW-SC, and almost equal to XC-Cube. In can be implemented using nested functions. terms of portability, LW-SC has an advantage getmem allocates a new object in heap and may that we can use pre-existing C . invoke the copying collector with nested func- 7.4 High-Level Services tion scan1. The copying collector can indi- This section lists high-level services which rectly call scan1 which performs the movement are important applications of nested functions (copy) of objects using roots (x, rest, a and and their implementation techniques in previ- kv) and indirectly calls scan0 in a nested man- 14 IPSJ Transactions on Programming June 2006 ner. The actual entity of scan0 may be another to walk the execution stack independently of instance of scan1 in the caller. The nested calls the stack pointer. When the child thread is are performed until the bottom of the stack is blocked, it can transfer the control to an arbi- reached. trary ancestor thread without copying the stack 7.4.2 Capturing/Restoring Stack State frames to heap. StackThreads/MP employs the Porch16) is a translator that transforms C unmodified GNU C compiler and implements programs into C programs supporting portable non-standard control-flows by a combination of checkpoints. Portable checkpoints capture an assembly language postprocessor and run- the state of a computation in a machine- time libraries. independent format that allows the transfer of Lazy Threads5) employ a similar but differ- computations across binary incompatible ma- ent approach to frame management and thread chines. They introduce source-to-source com- suspension. Frames are allocated in “stacklet”, pilation techniques for generating code to save which is a small stack for several frames. A and recover from such portable checkpoints au- blocked child thread returns the control to the tomatically. To save the stack state, the pro- parent without copying the stack frame to heap. gram repeatedly returns and legitimately saves When the parent is not at top of the stacklet, the parameters/local variables until the bot- it first allocates a new stacklet for allocating a tom of the stack is reached. During restoring, stack frame. Lazy Threads are implemented by this process is reversed. Similar techniques can modifying the GNU C compiler. be used to implement migration and first-class The implementation techniques for multiple continuations. threads using nested functions are shown in 8), As shown in Fig. 16, the stack state can be 17). captured without returning to the callers us- 7.4.4 Load Balancing ing nested functions. It uses similar tech- To realize efficient dynamic load balancing niques with the ones for scanning roots de- by transferring tasks among computing re- scribed above. sources in fine-grained parallel computing such 7.4.3 Multi-Threads: Latency Hiding as search problems, load balancing schemes Concert,13) OPA20) use similar translation which lazily create and extract a task by split- techniques to support suspension and resump- ting the present running task, such as Lazy Task tion of multiple threads on a single processor Creation (LTC),12) are effective. In LTC, a with a single execution stack (e.g., for latency newly created thread is directly and immedi- hiding). They create a new child thread as an ately executed like a usual call while (the con- ordinary function call and if the child thread tinuation of) the oldest thread in the computing completes its execution without being blocked, resource may be stolen by other idle computing the child thread simply returns the control to resources. Usually, the idle computing resource the parent thread. But in case of the sus- (thief ) randomly selects another computing re- pension of the child thread, the C functions source (victim) for stealing a task. for the child thread legitimately saves its (live) Compilers (translators) for multithreaded parameters/local variables into heap-allocated languages generate low-level code. In the orig- frames and simply returns the control to the inal LTC,12) assembly code is generated to di- parent thread. When a suspended thread be- rectly manipulate the execution stack. Both come runnable, it may legitimately restore nec- translators for Cilk4) and OPA20) generate C essary values from the heap-allocated frames. code. Since it is illegal and not portable for C The library implementation of StackThreads19) code to directly access the execution stack, the provides special two service routines: Cilk and OPA translators generate two versions switch_to_parent to save the context (state) (fast/slow) of code; the fast version code saves of the child thread and transfer the control to values of live variables in a heap-allocated frame the parent thread, and restart_thread to re- upon call (in the case of Cilk) or return (in the store the context and transfer the control to case of OPA) so that the slow version code can the restarted thread. These routines are imple- continue the rest of computation based on the mented in assembly languages by paying special heap-allocated saved continuation. attention to the treatment of callee-save regis- A message passing implementation3) of LTC ters. employs a polling method where the victim de- StackThreads/MP18) allows the frame pointer tects a task request sent by the thief and re- Vol.47 No.SIG0(PRO29) A Transformation-Based Implementation ofLightweight Nested Functions 15 turns a new task created by splitting the present 1299 (1996). running task. This techniques enables OPA,20) 7) Henderson, F.: Accurate Garbage Collection StackThreads/MP18) and Lazy Threads5) to in an Uncooperative Environment, Proc. of the support load balancing. 3rd International Symposium on Memory Man- We restructure LTC with backtracking, agement, pp.150–156 (2002). where callers’ variable are accessed by us- 8) Hiraishi, T., Li, X., Yasugi, M., Umatani, S. ing nested functions for infrequent task cre- and Yuasa, T.: Language Extension by Rule- ation.21),22) Based Transformation for S-Expression-Based C Languages, IPSJ Transsactions on Program- 8. Conclusion and Future Work ming, Vol. 46, No. SIG1(PRO 24), pp. 40–56 (2005). (in Japanese). This paper has presented a technique to im- 9) Hiraishi, T., Yasugi, M. and Yuasa, T.: Effec- plement nested functions for the C language, tive Utilization of Existing C Header Files in employing the SC language system. Since the Other Languages with Different Syntaxes, 7th implementation is transformation-based, it en- Workshop on Programming and Programming ables us to implement high-level services with Languages (PPL2005) (2005). (in Japanese). “stack walk” in a portable way. Further- 10) Hiraishi, T., Yasugi, M. and Yuasa, T.: Im- more, such services can be efficiently imple- plementing S-Expression Based Extended Lan- mented because we aggressively reduce the cost guages in Lisp, Proceedings of the International of creating and maintaining nested functions Lisp Conference, Stanford, CA, pp. 179–188 using “lightweight” closures. Future work in- (2005). cludes acutually implementing high-level lan- 11) Jones, S. P., Ramsey, N. and Reig, F.: C--: guages with such services (e.g., providing a a Portable Assembly Language that Supports garbage collected heap with a copying collec- Garbage Collection, International Conference on Principles and Practice of Declarative Pro- tor). gramming (1999). Acknowledgments This work was partly 12) Mohr, E., Kranz, D. A. and Halstead, Jr., supported by the Japan Society for the Pro- R.H.: Lazy Task Creation: A Technique for In- motion of Science (JSPS) the 21st Century creasing the Granularity of Parallel Programs, COE Program (Grant No. 14213201) and IEEE Transactions on Parallel and Distributed MEXT Grant-in-Aid for Exploratory Research Systems, Vol.2, No.3, pp.264–280 (1991). (17650008). 13) Plevyak, J., Karamcheti, V., Zhang, X. and Chien, A. A.: A Hybrid Execution Model References for Fine-Grained Languages on Distributed 1) Boehm, H.-J. and Weiser, M.: Garbage Col- Memory Multicomputers, Supercomputing’95 lection in an Uncooperative Environment, Soft- (1995). ware Practice & Experience, Vol.18, No.9, pp. 14) Ramsey, N. and Jones, S. P.: A Single In- 807–820 (1988). termediate Language That Supports Multiple 2) Breuel, T. M.: Lexical Closures for C++, Implementations of Exceptions, Proc. of the Usenix Proceedings, C++ Conference (1988). ACM SIGPLAN’00 Conference on Program- 3) Feeley, M.: A Message Passing Implementa- ming Language Design and Implementation, tion of Lazy Task Creation, Proceedings of pp.285–298 (2000). International Workshop on Parallel Symbolic 15) Stallman, R. M.: Using and Porting GNU Computing: Languages, Systems, and Appli- Compiler Collection (1999). cations, Lecture Notes in Computer Science, 16) Strumpen, V.: Compiler Technology for No.748, Springer-Verlag, pp.94–107 (1993). Portable Checkpoints, http://theory.lcs. 4) Frigo, M., Leiserson, C. E. and Randall, mit.edu/∼porch/ (1998). K.H.: The Implementation of the Cilk-5 Mul- 17) Tabata, Y., Yasugi, M., Komiya, T. and tithreaded Language, ACM SIGPLAN Notices Yuasa, T.: Implementation of Multiple Threads (PLDI’98), Vol.33, No.5, pp.212–223 (1998). by Using Nested Functions, IPSJ Transactions 5) Goldstein, S. C., Schauser, K. E. and Culler, on Programming, Vol.43, No.SIG 3(PRO 14), D.E.: Lazy Threads: Implementing a Fast Par- pp.26–40 (2002). (in Japanese). allel Call, Journal of Parallel and Distributed 18) Taura, K., Tabata, K. and Yonezawa, A.: Computing, Vol.3, No.1, pp.5–20 (1996). StackThreads/MP: Integrating Futures into 6) Hanson, D. R. and Raghavachari, M.: A Calling Standards, Proceedings of ACM SIG- Machine-Independent Debugger, Software – PLAN Symposium on Principles & Practice Practice & Experience, Vol.26, No.11, pp.1277– of Parallel Programming (PPoPP), pp. 60–71 16 IPSJ Transactions on Programming June 2006

(1999). 19) Taura, K. and Yonezawa, A.: Fine-grain Mul- (def (h esp i g) (fn int (ptr char) int (ptr closure-t)) tithreading with Minimal Compiler Support – (def argp (ptr char)) A Cost Effective Approach to Implementing (def efp (ptr (struct h_frame))) Efficient Multithreading Languages, Proc. of (def new-esp (ptr char)) (def esp-flag size-t (bit-and (cast size-t esp) 3)) Conference on Programming Language Design (def tmp int) and Implementation, pp.320–333 (1997). (def tmp2 int) 20) Umatani, S., Yasugi, M., Komiya, T. and (def tmp_fp (ptr closure-t)) (def tmp_fp2 (ptr closure-t)) Yuasa, T.: Pursuing Laziness for Efficient Im- plementation of Modern Multithreaded Lan- ;; Judge whether reconstruction of the execution stack is guages, Proc. of the 5th International Sympo- ;; required or not. (if esp-flag sium on High Performance Computing, Lecture (begin Notes in Computer Science, No.2858, pp.174– (= esp (cast (ptr char) 188 (2003). (bit-xor (cast size-t esp) esp-flag))) (= efp (cast (ptr (struct h_frame)) esp)) 21) Yasugi, M., Komiya, T. and Yuasa, T.: Dy- ;; Move the stack pointer by the frame size. namic Load Balancing by Using Nested Func- (= esp tions and Its High-Level Description, IPSJ (cast (ptr char) (+ (cast (ptr align-t) esp) Transactions on Advanced Computing Systems, (/ (+ (sizeof (struct h_frame)) Vol.45, No.SIG 11(ACS 7), pp.368–377 (2004). (sizeof align-t) -1) (in Japanese). (sizeof align-t))))) (= (mref (cast (ptr (ptr char)) esp)) 0) 22) Yasugi, M., Komiya, T. and Yuasa, T.: An ;; Restore the execution point. Efficient Load-Balancing Framework Based on (label LGOTO Lazy Partitioning of Sequential Programs, Pro- (switch (fref (mref efp) call-id) (case 0) (goto l_CALL) ceedings of Workshop on New Approaches to (case 1) (goto l_CALL2))) Software Construction, pp.65–84 (2004). (goto l_CALL))) 23) Yasugi, M., Takada, J., Tabata, Y., Komiya, (= efp (cast (ptr (struct h_frame)) esp)) ;; Move the stack pointer by the frame size. T. and Yuasa, T.: Primitives for Shared (= esp Memory and its Implementation with GCC, (cast (ptr char) IPSJ Transsactions on Programming, Vol.43, (+ (cast (ptr align-t) esp) (/ (+ (sizeof (struct h_frame)) No. SIG 1(PRO 13), pp. 118–132 (2002). (in (sizeof align-t) -1) Japanese). (sizeof align-t))))) (= (mref (cast (ptr (ptr char)) esp)) 0) ;; Call the nested function g. Appendix (begin (= tmp_fp g) (= argp A.1 An example of translation from (cast (ptr char) LW-SC into SC-0 (+ (cast (ptr align-t) esp) (/ (+ (sizeof (ptr char)) ;;; The pointer to the moved ‘‘nested function’’. (sizeof align-t) -1) (deftype nestfn-t (sizeof align-t))))) (ptr (fn (ptr char) (ptr char) (ptr void)))) ;; Push the arguments passed to nested function. ;;; The structure which contains the pointer to the moved (exps (= (mref (cast (ptr int) argp)) i) ;;; nested function and the frame pointer of (= argp ;;; the owner function. (cast (ptr char) (deftype closure-t struct (+ (cast (ptr align-t) argp) (def fun nestfn-t) (/ (+ (sizeof int) (def fr (ptr void))) (sizeof align-t) -1) (sizeof align-t)))))) (deftype align-t double) ;; Push the structure objcect that corresponds to ;; the frame of the nested function to ;;; The auxiliary function for calling nested functions. ;; the explicit stack. (def (lw-call esp) (fn (ptr char) (ptr char)) (= (mref (cast (ptr (ptr closure-t)) argp)) tmp_fp) (def clos (ptr closure-t) ;; Save the values of local variables to the frame. (mref (cast (ptr (ptr closure-t)) esp))) (= (fref efp -> tmp2) tmp2) (return ((fref clos -> fun) esp (fref clos -> fr)))) (= (fref efp -> tmp) tmp) (= (fref efp -> g) g) ;;; The frame structure of function h. (= (fref efp -> i) i) (def (struct h_frame) (= (fref efp -> argp) argp) (def tmp-esp (ptr char)) (= (fref efp -> tmp-esp) argp) (def argp (ptr char)) ;; Save the current execution point. (def call-id int) (= (fref efp -> call-id) 0) (def tmp2 int) (return (- (cast int 0) 1)) (def tmp int) ;; Continue the execution from here after the func- (def g (ptr closure-t)) tion call finishes. (def i int)) (label l_CALL nil) Vol. 47 No. SIG 0(PRO 29) A Transformation-Based Implementation of Lightweight Nested Functions 17

;; Restore local variables from the explicit stack. (def xfp (ptr (struct foo_frame)) xfp0) (= tmp2 (fref efp -> tmp2)) (def esp-flag size-t (bit-and (cast size-t esp) 3)) (= tmp (fref efp -> tmp)) (def parmp (ptr char) (= g (fref efp -> g)) (cast (ptr char) (= i (fref efp -> i)) (bit-xor (cast size-t esp) esp-flag))) ;; Get the return value. ;; Pop parameters from the explicit stack. (= tmp (mref (cast (ptr int) (fref efp -> argp))))) (def b int ;; Call the nested function g. (exps (begin (= parmp (= tmp_fp2 g) (cast (ptr char) (= argp (- (cast (ptr align-t) parmp) (cast (ptr char) (/ (+ (sizeof int) (sizeof align-t) -1) (+ (cast (ptr align-t) esp) (sizeof align-t))))) (/ (+ (sizeof (ptr char)) (mref (cast (ptr int) parmp)))) (sizeof align-t) -1) (sizeof align-t))))) (label LGOTO nil) ;; Push the arguments passed to nested function. (= efp (cast (ptr (struct g1_in_foo_frame)) esp)) (exps (= (mref (cast (ptr int) argp)) tmp) ;; Move the stack pointer by the frame size. (= argp (= esp (cast (ptr char) (cast (ptr char) (+ (cast (ptr align-t) argp) (+ (cast (ptr align-t) esp) (/ (+ (sizeof int) (/ (+ (sizeof (struct g1_in_foo_frame)) (sizeof align-t) -1) (sizeof align-t) -1) (sizeof align-t)))))) (sizeof align-t))))) ;; Push the structure objcect that corresponds to (= (mref (cast (ptr (ptr char)) esp)) 0) ;; the frame of the nested function to (inc (fref xfp -> x)) ;; the explicit stack. ;; Push the return value to the explicit stack. (= (mref (cast (ptr (ptr closure-t)) argp)) (= (mref (cast (ptr int) efp)) (+ (fref xfp -> a) b)) tmp_fp2) (return 0)) ;; Save the values of local variables to the frame. (= (fref efp -> tmp2) tmp2) (def (foo esp a) (fn int (ptr char) int) (= (fref efp -> tmp) tmp) (def efp (ptr (struct foo_frame))) (= (fref efp -> g) g) (def new-esp (ptr char)) (= (fref efp -> i) i) (def esp-flag size-t (bit-and (cast size-t esp) 3)) (= (fref efp -> argp) argp) (def x int 0) (= (fref efp -> tmp-esp) argp) (def y int 0) ;; Save the current execution point. (def tmp3 int) (= (fref efp -> call-id) 1) (return (- (cast int 0) 1)) ;; Judge whether reconstruction of the execution stack is ;; Continue the execution from here after ;; required or not. ;; the function call finishes. (if esp-flag (label l_CALL2 nil) (begin (= tmp2 (fref efp -> tmp2)) (= esp (cast (ptr char) (= tmp (fref efp -> tmp)) (bit-xor (cast size-t esp) esp-flag))) (= g (fref efp -> g)) (= efp (cast (ptr (struct foo_frame)) esp)) (= i (fref efp -> i)) ;; Move the stack pointer by the frame size. ;; Get the return value. (= esp (= tmp2 (mref (cast (ptr int) (cast (ptr char) (fref efp -> argp))))) (+ (cast (ptr align-t) esp) (return tmp2)) (/ (+ (sizeof (struct foo_frame)) (sizeof align-t) -1) ;;; The frame structure of function foo. (sizeof align-t))))) (def (struct foo_frame) (= (mref (cast (ptr (ptr char)) esp)) 0) (def tmp-esp (ptr char)) (label LGOTO (def argp (ptr char)) ;; Restore the execution point. (def call-id int) (switch (fref (mref efp) call-id) (def tmp3 int) (case 0) (goto l_CALL3))) (def y int) (goto l_CALL3))) (def x int) (= efp (cast (ptr (struct foo_frame)) esp)) (def a int) ;; Move the stack pointer by the frame size. (def g10 closure-t)) (= esp (cast (ptr char) ;;; The frame structure of function g1 . (+ (cast (ptr align-t) esp) (def (struct g1_in_foo_frame) (/ (+ (sizeof (struct foo_frame)) (def tmp-esp (ptr char)) (sizeof align-t) -1) (def argp (ptr char)) (sizeof align-t))))) (def call-id int) (= (mref (cast (ptr (ptr char)) esp)) 0) (def b int) (= new-esp esp) (def xfp (ptr (struct foo_frame)))) ;; Call the ordinary function h. (while ;;; Nested function g1 (moved to the top-level). (and (def (g1_in_foo esp xfp0) (== (= tmp3 (h new-esp 10 (fn (ptr char) (ptr char) (ptr void)) (ptr (fref (def new-esp (ptr char)) (cast (ptr (struct foo_frame)) (def efp (ptr (struct g1_in_foo_frame))) esp) ;; The frame pointer of the owner function. -> g10)))) 18 IPSJ Transactions on Programming June 2006

(- (cast int 0) 1)) Tasuku Hiraishi was born in (!= (= (fref efp -> tmp-esp) 1981. He received the B.E. de- (mref (cast (ptr (ptr char)) esp))) 0)) ;; Save the values of local variables to the frame. gree in Information Science in (= (fref efp -> tmp3) tmp3) 2003, the Master of Informatics (= (fref efp -> y) y) (= (fref efp -> x) x) in 2005, from Kyoto University. (= (fref efp -> a) a) He is currently a PhD candidate (= (fref efp -> g10 fun) g1_in_foo) (= (fref efp -> g10 fr) (cast (ptr void) efp)) at Graduate School of Informat- ;; Save the current execution point. ics, Kyoto University. His research interests in- (= (fref efp -> call-id) 0) clude programming languages and parallel pro- (return (- (cast int 0) 1)) ;; Continue the execution from here after cessing. He is a student member of the Japan ;; the function call finishes. Society for Software Science and Technology. (label l_CALL3 nil) ;; Restore local variables from the explicit stack. (= tmp3 (fref efp -> tmp3)) (= y (fref efp -> y)) (= x (fref efp -> x)) Masahiro Yasugi was born (= a (fref efp -> a)) in 1967. He received a B.E. in (= new-esp (+ esp 1))) electronic engineering, an M.E. (= y tmp3) (return (+ x y))) in electrical engineering, and a Ph.D. in information science ;;; The frame structure of function main . (def (struct main_frame) from the University of Tokyo in (def tmp-esp (ptr char)) 1989, 1991, and 1994, respec- (def argp (ptr char)) tively. In 1993–1995, he was a fellow of the (def call-id int) (def tmp4 int)) JSPS (at the University of Tokyo and the Uni- versity of Manchester). In 1995–1998, he was (def (main) (fn int) (def efp (ptr (struct main_frame))) a research associate at Kobe University. Since (def new-esp (ptr char)) 1998, he has been working at Kyoto University (def estack (array char 65536)) ; The explicit stack. (def esp (ptr char) estack) as an assistant professor (lecturer) and an as- (def tmp4 int) sociate professor (since 2003). In 1998–2001, he was a researcher at PRESTO, JST. His re- (label LGOTO nil) (= efp (cast (ptr (struct main_frame)) esp)) search interests include programming languages ;; Move the stack pointer by the frame size. and parallel processing. He is a member of the (= esp (cast (ptr char) ACM and the Japan Society for Software Sci- (+ (cast (ptr align-t) esp) ence and Technology. (/ (+ (sizeof (struct main_frame)) (sizeof align-t) -1) (sizeof align-t))))) (= (mref (cast (ptr (ptr char)) esp)) 0) Taiichi Yuasa received a (= new-esp esp) (while Bachelor of Mathematics degree (and in 1977, the Master of Mathe- (== (= tmp4 (foo new-esp 1)) (- (cast int 0) 1)) matical Sciences degree in 1979, (!= (= (fref efp -> tmp-esp) and a Doctor of Science degree (mref (cast (ptr (ptr char)) esp))) 0)) (def goto-fr (ptr char)) in 1987, all from Kyoto Univer- (= (mref (cast (ptr (ptr char)) esp)) 0) sity, Kyoto, Japan. He joined (= (fref efp -> tmp4) tmp4) the faculty of the Research Institute for Mathe- ;; Execute nested functions. (= goto-fr (lw-call (fref efp -> tmp-esp))) matical Sciences, Kyoto University, in 1982. He (if (== (cast (ptr char) goto-fr) is currently a Professor at the Graduate School (cast (ptr char) efp)) (goto LGOTO)) of Informatics, Kyoto University, Kyoto, Japan. (= new-esp (+ esp 1))) His current areas of interest include symbolic (return tmp4)) computation and programming language sys- (Received September 22, 2005) tems. He is a member of the ACM, the IEEE, (Accepted December 19, 2005) the Institute of Electronics, Information and Communication Engineers, and the Japan So- ciety for Software Science and Technology.