Stack-based typed assembly language

This paper presents STAL, a variant of Typed Assembly Language with constructs and types to support a limited form of stack allocation. As with other statically-typed low-level languages, the type system of STAL ensures that a wide class of errors cannot occur at run time, and therefore the language can be adapted for use in certifying compilers where security is a concern. Like the Java Virtual Machine Language (JVML), STAL supports stack allocation of local variables and procedure activation records, but unlike the JVML, STAL does not pre-suppose (cid:12)xed notions of procedures, exceptions, or calling conventions. Rather, compiler writers can choose encodings for these high-level constructs using the more primitive RISC-like mechanisms of STAL. Consequently, some important optimizations that are impossible to perform within the JVML, such as tail call elimination or callee-saves registers, can be easily expressed within STAL.


Introduction and Motivation
Statically typed source languages have efficiency and software engineering advantages over their dynamically typed counterparts. Modern type-directed compilers [19,25,7,32,20,28,12] exploit the properties of typed languages more extensively than their predecessors by preserving type information computed in the front end through a series of typed intermediate languages. These compilers use types to direct sophisticated transformations such as closure conversion [18,31,17,4,21], region inference [8], subsumption elimination [9,11], and unboxing [19,24,29]. In many cases, without types, these transformations are less effective or simply impossible. Furthermore, the type translation partially specifies the corresponding term translation and often captures the critical concerns in an elegant and succinct fashion. Strong type systems not only describe but also enforce many important invariants. Consequently, developers of type-based compilers may invoke a type-checker after each code transformation, and if the output fails to type-check, the developer knows that the compiler contains an internal error. Although type-checkers for decidable type systems will not catch all compiler errors, they have proven themselves valuable debugging tools in practice [22].
Despite the numerous advantages of compiling with types, until recently, no compiler propagated type information through the final stages of code generation. The TIL/ML compiler, for instance, preserves types through approximately 80% of compilation but leaves the remaining 20% untyped. Many of the complex tasks of code generation including register allocation and instruction scheduling are left unchecked and types cannot be used to specify or explain these low-level code transformations.
These observations motivated our exploration of very low-level type systems and corresponding compiler technology. In Morrisett et al. [23], we presented a typed assembly language (TAL) and proved that its type system was sound with respect to an operational semantics. We demonstrated the expressiveness of this type system by sketching a type-preserving compiler from an ML-like language to TAL. The compiler ensured that well-typed source programs were always mapped to well-typed assembly language programs and that they preserved source level abstractions such as user-defined abstract data types and closures. Furthermore, we claimed that the type system of TAL did not interfere with many traditional compiler optimizations including inlining, loop-unrolling, register allocation, instruction selection, and instruction scheduling.
However, the compiler we presented was critically based on a continuation-passing style transform, which eliminated the need for a control stack. In particular, activation records were represented by heap-allocated closures as in the SML of New Jersey compiler (SML/NJ) [5,3]. For example, Figure 1 shows the TAL code our heap-based compiler would produce for the recursive factorial computation. Each function takes an additional argument which represents the control stack as a continuation closure. Instead of "returning" to the caller, a function invokes its continuation closure by jumping directly to the code of the closure, passing the environment of the closure and the result in registers.
Allocating continuation closures on the heap has many advantages over a conventional stack-based implementation. First, it is straightforward to implement control primitives such as exceptions, first-class continuations, or user-level lightweight coroutine threads when continuations are heap allocated [3,31,34]. Second, Appel and Shao [2] have shown that heap allocation of closures can have better space properties, primarily because it is easier to share environments. Third, there is a unified memory management mechanism (namely the garbage collector) for allocating and collecting all kinds of objects, including stack frames. Finally, Appel and Shao [2] have argued that, at least for SML/NJ, the locality lost by heap-allocating stack frames is negligible.
Nevertheless, there are also compelling reasons for providing support for stacks. First, Appel and Shao's work did not consider imperative languages, such as Java, where the ability to share environments is greatly reduced nor did it consider languages that do not require garbage collection. Second, Tarditi and Diwan [14,13] have shown that with some cache architectures, heap allocation of continuations (as in SML/NJ) can have substantial overhead due to a loss of locality. Third, stack-based activation records can have a smaller memory footprint than heap-based activation records. Finally, many machine architectures have hardware mechanisms that expect programs to behave in a stack-like fashion. For example, the Pentium Pro processor has an internal stack that it uses to predict return addresses for procedures so that instruction pre-fetching will not be stalled [16]. The internal stack is guided by the use of call/return primitives which use the standard control stack.
Clearly, compiler writers must weigh a complex set of factors before choosing stack allocation, heap allocation, or both. The target language must not constrain these design decisions. In this paper, we explore the addition of a stack to our typed assembly language in order to give compiler writers the flexibility they need. Our stack typing discipline is remarkably simple, but powerful enough to compile languages such as Pascal, Java, or ML without adding high-level primitives to the assembly language. More specifically, the typing discipline supports stack allocation of temporary variables and values that do not escape, stack allocation of procedure activation frames, exception handlers, and displays, as well as optimizations such as callee-saves registers. Unlike the JVM architecture [20], our system does not constrain the stack to have the same size at each control-flow point, nor does it require new high-level primitives for procedure call/return. Instead, our assembly language continues to have low-level RISC-like primitives such as loads, stores, and jumps. However, source-level stack allocation, general source-level stack pointers, general pointers into either the stack or heap, and some advanced optimizations cannot be typed.
A key contribution of the type structure is that it provides a unifying declarative framework for specifying procedure calling conventions regardless of the allocation strategy. In addition, the framework further elucidates the connection between a heap-based continuation-passing style compiler, and a conventional stack-based compiler. In particular, this type structure makes explicit the notion that the only differences between the two styles are that, instead of passing the continuation as a boxed, heap-allocated tuple, a stack-based compiler passes the continuation unboxed in registers and the environments for continuations are allocated on the stack. The general framework makes it easy to transfer transformations developed for one style to the other. For instance, we can easily explain the callee-saves registers of SML/NJ [5,3,1] and the callee-saves registers of a stack-based compiler as instances of a more general CPS transformation that is independent of the continuation representation.

Overview of TAL and CPS-Based Compilation
In this section, we briefly review our original proposal for typed assembly language (TAL) and sketch how a polymorphic functional language, such as ML, can be compiled to TAL in a continuationpassing style, where continuations are heap-allocated. , v binds α in the following instructions. We consider syntactic objects to be equivalent up to alphaconversion, and consider label assignments, register assignments, heaps, and register files equivalent up to reordering of labels and registers. Register names do not alpha-convert. The notation X denotes a sequence of zero or more Xs, and | · | denotes the length of a sequence.
The instruction set consists mostly of conventional RISC-style assembly operations, including arithmetic, branches, loads, and stores. One exception, the unpack instruction, strips the quantifier from the type of an existentially typed value and introduces a new type variable into scope. On an un-  typed machine, this is implemented by an ordinary move. The other non-standard instruction is malloc, which is explained below. Evaluation is specified as a deterministic rewriting system that takes programs to programs (see Morrisett et al. [23] for details).
The types for TAL consist of type variables, integers, tuple types, existential types, and polymorphic code types. Tuple types contain initialization flags (either 0 or 1) that indicate whether or not components have been initialized. For example, if register r has type int 0 , int 1 , then it contains a label bound in the heap to a pair that can contain integers, where the first component may not have been initialized, but the second component has. In this context, the type system allows the second component to be loaded, but not the first. If an integer value is stored into r(0) then afterwards r has the type int 1 , int 1 , reflecting the fact that the first component is now initialized. The instruction malloc r[τ 1 , . . . , τ n ] heap-allocates a new tuple with uninitialized fields and places its label in register r.
Code types (∀[α 1 , . . . , α n ].Γ) describe code blocks (code[α 1 , . . ., α n ]Γ.I), which are instruction sequences, that expect a register file of type Γ and in which the type variables α 1 , . . . , α n are held abstract. In other words, Γ serves as a register file pre-condition that must hold before control may be transferred to the code block. Code blocks have no post-condition because control is either terminated via a halt instruction or transferred to another code block.
The type variables that are abstracted in a code block provide a means to write polymorphic code sequences. For example, the polymorphic code block roughly corresponds to a CPS version of the SML function fn (x:α) => (x, x). The block expects upon entry that register r1 contains a value of the abstract type α, and r2 contains a return address (or continuation label) of type ∀[].{r1 : α 1 , α 1 }. In other words, the return address requires register r1 to contain an initialized pair of values of type α before control can be returned to this address. The instructions of the code block allocate a tuple, store into the tuple two copies of the value in r1, move the pointer to the tuple into r1 and then jump to the return address in order to "return" the tuple to the caller. If the code block is bound to a label , then it may be invoked by simultaneously instantiating the type variable and jumping to the label (e.g., jmp [int]).
Source languages like ML have nested higher-order functions that might contain free variables and thus require closures to represent functions. At the TAL level, we represent closures as a pair consisting of a code block label and a pointer to an environment data structure. The type of the environment must be held abstract in order to avoid typing difficulties [21], and thus we pack the type of the environment and the pair to form an existential type.
All functions, including continuation functions introduced during CPS conversion, are thus represented as existentials. For example, once CPS converted, a source function of type int → has type (int, ( → void )) → void. 1 After closures are introduced, the code will have type: Finally, at the TAL level the function will be represented by a value with the type: Here, α 1 is the abstracted type of the closure's environment. The code for the closure requires that the environment be passed in register r1, the integer argument in r2, and the continuation in r3.
The continuation is itself a closure where α 2 is the abstracted type of its environment. The code for the continuation closure requires that the environment be passed in r1 and the unit result of the computation in r2.
To apply a closure at the TAL level, we first use the unpack operation to open the existential package. Then the code and the environment of the closure pair are loaded into appropriate registers, along with the argument to the function. Finally, we use a jump instruction to transfer control to the closure's code. Figure 1 gives the CPS-based TAL code for the following ML expression which computes six factorial: let fun fact n = if n = 0 then 1 else n * (fact(n -1)) in fact 6 end 1 The void return types are intended to suggest the non-returning aspect of CPS code. types τ ::= · · · | ns stack types σ ::= ρ | nil | τ ::σ type assignments ∆ ::= · · · | ρ, ∆ register assignments Γ ::= {r 1 :τ 1 , . . . , r n :τ n , sp:σ} word values w :

Adding Stacks to TAL
In this section, we show how to extend TAL to achieve a Stack-based Typed Assembly Language (STAL). Figure 3 defines the new syntactic constructs for the language. In what follows, we informally discuss the dynamic and static semantics for the modified language, leaving formal treatment to Appendix A.
Operationally, we model stacks (S) as lists of word-sized values. Uninitialized stack slots are filled with nonsense (ns). Register files now include a distinguished register, sp, which represents the current stack. There are four new instructions that manipulate the stack. The salloc n instruction places n words of nonsense on the top of the stack. In a conventional machine, assuming stacks grow towards lower addresses, an salloc instruction would correspond to subtracting n from the current value of the stack pointer. The sfree n instruction removes the top n words from the stack, and corresponds to adding n to the current stack pointer. The sld r, sp(i) instruction loads the i th word of the stack into register r, whereas the sst sp(i), r stores register r into the i th word. Note, the instructions ld and st cannot be used with the stack pointer.
A program becomes stuck if it attempts to execute: • sfree n and the stack does not contain at least n words, • sld r, sp(i) and the stack does not contain at least i + 1 words or else the i th word of the stack is ns, or • sst sp(i), r and the stack does not contain at least i + 1 words.
As in the original TAL, the typing rules for the modified language prevent well-formed programs from becoming stuck.
Stacks are described by stack types (σ), which include nil and τ ::σ. The latter represents a stack of the form w::S where w has type τ and S has type σ. Stack slots filled with nonsense have type ns. Stack types also include stack type variables (ρ) which may be used to abstract the tail of a stack type. The ability to abstract stacks is critical for supporting procedure calls and is discussed in detail later.
As before, the register file for the abstract machine is described by a register file type (Γ) mapping registers to types. However, Γ also maps the distinguished register sp to a stack type σ. Finally, code blocks and code types support polymorphic abstraction over both types and stack types.
One of the uses of the stack is to save temporary values during a computation. The general problem is to save on the stack n registers, say r 1 through r n , of types τ 1 through τ n , perform some computation e, and then restore the temporary values to their respective registers. This would be accomplished by the following instruction sequence where the comments (delimited by %) show the stack's type at each step of the computation.

sld
r n , sp(n − 1) % τ 1 ::τ 2 :: · · · ::τ n ::σ sfree n % σ If, upon entry, r i has type τ i and the stack is described by σ, and if the code for e leaves the state of the stack unchanged, then this code sequence is well-typed. Furthermore, the typing discipline does not place constraints on the order in which the stores or loads are performed.
It is straightforward to model higher-level primitives, such as push and pop. The former can be seen as simply salloc 1 followed by a store to sp(0), whereas the latter is a load from sp(0) followed by sfree 1. Also, a "jump-and-link" or "call" instruction which automatically moves the return address into a register or onto the stack can be synthesized from our primitives. To simplify the presentation, we did not include these instructions in STAL; a practical implementation, however, would need a full set of instructions appropriate to the architecture.
The stack is commonly used to save the current return address, and temporary values across procedure calls. Which registers to save and in what order is usually specified by a compilerspecific calling convention. Here we consider a simple calling convention where it is assumed there is one integer argument and one unit result, both of which are passed in register r1, and the return address is passed in the register ra. When invoked, a procedure may choose to place temporaries on the stack as shown above, but when it jumps to the return address, the stack should be in the same state as it was upon entry. Naively, we might expect the code for a function obeying this calling convention to have the following STAL type: Notice that the type of the return address is constrained so that the stack must have the same shape upon return as it had upon entry. Hence, if the procedure pushes any arguments onto the stack, it must pop them off.
However, this typing is unsatisfactory for two reasons. The first problem is that there is nothing preventing the procedure from popping off values from the stack and then pushing new values (of the appropriate type) onto the stack. In other words, the caller's stack frame is not protected from the function's code. The second problem is much worse: such a function can only be invoked from states where the stack is exactly described by σ. This effectively prevents invocation of the procedure from two different points in the program. In particular, there is no way for the procedure to push its return address on the stack and jump to itself.
The solution to both problems is to abstract the type of the stack using a stack type variable: To invoke a function with this type, the caller must instantiate the bound stack type variable ρ with the current type of the stack. As before, the function can only jump to the return address when the stack is in the same state as it was upon entry. However, the first problem above is addressed because the type checker treats ρ as an abstract stack type while checking the body of the code. Hence, the code cannot perform an sfree, sld, or sst on the stack. It must first allocate its own space on the stack, only this space may be accessed by the function, and the space must be freed before returning to the caller. 2 The second problem is solved because the stack type variable may be instantiated in different ways. Hence multiple call sites with different stack states, including recursive calls, may now invoke the function. In fact, a recursive call will usually instantiate the stack variable with a different type than the original call because unless it is a tail call, it will need to store its return address on the stack. Figure 4 gives stack-based code for the factorial example of the previous section. The function is invoked by moving its environment (an empty tuple) into r1, the argument into r2, and the return address label into ra and jumping to the label l fact. Notice that the nonzero branch must save the argument and current return address on the stack before jumping to the fact label in a recursive call. It is interesting to note that the stack-based code is quite similar to the heap-based code of Figure 1. Indeed, the code remains in a continuation-passing style, but instead of passing the continuation as a heap-allocated tuple, the environment of the continuation is passed in the stack pointer and the code of the continuation is passed in the return address register.
To more fully appreciate the correspondence, consider the type of the TAL version of l fact from Figure 1:  Our techniques can be applied to other calling conventions and do not appear to inhibit most optimizations. For instance, tail calls can be eliminated in CPS simply by forwarding a continuation closure to the next function. If continuations are allocated on the stack, we have the mechanisms to pop the current activation frame off the stack and to push any arguments before performing the tail call. Furthermore, the type system is expressive enough to type this resetting and adjusting for any kind of tail call, not just a self tail call. As another example, some CISC-style conventions place the environment, the argument(s), and return address on the stack, and return the result on the stack. With this convention, the factorial code would have type:

Exceptions
We now consider how to implement exceptions in STAL. We will find that a calling convention for function calls in the presence of exceptions may be derived from the heap-based CPS calling convention, just as was the case without exceptions. However, implementing this calling convention will require that the type system be made more expressive by adding compound stack types. This additional expressiveness will turn out to have uses beyond exceptions, allowing most compilerintroduced uses of pointers into the midst of the stack.

Exception Calling Conventions
In a heap-based CPS framework, exceptions are implemented by passing two continuations: the usual continuation and an exception continuation. Code raises an exception by jumping to the latter. For an integer to unit function, this calling convention is expressed as the following TAL type (ignoring the outer closure and environment): This type uses some two new constructs we now add to STAL (see Figure 5). When σ 1 and σ 2 are stack types, the stack type σ 1 • σ 2 is the result of appending the two types. Thus, in the above type, the function is presented with a stack with type ρ 1 • ρ 2 , all of which is expected by the regular continuation, but only a tail of which (ρ 2 ) is expected by the exception continuation. Since ρ 1 and ρ 2 are quantified, the function may still be used for any stack so long as the exception continuation accepts some tail of that stack.
To raise an exception, the exception is placed in r1 and the control is transfered to the exception continuation. This requires cutting the actual stack down to just that expected by the exception continuation. Since the length of ρ 1 is unknown, this can not be done by sfree. Instead, a pointer to the desired position in the stack is supplied in re , and is moved into sp. The type ptr (σ) is the type of pointers into the stack at a position where the stack has type σ. Such pointers are obtained simply by moving sp into a register.

Compound Stacks
The additional syntax to support exceptions is summarized in Figure 5. The new type constructors were discussed above. The word value ptr (i) is used by the operational semantics to represent pointers into the stack; the element pointed to is i words from the bottom of the stack. (See Figure 7 for details.) Of course, on a real machine, these would be implemented by actual pointers. The instructions mov r d , sp and mov sp, r s save and restore the stack pointer, and the instructions sld r d , r s (i) and sst r d (i), r s allow for loading from and storing to pointers.
To prohibit erroneous loads of this sort, the type system requires that the pointer r s be valid in the instructions sld r d , r s (i), sst r d (i), r s , and mov sp, r s . An invariant of our system is that the type of sp always describes the current stack, so using a pointer into the stack will be sound if that pointer's type is consistent with sp's type. Suppose sp has type σ 1 and r has type ptr (σ 2 ), then r is valid if σ 2 is a tail of σ 1 (formally, if there exists some σ such that σ 1 = σ • σ 2 ). If a pointer is invalid, it may be neither loaded from nor moved into the stack pointer. In the above example the load will be rejected because r1's type τ ::σ is not a tail of sp s type, ns::σ.

Using Compound Stacks
Recall the type for a function in the presence of exceptions: An exception may be raised within the body of such a function by restoring the handler's stack from re and jumping to the handler. A new exception handler may be installed by copying the stack pointer to re and making forthcoming function calls with the stack type variables instantiated to nil and ρ 1 • ρ 2 . Calls that do not install new exception handlers would attach their frames to ρ 1 and pass on ρ 2 unchanged.
Since exceptions are probably raised infrequently, an implementation could save a register by storing the exception continuation's code pointer on the stack, instead of in its own register. If this convention were used, functions would expect stacks with the type ρ 1 • (τ handler ::ρ 2 ) and exception pointers with the type ptr (τ handler :: This last convention illustrates a use for compound stacks that goes beyond implementing exceptions. We have a general tool for locating data of type τ amidst the stack by using the calling convention: ∀[ρ 1 , ρ 2 ].{sp:ρ 1 • (τ ::ρ 2 ), r1:ptr (τ ::ρ 2 ), . . .} One application of this tool would be for implementing Pascal with displays. The primary limitation of this tool is that if more than one piece of data is stored amidst the stack, although quantification may be used to avoid specifying the precise locations of that data, function calling conventions would have to specify in what order data appears on the stack. It appears that this limitation could be removed by introducing a limited form of intersection type, but we have not yet explored the ramifications of this enhancement.

Related and Future Work
Our work is partially inspired by Reynolds [26], which uses functor categories to "replace continuations by instruction sequences and store shapes by descriptions of the structure of the run-time stack." However, Reynolds was primarily concerned with using functors to express an intermediate language of a semantics-based compiler for Algol, whereas we are primarily concerned with type structure for general-purpose target languages.
Stata and Abadi [30] formalize the Java bytecode verifier's treatment of subroutines by giving a type system for a subset of the Java Virtual Machine language. In particular, their type system ensures that for any program control point, the Java stack is of the same size each time that control point is reached during execution. Consequently, procedure call must be a primitive construct (which it is in JVML). In contrast, our treatment supports polymorphic stack recursion, and hence procedure calls can be encoded with existing assembly-language primitives.
Tofte and others [8,33] have developed an allocation strategy involving regions. Regions are lexically scoped containers that have a LIFO ordering on their lifetimes, much like the values on a stack. As in our approach, polymorphic recursion on abstracted region variables plays a critical role. However, unlike the objects in our stacks, regions are variable-sized, and objects need not be allocated into the region which was most recently created. Furthermore, there is only one allocation mechanism in Tofte's system (the stack of regions) and no need for a garbage collector.
In contrast, STAL only allows allocation at the top of the stack and assumes a garbage collector for heap-allocated values. However, the type system for STAL is considerably simpler than the type system of Tofte et al., as it requires no effect information in types.
Bailey and Davidson [6] also describe a specification language for modeling procedure calling conventions and checking that implementations respect these conventions. They are able to specify features such as a variable number of arguments that our formalism does not address. However, their model is explicitly tied to a stack-based calling convention and does not address features such as exception handlers. Furthermore, their approach does not integrate the specification of calling conventions with a general-purpose type system. Although our type system is sufficiently expressive for compilation of a number of source languages, it falls short in several areas. First, it cannot support general pointers into the stack because of the ordering requirements; nor can stack and heap pointers be unified so that a function taking a tuple argument can be passed either a heap-allocated or a stack-allocated tuple. Second, threads and advanced mechanisms for implementing first-class continuations such as the work by Hieb et al. [15] cannot be modeled in this system without adding new primitives.
However, we claim that the framework presented here is a practical approach to compilation. To substantiate this claim, we are constructing a compiler called TALC that maps the KML programming language [10] to a variant of STAL described here, suitably adapted for the Intel IA32 architecture. We have found it straightforward to enrich the target language type system to include support for other type constructors, such as references, higher-order constructors, and recursive types. The compiler uses an unboxed stack allocation style of continuation passing, as discussed in this paper.
Although we have discussed mechanisms for typing stacks at the assembly language level, our techniques generalize to other languages. The same mechanisms, including the use of polymorphic recursion to abstract the tail of a stack, can be used to introduce explicit stacks in higher level calculi. An intermediate language with explicit stacks would allow control over allocation at a point where more information is available to guide allocation decisions.

Summary
We have given a type system for a typed assembly language with both a heap and a stack. Our language is flexible enough to support the following compilation techniques: CPS using both heap allocation and stack allocation, a variety of procedure calling conventions, displays, exceptions, tail call elimination, and callee-saves registers.
A key contribution of the type system is that it makes procedure calling conventions explicit and provides a means of specifying and checking calling conventions that is grounded in language theory. The type system also makes clear the relationship between heap allocation and stack allocation of continuation closures, capturing both allocation strategies in one calculus.

A Formal STAL Semantics
This appendix contains a complete technical description of our calculus, STAL. The STAL abstract machine is very similar to the TAL abstract machine (described in detail in Morrisett et al. [23]). The syntax appears in Figure 6. The operational semantics is given as a deterministic rewriting system in Figure 7. The notation a[b/c] denotes capture avoiding substitution of b for c in a. The notation a{b → c} represents map update: To make the presentation simpler for the branching rules, some extra notation is used for expressing sequences of type and stack type instantiations. We introduce a new syntactic class (ψ) for type sequences: ψ ::= · | τ, ψ | σ, ψ The static semantics is similar to TAL's but requires extra judgments for definitional equality of types, stack types, and register file types and uses a more compositional style for instructions. Definitional equality is needed because two stack types (such as (int::nil)•(int::nil) and int::int::nil) may be syntactically different but represent the same type. The judgments are summarized in Figure 8, the rules for type judgments appear in Figure 9, and the rules for term judgments appear in Figures 10 and 11. The notation ∆ , ∆ denotes appending ∆ to the front of ∆, that is: As with TAL, STAL is type sound: Theorem A.1 (Type Soundness) If P and P −→ * P then P is not stuck.