Certified Self-Modifying Code

Hongxu Cai Zhong Shao Alexander Vaynberg Department of Computer Science and Department of Computer Science Department of Computer Science Technology, Tsinghua University Yale University Yale University Beijing, 100084, China New Haven, CT 06520, USA New Haven, CT 06520, USA [email protected] [email protected] [email protected]

Abstract Examples System where Self-modifying code (SMC), in this paper, broadly refers to any opcode modification GCAP2 Sec 5.2 control flow modification GCAP2 Sec 6.2 program that loads, generates, or mutates code at runtime. It is Common unbounded code rewriting GCAP2 Sec 6.5 widely used in many of the world’s critical software systems to sup- Basic runtime code checking GCAP1 Sec 6.4 port runtime code generation and optimization, Constructs runtime code generation GCAP1 Sec 6.3 and linking, OS boot , just-in-time compilation, binary trans- multilevel RCG GCAP1 Sec 4.1 lation, or dynamic code encryption and obfuscation. Unfortunately, self-mutating code block GCAP2 Sec 6.9 SMC is also extremely difficult to reason about: existing formal mutual modification GCAP2 Sec 6.7 verification techniques—including Hoare logic and type system— self-growing code GCAP2 Sec 6.6 consistently assume that program code stored in memory is fixed polymorphic code GCAP2 Sec 6.8 and immutable; this severely limits their applicability and power. code optimization GCAP2/1 Sec 6.3 This paper presents a simple but novel Hoare-logic-like frame- Typical code compression GCAP1 Sec 6.10 work that supports modular verification of general von-Neumann Applications code obfuscation GCAP2 Sec 6.9 machine code with runtime code manipulation. By dropping the as- code encryption GCAP1 Sec 6.10 OS boot loaders GCAP1 Sec 6.1 sumption that code memory is fixed and immutable, we are forced shellcode GCAP1 Sec 6.11 to apply local reasoning and separation logic at the very begin- ning, and treat program code uniformly as regular data structure. Table 1. A summary of GCAP-supported applications We address the interaction between separation and code memory and show how to establish the frame rules for local reasoning even Neumann machines)—that supports modular verification of gen- in the presence of SMC. Our framework is realistic, but designed eral machine code with runtime code manipulation. By dropping to be highly generic, so that it can support assembly code under all the assumption that code memory is fixed and immutable, we are modern CPUs (including both x86 and MIPS). Our system is ex- forced to apply local reasoning and separation logic [14, 30] at the pressive and fully mechanized. We prove its soundness in the Coq very beginning, and treat program code uniformly as regular data proof assistant and demonstrate its power by certifying a series of structure. Our framework is realistic, but designed to be highly realistic examples and applications—all of which can directly run generic, so that it can support assembly code under all modern on the SPIM simulator or any stock x86 hardware. CPUs (including both x86 and MIPS). Our paper makes the fol- lowing new contributions: 1. Introduction • Our GCAP system is the first formal framework that can suc- Self-modifying code (SMC), in this paper, broadly refers to any cessfully certify any form of runtime code manipulation— program that purposely loads, generates, or mutates code at run- including all the common basic constructs and important appli- time. It is widely used in many of the world’s critical software sys- cations given in Table 1. We are the first to successfully certify tems. For example, runtime code generation and compilation can a realistic OS boot loader that can directly boot on stock x86 improve the performance of operating systems [21] and other ap- hardware. All of our MIPS examples can be directly executed plication programs [20, 13, 31]. Dynamic code optimization can in the SPIM 7.3 simulator[18]. improve the performance [4, 11] or minimize the code size [7]. Dy- • namic code encryption [29] or obfuscation [15] can support code GCAP is the first successful extension of Hoare-style program protection and tamper-resistant software [3]; they also make it hard logic that treats machine instructions as regular mutable data for crackers to debug or decompile the protected binaries. Evolu- structures. A general GCAP assertion can not only specify tionary computing systems can use dynamic techniques to support the usual precondition for data memory but also can ensure genetic programming [26]. SMC also arises in applications such that code segments are correctly loaded into memory before as just-in-time , dynamic loading and linking, OS boot- execution. We develop the idea of parametric code blocks to loaders, binary translation, and virtual machine monitor. specify and reason about all possible outputs of each self- Unfortunately, SMC is also extremely difficult to reason about: modifying program. These results are general and can be easily existing formal verification techniques—including Hoare logic [8, applied to other Hoare-style verification systems. 12] and type system [27, 23]—consistently assume that program • GCAP supports both modular verification [9] and frame rules code stored in memory is immutable; this significantly limits their for local reasoning [30]. Program modules can be verified power and applicability. separately and with minimized import interfaces. GCAP pin- In this paper we present a simple but powerful Hoare-logic- points the precise boundary between non-self-modifying code style framework—namely GCAP (i.e., CAP [33] on General von groups and those that do manipulate code at runtime. Non-self- modifying code groups can be certified without any knowledge (Machine) M ::= (Extension,Instr, Ec : Instr → ByteList, about each other’s implementation, yet they can still be safely Next : Address → Instr → State ⇀ State, linked together with other self-modifying code groups. Npc : Address → Instr → State → Address) (State) S ::= (M,E) • GCAP is highly generic in the sense that it is the first attempt ∗ to support different machine architectures and instruction sets (Mem) M ::= {f { b} in the same framework without modifying any of its inference (Extension) E ::= ... rules. This is done by making use of several auxiliary func- (Address) f,pc ::= ... (nat nums) tions that abstract away the machine-specific semantics and by b = ... .. constructing generic (platform-independent) inference rules for (Byte) :: (0 255) certifying well-formed code sequences. (ByteList) bs ::= b,bs | b ι ... In the rest of this paper, we first present our von-Neumann ma- (Instr) ::= chine GTM in Section 2. We stage the presentation of GCAP: Sec- (World) W ::= (S,pc) tion 3 presents a Hoare-style program logic for GTM; Section 4 presents a simple extended GCAP1 system for certifying runtime Figure 1. Definition of target machine GTM code loading and generation; Section 5 presents GCAP2 which extends GCAP1 with general support of SMC. In Section 6, we If Decode(S,pc,ι) is true, then present a large set of certified SMC applications to demonstrate the power and practicality of our framework. Our system is fully (S,pc) 7−→ Nextpc,ι(S), Npcpc,ι(S) mechanized—the Coq implementation (including the full sound-   ness proof) is available on the web [5]. Figure 2. GTM program execution 2. General Target Machine GTM which may include register files and disks, etc. No explicit code In this section, we present the abstract machine model and its op- heap is involved: all the code is encoded and stored in the memory erational semantics, both of which are formalized inside a mech- and can be accessed just as regular data. Instr specifies the instruc- anized meta logic (Coq [32]). After that, we use an example to tion set, with an encoding function Ec describing how instructions demonstrate the key features of a typical self-modifying program. can be stored in memory as byte sequences. We also introduce an auxiliary Decode predicate which is defined as follows: 2.1 SMC on Stock Hardware Decode((M,E),f,ι) , Ec(ι) = (M[f],...,M[f+|Ec(ι)|−1]) Before proceeding further, a few practical issues for implementing SMC on today’s stock hardware need to be clarified. First, most In other words, Decode(S,f,ι) states that under the state S, certain CPU nowadays prefetches a certain number of instructions before consecutive bytes stored starting from memory address f are pre- executing them from the cache. Therefore, instruction modification cisely the encoding of instruction ι. has to occur long before it becomes visible. Meantime, for back- Program execution is modeled as a small-step transition relation ward compatibility, some processors would detect and handle this between two Worlds (i.e., W 7−→ W′), where a world W is just a itself (at the cost of performance). Another issue is that some RISC state plus a program counter pc that indicates the next instruction to machines require the use of branch delay slots. In our system, we be executed. The definition of this transition relation is formalized assume that all these are hidden from programmers at the assembly in Fig 2. Next and Npc are two functions that define the behavior of level (which is true for the SPIM simulator). all available instructions. When instruction ι located at address pc is There exist more obstacles against SMC at the OS level. Oper- executed at state S, Nextpc,ι(S) is the resulting state and Npcpc,ι(S) ating systems usually assign a flag to each memory page to protect is the resulting program counter. Note that Next could be a partial data from being executed or, code from being modified. But this function (since memory is partial) while Npc is always total. can often be get around through special techniques. For example, To make a step, a certain number of bytes starting from pc are only stacks are allowed to support both the writing access and the fetched out and decoded into an instruction, which is then executed execution at the same time under Microsoft Windows, which pro- following the Next and Npc functions. There will be no transition if vides a way to do SMC. Next is undefined on a given state. As expected, if there is no valid To simplify the presentation, we will no longer consider these transition from a world, the execution gets stuck. effects because they do not affect our core ideas in the rest of To make program execution deterministic, the following condi- this paper. It is important to first sort out the key techniques for tion should be satisfied: reasoning about SMC, while applying our system to deal with ∀S,f,ι ,ι .Decode S,f,ι ∧ Decode S,f,ι −→ ι = ι specific hardware and OS restrictions will be left as future work. 1 2 ( 1) ( 2) 1 2 In other words, Ec should be prefix-free: under no circumstances 2.2 Abstract Machine should the encoding of one instruction be a prefix of the encoding of Our general machine model, namely GTM, is an abstract frame- another one. Instruction encodings on real machines follow regular work for von Neumann machines. GTM is general because it can be patterns (e.g., the actual value for each operand is extracted from used to model modern computing architecture such as x86, MIPS, certain bits). These properties are critical when involving operand- or PowerPC. Fig 1 shows the essential elements of GTM. An in- modifying instructions. Appel et al [2, 22] gave a more specific stance M of a GTM machine is modeled as a 5-tuple that deter- decoding relation and an in-depth analysis. mines the machine’s operational semantics. The definitions of the Next and Npc functions should also guar- A machine state S should consist of at least a memory com- antee the following property: if ((M,E),pc) 7−→ ((M′,E′),pc′) and ponent M, which is a partial map from the memory address to its M′′ is a memory whose domain does not overlap with those of M stored Byte value. Byte specifies the machine byte which is the min- and M′, then ((M ∪ M′′,E),pc) 7−→ ((M′ ∪ M′′,E′),pc′). In other imum unit of memory addressing. Note that because the memory words, adding extra memory does not affect the execution process component is a partial map, its domain can be any subset of nat- of the original world. This property can be further refined into the ural numbers. E represents other additional components of a state, following two fundamental requirements of the Next and Npc func-

2 2007/3/31 (State) S ::= (M,R) (Word) w ::= b,b (RegFile) R ∈ Register → Value (State) S ::= (M,R,D) R 16 ∗ s ∗ (Register) r ::= $1 | ... | $31 (RegFile) ::= {r { w} ∪ {r { w} (Disk) D ::= {l { b}∗ (Value) i,hwi ::= ... (int nums) 1 16 (Word Regs) r ::= rAX | rBX | rCX | rDX | rS I | rDI | rBP | rSP (Word) w,hii4 ::= b,b,b,b 8 (Byte Regs) r ::= rAH | rAL | rBH | rBL | rCH | rCL | rDH | rDL (Instr) ι ::= | li r ,i | add r ,r ,r | addi r ,r ,i s d d s t t s (S egment Regs) r ::= rDS | rES | rS S | r ,r | r , r | r , r move d s lw t i( s) sw t i( s) (Instr) ι ::= movw w,r16 | movw r16,rS | movb b,r8 | la r ,f | j f | jr r | beq r ,r ,i | jal f d s s t | jmp b | jmpl w,w | int b | ... | mul rd,rs,rt | bne rs,rt,i Figure 5. Mx86 data types Figure 3. MMIPS data types

R(rAH) := R(rAX)&(255 ≪ 8) R(rAL) := R(rAX)&255 R(r ) := R(r )&(255 ≪ 8) R(r ) := R(r )&255 Ec(ι) , ..., BH BX BL BX R(rCH) := R(rCX)&(255 ≪ 8) R(rCL) := R(rCX)&255 if ι = then Nextpc,ι(M,R) = R(rDH) := R(rDX)&(255 ≪ 8) R(rDL) := R(rDX)&255 jal f (M,R{$31{pc+4}) nop (M,R) R{rAH {b} := R{rAX {(R(rAX)&255|b ≪ 8)} li rd,i la rd,i (M,R{rd {i}) R{rAL {b} := R{rAX {(R(rAX)&(255 ≪ 8)|b)} add r ,r ,r (M,R{r {R(r )+R(r )}) d s t d s t R{rBH {b} := R{rBX {(R(rBX)&255|b ≪ 8)} addi r ,r ,i (M,R{r {R(r )+i}) t s t s R{rBL {b} := R{rBX {(R(rBX)&(255 ≪ 8)|b)} r ,r M,R{r {R r } move d s ( d ( s) ) R{rCH {b} := R{rCX {(R(rCX)&255|b ≪ 8)} M R M M lw rt,i(rs) ( , {rt {h (f),..., (f+3)i1}) R{rCL {b} := R{rCX {(R(rCX)&(255 ≪ 8)|b)} R M if f = (rs)+i ∈ dom( ) R{rDH {b} := R{rDX {(R(rDX)&255|b ≪ 8)} M R R sw rt,i(rs) ( {f,...,f+3{h (rt)i4}, ) R{rDL {b} := R{rDX {(R(rDX)&(255 ≪ 8)|b)} if f = R(rs)+i ∈ dom(M) mul rd,rs,rt (M,R{rd {R(rs)×R(rt)}) Otherwise (M,R) Figure 6. Mx86 8-bit register use and , if ι = then Npcpc,ι(M,R) = Ec(ι) ..., j f f if ι = then Nextpc,ι(M,R,D) = r R r jr s ( s) movw w,r16 (M,R{r16 {w},D) R R pc + i when (rs) = (rt), movw r16 ,r16 (M,R{r16 {R(r16 )},D) beq rs,rt,i 1 2 2 1 pc + 4 when R(r ) , R(r ) 16 S M R S R 16 D  s t movw r ,r ( , {r { (r )}, ) jal f f S 16 16 S  movw r ,r (M,R{r {R(r )},D) pc + i when R(rs) , R(rt), b,r8 M,R{r8 {b},D bne r ,r ,i movb ( ) s t 8 8 8 8 pc + 4 when R(rs) = R(rt) movb r ,r (M,R{r {R(r )},D)  1 2 2 1 Otherwise pc +|Ec(ι)| jmp b (M,R,D)  jmpl w1,w2 (M,R,D) Figure 4. MMIPS operational semantics int b BIOS Call b (M,R,D) ...... tions: and if ι = then Npcpc,ι(M,R) = ′ ′ ′ ′ ∀M,M ,M0,E,E ,pc,ι.M⊥M0 ∧ Nextpc,ι(M,E) = (M ,E ) → jmp b pc + 2 + b ′ ′ ′ jmpl w1,w2 w1 ∗ 16 + w2 M ⊥M ∧ Next ,ι(M⊎M ,E) = (M ⊎M ,E ) (1) 0 pc 0 0 Non-jmp instructions pc+|Ec(ι)| ′ ′ ...... ∀pc,ι,M,M ,E.Npcpc,ι(M,E) = Npcpc,ι(M⊎M ,E) (2) where Figure 7. Mx86 operational semantics M⊥M′ , dom(M) ∩ dom(M′)=∅, The set of instructions Instr is minimal and it contains only M⊎M′ , M ∪ M′ M⊥M′. if the basic MIPS instructions, but extensions can be easily made. M supports direct jump, indirect jump, and jump-and-link 2.3 Specialization MIPS (jal) instructions. It also provides relative addressing for branch By specializing every component of the machine M according to instructions (e.g. beq rs,rt,i), but for clarity we will continue using different architectures, we obtain different machines instances. code labels to represent the branching targets in our examples. The Ec function follows the official MIPS documentation and is MIPS specialization. The MIPS machine MMIPS is built as an omitted. Interested readers can read our Coq implementation. Fig 4 instance of the GTM framework (Fig 3). In MMIPS, the machine gives the definitions of Next and Npc. It is easy to see that these M,R R state consists of a ( ) pair, where is a register file, defined functions indeed satisfy the requirements we mentioned earilier. as a map from each of the 31 registers to a stored value. $0 is not included in the register set since it always stores constant zero and x86 (16-bit) specialization. In Fig 5, we show our x86 machine, is immutable according to MIPS convention. A machine Word is the Mx86, as an instance of GTM. The specification of Mx86 is a composition of four Bytes. To achieve interaction between registers restriction of the real x86 architecture. It is limited to 16-bit real and memory, two operators — h·i1 and h·i4 — are defined (details mode, and has only a small subset of instructions, including indirect omitted here) to do type conversion between Word and Value. and conditional jumps. We must note, however, that this is not due

3 2007/3/31 Call 0x13 (disk operations) (id = 0x13) register $2 and then call halt. But it could jump to the modify R Command0x02(diskread) ( (rAH) = 0x02) subroutine first which will overwrite the target code with the new Parameters instruction addi $2,$2,1. So the actual result of this program can R R Count = (rAL) Cylinder = (rCH) vary: if R($2) , R($4), the program copies the value of $4 to $2; Sector = R(r ) Head = R(r ) CL DH otherwise, the program simply adds $2 by 1. Disk Id = R(rDL) Bytes = Count ∗ 512 Src = (Sector − 1) ∗ 512 Even such a simple program cannot be handled by any existing Dest = R(rES ) ∗ 16 + R(rBX) verification frameworks since none of them allow code to be mu- Conditions tated at anytime. General SMCs are even more challenging: they Cylinder = 0 Head = 0 are difficult to understand and reason about because the actual pro- Disk Id = 0x80 Sector < 63 gram itself is changing during the execution—it is difficult to figure Effect out the program’s control and data flow. M′ = M{Dest + i{D(S rc + i)} (0 ≤ i ≤ Bytes) R′ = R{r {0} D′ = D AH 3. Hoare-Style Reasoning under GTM

Figure 8. Subset of Mx86 BIOS operations Hoare-style reasoning has always been done over programs with separate code memory. In this section we want to eliminate such re- striction. To reason about GTM programs, we formalize the syntax .data # Data declaration section and the operational semantics of GTM inside a mechanized meta logic. For this paper we will use the calculus of inductive construc- 100 new: addi $2, $2, 1 # the new instr tions (CiC) [32] as our meta logic. Our implementation is done us- ing Coq [32] but all our results also apply to other proof assistants. .text # Code section We will use the following syntax to denote terms and predicates 200 main: beq $2, $4, modify # do modification in the meta logic: 204 target: move $2, $4 # original instr (Term) A, B ::= Set | Prop | Type | x | λx: A. B | AB 208 j halt # exit | A → B | ind. def. | ... 212 halt: j halt (Prop) p,q ::= True | False |¬p | p ∧ q | p ∨ q | p → q |∀ . |∃ . | ... 216 modify: lw $9, new # load new instr x: A p x: A p 224 sw $9, target # store to target 3.1 From Invariants to Hoare Logic 232 j target # return The program safety predicate can be defined as follows: Figure 9. opcode.s: Opcode modification True = W , if n 0 Safenn( ) ′ ′ ′ (∃W .W 7−→ W ∧ Safenn−1(W ) if n > 0 to inability to add such instructions, but simply due to the lack of W , W time that would be needed to be both extensive and correct in our Safe( ) ∀n:Nat.Safenn( ) definitions. But the subset is not so trivial as to be useless; in fact Safenn states that the machine is safe to execute n steps from a it is adequate for certification of interesting examples such as OS world, while Safe describes that the world is safe to run forever. boot loaders. Invariant-based method [17] is a common technique for proving In order to certify a boot loader, we augment the Mx86 state to safety properties of programs. include a disk. Everything else that is machine specific has no effect on the GTM specification. Some of the features of Mx86 include the 8-bit registers, which are dependent on 16-bit registers. This fact Definition 3.1 An invariant is a predicate, namely Inv, defined that is responsible for Fig 6, which shows how the 8-bit registers over machine worlds, such that the following holds: ′ ′ are extracted out of the 16-bit ones. The memory of Mx86 is • ∀W.Inv(W) −→ ∃W .(W 7−→ W ) (Progress) segmented, a fact which is mostly invisible, except in the long jump • ∀W.Inv(W) ∧ (W 7−→ W′) −→ Inv(W′) (Preservation) instruction (Fig 7). as a black box with proper formal specifications (Fig 8). We define the BIOS call as a primitive operation in the The existence of an invariant immediately implies program safety, semantics. In this paper, we only define a BIOS command for disk as shown by the following theorem. read, as it is needed for our boot loader. Since the rules of the BIOS semantics are large, it is unwieldy to Theorem 3.2 If Inv is an invariant then ∀W.Inv(W) → Safe(W). present them in the usual mathematical notation, and instead a more descriptive form is used. The precise definition can be extracted if Invariant-based method can be directly used to prove the exam- one takes the parameters to be let statements, the condition and ple we just presented in Fig 9: one can construct a global invariant the command number to be a conjunctive predicate over the initial for the code, as much like the tedious one in Fig 10, which satisfies ff M′,R′,D′ state, and the e ect to be the ending state in the form ( ). our Definition 3.1. Since we did not want to present a complex definition of the Invariant-based verification is powerful especially when carried disk, we assume our disk has only one cylinder, one side, and 63 out with a rich meta logic. But a global invariant for a program, sectors. The BIOS disk read command uses that assumption. usually troublesome and complicated, is often unrealistic to di- rectly find and write down. A global invariant needs to specify the 2.4 A Taste of SMC condition on every memory address in the code. Even worse, this We give a sample piece of self-modifying code (i.e., opcode.s) prevents separate verification (of different components) and local in Fig 9. The example is written in MMIPS syntax. We use line reasoning, as we see: every program module has to be annotated numbers to indicate the location of each instruction in memory. with the same global invariant which requires the knowledge of the It seems that this program will copy the value of register $4 to entire code base.

4 2007/3/31 (CodeBlock) B ::= f: I Inv(M,R,pc) , Decode(M,100,ι100) ∧ Decode(M,200,ι200)∧ (InstrSeq) I ::= ι;I | ι Decode(M,208,ι208) ∧ Decode(M,212,ι212) ∧ Decode(M,216,ι216)∧ (CodeHeap) ::= {f { I}∗ Decode(M,224,ι224) ∧ Decode(M,232,ι232) ∧ ( a ∈ → (pc=200 ∧ Decode(M,204,ι204)) ∨ (pc=204 ∧ Decode(M,204,ι204))∨ (Assertion) State Prop

(pc=204 ∧ Decode(M,204,ι100) ∧ R($2) = R($4))∨ (ProgSpec) Ψ ∈ Address ⇀ Assertion R R R (pc=208 ∧ ($4) ≤ ($2) ≤ ($4) + 1)∨ Figure 12. Auxiliary constructs and specifications (pc=212 ∧ R($4) ≤ R($2) ≤ R($4) + 1) ∨ (pc=216 ∧ R($2) , R($4))∨ (pc=224 ∧ R($2) = R($4) ∧ hR($9)i = Ec(ι ))∨ 4 100 a ⇒ a′ , ∀S.(a S → a′S) a ⇔ a′ , ∀S.(a S ↔ a′S) (pc=232 ∧ R($2) = R($4) ∧ Decode(M,204,ι ))) 100 ¬a , λS.¬a S a ∧ a′ , λS.a S ∧ a′S Figure 10. A global invariant for the opcode example a ∨ a′ , λS.a S ∨ a′S a → a′ , λS.a S → a′S ∀x.a , λS.∀x.(a S) ∃x.a , λS.∃x.(a S) ′ ′ ′ ′ ′ a ∗ a , λ(M0,R).∃M,M .M0=M⊎M ∧ a(M,R) ∧ a (M ,R) Code Heap Code Blocks where M⊎M′ , M ∪ M′ when M⊥M′, M⊥M′ , dom(M) ∩ dom(M′)=∅

f1: {p1} Figure 13. Assertion operators

f1: immutability can be enforced in the inference rules using a simple separation conjunction borrowed from separation logic [14, 30]. f : f2: 2 {p } 2 3.2 Specification Language Our specification language is defined in Fig 12. A code block B is a f3: syntactic unit that represents a sequence I of instructions, beginning

f3: at specific memory address f. Note that in CAP, we usually insist {p3} that jump instructions can only appear at the end of a code block. This is no longer required in our new system so the division of code blocks is much more flexible. The code heap C is a collection of code blocks that do not over- lap, represented by a finite mapping from addresses to instruction Figure 11. Hoare-style reasoning of assembly code sequences. Thus a code block can also be understood as a singleton code heap. To support Hoare-style reasoning, assertions are defined Traditional Hoare-style reasoning over assembly programs as predicates over GTM machine states (i.e., via “shallow embed- (e.g., CAP [33]) is illustrated in Fig 11. Program code is assumed ding”). A program specification Ψ is a partial function which maps to be stored in a static code heap separated from the main mem- a memory address to its corresponding assertion, with the intention ory. A code heap can be divided into different code blocks (i.e. to represent the precondition of each code block. Thus, Ψ only has consecutive instruction sequences) which serve as basic certify- entries at each location that indicates the beginning of a code block. ing units. A precondition is assigned to every code block, whereas Fig 13 defines an implication relation and a equivalence relation no postcondition shows up since we often use CPS (continuation between two assertions (⇒) and also lifts all the standard logical ′ passing style) to reason about low-level programs. Different blocks operators to the assertion level. Note the difference between a → a ′ can be independently verified then linked together to form a global and a ⇒ a : the former is an assertion, while the latter is a propo- invariant and complete the verification. sition! We also define standard separation logic primitives [14, 30] Here we present a Hoare-logic-based system GCAP0 for GTM. as assertion operators. The separating conjunction (∗) of two asser- Comparing to the limited usage of existing systems TAL or CAP, a tions holds if they can be satisfied on two separating memory ar- system working on GTM do have several advantages: eas (the register file can be shared). Separating implication, empty First, instruction set and operational semantics come as an in- heap, or singleton heap can also be defined directly in our meta tegrated part for a TAL or CAP system, thus are usually fixed and logic. Solid work has been established on separation logic, which limited. Although extensions can be made, it costs fair amount of we use heavily in our proofs. efforts. On the other hand, GTM abstracts both components out, and a system that directly works on it will be robust and suitable Lemma 3.3 (Properties for separation logic) Let a, a′ and a′′ be for different architecture. assertions, then we have Also, GTM is more realistic since it has the benefit that pro- ′ ′ ′ grams are encoded and stored in memory as real computing plat- 1. For any assertions a and a , we have a ∗ a ⇔ a ∗ a. form does. Besides regular assembly code, A GTM-based verifica- 2. For any assertions a, a′ and a′′, we have (a∗a′)∗a′′ ⇔ a∗(a′ ∗ tion system would be able to manage real binary code. a′′). Our generalization is not trivial. Firstly, unifying different types 3. Given type T, for any assertion function P ∈ T → Assertion and of instructions (especially between regular command and control assertion a, we have (∃x ∈ T. P x) ∗ a ⇔∃x ∈ T.(P x ∗ a). transfer instruction) without loss of usability requires an intrinsic 4. Given type T, for any assertion function P ∈ T → Assertion and understanding of the relation between instructions and program assertion a, we have (∀x ∈ T. P x) ∗ a ⇔∀x ∈ T.(P x ∗ a). specifications. Secondly, code is easily guaranteed to be immutable in an abstract machine that separates code heap as an individual We omit the proof of these properties since they are standard laws component, which GTM is different from. Surprisingly, the same of separation logic [14, 30].

5 2007/3/31 λS.Decode(S,f,ι) if I = ι We can instantiate the  and  rules on each instruction blk(f: I) , λS.Decode(S,f,ι) ∧ (blk(f+|Ec(ι)|: I′) S) if I = ι;I′ if necessary. For example, specializing  over the direct jump  (j f′) results in the following rule: blk(C) , ∀f ∈ dom(C).blk(f: C(f)) ′ ′ , ′ a ⇒ Ψ(f ) Ψ ⇒ Ψ ∀f ∈ dom(Ψ).(Ψ(f) ⇒ Ψ (f)) () Ψ ⊢{a}f: j f′ Ψ1(f) if f ∈ dom(Ψ1) \ dom(Ψ2) (Ψ1 ∪ Ψ2)(f) , Ψ2(f) if f ∈ dom(Ψ2) \ dom(Ψ1) Specializing  over the add instruction makes  Ψ1(f) ∨ Ψ2( f ) if f ∈ dom(Ψ1) ∩ dom(Ψ2) ′ I ′  Ψ ⊢{a }f+4: Ψ ∪{f+4 { a }⊢{a}f: add rd,rs,rt  Figure 14. Predefined functions Ψ ⊢{a}f: add rd,rs,rt;I which via  can be further reduced into Ψ ⊢{a′}f+4: I Ψ ⊢W (Well-formed World) ∀(M,R).a (M,R) → a′ (M,R{r {R(r )+R(r )}) d s t () Ψ ⊢ C : Ψ (a ∗ (blk(C) ∧ blk(pc: I)) S Ψ ⊢{a}pc: I Ψ ⊢{a}f: add rd,rs,rt;I () Ψ ⊢(S,pc) Another interesting case is the conditional jump instructions, such as beq, which can be instantiated from rule  as ′ Ψ ⊢ C : Ψ (Well-formed Code Heap) ′ Ψ ⊢{a }(f+4): I ∀(M,R).a (M,R) → ((R(rs) = R(rt) → C ′ C ′ C C ′ Ψ1 ⊢ 1 : Ψ Ψ2 ⊢ 2 : Ψ dom( 1) ∩ dom( 2) = ∅ Ψ(f+i)(M,R)) ∧ (R(rs) , R(rt) → a (M,R))) 1 2 (-) () C C ′ ′ I Ψ1 ∪ Ψ2 ⊢ 1 ∪ 2 : Ψ1 ∪ Ψ2 Ψ ⊢{a}f: beq rs,rt,i; Ψ ⊢{a}f: I The instantiated rules are straightforward to understand and () Ψ ⊢ {f { I} : {f { a} convenient to use. Most importantly, they can be automatically generated directly from the operational semantics for GTM. C ′ Ψ ⊢{a}B (Well-formed Code Block) The well-formedness of a code heap (Ψ ⊢ : Ψ ) states that given Ψ′ specifying the preconditions of each code block of C, all Ψ ⊢{a′}(f+|Ec(ι)|): I Ψ ∪ {f+|Ec(ι)| { a′} ⊢{a}f: ι the code in C can be safely executed with respect to specification Ψ. () Ψ ⊢{a}f: ι;I Here the domain of C and Ψ′ should always be the same. The  rule casts a code block into a corresponding well-formed singleton S S S S ∀ .a → Ψ(Npcf,ι( )) (Nextf,ι( )) code heap, and the - rule merges two disjoint well-formed () Ψ ⊢{a}f: ι code heaps into a larger one. A world is well-formed with respect to a global specification Ψ Figure 15. Inference rules for GCAP0 (the  rule), if • the entire code heap is well-formed with respect to Ψ; Fig 14 defines a few important macros: blk(B) holds if B is stored properly in the memory of the current state; blk(C) holds • the code heap and the current code block is properly stored; if all code blocks in the code heap C are properly stored. We also • A precondition a is satisfied, separately from the code section; define two operators between program specifications. We say that • a one program specification is stricter than another, namely Ψ ⇒ Ψ′, the instruction sequence is well-formed under . if every assertion in Ψ implies the corresponding assertion at the The  rule also shows how we use separation conjunction to same address in Ψ′. The union of two program specifications is just ensure that the whole code heap is indeed in the memory and the disjunction of the two corresponding assertions at each address. always immutable; because assertion a cannot refer to the memory Clearly, any two program specifications are stricter than their union region occupied by C, and the memory domain never grow during specification. the execution of a program, the whole reasoning process below the top level never involves the code heap region. This guarantees that 3.3 Inference Rules no code-modification can happen during the program execution. To verify the safety and correctness of a program, one needs to Fig 15 presents the inference rules of GCAP0. We give three sets first establish the well-formedness of each code block. All the code of judgments (from local to global): well-formed code block, well- blocks are linked together progressively, resulting in a well-formed formed code heap, and well-formed world. global code heap where the two accompanied specifications must Intuitively, a code block is well-formed (Ψ ⊢{a}B) iff, starting match. Finally, the  rule is used to prove the safety of the from a state satisfying its precondition a, the code block is safe to initial world for the program. execute until it jumps to a location in a state satisfying the spec- ification Ψ. The well-formedness of a single instruction (rule - Soundness and frame rules. The soundness of GCAP0 guaran- ) directly follows this understanding. Inductively, to validate tees that any well-formed world is safe to execute. Establishing a the well-formedness of a code block beginning with ι under pre- well-formed world is equivalent to an invariant-based proof of pro- condition a (rule ), we should find an intermediate assertion gram correctness: the accompanied specification Ψ corresponds to ′ a serving simultaneously as the precondition of the tail code se- a global invariant that the current world satisfies. This is established quence, and the postcondition of ι. In the second premise of , through a series of weakening lemmas, then progress and preserva- ′ since our syntax does not have a postcondition, a is directly fed tion lemmas; frame rules are also easy to prove. into the accompanied specification. Note that for a well-formed code block, even though we have Ψ ⊢ C Ψ′ ∈ Ψ′ added an extra entry to the program specification Ψ when we Lemma 3.4 : if and only if for every f dom( ) we ∈ C Ψ ⊢{Ψ′ } C validate each individual instruction, the Ψ used for validating each have f dom( ) and (f) f: (f). code block and the tail code sequence remains the same. Proof Sketch: For the necessity, prove from inversion of the -

6 2007/3/31  rule, the  rule, and the - rule. For the suffi- Ψ ⊢W (Well-formed World) ciency, just repetitively apply the - rule.  Ψ ⊢(C,Ψ) C′ ⊆ C ′ ′ Lemma 3.5 contains the standard weakening rules and can be (a ∗ (blk(C ) ∧ blk(pc: I))) S Ψ ⊢{a}pc: I ∀f ∈ dom(Ψ′).(Ψ′(f) ∗ (blk(C′) ∧ blk(pc: I)) ⇒ Ψ(f)) proved by induction over the well-formed-code-block rules and by (-) using Lemma 3.4. Ψ ⊢(S,pc)

C ′ Lemma 3.5 (Weakening Properties) Ψ ⊢( ,Ψ ) (Well-formed Code Specification) ′ ′ ′ ′ Ψ ⊢{a}B Ψ ⇒ Ψ a ⇒ a Ψ1 ⊢(C1,Ψ ) Ψ2 ⊢(C2,Ψ ) dom(C1) ∩ dom(C2) = ∅ (-) 1 2 (-) ′ ′ B C C , ′ ′ Ψ ⊢{a } Ψ1 ∪ Ψ2 ⊢( 1 ∪ 2 Ψ1 ∪ Ψ2) ′ ′ Ψ C Ψ Ψ Ψ Ψ Ψ ′ 1 ⊢ : 2 1 ⇒ 1 2 ⇒ 2 Ψ ⊢ C : Ψ (-) () Ψ′ C Ψ′ Ψ ∗ blk(C) ⊢(C,Ψ′ ∗ blk(C)) 1 ⊢ : 2 Proof: Figure 16. Inference rules for GCAP1 1. By doing induction over the  rule and the  rule, we have the result. Proof: For the first rule, we need the following important fact of 2. Easy to see from 1 and Lemma 3.4.  our GTM specializations: If M ⊥M and W W′ 0 1 Lemma 3.6 (Progress) If Ψ ⊢ , then there exists a program , ′ ′ ′ M ,E M ,E , such that W 7−→ W . Nextpc,ι( 0 ) = ( 0 ) M′ M Proof: By inversion of the  rule we know that there is a then we have 0⊥ 1, and code sequence I indeed in the memory starting from the current ′ ′ Nextpc,ι(M ⊎M ,E) = (M ⊎M ,E ) pc. Suppose the first instruction of I is ι, then from the rules for 0 1 0 1 well-formed code blocks, we learn that there exists Ψ′, such that Then it is easy to prove by induction over the two well-formed code Ψ′ ⊢{a}ι. block rules. The second rule is thus straightforward.  Since a is satisfied for the current state, by inversion of the  rule, this guarantees that our world W is safe to be executed Note that the correctness of this rule relies on the condition we gave ff further for at least one step.  in Sec 2 (incorporating extra memory does not a ect the program execution), as also pointed out by Reynolds [30].   ′ ′ With the - rule, one can extend a locally certified Lemma 3.7 (Preservation) If Ψ ⊢W and W 7−→ W , then Ψ ⊢W . code block with an extra assertion, given the requirement that this Proof Sketch: Suppose the premises of the  rule hold. Again, assertion holds separately in conjunction with the original assertion a code block pc : I is present in the current memory. We only as well as the specification. Frame rules at different levels will be need to find an a′ and I′ satisfying (a′ ∗ (blk(C) ∧ blk(pc : I′)) S used as the main tool to divide code and data to solve the SMC issue and Ψ ⊢{a′}pc: I′, when there is a W′ = (M′,E′,pc′) such that later. All the derived rules and the soundness proof have been fully W 7−→ W′. mechanized in Coq [5] and will be used freely in our examples. There are two cases: 4. Certifying Runtime Code Generation 1. I = ι is a single-instruction sequence. It would be easy to show that pc′ ∈ dom(Ψ) and pc′ ∈ dom(C). We choose GCAP1 is a simple extension of GCAP0 to support runtime code generation. In the top  rule for GCAP0, the precondition a for ′ ′ ′ ′ a = Ψ(pc ), and I = C(pc ) the current code block must not specify memory regions occupied to satisfy the two conditions. by the code heap, and all the code must be stored in the memory and 2. I = ι;I′ is a multi-instruction sequence. Then by inversion of remain immutable during the whole execution process. In the case the  rule, either we reduce to the previous case, or pc′ = of runtime code generation, this requirement has to be relaxed since pc+|Ec(ι)| and there is an a′ such that Ψ ⊢{a′}pc′ : I′. Thus a′ the entire code may not be in the memory at the very beginning— and I′ are what we choose to satisfy the two conditions.  some can be generated dynamically! Inference rules. GCAP1 borrows the same definition of well- Theorem 3.8 (Soundness of GCAP0) If Ψ ⊢W, then Safe(W). formed code heaps and well-formed code blocks as in GCAP0: they use the same set of inference rules (see Fig 15). To support runtime , W W Proof: Define predicate Inv λ .(Ψ ⊢ ). Then by Lemma 3.6 code generation, we change the top rule and insert an extra layer and Lemma 3.7, Inv is an invariant of GTM. Together with Theo-  of judgments called well-formed code specification (see Fig 16) rem 3.2, the conclusion holds. between well-formed world and well-formed code heap. If “well-formed code heap” is a static reasoning layer, “well- The following lemma (a.k.a., the frame rule) captures the formed code specification” is more like a dynamic one. Inside essence of local reasoning for separation logic: an assertion for a well-formed code heap, no information about program code is included, since it is implicitly guaranteed by the Ψ ⊢{a}B Lemma 3.9 (-) code immutability property. For a well-formed code specification, (λf.Ψ(f) ∗ a′) ⊢{a ∗ a′}B on the other hand, all information about the required program code where a′ is independent of every register modified by B. should be provided in the precondition for all code blocks.  Ψ ⊢ C : Ψ′ We use the rule to transform a well-formed code heap (-) into a well-formed code specification by attaching the whole code (λf.Ψ(f) ∗ a) ⊢ C : λf.Ψ′(f) ∗ a information to the specifications on both sides. - rule has the where a is independent of every register modified by C. same form as -, except that it works on the dynamic layer.

7 2007/3/31 The new top rule (-) replaces a well-formed code heap The original code: with a well-formed code specification. The initial condition is now main: la $9, gen # get the target addr li $8, 0xac880000 # load Ec(sw $8,0($4)) weakened! Only the current (locally immutable) code heap with the   sw $8, 0($9) # store to gen current code block, rather than the whole code heap, is required to   li $8, 0x00800008 # load Ec(jr $4)  be in the memory. Also, when proving the well-formedness of the B1  sw $8, 4($9) # store to gen+4 current code block, the current code heap information is stripped   la $4, ggen # $4 = ggen from the global program specification.  la $9, main # $9 = main   li $8, 0x01200008 # load Ec(jr $9) to $8 Local reasoning. On the dynamic reasoning layer, since code in-  j gen # jump to target  formation is carried with assertions and passed between code mod-  gen: nop # to be generated ules all the time, verification of one module usually involves the nop # to be generated knowledge of code of another (as precondition). Sometimes, such ggen: nop # to be generated knowledge is redundant and breaks local reasoning. Fortunately, The generated code: gen: sw $8,0($4) a frame rule can be established on the code specification level as B2 well. We can first locally verify the module, then extend it with the  jr $4 frame rule so that it can be linked with other modules later. B3 { ggen: jr $9 Figure 17. mrcg.s: Multilevel runtime code generation Ψ ⊢(C,Ψ′) Lemma 4.1 (-) λ .Ψ C,λ .Ψ′ ( f (f) ∗ a) ⊢( f (f) ∗ a) 4.1 Example: Multilevel Runtime Code Generation where a is independent of any register that is modified by C. We use a small example mrcg.s in Fig 17 to demonstrate the Proof: By induction over the derivation for Ψ ⊢ (C,Ψ′). There are   usability of GCAP1 on runtime code generation. Our mrcg.s is only two cases: if the final step is done via the - rule, the already fairly subtle—it does multilevel RCG, which means that conclusion follows immediately from the induction hypothesis; if  code generated at runtime may itself generate new code. Multilevel the final step is via the rule, it must be derived from a well- RCG has its practical usage [13]. In this example, the code block formed-code-heap derivation: B1 can generate B2 (containing two instructions), which will again ′ B Ψ0 ⊢ C : Ψ (3) generate 3 (containing only a single instruction). 0 The first step is to verify B , B and B respectively and locally, ′ ′ 1 2 3 with Ψ= λf.Ψ0(f) ∗ blk(C) and Ψ = λf.Ψ (f) ∗ blk(C); we first as the following three judgments show:   0 apply the - rule to (3) obtain: {gen { a2 ∗ blk(B2)} ⊢{a1}B1 C ′ {ggen { a ∗ blk(B )} ⊢{a }B (λf.Ψ0(f) ∗ a) ⊢ : λf.Ψ0(f) ∗ a 3 3 2 2 { { a } ⊢{a }B and then apply the  rule to get the conclusion.  main 1 3 3 where

In particular, by setting the above assertion a to be the knowl- a1 = λS.True, edge about code not touched by the current module, the code can a2 = λ(M,R).R($9) = main ∧ R($8) = Ec(jr $9) ∧ R($4) = ggen, be excluded from the local verification. a = λ(M,R).R($9) = main As a more concrete example, suppose that we have two locally 3 As we see, B has no requirement for its precondition, B simply certified code modules C1 and C2, where C2 is generated by C1 at 1 2 requires that proper values are stored in the registers $4, $8, and $9, runtime. We first apply - to extend C2 with assertion while B demands that $9 points to the label . blk(C1), which reveals the fact that C1 does not change during 3 main C   All the three judgments are straightforward to establish, by the whole executing process of 2. After this, the - rule is   applied to link them together into a well-formed code specification. means of GCAP1 inference rules (the rule and the rule). For example, the pre- and selected intermediate conditions for B1 We give more examples about GCAP1 in Section 6. are as follows: S Soundness. The soundness of GCAP1 can be established in the {λ .True} same way as Theorem 3.8 (see TR [5] for more detail). main: la $9, gen {λ(M,R).R($9) = gen} li $8, 0xac880000 Theorem 4.2 (Soundness of GCAP1) If Ψ ⊢W, then Safe(W). sw $8, 0($9) {(λ(M,R).R($9) = gen) ∗ blk(gen: sw $8,0($4))} li $8, 0x00800008 This theorem can be established following a similar technique sw $8, 4($9) as in GCAP0 (more cases need to be analyzed because of the {blk(B2)} additional layer). However, we decide to delay the proof to the next la $4, ggen section to give better understanding of the relationship between la $9, main GCAP1 and our more general framework GCAP2. By converting a li $8, 0x01200008 B well-formed GCAP1 program to a well-formed GCAP2 program, {a2 ∗ blk( 2)} we will finally see that GCAP1 is sound given the fact that GCAP2 j gen is sound (see Theorem 5.5 and Theorem 5.8 in the next section). The first five instructions generate the body of B2. Then, regis- To verify a program that involves run-time code generation, we ters are stored with proper values to match B2’s requirement. No- first establish the well-formedness of each code module (which tice the three li instructions: the encoding for each generated in- never modifies its own code) using the rules for well-formed code struction are directly specified as immediate value here. heap as in GCAP0. We then use the dynamic layer to combine these Notice that blk(B1) has to be satisfied as a precondition of code modules together into a global code specification. Finally we B3 since B3 points to B1. However, to achieve modularity we do use the new - rule to establish the initial state and prove the not require it in B3’s local precondition. Instead, we leave this correctness of the entire program. condition to be added later via our frame rule.

8 2007/3/31 LINKG Code Heap Code Blocks Control Flow LIFT

f1: {p } {p '} main a1 *blk(1) 1 1 f1: FRAMESPEC LINKG LIFT f2: f2: ΨG = gen a2 *blk(2) *blk(1) {p2}

LIFT f3: f3: {p3} ggen a3 *blk(3)*blk(1) f4: f4: {p4}

Figure 18. mrcg.s: GCAP1 specification

After the three code blocks are locally certified, the  rule Figure 19. Typical Control Flow in GCAP2 and then the  rule are respectively applied to each of them, as illustrated in Fig 18, resulting in three well-formed singleton code heaps. Afterwards, B2 and B3 are linked together and we (ProgSpec) Ψ ∈ Address ⇀ Assertion apply - rule to the resulting code heap, so that it can successfully be linked together with the other code heap, forming (CodeSpec) Φ ∈ CodeBlock ⇀ Assertion the coherent global well-formed specification (as Fig 18 indicates): Figure 20. Assertion language for GCAP2 B B B ΨG = {main { a1 ∗ blk( 1), gen { a2 ∗ blk( 2) ∗ blk( 1), ggen { a3 ∗ blk(B3) ∗ blk(B1)} now describes an execution sequence of instructions starting at cer- C C tain memory address, rather than merely a static instruction se- which should satisfy ΨG ⊢ ( ,ΨG ) (where stands for the entire code heap). quence currently stored in memory. There is no difference between Now we can finish the last step—applying the - rule to these two understandings under the non-self-modifying circum- the initial world, so that the safety of the whole program is assured. stance since the static code always executes as it is, while a funda- mental difference could appear under the more general SMC cases. 5. Supporting General SMC The new understanding execution code block characterizes better the real control flow of the program. Sec 6.9 discusses more about Although GCAP1 is a nice extension to GCAP0, it can hardly be the importance of this generalization. used to certify general SMC. For example, it cannot verify the opcode modification example given in Fig 9 at the end of Sec 2. Specification language. The specification language is almost In fact, GCAP1 will not even allow the same memory region to same as GCAP1, but GCAP2 introduces one new concept called contain different runtime instructions. code specification (denoted as Φ in Fig 20), which generalizes General SMC does not distinguish between code heap and data the previous code and specification pair to resolve the problem of heap, therefore poses new challenges: first, at runtime, the instruc- having multiple code blocks starting at a single address. A code tions stored at the same memory location may vary from time to specification is a partial function that maps code blocks to their time; second, the control flow is much harder to understand and rep- assertions. When certifying a program, the domain of the global resent; third, it is unclear how to divide a self-modifying program Φ indicates all the code blocks that can show up at runtime, and into code blocks so that they can be reasoned about separately. the corresponding assertion of a code block describes its global To tackle these obstacles, we have developed a new verification precondition. The reader should note that though Φ is a partial system GCAP2 supporting general SMCs. Our system is still built function, it can have an infinite domain (indicating that there might upon our machine model GTM. be an infinite number of possible runtime code blocks).

5.1 Main Development Inference rules. GCAP2 has three sets of judgements (see Fig 21): The main idea of GCAP2 is illustrated in Fig 19. Again, the po- well-formed world, well-formed code spec, and well-formed code tential runtime code is decomposed into code blocks, representing block. The key idea of GCAP2 is to eliminate the well-formed- the instruction sequences that may possibly be executed. Each code code-heap layer in GCAP1 and push the “dynamic reasoning layer” block is assigned with a precondition, so that it can be certified in- down inside each code block, even into a single instruction. Inter- dividually. Unlike GCAP1, since instructions can be modified, dif- estingly, this makes the rule set of GCAP2 look much like GCAP0 ferent runtime code blocks may overlap in memory, even share the rather than GCAP1. same entry location. Hence if a code block contains a jump instruc- The inference rules for well-formed code blocks has one tiny but tion to certain memory address (such as to f1 in Fig 19) at which essential difference from GCAP0/GCAP1. A well-formed instruc- several blocks start, it is usually not possible to tell statically which tion () has one more requirement that the instruction must ac- block it will exactly jump to at runtime. What our system requires tually be in the proper location of memory. Previously in GCAP1, instead is that whenever the program counter reaches this address this is guaranteed by the  rule which adds the whole static code (e.g. f1 in Fig 19), there should exist at least one code block there, heap into the preconditions; for GCAP2, it is only required that the whose precondition is matched. After all the code blocks are cer- current executing instruction be present in memory. tified, they can be linked together in a certain way to establish the Intuitively, the well-formedness of a code block Ψ ⊢{a}f: I now correctness of the program. states that if a machine state satisfies assertion a, then I is the only To support self-modifying features, we relax the requirements possible code sequence to be executed starting from f, until we get of well-formed code blocks. Specifically, a well-formed code block to a program point where the specification Ψ can be matched.

9 2007/3/31 Ψ ⊢W (Well-formed World) Lemma 5.1 Ψ ⊢{a}B Ψ ⇒ Ψ′ a′ ⇒ a ~Φ ⊢ Φ a S ~Φ ⊢{a}pc: I (-) () ′ ′ B ~Φ ⊢(S,pc) Ψ ⊢{a } where ~Φ , λf.∃I.Φ(f: I). Ψ ⊢ Φ Ψ ⇒ Ψ′ ∀B ∈ dom(Φ′).(Φ′ B ⇒ Φ B) (-) Ψ′ ⊢ Φ′ Ψ ⊢ Φ (Well-formed Code Specification) And also we have the relation between well-formed code spec- Ψ1 ⊢ Φ1 Ψ2 ⊢ Φ2 dom(Φ1) ∩ dom(Φ2) = ∅ () ification and well-formed code blocks. Ψ1 ∪ Ψ2 ⊢ Φ1 ∪ Φ2 Ψ ⊢ Φ B ∈ Φ ∀B ∈ dom(Φ).Ψ ⊢{Φ B}B Lemma 5.2 if and only if for every dom( ) we have () Ψ ⊢{Φ B}B. Ψ ⊢ Φ Proof Sketch: For the necessity, prove from inversion of  and Ψ ⊢{a}B (Well-formed Code Block)  rule, and weaken properties ( Lemma 5.1). The sufficiency is trivial by  rule.  Ψ ⊢{a′}f+|Ec(ι)|: I Ψ ∪ {f+|Ec(ι)| { a′} ⊢{a}f: ι () Ψ ⊢{a}f: ι;I Lemma 5.3 (Progress) If Ψ ⊢ W, then there exists a program W′, such that W 7−→ W′. ∀S.a S → ( Decode(S,f,ι) ∧ Ψ(Npcf,ι(S)) Nextf,ι(S)) () Proof Sketch: Easy to see by inversion of the  rule, the  Ψ ⊢{a}f: ι rule and the  rule successively.  Figure 21. Inference rules for GCAP2 Lemma 5.4 (Preservation) If Ψ ⊢W and W 7−→ W′, then Ψ ⊢W′. Proof Sketch: The first premise is already guaranteed. The other The precondition for a non-self-modifying code block B must two premises can be verified by checking the two cases of well- now include B itself, i.e. blk(B). This extra requirement does not formed code block rules.  compromise modularity, since the code is already present and can be easily moved into the precondition. For dynamic code, the initial Theorem 5.5 (Soundness of GCAP2) If Ψ ⊢W, then Safe(W). stored code may differ from the code actually being executed. An example is as follows: Local reasoning. Frame rules are still the key idea for supporting {blk(100: movi r0,Ec(j 100); sw r0,102)} local reasoning. 100: movi r0, Ec(j 100) sw r0, 102 Theorem 5.6 (Frame Rules) j 100 Ψ ⊢{a}B (-) The third instruction is actually generated by the first two, thus does (λf.Ψ(f) ∗ a′) ⊢{a ∗ a′}B not appear in the precondition. This kind of judgment can never be shown in GCAP1, nor in any previous Hoare-logic models. where a′ is independent of every register modified by B. Note that our generalization does not make the verification more Ψ ⊢ Φ (-) difficult: as long as the specification and precondition are given, (λf.Ψ(f) ∗ a) ⊢ (λB.Φ B ∗ a) the well-formedness of a code block can be established in the same mechanized way as before. where a is independent of every register modified by any code The judgment Ψ ⊢ Φ (well-formed code specification) is fairly block in the domain of Φ. comparable with the corresponding judgment in GCAP1 if we In fact, since we no longer have the static code layer in GCAP2, the C,Ψ notice that the pair ( ) is just a way to represent a more limited frame rules play a more important role in achieving modularity. For Φ. The rules here basically follow the same idea except that the  example, to link two code modules that do not modify each other, rule allows universal quantification over code blocks: if every we first use the frame rule to feed the code information of the other block in a code specification’s domain is well-formed with respect module into each module’s specification and then apply  rule. to a program specification, then the code specification is well- formed with respect to the same program specification. 5.2 Example and Discussion The interpretation operator ~− establishes the semantic re- We can now use GCAP2 to certify the opcode-modification exam- lation between program specifications and code specifications: it ple given in Fig 9. There are four runtime code blocks that need transforms a code specification to a program specification by unit- to be handled. Fig 22 shows the formal specification for each code ing the assertions (i.e. doing assertion disjunction) of all blocks block, including both the local version and the global version. Note starting at the same address together. In the judgment for well- that B and B are overlapping in memory, so we cannot just use formed world (rule ), we use ~Φ as the specification to estab- 1 2 GCAP1 to certify this example. lish the well-formed code specification and the current well-formed Locally, we need to make sure that each code block is indeed code block. We do not need to require the current code block to be stored in memory before it can be executed. To execute B and B , stored in memory (as GCAP1 did) since such requirement will be 1 4 we also require that the memory location at the address new stores specified in the assertion a already. a proper instruction (which will be loaded later). On the other hand, Soundness. The soundness proof follows almost the same tech- since B4 and B2 can be executed if and only if the branch occurs at niques as in GCAP0. main, they both have the precondition R($2) = R($4). Firstly, due to the similarity between the rules for well-formed After verifying all the code blocks based on their local specifica- code blocks, the same weakening lemmas still hold: tions, we can apply the frame rule to establish the extended specifi- cations. As Fig 22 shows, the frame rule is applied to the local judg- ments of B2 and B4, adding blk(B3) on their both sides to form the

10 2007/3/31 main: beq $2, $4, modify with the number k as a parameter. It simply represents a family of B move $2, $4 # B starts here 1  2 code blocks where k ranges over all possible natural numbers.  j halt The code parameters can potentially be anything, e.g., instruc- B  target: addi $2, $2, 1 2 j halt tions, code locations, or the operands of some instructions. Taking  a whole code block or a code heap as parameter may allow us to B 3 { halt: j halt express and prove more interesting applications. Certifying parametric code makes use of the universal quantifier modify: lw $9, new  B4 sw $9, target in the rule . In the example above we need to prove the  judgment  j target  Let ∀k.(Ψk ⊢{λS.True}f : li $2,k;j halt) a , blk , , ∗ blk B 1 (new : addi $2 $2 1) ( 1) where Ψk(halt) = (λ(M,R).R($2) = k), to guarantee that the para- ′ , B B metric code block is well-formed with respect to the parametric a1 blk( 3) ∗ blk( 4) Ψ a , (λ(M,R).R($2)=R($4)) ∗ blk(B ) specification . 2 2 Parametric code blocks are not just used in verifying SMC; they , M R R R R a3 (λ( , ). ($4) ≤ ($2) ≤ ($4) + 1) can be used in other circumstances. For example, to prove position a4 , (λ(M,R).R($2)=R($4)) ∗ blk(new : addi $2,$2,1) ∗ blk(B1) independent code, i.e. code whose function does not depend on the Then local judgments are as follows absolute memory address where it is stored, we can parameterize the base address of that code to do the certification. Parametric { , { B {modify a4 halt a3} ⊢{a1} 1 code can also improve modularity, for example, by abstracting out {halt { a3} ⊢{a2}B2 certain code modules as parameters.

{halt { a3 ∗ blk(B3)} ⊢{a3 ∗ blk(B3)}B3 We will give more examples of parametric code blocks in Sec 6. B B {target { a2} ⊢{a4 ∗ blk( 4)} 4 Expressiveness. The following important theorem shows the ex- After applying frame rules and linking, the global specification becomes pressiveness of our GCAP2 system: as long as there exists an in- ′ ′ ′ variant for the safety of a program, GCAP2 can be used to certify {modify { a4 ∗ a ,halt { a3 ∗ a } ⊢{a1 ∗ a }B1 1 1 1 it with a program specification which is equivalent to the invariant. {halt { a3 ∗ blk(B3)} ⊢{a2 ∗ blk(B3)}B2 {halt { a ∗ blk(B )} ⊢{a ∗ blk(B )}B 3 3 3 3 3 Theorem 5.7 (Expressiveness of GCAP2) If Inv is an invariant of { { a ∗ blk B } ⊢{a ∗ a′ }B target 2 ( 3) 4 1 4 GTM, then there is a Ψ, such that for any world (S,pc) we have

Figure 22. opcode.s: Code and specification Inv(S,pc) ←→ ((Ψ ⊢(S,pc)) ∧ (Ψ(pc) S)). Proof: We give a way to construct Ψ from Inv: B blk B ∗ blk B corresponding global judgments. And for 1, ( 3) ( 4) Φ(f: ι) = λS.Inv(S,f) ∧ Decode(S,f,ι), ∀f,ι is added; here the additional blk(B4) in the specification entry for halt will be weakened out by the  rule (the union of two pro- Ψ= ~Φ = λf.λS.Inv(S,f) gram specifications used in the  rule is defined in Fig 14). We first prove the forward direction. Given a program (S,pc) Finally, all these judgments are joined together via the  rule satisfying Inv(S,pc), we want to show that there exists a such that to establish the well-formedness of the global code. This is similar Ψ ⊢(S,pc). to how we certify code using GCAP1 in the previous section, except Observe the fact that for all f and S, that the  process is no longer required here. The global code specification is exactly: Ψ(f) S ←→ ∃I.Φ(f: I) S ←→ ∃ι. S,f ∧ Decode S,f,ι Φ = B { ′ ,B { B ,B { B ,B { ′ Inv( ) ( ) G { 1 a1 ∗ a1 2 a2 ∗ blk( 3) 3 a3 ∗ blk( 3) 4 a4 ∗ a1} ←→ Inv(S,f) ∧∃ι.Decode(S,f,ι) which satisfies ~Φ  ⊢ Φ and ultimately the  rule can be suc- G G ←→ Inv(S,f) cessfully applied to validate the correctness of opcode.s. Actually we have proved not only the type safety of the program but also its where the last equivalence is due to the Progress property of Inv partial correctness, for instance, whenever the program executes to together with the transition relation of GTM. the line halt, the assertion a3 will always hold. Now we prove the first premise of , i.e. ~Φ ⊢ Φ, which by rule , is equivalent to Parametric code. In SMC, as mentioned earlier, the number of B B B code blocks we need to certify might be infinite. Thus, it is im- ∀ ∈ dom(Φ).(~Φ ⊢{Φ } ) possible to enumerate and verify them one by one. To resolve this which is issue, we introduce auxiliary variable(s) (i.e. parameters) into the ,ι. Ψ Φ ι ι code body, developing parametric code blocks and, correspond- ∀f ( ⊢{ (f: )}f: ) (4) ingly, parametric code specifications. which can be rewritten as Traditional Hoare logic only allows auxiliary variables to appear ∀f,ι.(λf.λS′.Inv(S′,f) ⊢{λS′.Inv(S′,f) ∧ Decode(S′,f,ι)}f: ι) in the pre- or post-condition of code sequences. In our new frame- (5) work, by allowing parameters appearing in the code body and its On the other hand, since Inv is an invariant, it should satisfy the assertion at the same time, assertions, code body and specifications progress and preservation properties. Applying  rule, (5) can can interact with each other. This make our program logic even be easily proved. Therefore, ~Φ ⊢ Φ. more expressive. Now, by the progress property of Inv, there should exist a valid One simplest case of parametric code block is as follows: instruction ι such that Decode(S,f,ι) holds. Choose f: li $2, k , S′ S′ S′ j halt a λ .Inv( ,pc) ∧ Decode( ,f,ι)

11 2007/3/31 {rDL = 0x80 ∧ D(512) = Ec(jmp −2) ∗ blk(B1)} bootld: movw $0, %bx # can not use ES Copied by BIOS movw %bx,%es # kernel segment before start-up  jmp 0x1000  movw $0x1000,%bx # kernel offset  movb $1, %al # num sectors load kernel  0x7c00 kernel  movb $2, %ah # disk read command B  start executing 1  movb $0, %ch # starting cylinder Sector 2  here  movb $2, %cl # starting sector  kernel  movb $0, %dh # starting head  (in mem) bootloader  int $0x13 # call BIOS  0x1000 copied by  ljmp $0,$0x1000 # jump to kernel Sector 1  bootloader   B {blk( 2)} Memory Disk B2 { kernel: jmp $-2 # loop Figure 24. A typical boot loader Figure 23. bootloader.s: Code and specification loader is the code contained in those bytes that will load the rest Then it’s obvious that a S is true. And since the correctness of of the OS from the disk into main memory, and begin executing ~Φ ⊢{a}pc: ι is included in (4), all the premises in rule  has the OS (Fig 24). Therefore certifying a boot loader is an important been satisfied. piece of a complete OS certification. So we get Ψ ⊢(S,pc), and it’s trivial to see that Ψ(pc) S. To show that we support a real boot loader, we have created one On the other hand, for our Ψ, if Ψ(pc) S, obvious Inv holds on that runs on the Bochs simulator[19] and on a real x86 machine. the world (S,pc). This completes our proof.  The code (Fig 23) is very simple, it sets up the registers needed to make a BIOS call to read the hard disk into the correct memory Together with the soundness theorem (Theorem 5.5), we have location, then makes the call to actually read the disk, then jumps showed that there is a correspondence relation between a global to loaded memory. program specification and a global invariant for any program. The specifications of the boot loader are also simple. rDL = It should also come as no surprise that any program certified un- 0x80 makes sure that the number of the disk is given to the boot der GCAP1 can always be translated into GCAP2. In fact, the judg- loader by the hardware. The value is passed unaltered to the int in- ments of GCAP1 and GCAP2 have very close connections. The struction, and is needed for that BIOS call to read from the correct translation relation is described by the following theorems. Here disk. D(512) = Ec(jmp −2) makes sure that the disk actually con- we use ⊢GCAP1 and ⊢GCAP2 to represent the two kinds of judgements tains a kernel with specific code. The code itself is not important, respectively. Let but the entry point into this code needs to be verifiable under a triv- ∗ ∗ ∗ ial precondition, namely that the kernel is loaded. The code itself ~({fi { Ii} ,{fi { ai} ) , {(fi : Ii) { a1} can be changed. The boot loader proof will not change if the code then we have changes, as it simply relies on the proof that the kernel code is certi- fied. The assertion blk(B1) just says that boot loader is in memory Theorem 5.8 (GCAP1 to GCAP2) when executed. B B B B 1. If Ψ ⊢GCAP0 {a} , then (λf.Ψ(f) ∗ blk( )) ⊢GCAP2 {a ∗ blk( )} . C ′ C ′ 6.2 Control Flow Modification 2. If Ψ ⊢GCAP1 ( ,Ψ ), then Ψ ⊢GCAP2 ~( ,Ψ ). 3. If Ψ ⊢ W, then Ψ ⊢ W. The fundamental operation unit of self-modification mechanism GCAP1 GCAP2 is the modification of single instructions. There are several possi- Proof: bilities of modifying an instruction. Opcode modification, which 1. If a ∗ blk(B), By frame rule of CAP, we know blk(B) is mutates an operation instruction (i.e. instruction that does not always satisfied and not modified during the execution of B. Thus jump) into another operation instruction, has already been shown in the additional premise of  is always satisfied. Then its’ easy Sec 5.2 as a typical example. This kind of code is relatively easier to get the conclusion by doing induction over the length of the code to deal with since it does not change the control flow of a program. block. Another case which looks trickier is control flow modification. C, ′ 2. With the help of 1, we see that if Ψ ⊢GCAP1 ( Ψ ), then for ev- In this scenario, an operation instruction into a control transfer ′ C ′ C ery f ∈ dom(Ψ ), we have f ∈ dom( ) and Ψ ⊢GCAP2 {Ψ (f)}f: (f). instruction, or the opposite way, and we can also modify a control Then the result is obvious. transfer instruction into another. 3. Directly derived from 2.  Control flow modification is useful in patching of subroutine call address. And we give an example here that demonstrates con- Finally, as promised, from Theorem 5.8 and the soundness of trol flow modification, and show how to verify it. GCAP2 we directly see the soundness of GCAP1 (Theorem 4.2). halt: j halt 6. More Examples and Applications main: la $9, end lw $8, 0($9) We show the certification of a number of representative examples addi $8, $8, -1 and applications using GCAP (see Table 1 in Sec 1). addi $8, $8, -2 sw $8, 0($9) 6.1 A Certified OS Boot Loader end: j dead An OS boot loader is a simple, yet prevalent application of runtime dead: code loading. It is an artifact of a limited bootstrapping protocol, The entry point of the program is the main line. At the first look, but one that continues to exist to this day. The limitation on an the code seems “deadly” because there is a jump to the dead line, x86 architecture is that the BIOS of the machine will only load the which does not contain a valid instruction. But fortunately, before first 512 bytes of code into main memory for execution. The boot the program executes to that line, the instruction at it will be fetched

12 2007/3/31 {blk(main : la $9,end;I1;j dead)} .data # Data declaration section main: la $9, end vec1: .word 22, 0, 25 lw $8, 0($9) vec2: .word 7, 429, 6 addi $8, $8, -1 result: .word 0 addi $8, $8, -2 tpl: li $2, 0 # template code sw $8, 0($9) lw $13, 0($4) j mid2 li $12, 0 mul $12, $12, $13 {(λ(M,R).hR($8)i4=Ec(j mid2) ∧ R($9)=end) ∗ blk(mid2 : I2;j mid2)} add $2, $2, $12 mid2: addi $8, $8, -2 jr $31 sw $8, 0($9) j mid1 .text # Code section # {True} {(λ(M,R).hR($8)i4=Ec(j mid1) ∧ R($9)=end) ∗ blk(mid1 : I1;j mid1)} main: li $4, 3 # vector size mid1: lw $8, 0($9) li $8, 0 # counter addi $8, $8, -1 la $9, gen # target address addi $8, $8, -2 la $11, tpl # template address sw $8, 0($9) lw $12, 0($11) j halt sw $12, 0($9) # copy the 1st instr addi $9, $9, 4 {blk(halt : j halt)} # {p(R($8))} halt: j halt loop: beq $8, $4, post li $13, 4 where mul $13, $13, $8 I1 = lw $8,0($9);addi $8,$8,−1;I2, lw $10, vec1($13) beqz $10, next # skip if zero I , , , 2 = addi $8 $8 −2;sw $8 0($9) lw $12, 4($11) add $12, $12, $13 Figure 25. Control flow modification sw $12, 0($9) lw $12, 8($11) add $12, $12, $10 out, subtracted by 3 and stored back, so that the jump instruction on sw $12, 4($9) end now points to the second addi instruction. Then the program lw $12, 12($11) will continue executing from that line, until finally hit the halt line sw $12, 8($9) and loops forever. lw $12, 16($11) Our specification for the program is described in Fig 25. There sw $12, 12($9) are four code blocks that needs to be certified. Then execution addi $9, $9, 16 process is now very clear. # {p(R($8)+1)} next: addi $8, $8, 1 6.3 Runtime Code Generation and Code Optimization j loop # {p(3)} Runtime code generation (also known as dynamic code genera- post: lw $12, 20($11) tion), which dynamically creates code for later use dur- sw $12, 0($9) ing execution of a program[16], is probably the broadest usage of la $4, vec2 SMC. The code generated at runtime usually have better perfor- jal gen M R R mance than static code in several aspects, because more informa- # {λ( , ). ($2)=v1 · v2} sw $2, result tion would be available at runtime. For example, if the value of a j main variable doesn’t change during runtime, program can then generate gen: specialized code with the variable as an immediate constant at run- time, which enables code optimizations such as pre-calculation and . Figure 26. Vector Dot Product [28] gave plenty of utilization of run-time code generation. The certification of all those applications can be fit into our system. Here we show the certification of one in them. We demonstrate the lw $13,k($4);li $12,u; runtime code generation version of a fundamental operation of ma- Ik,u , mul $12,$12,$13;add $2,$2,$12, if u > 0, trix multiplication – vector dot product, which was also mentioned ( in [31]. ∅, if u = 0. Efficient matrix multiplication is important to digital signal pro- p(k) , λ(M,R).R($4)=3 ∧ k≤R($4) ∧ R($9)=gen+4+16k∧ cessing and scientific computation problems. Sometimes the ma- blk(gen: li $2,0;I M ;...;I M )(M,R) trices might have certain characteristics such as sparseness(large 0, (vec1+0) k−1, (vec1+k−1) number of zeros) or small integers. The use of runtime code gener- ation allows these characteristics to be specialized on locally opti- Figure 27. Dot Product Assertion Macro mized code for each row of the matrix, based on the actual values. Since code for each row is specified once but used n times by the memory address result. For vector (22,0,25), the generated code other matrix(one for each column), the performance improved eas- would be as follows: ily surpasses the cost caused by the generation of code at runtime. Since this program does not involve code-modification, we use The code shown in Fig 26 (which includes the specification) GCAP1 to certify it. Each code block can be verified separately will read the vector v1 stored at vec1, generate specialized code at with the assertions we gave in Fig 26, Fig 27. The main part of the gen with elimination of multiplication by 0, and then run the code program can be directly linked and certified to form a well-formed at gen to compute the dot product of v1 and v2, and store it into the code heap. On the other hand, the generated code, as in Fig 28, can

13 2007/3/31 # {R($31) = back ∧ R($4)=vec2} .data # Data declaration section gen: li $2, 0 # int gen(int *v) lw $13, 0($4) # { num: .byte 8 li $12, 22 # int res = 0; mul $12, $12, $13 # res += 22 * v[0]; .text # Code section add $2, $2, $12 # res += 25 * v[2]; lw $13, 8($4) # return res; main: lw $4, num # set argument li $12, 25 # } lw $9, key # $9 = Ec(add $2,$2,0) mul $12, $12, $13 li $8, 1 # counter add $2, $2, $12 li $2, 1 # accumulator jr $31 loop: beq $8, $4, halt # check if done addi $8, $8, 1 # inc counter Figure 28. Generated Code of Vector Dot Product add $10, $9, $2 # new instr to put key: addi $2, $2, 0 # accumulate sw $10, key # store new instr # {True} j loop # next round main: jal f # {λ(M,R).R($8) = 42} halt: j halt move $2, $8 j halt Figure 30. fib.s: Fibonacci number # {(λ(M,R).R($31)=main+4) ∗ blk(addr : jal f) ∗blk(main : jal f)} f: li $8, 42 B lw $9, -4($31) {blk( 1)} main: lw $4, num lw $10, addr lw $9, key bne $9, $10, halt B  1 li $8, 1 jr $31   li $2, 1  # {R($2) = R($8) = 42}  {(λ(M,R).R($4)=M(num) ∧ R($9)=Ec(addi $2,$2,0)∧ R R B halt: j halt ($8)=k+1 ∧ ($2)=fib(k+1)) ∗ blk( 2,k)} loop: beq $8, $4, halt addi $8, $8, 1 addr: jal f  add $10, $9, $2 B  2,k  key: addi $2, $2, fib(k)  Figure 29. Runtime code-checking  sw $10, key   j loop   M R R M B be verified on its own. After these are all done, we use the -  {(λ( , ). ($2)=fib( (num))) ∗ blk( 3)} B  rule and the  rule to merge the two parts together, which 3 { halt: j halt completes the whole verification. The modularity of our system Figure 32. fib.s: Code and specification guarantees that, the verification of the generated code (Fig 28 does not need any information about its “mother code”. The precondition of f asserts the return address, and both addr 6.4 Runtime Code Checking and main store the jal f instruction, so that the code block can check the equality of these two memory location. Note that the As we mentioned, the limitation of previous verification framework precondition of main does not specify the content of addr since is not only about self-modifying code. They can also not deal with this can be added later via the frame rule. “self-reading code” that does not do any code modification. The example below is a program where reading the code of itself helps the decision of the program executing. We verify it with 6.5 Fibonacci Number and Parametric Code GCAP1. To demonstrate the usage of parametric code, we construct an ex- Here is an interesting application of code-reading programs. As ample fib.s to calculate the Fibonacci function fib(0) = 0, fib(1) = we know, there is a subroutine call instruction in most architectures 1, fib(i+2) = fib(i) + fib(i+1), shown in Fig 30. More specifically, (jal in MIPS or call in x86). The call instruction in essentially a fib.s will calculate fib(M(num)) which is fib(8) = 21 and store it composition of two operations: to first store the address of the next into register $2. instruction to the return address register ($31 in MIPS), and then It looks strange that this is possible since throughout the whole jump to the target address. When the subroutine finishes executing, program, the only instructions that write $2 is the fourth instruction hopefully, the return address register should point to the next in- which assigns 1 to it and the line key which does nothing. struction in the parent program, so that the program can continue The main trick, of course, comes from the code-modification execution normally. However, this is not guaranteed, because a vi- instruction on the line next to key. In fact, the third operand of cious parent program can store a bad address into the return address the addi instruction on the line key alters to the next Fibonacci register and jump to the subroutine. The code in Fig 29 shows how number (temporarily calculated and stored in register $10 before to avoid this kind of cheating by runtime code-checking mecha- the instruction modification) during every loop. Fig 31 illustrates nism. the complete execution process. The main program calls f as a subroutine. After the subroutine Since the opcode of the line key would have an unbounded have done its job (storing 42 into register $8), and before return, it number of runtime values, we need to seek help from paramet- first checks whether the instruction previous to the return address ric code blocks. The formal specifications for each code block is register ($31) is indeed a jal (or call) instruction. If not so, it simply shown in Fig 32. We specify the program using three code blocks, halts the program; otherwise, return. where the second block—the kernel loop of our program B2,k—is a

14 2007/3/31 main: lw $4, num B1 lw $9, key li $8, 1 li $2, 1

B B B B {$8=1,$2=1,$4=8} 2,0 {$8=2,$2=1,$4=8} 2,1 {$8=7,$2=13,$4=8} 2,6 {$8=8,$2=21,$4=8} 2,7 loop: beq $8, $4, halt beq $8, $4, halt beq $8, $4, halt beq $8, $4, halt addi $8, $8, 1 addi $8, $8, 1 addi $8, $8, 1 addi $8, $8, 1 add $10, $9, $2 add $10, $9, $2 ... add $10, $9, $2 add $10, $9, $2 addi $2, $2, 0 addi $2, $2, 1 addi $2, $2, 8 addi $2, $2, 13 sw $10, key sw $10, key sw $10, key sw $10, key j loop j loop j loop j loop

where fib(0)=0, fib(1)=1, {$2=fib(r8)=21} B3 fib(i+2)=fib(i)+fib(i+1). halt: j halt

Figure 31. fib.s: Control flow

main: la $8, loop

la $9, new ... move $10, $9 lw . . . lw . . . lw . . . lw ...... loop: lw $11, 0($8) mo v e . . mo v e . . mo v e . . mo v e . .

sw $11, 0($9) lw . . . lw . . . lw . . .

addi $8, $8, 4 ...... mo v e . . mo v e . . addi $9, $9, 4 mo v e . . bne $8, $10, loop lw . . . lw ...... move $10, $9 mo v e . . mo v e . . new: lw ...... mo v e . . Figure 33. selfgrow.s: Self-growing Code ...... parametric one. The parameter k appears in the operand of the key instruction as an argument of the Fibonacci function. ... Consider the execution of code block B2,k. Before it is exe- cuted, $9 stores Ec(addi $2,$2,0), and $2 stores fib(k+1). There- Figure 34. selfgrow.s: Self-growing process fore, the execution of the third instruction addi $10,$9,$2 changes $10 into Ec(addi $2,$2,fib(k+1))1, so at the end of the loop, B $2 is now fib(k+1)+fib(k)=fib(k+2), and key has the instruction {blk( 0)} main: la $8, loop addi $2,$2,fib(k+1) instead of addi $2,$2,fib(k), then the program B la $9, new B 0  continues to the next loop 2,k+1.  move $10, $9 The global code specification we finally get is as follows :   {(λ(M,R).∃0≤i<6.R($8)=k+4i ∧ R($9)=k+24+4i∧ R($10)=k+24 ∧∀0≤ j<4i.M(k+ j)=M(k+24+ j)) ∗ blk(B )} Φ , {B1 { a1, B2k { a2k | k∈Nat, B3 { a3} (6) k k: lw $11, 0($8) But note that this formulation is just for readability; it is not di- sw $11, 0($9)  rectly expressible in our meta logic. To express parameterized code B  addi $8, $8, 4 k  addi $9, $9, 4 blocks, we need to use existential quantifiers. The example is in fact   bne $8, $10, k represented as Φ , λB.λS.(B=B1 ∧ a1 S) ∨ (∃k.B=B2k ∧ a2k S) ∨   move $10, $9 (B=B3 ∧ a3 S). One can easily see the equivalence between this   definition and (6).  Figure 35. selfgrow.s: Code and specification 6.6 Self Replication The copying process goes on and on, till the whole available Combining self-reading code with runtime code generation, we memory is consumed and, presumably, the program would crash. can produce self-growing program, which keeps replicating itself However, under our assumption that the memory domain is infinite, forever. This kind of code appears commonly in Core War—a game this code never kill itself and thus can be certified. where different people write assembly programs that attack the The specification is shown in Fig 35. The code block B is other programs. Our demo code selfgrow.s is shown in Fig 33. 0 certified separately; its precondition merely requires that B itself After initializing the registers, the code repeatedly duplicates itself 0 matches the memory. All the other code including the original loop and continue to execute the new copy , as Fig 34 shows . body and every generated one are parameterized as a code block The block starting at loop is the code body that keeps being family and certified altogether. In their preconditions, besides the duplicated. During the execution, this part is copied to the new lo- requirement that the code block matches the memory, there should cation. Then the program continues executing from new, until an- exist an integer i ranged between 0 and 5 (both inclusive), such other code block is duplicated. Note that here we rely on the prop- that the first i instructions have been copied properly, and the three erty that instruction encodings for branches use relative addressing, registers $8, $9, and $10 are stored with proper values respectively. thus every time our code is duplicated, the target address of the bne instruction would change accordingly. 6.7 Mutual-modifying Modules 1 To simplify the case, we assume the encoding of addi instruction has a In the circumstances of runtime code generation, the interaction linear relationship with respect to its numerical operand in this example. between two modules is one-way: the “mother code” manipulates

15 2007/3/31 {blk(B1)} {blk(B0)} B1 { main: j alter main: la $10, body

{(blk(B1) ∗ blk(B2)} {blk(B1)} {blk(B2)} alter: lw $8, main body: lw $8, 12($10) lw $8, 12($10) B  li $9, 0 lw $9, 16($10) lw $9, 16($10) 2  sw $9, main  sw $9, 12($10) sw $8, 16($10)  j main  addi $2, 21 addi $2, 21  M R R B addi $2, 21 addi $2, 21  {(λ( , ).h ($8)i4=Ec(j alter)) ∗ blk( 3)} main: nop sw $8, 16($10) sw $9, 12($10) B3  sw $8, alter lw $9, 8($10) lw $9, 8($10) lw $8, 20($10) lw $8, 20($10) {blk(B )} 4 sw $9, 20($10) sw $9, 20($10) B { alter: j alter 4 sw $8, 8($10) sw $8, 8($10) Figure 36. Specification - Mutual Modification j body j body the “child code” but not the opposite. If both two modules operate Figure 37. Specification - Mutual Modification on each other, the situation becomes trickier. But with GCAP2, this can be solved without too much difference. The safety is not obvious as there is an illegal instruction j dead Here is a mini example showing what can happen under this in our program. Since the code blocks that need to be certified circumstance. overlap, we use GCAP2 to prove it. The specification is showed main: j alter in Fig 37. Note the tiny difference between the code above and sw $8, alter the code in Fig 37. This difference demonstrates the different un- alter: lw $8, main derstanding between “stored” code blocks and “executing” code li $9, 0 blocks, and makes our verification work. sw $9, main j main 6.9 Code Obfuscation The main program will firstly jump to the alter block, and The purpose of code obfuscation is to confuse debuggers. Although after it modified the instruction at main and transfer the control there are many other approaches that do code obfuscation without back, the main block will modify the alter instruction back, involving SMC such as control-flow based obfuscation, SMC is a which makes the program loops forever. much more effective way due to its hardness for understanding. The specification for this tangly is fairly simple, as shown in By instrumenting a big amount of redundant, garbage SMC into Fig 36. The actual number of code blocks that needs to be certi- normal code, plenty of memory reading and writing instructions fied is 4, although there are only two starting addresses. Both the will considerably confuse the disassembler. The method mentioned preconditions and the code body for different code blocks at the in [15] is a typical attempt. In PLDI’06’s tutorial, Bjorn De Sutter same starting address differs completely. For example, the first time also demonstrated their Diablo as a binary rewriting program to the program enters main, it only requires B1 itself, while when the instrument self-modifying code for code-obfuscation. second time it enters, the code body changes to B3 and the precon- To certify a SMC-obfuscated program, traditional verification dition requires that register $8 stores the correct value. In fact, by techniques becomes extremely tough. At the same, the reasoning certifying this example, the control flow becomes much clearer to of this kind of code appears fairly simple under our framework. the readers. Take the following code as example: 6.8 Polymorphic Code main: la $8, g polymorphic code is code that mutates itself while keeps the algo- lw $9, 0($8)  addi $10, $9, 4 rithm equivalent. It is commonly used in the writing of computer  virus and trojans who want to hide their presence for certain pur-  sw $10, g B  pose. Since anti-virus software usually identify virus by recogniz- 1  lw $11, h  ing particular byte patterns in the memory, the technique of poly-  g: sw $9, 0($8)  h: j dead morphic code has been a way to combat against this.   sw $11, h The following oscillating code is a simple demo example. Every  j main  time it executes through, content of the code would switch (between  B and B ), but its function is always to add 42 to register $2. The program does several weird memory loading and storing, 1 2 and seems that there is a dead instruction, while actually it does B # 0 = nothing. The line g and h would change during the execution, and main: la $10, body be changed back afterwards. Thus the code looks intact after one round of execution. If we remove the last line, the piece of code # B1 = # B2 = body: lw $8, 12($10) lw $8, 12($10) can be instrumented into regular program to fool the debugger. lw $9, 16($10) lw $9, 16($10) In our GCAP2 system, such code can be verified with a single sw $9, 12($10) sw $8, 16($10) judgment, as simple as Fig 38 shows. j dead addi $2, 21 One observation is that this program can also be certified with addi $2, 21 j dead the more traditional-like system GCAP1. But in contrast, since sw $8, 16($10) sw $9, 12($10) GCAP1, like traditional reasoning, doesn’t allow a code block to lw $9, 8($10) lw $9, 8($10) mutate any of its own code, this program has to be divided into lw $8, 20($10) lw $8, 20($10) several smaller intact parts to certify. However, this approach has sw $9, 20($10) sw $9, 20($10) several disadvantages. Firstly, the code blocks are divided too small sw $8, 8($10) sw $8, 8($10) j body j body and too difficult to manipulate. Secondly, this approach would fail

16 2007/3/31 {blk(B1)} Code Heap Code Heap

main: la $8, g start: start: lw $9, 0($8) addi $10, $9, 4 decrypter decrypter

sw $10, g body: lw $11, h encrypted decrypted sw $9, 4($8) main code main code decrypting... sw $9, 0($8) ......

sw $11, h 0x12c5fa93 beq $8, $9, 8 0x229ab1d8 nop j main 0x98d93bc3 li $2, 1 0xcc693de8 li $3, 2 0x127f983b add $2, $2, $3 0x907e6d64 li $4, 5

Figure 38. Specification - Obfuscation ......

# {True} Figure 40. The execution of runtime code decryption main: la $8, pg la $9, pgend li $10, 0xffffffff A simple encryption and decryption example encrypt.s adapted # {(λ(M,R).(pg≤R($8)

17 2007/3/31 modifying techniques in shellcoding have nothing new than previ- [2] A. W. Appel and A. P. Felty. A semantic model of types and machine ous examples and can be certified without surprise. instructions for proof-carrying code. In Proc. 27th ACM Symposium on Principles of Programming Languages, pages 243–253, Jan. 2000. [3] D. Aucsmith. Tamper resistant software: An implementation. In 7. Implementation Proceedings of the First International Workshop on Information We use the Coq proof assistant [32], which comes with an expres- Hiding, pages 317–333, London, UK, 1996. Springer-Verlag. sive underlying logic, to implement our full system, including the [4] V. Bala, E. Duesterwald, and S. Banerjia. Dynamo: a transparent formal construction of GTM, MMIPS, Mx86, GCAP0, GCAP1, and dynamic optimization system. In Proc. 2000 ACM Conf. on Prog. GCAP2, as well as the proof of soundness and related properties of Lang. Design and Implementation, pages 1–12, 2000. all the three systems. The assertion language as well as the basic [5] H. Cai, Z. Shao, and A. Vaynberg. Certified self-modifying properties of separation logic are also implemented and proved. code (extended version & coq implementation). Technical Report We have certified several typical examples, including a com- YALEU/DCS/TR-1379, Yale Univ., Dept. of Computer Science, Mar. plete OS boot loader written in x86 for the demonstration of 2007. http://flint.cs.yale.edu/publications/smc.html. GCAP1 and the Fibonacci example in MIPS for the use of GCAP2. [6] K. Crary. Toward a foundational typed . In Proc. The system can be immediately used for more realistic and 30th ACM Symposium on Principles of Programming Languages, theoretical applications. pages 198–212, Jan. 2003. [7] S. Debray and W. Evans. Profile-guided code compression. In Proceedings of the 2002 ACM Conference on Programming Language 8. Related Work and Conclusion Design and Implementation, pages 95–105, New York, NY, 2002. Previous assembly code certification systems (e.g., TAL [23], ACM Press. FPCC [1, 10], and CAP [33, 25]) all treat code separately from [8] R. W. Floyd. Assigning meaning to programs. Communications of data memory, so that only immutable code is supported. Appel et the ACM, Oct. 1967. al [2, 22] described a model that treats machine instructions as [9] N. Glew and G. Morrisett. Type-safe linking and modular assembly data in von Neumann style; they raised the verification of runtime language. In Proc. 26th ACM Symposium on Principles of Program- code generation as an open problem, but did not provide a solu- ming Languages, pages 250–261, Jan. 1999. tion. TALT [6] also implemented the von Neumann machine model [10] N. A. Hamid, Z. Shao, V. Trifonov, S. Monnier, and Z. Ni. A syntactic where code is stored in memory, but it still does not support SMC. approach to foundational proof-carrying code. In Proc. 17th Annual TAL/T [13, 31] is a typed assembly language that provides some IEEE Symp. on Logic in Computer Science, pages 89–100, July 2002. / limited capabilities for manipulating code at runtime. TAL T code [11] G. M. Henry. Flexible high-performance matrix multiply via a self- is compiled from Cyclone—a type safe C subset with extra support modifying runtime code. Technical Report TR-2001-46, Department for template-based runtime code generation. However, since the of Computer Sciences, The University of Texas at Austin, Dec. 2001. code generation is implemented by specific macro instructions, it [12] C. A. R. Hoare. Proof of a program: FIND. Communications of the does not support any code modification at runtime. ACM, Jan. 1971. Otherwise there was actually very little work done on the certi- fication of self-modifying code in the past. Previous program ver- [13] L. Hornof and T. Jim. Certifying compilation and run-time code ification systems—including Hoare logic, type system, and proof- generation. Higher Order Symbol. Comput., 12(4):337–375, 1999. carrying code [24]—consistently maintain the assumption that pro- [14] S. S. Ishtiaq and P. W. O’Hearn. BI as an assertion language for gram code stored in memory is immutable. mutable data structures. In Proc. 28th ACM Symposium on Principles We have developed a simple Hoare-style framework for mod- of Programming Languages, pages 14–26, 2001. ularly verifying general von Neumann machine programs, with [15] Y. Kanzaki, A. Monden, M. Nakamura, and K. ichi Matsumoto. strong support for self-modifying code. By statically specifying and Exploiting self-modification mechanism for program protection. In reasoning about the possible runtime code sequences, we can now COMPSAC ’03, page 170, 2003. successfully verify arbitrary runtime code modification and/or gen- [16] D. Keppel, S. J. Eggers, and R. R. Henry. A case for runtime eration. Taking a unified view of code and data has given us some code generation. Technical Report UWCSE 91-11-04, University of surprising benefits: we can now apply separation logic to support Washington, November 1991. local reasoning on both program code and regular data structures. [17] L. Lamport. The temporal logic of actions. ACM Transactions on Programming Languages and Systems, 16(3):872–923, May 1994. Acknowledgment [18] J. Larus. SPIM: a MIPS32 simulator. v7.3, 2006. [19] K. Lawton. BOCHS: IA-32 emulator project. v2.3, 2006. We thank Xinyu Feng, Zhaozhong Ni, Hai Fang, and anonymous referees for their suggestions and comments on an earlier version [20] P. Lee and M. Leone. Optimizing ML with run-time code generation. of this paper. Hongxu Cai’s research is supported in part by the In Proc. 1996 ACM Conf. on Prog. Lang. Design and Implementation, National Natural Science Foundation of China under Grant No. pages 137–148. ACM Press, 1996. 60553001 and by the National Basic Research Program of China [21] H. Massalin. Synthesis: An Efficient Implementation of Fundamental under Grants No. 2007CB807900 and No. 2007CB807901. Zhong Operating System Services. PhD thesis, Columbia University, 1992. Shao and Alexander Vaynberg’s research is based on work sup- [22] N. G. Michael and A. W. Appel. Machine instruction syntax and ported in part by gifts from Intel and Microsoft, and NSF grant semantics in higher order logic. In International Conference on CCR-0524545. Any opinions, findings, and conclusions contained Automated Deduction, pages 7–24, 2000. in this document are those of the authors and do not reflect the [23] G. Morrisett, D. Walker, K. Crary, and N. Glew. From system F views of these agencies. to typed assembly language. ACM Transactions on Programming Languages and Systems, 21(3):527–568, 1999. [24] G. Necula. Proof-carrying code. In Proc. 24th ACM Symposium on References Principles of Programming Languages, pages 106–119, New York, [1] A. W. Appel. Foundational proof-carrying code. In Proc. 16th IEEE Jan. 1997. ACM Press. Symp. on Logic in Computer Science, pages 247–258, June 2001.

18 2007/3/31 [25] Z. Ni and Z. Shao. Certified assembly programming with embedded code pointers. In Proc. 33rd ACM Symposium on Principles of Programming Languages, Jan. 2006. [26] P. Nordin and W. Banzhaf. Evolving turing-complete programs for a register machine with self-modifying code. In Proc. of the 6th International Conf. on Genetic Algorithms, pages 318–327, 1995. [27] B. C. Pierce. Advanced Topics in Types and Programming Languages. The MIT Press, Cambridge, MA, 2005. [28] M. Poletto, W. C. Hsieh, D. R. Engler, and M. F. Kaashoek. ’C and tcc: a language and compiler for dynamic code generation. ACM Transactions on Programming Languages and Systems, 21(2):324– 369, 1999. [29] Ralph. Basics of SMC. http://web.archive.org/web/ 20010425070215/awc.rejects.net/files/text/smc.txt, 2000. [30] J. Reynolds. Separation logic: a logic for shared mutable data structures. In Proc. 17th IEEE Symp. on Logic in Computer Science, 2002. [31] F. M. Smith. Certified Run-Time Code Generation. PhD thesis, Cornell University, Jan. 2002. [32] The Coq Development Team, INRIA. The Coq proof assistant reference manual. The Coq release v8.0, 2004-2006. [33] D. Yu, N. A. Hamid, and Z. Shao. Building certified libraries for PCC: Dynamic storage allocation. Science of Computer Programming, 50(1-3):101–127, Mar. 2004.

19 2007/3/31