Iowa State University Capstones, Theses and Graduate Theses and Dissertations Dissertations

2021

Trust, transforms, and control flow: A graph-theoretic method to verifying source and binary control flow equivalence

Ryan Christopher Goluch Iowa State University

Follow this and additional works at: https://lib.dr.iastate.edu/etd

Recommended Citation Goluch, Ryan Christopher, "Trust, transforms, and control flow: A graph-theoretic method to verifying source and binary control flow equivalence" (2021). Graduate Theses and Dissertations. 18498. https://lib.dr.iastate.edu/etd/18498

This Thesis is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. Trust, transforms, and control flow: A graph-theoretic method to verifying source

and binary control flow equivalence

by

Ryan Christopher Goluch

A thesis submitted to the graduate faculty

in partial fulfillment of the requirements for the degree of

MASTER OF SCIENCE

Major: Computer Engineering (Secure and Reliable Computing)

Program of Study Committee: Suresh Kothari, Major Professor Samik Basu Akhilesh Tyagi

The student author, whose presentation of the scholarship herein was approved by the program of study committee, is solely responsible for the content of this thesis. The Graduate College will ensure this thesis is globally accessible and will not permit alterations after a degree is conferred.

Iowa State University

Ames, Iowa

2021

Copyright © Ryan Christopher Goluch, 2021. All rights reserved. ii

DEDICATION

I would like to dedicate this work to my family, friends, colleagues, and mentors. I am grateful to know you all and have you in my life. You all have helped get me to this point, become who I am today, and contribute a verse. Thank you. iii

TABLE OF CONTENTS

Page

LIST OF TABLES ...... v

LIST OF FIGURES ...... vi

ACKNOWLEDGMENTS ...... ix

ABSTRACT ...... x

CHAPTER 1. OVERVIEW ...... 1 1.1 Research Theme ...... 1 1.2 Thesis Overview ...... 2

CHAPTER 2. A NEW APPROACH ...... 3 2.1 Background and Related Work ...... 3 2.2 Control Flow ...... 4 2.2.1 Extracting Control Flow from Binary Code ...... 5 2.2.2 Control Flow Based Obfuscation ...... 7 2.2.3 Deobfuscation ...... 8 2.2.4 Control Flow Based Software Security ...... 9 2.3 Graph Isomorphism ...... 9 2.3.1 Graph Isomorphism Techniques ...... 10 2.4 Our Approach ...... 11 2.5 Practicality of Our Approach ...... 13

CHAPTER 3. TRANSFORM MOTIVATION ...... 15 3.1 Practicality of Graph Transforms ...... 16 3.2 Transform Cases ...... 16 3.2.1 Case 1 ...... 16 3.2.2 Case 2 ...... 19 3.2.3 Case 3 ...... 21 3.3 Transform Algorithm Overview ...... 23 3.3.1 Transform Algorithm ...... 25

CHAPTER 4. STATIC FUNCTION TRANSFORM ...... 27 4.1 XINU Example ...... 28 4.2 Static Function Transform Overview ...... 31 4.2.1 Finding Static Function Calls ...... 33 4.2.2 Create and Store Original Static CFGs ...... 34 iv

4.2.3 Create Transformed Graph ...... 34 4.2.4 Application to XINU Example ...... 35 4.3 Implications of the Static Transform ...... 35

CHAPTER 5. SWITCH STATEMENT TRANSFORM ...... 44 5.1 Switch Statement Transform ...... 44 5.2 Working Example ...... 45 5.2.1 Finding Switch and Case Nodes ...... 46 5.2.2 Sorting Cases in Ascending Order ...... 48 5.2.3 Creation of 2-way Branches ...... 48 5.2.4 Switch Block Replacement ...... 51 5.2.5 Transformed Working Example ...... 53 5.3 XINU Example ...... 56 5.4 Implications of the Switch Transform ...... 58

CHAPTER 6. SHORT CIRCUIT TRANSFORM ...... 60 6.1 Working Example ...... 61 6.1.1 Finding CC Nodes ...... 61 6.1.2 Processing non-CC Nodes to Retain ...... 63 6.1.3 CC Nodes ...... 65 6.1.4 Transformed Working Example ...... 66 6.2 Implications of Design Decision ...... 68

CHAPTER 7. ISOMORPHISM ALGORITHM TO CHECK CONTROL FLOW EQUIVA- LENCE ...... 74 7.1 Introduction ...... 74 7.2 Description of the Isomorphic Algorithm ...... 74

CHAPTER 8. RESULTS ...... 81 8.1 Categorization using Isomorphism ...... 81 8.1.1 G-Iso Example ...... 82 8.1.2 L-Iso Example ...... 82 8.1.3 A-Iso Example ...... 83 8.1.4 N-Iso Example ...... 84 8.2 XINU Results ...... 85 8.3 Discussion of Results ...... 87

CHAPTER 9. CONCLUSION AND FUTURE RESEARCH DIRECTIONS ...... 92 9.1 Future Research Directions ...... 93

BIBLIOGRAPHY ...... 95 v

LIST OF TABLES

Page

Table 6.1 CFG Edge Combinations ...... 63

Table 8.1 XINU Isomorphism Results ...... 85

Table 8.2 XINU Overlap Results ...... 87 vi

LIST OF FIGURES

Page

Figure 3.1 sgetch. Source CFG ...... 18

Figure 3.2 sgetch.c Binary CFG ...... 18

Figure 3.3 sgetch.c Source CFG ...... 18

Figure 3.4 sgetch.c Binary CFG ...... 18

Figure 3.5 sgetch.c Source Transformed CFG ...... 19

Figure 3.6 sgetch.c Binary Transformed CFG ...... 19

Figure 3.7 signal.c Source CFG ...... 21

Figure 3.8 signal.c Binary CFG ...... 21

Figure 3.9 signal.c Source Marked CFG ...... 22

Figure 3.10 signal.c Binary Marked CFG ...... 22

Figure 3.11 signal.c Source Transformed CFG ...... 22

Figure 3.12 signal.c Binary Transformed CFG ...... 22

Figure 3.13 fputc.c Source CFG ...... 24

Figure 3.14 fputc.c Binary CFG ...... 24

Figure 3.15 fputc.c Source Transformed CFG ...... 24

Figure 3.16 fputc.c Binary Transformed CFG ...... 24

Figure 3.17 fputc.c Correct Source Transformed CFG ...... 24

Figure 4.1 qsort.c Source and Binary CFGs ...... 28 vii

Figure 4.2 partition() Source CFG ...... 32

Figure 4.3 swap elements() Source CFG ...... 32

Figure 4.4 qsort.c Call Graph ...... 32

Figure 4.5 partition() Call Graph ...... 32

Figure 4.6 qsort.c Transformed CFG ...... 37

Figure 4.7 create.c removed conditional ...... 41

Figure 4.8 create.c Source and Binary CFGs ...... 42

Figure 4.9 create.c Source and Binary PCGs ...... 43

Figure 5.1 Original Example CFGs ...... 47

Figure 5.2 Binary CFG for Switch with 10 cases ...... 51

Figure 5.3 Original CFG- Switch with 3 cases ...... 54

Figure 5.4 Transformed CFG- Switch with 3 cases ...... 54

Figure 5.5 Binary CFG- Switch with 3 cases ...... 54

Figure 5.6 Original CFG- Switch with 4 cases ...... 55

Figure 5.7 Transformed CFG- Switch with 4 cases ...... 55

Figure 5.8 Binary CFG- Switch with 4 cases ...... 55

Figure 5.9 ttyControl Source CFG ...... 57

Figure 5.10 ttyControl Transformed CFG ...... 57

Figure 5.11 ttyControl Binary CFG ...... 57

Figure 6.1 Working Example CFG ...... 62

Figure 6.2 Nested CC Nodes ...... 66

Figure 6.3 Working Example Transformed CFG ...... 68

Figure 6.4 Original and Transformed CFGs ...... 69 viii

Figure 6.5 Working Example Binary CFG ...... 70

Figure 6.6 Initial BDD ...... 71

Figure 6.7 Relationship between transformed CFG and binary CFG ...... 71

Figure 6.8 BDD Reduction Steps ...... 72

Figure 6.9 Removal of redundant leaf nodes part 2 ...... 72

Figure 6.10 Final Reduced Ordered BDD ...... 72

Figure 6.11 Relationship between transformed CFG and final BDD ...... 73

Figure 7.1 mailboxAlloc.c Source Transformed Graph ...... 79

Figure 7.2 mailboxAlloc.c Binary Transformed Graph ...... 79

Figure 7.3 mailboxAlloc.c Updated Binary Transformed Graph ...... 79

Figure 8.1 Transformed Graphs for getnet.c ...... 82

Figure 8.2 CFGs for x reboot.c ...... 83

Figure 8.3 Transformed Graphs for rarp in.c ...... 84

Figure 8.4 MIPS and PPC Code for rarp in.c Branch Consolidation ...... 85

Figure 8.5 Transformed Graphs for dgread.c ...... 86

Figure 8.6 Data Flow Optimization Example ...... 89

Figure 8.7 Colored Graphs for dgread.c ...... 91 ix

ACKNOWLEDGMENTS

First, I would like to acknowledge and thank my parents and my sister. I would not be anywhere close to where I am today had it not been for their constant support and encouragement in what I do and for always providing me the opportunity to be successful in my education, work, and life. They are my sounding board, my rock who always help to give me a new perspective on things. Second, I owe a large debt of gratitude and thanks to my advisor, Dr. Suresh Kothari. Little did I know back in the spring of 2019 as a student in his course that I would be here two years later presenting this research. Dr. Kothari’s support and encouragement helped to make a daunting research question into something approachable and interesting to work on. His mentorship has helped me to learn and grow in more ways than I can count as a student, researcher, and person.

It has been a true privilege to have the opportunity to work with Dr. Kothari. I would also like to acknowledge and thank Payas Awadhutkar. When I first started this research, Payas took me under his wing and helped make sure that I had a solid foundation to build on while always being happy to answer my many questions. Finally, I would like to acknowledge everyone else I have met through out my journey at Iowa State since you have all helped me get to this point, for which I will be forever thankful. x

ABSTRACT

The software development process often requires the use of tools that the developers did not write themselves, such as a compiler. Additionally, when security researchers perform tasks such as binary analysis and reverse engineering disassembly and decompiler tools may be used which the researchers did not develop themselves. Yet in all of these cases, there is inherent trust placed in the tools that are being used for these tasks without thought to whether or not that trust is valid.

In this work, we provide an overview of what has already been done in the realms of verification and establishment of this trust as well as show our contribution to this area. Our approach is taken through an algorithm that allows us to validate or invalidate that trust by comparing the control flow graphs (CFGs) for a piece of source code to the corresponding CFG for the disassembled binary. This is done by putting the source CFG through a series of program functionality preserving transforms that implement different control structure compiler optimizations. Then we are able to make use of a graph isomorphism algorithm to determine whether or not the two CFGs are isomorphic and determine if the trust is valid. We evaluate the effectiveness of this algorithm against the XINU codebase and report our results. 1

CHAPTER 1. OVERVIEW

1.1 Research Theme

The catalyst for this research idea and question was inspired by Ken Thompson’s Turing

Award Lecture “Reflections on Trusting Trust” [52]. Thompson presented the philosophical, yet practical, question of “Can you trust software that you did not write yourself?”. The short answer to his question is no, you can not trust software that you do not write yourself. This is to include any tools that you may have used in developing your software. Thought of in a recursive manner, you would need to have written all of the code that is used up until you reach your base case (i.e. some tool) on your own. Even at this point, there is still no way to be certain that your code is safe and secure. This lack of complete trust is often not addressed and takes on the role of an accepted risk in the software development process. Accepting this risk does not mitigate the fact that developers continually write what is thought to be secure code. Yet we still end up with zero-days, memory leaks, and countless other software bugs and security vulnerabilities showing up in the wild- either from the developer or the tools they are using.

In the years after this lecture, David Wheeler presented a solution to detect the attack

Thompson had described [55]. This solution relied upon having a second trusted compiler with which you are able to use to generate a second binary for bitwise comparison of the outputs. The downside to this approach is that it only focuses on the compiler and relies only upon a bitwise comparison. It does not allow someone to actually see the comparison between the source code and what is generated for the binary. There is no way to verify the control structure which is executed at run time is maintained from the initial source control flow graph (CFG) down to the binary control flow graph (CFG). In many ways, this research is a new take on the ideas presented by Thompson and Wheeler as well as an attempt to efficiently and effectively answer 2

Thompson’s question. The research presented here not only focuses on the security and safety of software being developed, but on a boarder scale focuses on the ability to trust that the tools being developed and used for security research are accurate in the results they produce.

The backbone of this work is focused on creating a common ground for two different CFGs- the source and binary for a given piece of code- in order to determine if they have the same control structure. All of this is with the underlying goal of effectively answering the question “Can we trust this software?”. In order to do this, we take a graph theoretic approach in developing an algorithm that determines if the control structure is the same between a source and binary CFG.

To properly model the control structure of both functions, CFGs are used which leads this to be, at a high level, a graph isomorphism problem. The idea being that if the control structures are the same between the source code which was what the developer intended and the binary, then we are able to trust the software.

1.2 Thesis Overview

Throughout the rest of this thesis, we present the following ideas and subsequent results. First we review what related work has already been done in this area and the ideas they have presented throughout Chapter2. Next, we present the motivation behind our approach of CFG transforms as well as the explanations of the transforms themselves throughout Chapters3,4,5, and6 respectively. Once we have established the motivation for taking a graph transform approach and explained our transforms, Chapter7 describes our finalized algorithm and Chapter8 discusses the results of our algorithm tested against the XINU codebase. We conclude this thesis in Chapter9 with a discussion of future research directions which build off of this work. 3

CHAPTER 2. A NEW APPROACH

2.1 Background and Related Work

A 1974 U.S. Airforce report [33] discusses trap doors that can sabotage the compilation itself.

To understand compiler trap doors, first consider an application whose source is intact, but its binary has a trap door. Let BT be the malicious application binary that contains the trap door code TA. While TA is invisible, it is vulnerable to recompilation. Ordinarily, a recompilation of the intact application source would remove the trap door TA. Now, consider a trap door TC in the compiler binary that bands with TA. Since the compiler binary itself is infected, a recompilation of the compiler source does not remove TC . With TC intact, TA is reinstated when the application is recompiled. There is no escape from TC or TA, the compiler and the application binaries have trap doors despite recompilations.

The trap doors exemplified in the report [33] were patches to the binary object files of the

Multics . The report suggested a countermeasure to such object code trap doors by having customers recompile the system from source, although the report also notes that this could play directly into the hands of the penetrator who has made changes in the source code. In fact, the Airforce Multics contract specifically required that Honeywell deliver source code to the

Pentagon to permit such recompilations. However, the report pointed out the possibility that a trap door in the PL/I compiler could install trap doors into the Multics operating system when modules were compiled and could maintain its own existence by recognizing when the PL/I compiler was compiling itself. This recognition was the basis several years later for the TCSEC

Class A1 requirement for generation of new versions from source using a compiler maintained under strict configuration control [32]. A story on CNET news reports that the Chinese government has similar concerns about planted trap doors in Microsoft software [23]. 4

The Airforce report [33] was an inspiration to Ken Thompson who implemented the self-inserting compiler trap door into an early version of . Thompson described his trap door in his 1984 Turing Award lecture “Reflections on Trusting the Trust” [52]. Compiler trap doors are significantly more complex; however, they are quite practical to implement at a nominal cost.

Relatively little research effort has been spent on guaranteeing that a compiler actually implements the control in binary as it is intended by the source. The absence of control flow guarantees have a pervasive adverse impact on almost all software safety and security techniques such as control flow integrity [6,7, 17, 19, 58, 60], model checking [15], obfuscation [27, 40], de-obfuscation [54], and others [16, 18, 25, 28, 48]; they are all founded on the presumption that the control flow in binary is as intended by the source.

The C compiler is written in C. The compiler trap door is an example of “chicken and egg” problems that arise when compilers are written in their own language. No amount of source-level verification or scrutiny can protect us from using untrusted code. Instead of the compiler, trapdoors can be installed in any program-handling program such as an assembler, a loader, or even hardware microcode. As the level of a program gets lower, these trap doors become harder and harder to detect [52]. To obfuscate a binary, Linn and Debray [42] precisely exploit the fact that standard disassemblers assume certain properties of compiler-generated code, and those properties can be violated by obfuscation without changing the program’s functionality. Standard disassemblers are confused and fail to correctly translate binary code into its corresponding assembly representation.

2.2 Control Flow

Frances Allen from IBM introduced the notion control flow graph (CFG) [8]. He defined basic block as a linear sequence of program instructions having one entry point (the first instruction executed) and one exit point (the last instruction executed). He defined control flow graph as a 5 directed graph in which the nodes represent basic blocks, and the edges represent control flow paths.

Mathematically, a CFG is a visual depiction of a binary relation that stipulates the successors si1, si2, .., sij for each program statement si, where a successor can follow si during an execution of the program. The binary relation is the collection of pairs (si , si1), (si , si2), ..., (si , sij). In a CFG, the nodes represent program statements, and the edges represent the pairs in the binary relation. Since it is defined for a function in the code, the CFG has a unique root corresponding to the first program statement of the function. With multiple RETURN statements, a function can have multiple last program statements. We assume a dummy program statement that is successor of all RETURN statements. As a result, the CFG has a unique leaf node. We shall refer to the root as the entry node and and the leaf as the exit node of the CFG.

2.2.1 Extracting Control Flow from Binary Code

A decompiler or reverse compiler, is a program that attempts to perform the inverse process of the compiler: given an executable program compiled in any high-level language, the aim is to produce a high-level language program that performs the same function as the executable program. The reverse-engineering process can be divided into two parts: disassembly and decompilation [40]. The task of the disassembly phase is the extraction of the symbolic representation of the instructions (assembly code) from binary code. Decompilation is the process of reconstructing higher-level semantic structures from the program’s assembly-level representation.

The frontend of the decompiler takes as input a binary program for a specific machine, loads it into , parses it, and produces an intermediate representation of the program, as well as the program’s control flow graph. The parser disassembles code starting at the entry point given by the loader and follows the instructions sequentially until a change in flow of control is met. Flow of control is changed due to a conditional, unconditional or indexed branch, or a 6 procedure call; in which case the target branch label(s) start new instruction paths to be disassembled. A path is finished by a return instruction or program termination. All instruction paths are followed in a recursive manner [22].

Control flow gets complicated due to machine instructions that use indexed or indirect addressing modes. To handle these, heuristic methods are implemented. For example, while disassembling code, the parser must check for sequences of instructions that represent a multiway branch (e.g. a switch statement), normally implemented as an index jump in a jump table [21].

Inferring control flow from binary is an involved process. Due to the nature of machine code instructions, the compiler might need to introduce intermediate branches in an executable program, because there is no machine instruction capable of branching more than a certain maximum distance in bytes (architecture dependent). An optimization pass over the control flow graph removes this redundancy, by replacing the target branch location of all conditional or unconditional jumps that branch to an unconditional jump (and any recursive branches in this format) with the final target basic block. While performing this process, some basic blocks are not going to be referenced any more, as they were used only for intermediate branches. These nodes must be eliminated from the graph as well.

Failure to statically resolve the target of an indirect branch instruction thus leads to an either incomplete or grossly over approximated control flow graph. Often, data flow analysis can aid in resolving such indirect branches; however, data flow analysis already requires a precise control

flow graph to work on. This seemingly paradox situation has been referred to as an inherent

“chicken and egg” problem in the literature [51]. An extension of the classical worklist algorithm in program analysis is used to empower control flow reconstruction by using data flow [35, 36, 37].

There have been some methods discussed in the literature as well about how to construct the

CFGs that are being extracted from a binary. Two of those methods are the following: 7

1. Method Based on XML [59]: an abstract description of binary assembly code is stored in a

XML format. Drawing the binary disassembly information from IDA Pro, the accuracy of

this method for the ARM and is empirically evaluated.

2. Provably Correct Control Flow Graphs [10]: The literature contains several approaches to

extract control-flow graphs automatically from program code. However, typically no formal

argument is given why the extraction is property preserving. This paper defines the CFG as

a Kripke structure, presents a flow graph extraction algorithm for Java bytecode (JBC), and

it proves that the extraction algorithm is sound with respect to program behavior. The

extraction algorithm considers all the typical intricacies of Java, e.g., virtual method call

resolution, the differences between dynamic and static object types, and exception handling.

2.2.2 Control Flow Based Obfuscation

Software Obfuscation generally refers to the process of transforming a program P through an obfuscator O into an equivalent program O(P) such that an adversary can derive no more information from a white-box observation of O(P) than from a black-box observation of P [27].

Three general categories of obfuscation are: lexical obfuscation, data obfuscation, and control flow obfuscation. Obfuscation can enhance software security by making it harder for an attacker to steal intellectual property, or to make unauthorized modifications to . On the other hand, obfuscation could also be used by attackers to hide malicious code such as viruses or

Trojan Horses from virus scanners

Typically, the control flow obfuscation is a transformation of the CFG. For example, the control

flow graph flattening obfuscation [16] breaks the CFG of a program by removing all easily identifiable if-statements and loops. These structures are then flattened as they all become different cases of a large switch statement routing the control flow through the statements of the right case. 8

The lexical obfuscation introduces spurious code blocks into the source program to obscure the real control flow. The data flow obfuscation extracts and hides the real control flow, typically by dividing the source program into several pieces and adding dispersed dummy code segments; it uses a runtime dispatcher to assemble the real control flow.

For all obfuscation methods, establishing the equivalence of O(P) with P inherently requires that the CFG extracted from the binary code is correct. Usually, the main concern when defining a program obfuscation is the fact that it does not degrade the performance of the compiled code.

Surprisingly, very few works in the literature are concerned by semantics preservation of program obfuscations. However, some obfuscations (e.g. CFG flattening) are non-trivial semantic transformations. Even if they are well understood, their implementations on real languages such as C are still error-prone [16]. The paper by Tsai et. al. [53] presents a graph-theoretic approach to express several existing control-flow obfuscation techniques as a sequence of basic transformations on the CFGs.

2.2.3 Deobfuscation

In a paper by Kruegel et. al., they have presented static analysis techniques to correctly disassemble Intel x86 binaries that are obfuscated to resist static disassembly. The technique is a control-flow-based and statistical techniques to deal with hard-to-disassemble binaries [40].

Additionally, Udupa et. al. [54] indicate that much of the effects of code obfuscation, designed to increase the difficulty of static analyses, can be defeated using simple combinations of straightforward static and dynamic analyses. Their paper presents several techniques to defeat obfuscation. Many of these techniques such as cloning, path feasibility analysis are based on control flow graphs. Obfuscation and de-obfuscation is an arms race. It is possible to devise obfuscation techniques that can make the existing disassembly algorithms less effective. However, this arms race is usually in favor of the de-obfuscator [40]. Obfuscator have many constraints; they must devise techniques that transform the program without seriously impacting the run-time performance or increasing the binary’s size or memory footprint while there are no such 9 constraints for the de-obfuscator. The control flow graph is crucial to obfuscators and de-obfuscators.

2.2.4 Control Flow Based Software Security

Recent literature has produced the following as some of the methods which are used to ensure software security based on the control flow for a piece of software.

1. Binary Control Flow Trimming: This is a method to reduce the size and complexity of

binary software by removing functionalities unwanted by code-consumers (but possibly

intended by code-producers) without any reliance on source code or debug metadata [28].

2. Control Flow Integrity: This is a method to restrict the set of possible control-flow transfers

to those that are strictly required for correct program execution. The analysis phase of the

method approximates the set of legitimate control-flow transfers; the enforcement phase

applies the CFG at runtime to ensure that all executed branches correspond to edges in the

CFG [7]. Prolific research on CFI in in recent years, 25 new CFI algorithms appeared in the

top four applied security conferences alone, is mainly aimed at improving performance,

enforcing richer policies, obtaining higher assurance of policy-compliance, and protecting

against more subtle and sophisticated attacks [58]. An empirical study of precision, security,

and performance of CFI techniques is presented in [17].

3. Model Checking with CFGs: Many software safety and security properties are described as

properties of CFGs. The paper [15] introduces a formalism based on a linear-time temporal

logic for specifying global security properties pertaining to the control flow of the program

and illustrates its expressive power with several existing properties.

2.3 Graph Isomorphism

Deciding whether two graphs are structurally identical, or isomorphic, is a classical algorithmic problem. Graph Isomorphism (GI) as a computational problem first appears in the chemistry 10 literature of the 1950s as the problem of matching a molecular graph. GI has emerged as one of the few natural problems in the complexity class NP that could neither be classified as being hard

(NP-complete) nor shown to be solvable in polynomial-time [29]. Computational Complexity: GI has many application in biology, chemistry, physics, mathematics, and computer science. The problem is of central interest in areas such as logic, group theory, and quantum computing.

Working with graphs is unequally more challenging than working with feature vectors, as even basic mathematic operations of addition (graph merge) and subtraction (graph difference) cannot be defined in a standard way [49].

The ideas of the graph isomorphism problem are able to be applied to binary analysis as well.

Recognizing reused functions for binary code is critical to digital forensics, especially as many modern malware typically contain many functions borrowed from open-source software packages.

Based on a graph that integrates control flow, register flow, and function call, the paper [9] presents a graph matching technique for identifying reused functions in binary code. The problem of matching between binaries is important for software copyright enforcement as well as for identifying disclosed vulnerabilities in software. The paper [34] presents a binary code matching technique using a statistical model comprising instruction mnemonics, control flow sub-graphs.

The paper [57] uses graph matching techniques to measure the similarity metric between two function-call graphs to identify obfuscated malware variants. Additionally, isomorphic graphs have applications to cyber security. The paper [48] present a verification approach that detects different types of hardware trojans in register-transfer-level (RTL) models by exploiting an efficient control flow subgraph matching algorithm.

2.3.1 Graph Isomorphism Techniques

We now present a brief summary of different techniques to decide if two graphs are isomorphic.

1. Key History: Babai [11] has presented polynomial-time probabilistic algorithm for

vertex-colored graphs with bounded color. Luks [43] has showed that isomorphism can be 11

decided in polynomial time on graph classes of bounded degree. McKay has developed GI

tool Nauty, which marks a breakthrough in practical isomorphism testing [45]. In a recent

breakthrough, Babai proved that GI is solvable in quasi-polynomial time [12].

2. Combinatorial Algorithms: GI is weakly established by consistent failure to find a

“certificate” of non-isomorphism. For example, we can count vertices, edges, and triangles

in both graphs; if any of these counts differ, the graphs are non- isomorphic. Or we can look

at the degrees of the vertices. If there is some d such that the two graphs have a different

number of vertices of degree d, the graphs are non-isomorphic. The Weisfeiler-Leman

algorithm provides a systematic approach to generate such certificates [50].

3. Color Refinement Algorithms: Color refinement is an important subroutine of almost all

practical graph isomorphism tools, and it is also a building block for many theoretical

results. The color refinement algorithm iteratively computes a coloring of the vertices of a

graph. The actual colors used are irrelevant, what matters is the partition of the vertices

into color classes. The final coloring has the property that any two vertices of the same

color have the same number of neighbors in each color class.

4. Group Theoretic Algorithms: As the GI and the automorphism group problem are

polynomially equivalent, it suffices to solve the latter [44]. These algorithms are used to

establish polynomial time bounds for GI problems for graphs of bounded degree [29].

2.4 Our Approach

We present a graph-theoretic technique to verify whether source and binary are control flow equivalent. As an idea, it is simple- compare the the control flow graphs derived from the source and the binary. Joxean Koret (Basque hacker) describes in a blog [39] “Histories of comparing binaries with source codes”, his attempt at comparing the control flow graphs. Koret observes

“CFGs generated from human written code are totally different from the CFGs we get after the compiler generates its output. It gets even worst when we consider that there can be various 12 different optimization levels applied.” Moreover, extracting precise CFGs from binary has its own challenges [10, 22, 37, 56].

We leverage existing work [4, 14, 21, 26, 35, 36, 46, 51, 56, 59] on constructing control flow graphs from binary. The main novelty of our research is rooted in our method to compare the source and the binary CFGs. As a quick overview, the steps are:

1. We compare the CFGs using feature vectors VS and VB for the source and the binary, respectively. These feature vectors are based on distinguished nodes and edges such as loop

headers, loop back edges, branch points, join points etc.

2. We have designed program functionality preserving transformations to transform the source

CFG so that VS comes closer to VB. We identified specific transformations by categorizing

the differences between VS and VB and designed a transformation for each category. We leveraged the knowledge of compiler optimizations (short-circuiting [21], switch conversion

[22] etc.) to design these transformations.

3. We designed a vertex color preserving isomorphism algorithm to check whether the

transformed source CFG is isomorphic to the binary corresponding CFG. The isomorphism

maps a vertex of the transformed source CFG to a linear block of the binary CFG. Our

isomorphism algorithm is based on Weisfeiler-Leman algorithm [50] and color refinement

[29, 30, 31, 41]. The vertex coloring uses the distinguished features incorporated in forming

the feature vectors.

For the proposed Control Flow Equivalence (CFE) method, we have come up with several transformations to bring VS comes closer to VB (Step 2 in the above description). To identify the required transformations, we have used real-world software as well as our hand-made programs

(e.g., switch statements to identify the control structure variability of the binary with respect to the number switch cases). We have identified a sufficient set of transformations; they can be applied in a sequence to make VS equal to VB for the real-world software and the hand-made test 13 programs in our empirical study. We expect that new transformations will be discovered as the

CFE method is used on other software. The vertex coloring isomorphism is quite useful to pinpoint the differences between the source and the binary CFG. New transforms can be designed by examining the difference.

We have created a tool suite that implements the CFE method. The tool suite uses the disassembler and the Atlas platform [2, 24]. The tool suite provides capabilities: (a) construct and visualize CFGs, (b) compute the feature vectors VS and VB, (c) construct the isomorphism mapping and show the result of complete or partial matching.

2.5 Practicality of Our Approach

Let us discuss practical use cases of the CFE method. Foremost, it can be used to check for compiler trap doors. If the source and the binary are found to be not equivalent by CFE, the differences revealed by CFE can be checked for possible compiler trap doors. Next, suppose you have only the application binary and you want to use techniques such CFI or model checking. A critical prerequisite for these techniques is an accurate method M for constructing CFGs from binary. CFE can be used to check the accuracy of M. Typically, the performance and accuracy of

CFI or model checking are evaluated on benchmarks such as the SPECint [5] or Apache [1]. With

CFE, the accuracy of M can be checked using the source and binary for the benchmarks. The paper discusses the general need for ascertaining control control flow equivalence [9, 13, 47].

Let us conclude by discussing connections between the CFE method, structured programs, and obfuscation. Let P be a structured program source and B(P ) its binary. Is B(P ) a structured program? The definition that a structured program is a program without GO TO is not helpful to answer the question. Structuredness should be about a property of CFGs that simplifies program comprehension and analysis, and not about the presence or absence of GO TO [38].

CFGs of structured programs are reducible graphs. In general, reducible graphs are easier to analyze. Many static analyses either require the code to have a reducible CFG or they may be too 14 costly when applied to irreducible CFGs [20]. An irreducible CFG can be transformed into a reducible CFG that preserves program functionality, but the transform can cause an exponential blow up in the number of nodes [20]. Graph transformations we have created for CFE, preserve program functionality and the graph reducibility. Thus, if the CFGs of P and B(P ) are shown to be equivalent using CFE then it makes sense to say that B(P ) is a structured program. Let us now connect CFE to obfuscation. In a graph-theoretic approach to quantitative analysis of control-flow obfuscating transformations [53], the obfuscation is modeled as a sequence of graph transformations on CFGs, and they propose a measure of the difficulty reversing obfuscation.

Compilation itself can be considered a control-flow obfuscation. Using the metric proposed in [53],

If P and B(P ) are CFE equivalent, then the reducibility is preserved, and thus the compiler obfuscation is not difficult. Note that the compiler or any obfuscation would really be difficult if the original CFG is reducible but the obfuscated CFG is not reducible. 15

CHAPTER 3. TRANSFORM MOTIVATION

When first approaching this verification problem, we looked at trying to use the metrics associated with a graph for verification. This included things such as the path count for each function and different node counts (e.g. the number of nodes as a whole, exit points, and loops).

These metrics were used to create the initial feature vectors as described in Chapter2, which could be used to describe each graph and compared to one another. However, we quickly learned that this was not sufficient and failed to provide meaningful insights. Path counts are a useful checkpoint to see if two graphs may have the same structure. However path counts and the initial vectors created as a whole fail to precisely identify whether or not what you are writing in source code is what is the same structure that is executed in a binary.

A difference in path counts between the source and binary CFG for the same function allowed us to confirm that the structure had changed and the two CFGs. The differing paths counts showed that the source and binary were not CFE. More work needed to be done at this point in order to fully check that the control structure of a source graph is maintained in the binary at a verification level. At a trust level, more work needed to be done to ensure that the tools being used were accurate- whether that is the compiler or disassembler or both. To some degree, you are unable to trust the code that you write if you are unable to trust the tools used to create that code. These problems go hand in hand and it is because of this, that we needed to do more. We needed to look closer at the graphs themselves and start to bridge the gaps between the source and binary CFGs by working to get them to a common ground- thus came the rise of the transforms. 16

3.1 Practicality of Graph Transforms

It is a well known that the graph isomorphism problem is intractable in the general sense. This was initially thought to be the case when investigating the equivalence of CFGs generated for source code and its respective disassembled binary code. However, we found this to not hold for multiple reasons. The first being that CFGs have inherent properties about them that make this problem more tractable. These properties include the limited number of nodes and edges as well as the fact that all edges are directed. The second reason being that we are able to transform these graphs to create a“common ground” between the characteristics of each graph. For example, since the binary graphs only evaluate a single condition at each branch node, we can transform the source such that this constraint of one condition per branch node is enforced. With these transformations and the subsequent reduction of the graphs to only the nodes that impact the control flow of a piece of code, the isomorphic graph problem becomes much more tractable.

One transformation that all graphs undergo in order to start building this common ground is the consolidation of all exit points to a single leaf node. This is represented in the source graphs as the “src exit” node and in the binary graphs as the“binary exit” node. Below we present our case to fully illustrate the necessity of these transforms before attempting to determine whether or not the graphs are isomorphic and an overview of the algorithm which graphs go through in order to be transformed. Without these transforms, the graph isomorphism problem for source and binary

CFGs is a nonstarter.

3.2 Transform Cases

3.2.1 Case 1

The first case that we present is that of a single branch node being found in the source and binary CFGs from the function sgetch.c in XINU. In this example, the branch node is an if statement which can be seen in the source code on line 6 from Listing 3.1. The original CFGs for the source and binary can be seen in Figures 3.1 and 3.2 respectively. As evidenced by the two 17

graphs and the knowledge of how assembly languages work, we know that the linear blocks of

code in the source are condensed to smaller blocks in the binary. This correspondence can be seen

in Figures 3.3 and 3.4 where each colored box in the source corresponds to the colored box in the

binary. In order to verify that the overall control structure matches between the source and the

binary graphs, we only care about retaining the nodes which have an impact on the control flow

of a program. Thus we only care about retaining the branch and exit nodes from both graphs.

Now if we transform each of these graphs such that they are reduced to only those two nodes, we

get what is shown in Figures 3.5 and 3.6. From these graphs, we can clearly see that modulo the

linear blocks of code that do not impact the overall control flow, the graphs are isomorphic. This

case thus illustrates the need to transform the graphs to only the nodes that impact the control

flow of the function.

1 static int sgetch(int _ignored, int _str_p){

2 const char **str_p = (const char **)_str_p;

3 const char *str = *str_p;

4 int c;

5

6 if (*str ==’\0’){

7 c = EOF ;

8 }

9 else{

10 c = (unsigned char)(*str);

11 *str_p = ++str;

12 }

13 return c;

14 }

Listing 3.1: sgetch.c Source Code 18

Figure 3.2: sgetch.c Binary CFG

Figure 3.1: sgetch.c Source CFG

Figure 3.4: sgetch.c Binary CFG Figure 3.3: sgetch.c Source CFG 19

Figure 3.5: sgetch.c Source Figure 3.6: sgetch.c Binary Transformed CFG Transformed CFG

3.2.2 Case 2

The second case that we present is that of a single branch node from the source CFG in signal.c mapping to two branch nodes in the binary CFG. This is due to the fact that the single branch node in the source is what we will refer to as a compound conditional. That is, if you have

4 conditions that need to be evaluated at a branch node, you would have something like the following in your code c1 && c2 || c3 && c4. In source code, we are allowed to have multiple conditions be evaluated in a single branch node. However, when our source code is compiled down to the binary, the assembly language is only able to evaluate one condition at a time. Therefore the branch node consisting of multiple conditions being evaluated is broken down into simple conditions that consist of branch nodes where only one condition is evaluated at any given time.

Additionally, this allows the compiler to optimize how much code is ran by doing a minimal evaluation of all of the conditions. The presence of these complex conditions creates a challenge when attempting to determine whether or not source and binary graphs are isomorphic.

Therefore, in order to address this problem we have created a transform that breaks up the complex conditions into its corresponding simple conditions. In this transform, we assume left to right evaluation ordering. That is in the example above, c1 is evaluated before c2 and so on. For the purposes of this research, this transform is referred to as a “short circuit” transform. 20

The source code for the signal.c example can be seen below in Listing 3.2. We can see from

this source code, that on line 31 we have a preprocessor symbol that is replaced with the text

found in Listing 3.3. This shows the complex condition that ultimately is created in the source on

line 6 with the original source and binary CFGs shown in Figure 3.7 and Figure 3.8. Just from

initial comparison, these graphs do not appear to be isomorphic however should you take the

source and transform it such that all conditions are simple conditions they are in fact isomorphic.

In Figure 3.9 and Figure 3.10 we can see which conditional branch nodes from the source

correspond to a given branch node in the binary via colored boxes. Now if we apply the short

circuit transform to the source, followed by the reduction of linear code blocks that was applied in

Case 1 we get the graphs seen in Figures 3.11 and 3.12. Once this transformation has been

applied it is clear that the graphs are now isomorphic and that the control flow between the

source and binary is maintained.

1 syscall signal(semaphore sem){

2 register struct sement *semptr;

3 irqmask im;

4

5 im = disable();

6 if (isbadsem(sem)){

7 restore(im);

8 returnSYSERR;

9 }

10 semptr = &semtab[sem];

11 if ((semptr->count++) < 0){

12 ready(dequeue(semptr->queue), RESCHED_YES);

13 }

14 restore(im);

15 returnOK;

16 }

Listing 3.2: signal.c Source Code

1 /* isbadsem- check validity of reqested semaphore id and state*/ 21

2 #define isbadsem(s) ((s >= NSEM) || (SFREE == semtab[s].state))

Listing 3.3: isbadsem() Replacement Code

Figure 3.8: signal.c Binary CFG Figure 3.7: signal.c Source CFG

3.2.3 Case 3

The final case that will be discussed is the need for architecture specific transforms to handle

any instructions that may be used which alter the control flow based on the instruction set that

will be used. For this case, we will use the example of fputc.c and the architecture specific

instruction that we will see utilized in the binary graph is sltiu which stands for “Set on Less

Than Immediate Unsigned”. This instruction allows you to do a less than comparison with a

single instruction and record the boolean result to be used later. This is beneficial, as we will see,

in optimizing how branch nodes are evaluated for the MIPS architecture. 22

Figure 3.9: signal.c Source Marked CFG Figure 3.10: signal.c Binary Marked CFG

Figure 3.11: signal.c Source Figure 3.12: signal.c Binary Transformed CFG Transformed CFG 23

To start, we have the source code for fputc.c found in Listing 3.4. The corresponding source

and binary CFGs can be seen in Figures 3.13 and 3.14. Now initially, it would appear that the

complex condition found on line 29 of the source code would cause us to preform the short circuit

transform found in Section 3.2.2 and produce the graph seen in Figure 3.15. However, when this

is compared to the reduced binary CFG seen in Figure 3.16 we see that the graphs are clearly not

isomorphic. This is due to the fact that in the binary graph, the sltiu instruction is being used

as a way to reduce the number of comparisons that need to be performed. Therefore, the final

transformed source CFG should actually be what is shown in Figure 3.17 and when comparing

this to what is seen in Figure 3.16, we can see that the graphs are indeed isomorphic.

1 int fputc(int c, int dev){

2 int ret;

3

4 ret = putc(dev, c);

5 if (ret == SYSERR || ret == EOF){

6 returnEOF;

7 }

8 else{

9 return(int)(unsigned char)ret;

10 }

11 }

Listing 3.4: fputc.c Source Code

3.3 Transform Algorithm Overview

Until this point, we have seen what general transforms we can do and how they are effective at

modifying one graph to be isomorphic with another. The real power in our algorithm now comes

into play when combining all of the transforms together since this what happens when you compile

source code down into a binary. These transforms and their algorithmic application are all in an

attempt to reduce the source and binary CFGs down to a common set of nodes with the goal of 24

Figure 3.13: fputc.c Source CFG Figure 3.14: fputc.c Binary CFG

Figure 3.15: fputc.c Source Trans- Figure 3.16: fputc.c Binary Transformed CFG formed CFG

Figure 3.17: fputc.c Correct Source Transformed CFG 25 making the isomorphic checking process as easy as possible. The source and binary CFGs start on opposite ends of the spectrum and these transforms work to bring them both to the center.

3.3.1 Transform Algorithm

The creation of the final transformed source graphs may seem counter intuitive at first, however it is necessary to ensure accurate and efficient transformations. The first steps entail expanding the original graph to replace any static function calls with the corresponding static function CFG, which is discussed in detail in Chapter4. Once we have done this, we know that we will have all nodes that may end up in the final source graph after any other transformations. Next, we can start modifying any source specific structures such as switch statements and branch nodes that would cause short circuiting.

When it comes to applying the switch and short circuit transforms covered in Chapters5 and6 respectively, the order in which they are applied does not matter. This is due to the fact that one is not dependent upon the other. There may be short circuit branch nodes which appear in the code for a case statement, however that does not impact the changes that need to be made to the n-way branch of the switch statement. The same applies to a function which might have a switch statement after a complex condition.

The final steps for each graph we process is to remove any node which does not impact the control structure of the graph and consolidate the exit points. This allows us to reduce the graphs, which otherwise may be quite large, down to only the nodes that truly matter for determining if the source and binary CFGs are equivalent. Nodes that would be included in this category are things such as branch nodes, exit points of static functions, loop headers, and loopback tails. Loopback tails are the tail of an edge leading back to a loop header and they are retained when they are needed to demonstrate the presence of a loop body in order to prevent the formation of a “self-loop” that was not originally present in the graph. In the remainder of this chapter, we can refer to this transform as Tr. Consolidating the exit points allows the source 26 graphs to have 1 exit node instead of n exit nodes for each return statement which helps to bring the number of nodes in both the source and binary graphs to an equilibrium. The same is true for the binary graphs which have the potential to have more than one exit node, however this is less likely. For the remainder of this chapter and the discussion of this algorithm, the transform which consolidates exit points can be referred to as Te. Once the source graph has gone through all of these transforms we can produce final transformed source graph. A high level overview of this process is shown in Algorithm1.

Algorithm 1: Source CFG Transform Algorithm(CFG) Input: CFG to transform Output: The transformed CFG 1 S1 = Static-Transform(CFG) 2 S2 = Switch-Transform(S1) 3 S3 = Short-Circuit-Transform(S2) and Te(S2) 4 S4 = Tr(S3) 5 return M4

By applying the transforms that will be discussed in Chapters4,5, and6 we are working to move the source CFG closer to the center of the spectrum and shrinking the divide between the source and binary CFGs. After this point after both graphs have undergone their necessary transforms, the isomorphic checking algorithm is able to be applied. 27

CHAPTER 4. STATIC FUNCTION TRANSFORM

In writing C source code, developers have the option to make use of static functions in their source files. There are some benefits to using static functions within your code. For instance, it allows you to modularize your code and simplify the primary function that is being written by creating what are commonly referred to as “helper functions”. If you have segments of code that are reused in multiple places in the same function, it makes sense to simply put the reused segments in a separate function that you then invoke. An additional benefit to using the static keyword in your function declaration is that it ensures your function is only visible to the code located in the same source file or translation unit. That is, if you have a file called foo.c which has a function static void fooBar() and another file called bar.c, the scope of fooBar() is limited to foo.c. The static function is not visible to bar.c. Since the scope of a static function is limited to just the translational unit where it is written, this allows developers to reuse function names in multiple places should they need or want to. This is analogous to having a“local” function or using the private keyword in object oriented languages.

The use of static functions can be very useful when writing source code, as shown above, however their presence complicates things when we attempt analyze the disassembled binary versions. Additionally, these complications that arise in the binary analysis appear to not be present when attempting to analyze the source code. The primary complication seems to stem from the scope of the static functions. Since we know that static functions are limited to only being invoked by code in the same source file, there is no need for the compiler to create a function call in binary. This leads to the code from the static function being inserted directly at the call site in the binary. Now when we disassemble the binary and create the subsequent CFG we potentially have a different control structure than that of the source code. Ultimately this 28 leads to challenges in analyzing and verifying that the control structure generated by the compiler matches that which is developed in the source code.

4.1 XINU Example

We will now provide the analysis for an example of the use of static functions in XINU and the impact, as previously discussed. The function that we will be analyzing is qsort.c and the respective CFGs can be seen in Figures 4.1a and 4.1b.

(a) Source qsort.c CFG (b) Binary qsort.c CFG

Figure 4.1: qsort.c Source and Binary CFGs

Now upon a first glance over the CFGs, we can see that there is something very different about the control structure in the source compared to the binary. For starters, just the number of nodes being different is a point of interest along with the presence of loops within the binary. This is 29 something very interesting and initially unexpected. There are certain things the compiler will do that we can take into account when doing this analysis, but creating nodes and loops in the binary where there appear to be none in the source is not one of them.

Since we have found this discrepancy in the CFGs, it can be beneficial to go back to the source code for further investigation. If we look back at Figure 4.1a, we can see that there are a series of function calls in the sequence of nodes along the true branch going out of the root. The corresponding source code for just the first function call that can be seen in Listing 4.1. Going one step further, we also have the source code and CFG for the function partition() which is the first function called if that branch condition evaluates to true. This code can be seen in

Listing 4.1 and 4.2.

Now that we have the source code and can see what is getting called in the source CFG, we can see that there is clearly a static function that has the ability to be executed by qsort.c. Within this static function, partition(), we can see that there is guaranteed to be a do-while loop and two branch conditions in the binary (see lines 61 to 79 of the source code). Now that we see this, it makes sense that there would be the extra loops as well as branch diamonds in the binary CFG and PCG since we know how the compiler interprets and uses static functions.

In conjunction with the additional branch nodes problem starting to be solved, we can see the proof this is in fact what happens, that is the replacing of the static function call with the associated code with more than the code and CFGs. First, we can look at the rest of the execution in the partition() function. If we were to hit line 77 in partition(), then we are at yet another static function call, this time for swap elements(). The source code and CFG for swap elements() can bee seen in Listing 4.1 and 4.3. Here we can see that there is another loop that can be called and therefore accounts for the final branch node seen in the binary CFG.

Additionally, a more formal indication can be seen if we look at the call graph for the function partition() and qsortt.c as seen in Figures 4.4 and 4.5. These two call graphs formally tell us 30

that the only thing that could possibly be in the location of that function call is the compiled

source code for partition().

This example given is just for the first static function that is called within qsort.c, however the

logic and reasoning applied here holds true for the remainder of the static function calls.

1 static size_t partition(void *base, size_t nmemb, size_t size, int (*compar)(const

void*, const void*));

2 static void swap_elements(void *p1, void *p2, size_t size);

3

4 void qsort(void *base, size_t nmemb, size_t size, int (*compar)(const void*,

const void*)){

5 if (nmemb > 1){

6 size_t pivot_index = partition(base, nmemb, size, compar);

7 qsort(base, pivot_index, size, compar);

8 qsort(base + (pivot_index + 1) * size, nmemb - (pivot_index + 1),

9 size, compar);

10 }

11 }

12

13 static size_t partition(void *base, size_t nmemb, size_t size, int (*compar)(const

void*, const void*)){

14 void *p1, *p2;

15 size_t p1_index = 1;

16

17 /* Pivot is at @base.*/

18 p1 = base + size;

19 p2 = base + (nmemb * size);

20 do{

21 if ((*compar)(p1, base) <= 0){

22 p1 += size ;

23 p1_index++;

24 }

25 else{

26 p2 -= size ; 31

27 swap_elements(p1, p2, size);

28 }

29 } while (p1 != p2);

30

31 p1 -= size ;

32 p1_index--;

33 swap_elements(p1, base, size);

34 return p1_index;

35 }

36

37 static void swap_elements(void *_p1, void *_p2, size_t size){

38 unsigned char *p1 = _p1;

39 unsigned char *p2 = _p2;

40 size_t i;

41 unsigned char tmp;

42

43 for (i = 0; i < size; i++){

44 tmp = p1[i];

45 p1[i] = p2[i];

46 p2[i] = tmp ;

47 }

48 }

Listing 4.1: qsort.c Source Code

4.2 Static Function Transform Overview

In order to address the challenges presented by using static functions, we have developed a

source CFG transform to mimic the behavior seen in binaries. This transform takes a given

source code CFG and replaces any static function calls that are not used within a branch node,

with the respective static function CFG. At a high level, this algorithm follows the steps seen in

Algorithm2. 32

Figure 4.2: partition() Source CFG Figure 4.3: swap elements() Source CFG

Figure 4.5: partition() Call Graph Figure 4.4: qsort.c Call Graph 33

Algorithm 2: Static-Function-Transform(CFG) Input: Source Control flow graph, CFG, to transform Output: The transformed CFG 1 Find all static functions within the given CFG 2 Create and store all static CFGs for use ease of use later 3 Create nodes nodes and edges for graph to be returned 4 return Transformed CFG

4.2.1 Finding Static Function Calls

The first step that needs to be taken in this transform is to identify all of the possible static function calls within a given source CFG. This is easily done through the query, tagging, and

filtering abilities provided by the Atlas API. The idea behind locating all of the possible static functions as the first step is to help minimize the number of “context switches” that need to happen later in the algorithm. Since we know that this transform is dealing with static functions and we know what we will need to do with those functions, it makes sense to find those functions

first to ease the creation of the final graph. Additionally taking this first step and the next step described in Section 4.2.2 causes the number of times a given static function is called to be irrelevant. This is due to the fact that we are pre-processing as much information about the static functions as possible before creating the final graph that will be returned. Algorithm3 shows a pseudo code version of this step of the algorithm.

Algorithm 3: Static-Function-Searching(CFG) Input: Source Control flow graph, CFG, to transform Output: A list of the names of the static functions found in the the source CFG 1 tUnits = Find all Translation Units in the Graph Universe 2 cfgTUnit = Select the Translation Unit for the given CFG 3 functionCalls = Find all functions called in cfgTUnit 4 staticFunctions = List to contain the names of all found static functions 5 foreach Node f in functionCalls do 6 if f.tagged(“static function”) and !staticFunctions.contains(f) then 7 staticFunctions.add(f) 8 return staticFunctions 34

4.2.2 Create and Store Original Static CFGs

Now that we have identified all possible static function calls within the given CFG, we will generate the original CFGs for each of the static functions. This will allow us to map each of the static functions to its respective CFG and store the original CFGs that we will use to create the nodes in the final, returned CFG. This is done simply by iterating through each of the static functions which were returned from Algorithm3. While we perform this iteration, we need to perform a check to ensure that the translation unit of the found CFG matches the translation unit of the given CFG. This is due to the fact that static functions allow developers to use the same function name across multiple source files. If this check is satisfied, we are able to map the

CFG to the given static function. This process can be seen in Algorithm4.

Algorithm 4: Static-CFG-Creation(staticFunctions) Input: A list of static functions found in the source CFG 1 originalTU = Original CFG translation unit 2 staticCFG = Map containing each static function and the respective CFG 3 foreach Function f in staticFunctions do 4 c = CFG(f) 5 cTU = Translation Unit for c 6 if !staticCFG.containsKey(f) and originalTU.equals(cTU) then 7 staticCFG.put(f, c)

4.2.3 Create Transformed Graph

The final step is to create the transformed CFG that will be returned. This is also the most technically and logically challenging step of the transform. The final step is completed by iterating over the edges from the original CFG and first checking if the nodes an edge is coming from and pointing to are both static function calls. If this is the case, then we take the leaves of the“from graph” and add edges from each leaf to the root of the“to graph”. Should this is not be the case, then we check to see if the node an edge points to is a static function or if the node an edge comes from is a static function. Based on the results of this check, we add the necessary nodes and edges between the created nodes from the original CFG and the static function CFG 35 nodes. Lastly, if an edge does not meet any of the previously listed cases, then we simply create the from node, to node, and edge since we know these to all be from the original CFG and nothing needs to be changed from their original structure. It is during this step that we may also encounter the case of having “nested” static functions. That is one static function calls another static function. Should this be the case, we recursively call our “create static CFG” function until we do not find a static function call. As the recursion stack is collapsed, each static function call is replaced with its respective CFG. An outline of the code that performs this final step of the transform can be seen in Algorithm5.

4.2.4 Application to XINU Example

Now, if we revisit our original example of the function qsort.c, we can now apply this transform to that function. Using the function qsort.c allows us to see all parts of this algorithm in action. This is due to the fact that in the qsort.c function, there is a call to the static function partition() and within the function partition() there are multiple calls to the static function swap elements(). The results of this transform can be seen in Figure 4.6 and within this graph, we see that the five branch nodes which present in the original binary CFG are now present in the source CFG.

4.3 Implications of the Static Transform

When we look at the bigger picture, this transform initially makes our source CFGs much more complex by increasing the number of nodes. We are taking a single node that is representing a function call and replacing it with the entire CFG of the function we are calling. However, this replacement is beneficial to us when we start to look at the CFGs that are generated by the disassembled binary. During compilation static functions are able to be replaced with their assembly code instead of a function call. This is due to the fact that the scope of static functions is limited to the just that of the translation unit where they are located. Now, by replacing all of the static function calls in the source with their respective CFGs we are able to get the source 36

Algorithm 5: Create-Returned-Graph(CFG, staticFunctions) Input: Source CFG to transform and a list of the static functions in the source graph 1 foreach Edge e in CFG do 2 Node from = e.from 3 Node to = e.to 4 if staticFunctions.conatins(from) and staticFunctions.contains(to) then 5 Get from function’s leaves 6 Add edges from each leaf to root of to’s graph 7 else if staticFunctions.conatins(to) then 8 Create from node if needed 9 Check to make sure to root has not been made 10 if root has been made then 11 Add edge between root and from node 12 continue 13 Create static CFG in returned graph 14 Recursively build other static CFG’s if needed 15 Add edge between root and from node 16 Add edges between leaves and to’s successor 17 else if staticFunctions.contains(from) then 18 Check to see if from has already been made 19 if from node in final graph then 20 Add edges between from leaves and to node 21 continue 22 Create static CFG in returned graph 23 Recursively build other static CFG’s if needed 24 Add edges between leaves and to node 25 else 26 Create from node for final graph 27 Create to node for final graph 28 Add edge between new from and to nodes 29 return transformedCFG 37

Figure 4.6: qsort.c Transformed CFG 38 code graphs one step closer to what they look like in the binary which will aides in the verification of source code and the respective binary.

The overall goal of this research is to develop an algorithm that will allow us to verify if the disassembled binary maintains the same control structure as that of the source code. This finding of the static function usage and its implications help us to better inform the development of this algorithm. With this knowledge, we know that we can not only rely on comparing the number of paths in both the source and binary CFGs. Even so, if we compare the number of nodes and edges this will not be a sufficient metric in conjunction with the number of paths. The use of static functions will increase the number of nodes in the CFG as well as potentially increase the number of branch statements. Additionally, there is the potential for edge cases to occur.

While investigating the cause for the difference in the paths for various CFGs, it makes sense to also check if there were functions where the number of branch statements differed in the binary and source yet still caused the number of paths to be the same. Upon investigation of these differences, we see the function create.c meets this criteria. The source and binary CFGs associated with this function can be seen in Figures 4.8a and 4.8b respectively. Here again, we see that the number of branch nodes does not match between the source and binary (i.e. there are 3 conditional nodes and a loop node in the binary compared to only 3 conditional nodes in the source). This can be more clearly seen in the PCGs shown in Figures 4.9a and 4.9b.

When looking at the source code, we notice two things- the first being the presence of a static function as seen in Listing 4.2. Except this static function has both a loop as well as a conditional in it which we would think would cause a total of 5 branch nodes in the binary CFG, not 4. The interesting thing is then, that this conditional actually remains in the binary CFG and there is one in the source that appears to not remain. This conditional in Figure 4.7 appears to be condensed to the the following line of code in the binary: sltiu v1, v2, 0x80. The MIPS instruction, sltiu, stands for ”set on less than unsigned” and evaluates as follows: if v2 < 0x80 39

then set v1 = 1. From further investigation into the source code, we see that the constant MINSTK

being used in the conditional is defined in a header file as the decimal value 128 which translates

to the hex value of 0x80. The operation being done in a single assembly instruction to simplify

the comparison and check seen in the conditional. Ultimately this is an optimization done by the

compiler at compile time, however it also leads to an interesting edge case when trying to verify

the control structures of the functions.

One final implication this finding seems to have is with regards to reverse engineering. Without

having the source code of a given binary for comparison, it appears to be a very daunting task to

say with any amount of certainty if the binary definitively came from the source. There are too

many variables at play that create too many possible combinations of possible outputs in regards

to control structures. Without a way to verify and understand the control structure of the binary,

you are unable to say for certain that what is actually going on.

1 static int thrnew(void);

2

3 tid_typ create(void *procaddr, uint ssize, int priority, const char *name, int

nargs, ...){

4 irqmask im;/* saved state*/

5 ulong *saddr;/* stack address*/

6 tid_typ tid;/* new threadID*/

7 struct thrent *thrptr;/* pointer to new control block*/

8 va_list ap;/* list of thread arguments*/

9

10 im = disable();

11

12 if (ssize < MINSTK){

13 ssize = MINSTK;

14 }

15

16 saddr = stkget(ssize);

17 if (SYSERR == (int)saddr){ 40

18 restore(im);

19 returnSYSERR;

20 }

21

22 tid = thrnew();

23 if (SYSERR == (int)tid){

24 stkfree(saddr, ssize);

25 restore(im);

26 returnSYSERR;

27 }

28

29 thrcount++;

30 thrptr = &thrtab[tid];

31

32 thrptr->state = THRSUSP;

33 thrptr->prio = priority;

34 thrptr->stkbase = saddr;

35 thrptr->stklen = ssize;

36 strlcpy(thrptr->name, name, TNMLEN);

37 thrptr->parent = gettid();

38 thrptr->hasmsg = FALSE;

39 thrptr->memlist.next = NULL;

40 thrptr->memlist.length = 0;

41

42 /* Set up default file descriptors.*/

43 thrptr->fdesc[0] = CONSOLE;/* stdin is console*/

44 thrptr->fdesc[1] = CONSOLE;/* stdout is console*/

45 thrptr->fdesc[2] = CONSOLE;/* stderr is console*/

46

47 va_start(ap, nargs);

48 thrptr->stkptr = setupStack(saddr, procaddr, INITRET, nargs, ap);

49 va_end(ap);

50 41

51 restore(im);

52 return tid;

53 }

54

55 static int thrnew(void){

56 int tid;

57 static int nexttid = 0;

58

59 for (tid = 0; tid < NTHREAD; tid++){

60 nexttid = (nexttid + 1) % NTHREAD;

61 if (THRFREE == thrtab[nexttid].state){

62 return nexttid;

63 }

64 }

65 returnSYSERR;

66 }

Listing 4.2: create.c Source Code

Figure 4.7: create.c removed conditional 42

(a) Source create.c CFG (b) Binary create.c CFG

Figure 4.8: create.c Source and Binary CFGs 43

(a) Source create.c PCG (b) Binary create.c PCG

Figure 4.9: create.c Source and Binary PCGs 44

CHAPTER 5. SWITCH STATEMENT TRANSFORM

When writing software, developers often want to reduce the amount of redundant code that is written and ensure that the code they keep is clear and readable. These ideas come into play when you want to have unique ways to handle different situations based on the value of a variable and thus we have the rise of switch statements in source code. Switch statements, along with their associated case statements, allow developers to easily and clearly control what code is executed based on the value of the switch parameter. Even though these blocks of code are easily read in the source, there is no such construct as a switch or case statement once the source code gets compiled down into a binary executable. This is due to the fact that in the source code, a switch statement can be viewed as an n-way branch node where n is the number of cases. All of the components that make up these blocks of code are ultimately translated into multiple 2-way branch statements that are evaluated at runtime in order to execute the correct code based on the switch parameter and minimize the number of comparisons that need to happen.

The translation of switch and case statements to branch nodes that make up n 2-way branches, creates a challenge when attempting to verify the control flow of source code to the disassembled binary. This is due to the fact that the CFGs for source code do not view the cases in a switch statement as branches even though they are evaluated as such at run time. Therefore if we attempt to analyze the source and binary CFGs for a piece of source code that utilizes switch statements, the analysis will get nowhere without taking into account the transformation performed by the compiler on switch statements.

5.1 Switch Statement Transform

The remainder of this chapter is the presentation of an algorithm to accurately perform the transformation of switch and case statements in the source code into the n 2-way branches found 45

in the binary, in order to have a unifying control structure between the source and binary CFGs.

A high level overview of the algorithm can be seen in Algorithm 11.

Algorithm 6: Switch-Transform(CFG) Input: Control flow graph, CFG, to transform Output: The transformed CFG 1 Create nodes and edges for returned graph. Find all switch and case nodes 2 Map each switch node to a list of their caseNodes 3 Sort cases in ascending order 4 2-way branch creation with case nodes 5 Create default node mapping 6 Replace the switch block with the n 2-way branches 7 return Transformed CFG

5.2 Working Example

Throughout this section of the chapter, we will provide detailed explanations of the steps that

are taken in the algorithm to properly transform the switch statement found in the source code to

the n 2-way branch structure that is found in the binary. We will be using two different examples

to show how the algorithm works throughout this section. The first example is a switch statement

with 3 cases, each of which has a break statement as the final line. The second example is a

switch statement with 4 cases and a default case with no break statements or other modifying

elements. The source code for these two examples can be seen in Listing 5.1. The initial CFGs for

our working examples can be seen in Figure 5.1.

1 void testing_3_switch_break(){

2 int c1, c2, c3, x;

3 switch(x){

4 case c1:

5 printf ("case1");

6 break;

7 case c2:

8 printf ("case2");

9 break; 46

10 case c3:

11 printf ("case3");

12 break;

13 }

14 }

15

16 void testing_4_switch(){

17 int c1, c2, c3, c4, x;

18 switch(x){

19 case c1:

20 printf ("case1");

21 case c2:

22 printf ("case2");

23 case c3:

24 printf ("case3");

25 case c4:

26 printf ("case4");

27 }

28 }

Listing 5.1: Working Example Source Code

5.2.1 Finding Switch and Case Nodes

The first step that needs to be taken in this algorithm is to determine if there are any switch

statements in the source graph. Assuming this to be true, we must do the following. First,

identify any nodes which have an edge leading to the switch node in the source CFG. These nodes

will be referred to as “switch predecessors” and it is important to do this now in order to make it

easier to add the n 2-way branch structure to the final graph later since all switch predecessors

will need to point to the root branch. Next we must identify all the case statements as well as the

potential for a default case statement. During this process, we keep track of the true destinations

for each case. That is, the first node in a graph which would be reached should a case statement 47

(a) Original CFG- Switch with 3 cases (b) Original CFG- Switch with 4 cases

Figure 5.1: Original Example CFGs 48 evaluate to true in order to properly transform the final graph. Lastly, we will create all the other control flow nodes and control flow edges which will be returned in the final graph.

Once all of this is done with the non-switch nodes, we iterate through all the switch and case nodes that were identified to create a mapping of switch nodes to a list of cases. This is done to help keep track of the case nodes associated with each switch statement, should there be more than one switch statement. Theoretically speaking this could slow down the run time, but in reality we do not anticipate having large numbers of switch node or cases present. A programmatic view of this process can be seen in Algorithm7.

5.2.2 Sorting Cases in Ascending Order

Once all of the cases are identified, we sort them in ascending order. That is sort them according to the order that they appear in the source code. This must be done for two reasons.

The first being that this algorithm is ran on the original source CFG and we cannot guarantee that the cases in the graph will be listed in sequential order or that they will be evaluated in sequential order in Algorithm7. The second reason that this must be done is due to the fact that in the binary CFG, the root of the 2-way branch structure that gets created is the “middle case” from the source code. This is appears to be an optimization done in order to minimize the number of comparisons that need to be checked. For the purposes of this transform algorithm, we used bubble sort since we will likely never have an exceptionally large number of cases to process.

A programmatic overview of this step can be seen in Algorithm8.

5.2.3 Creation of 2-way Branches

Now that we have the cases properly sorted, we are able to transform them into their respective

2-way branch nodes. We have some flexibility in how this is done due to the fact that the root of this structure is the middle case node as well as the fact that we have already sorted the nodes.

For the purposes of our code, we create the nodes from the middle through the upper half of the array first and then create the nodes for the lower half. This creation process entails turning the 49

Algorithm 7: Finding Switch Nodes(CFG) Input: Control flow graph, CFG, to transform 1 switchPredecessors = List of all switch predecessor nodes 2 caseNodes = List of case nodes and true destinations 3 switchNodes = List of switch nodes in the graph 4 foreach Edge e in CFG.edges do 5 if e.to().taggedWith(“switch node”) then 6 Create from node for returned graph 7 switchPredecessors.add(e.from()) 8 switchNodes.add(e.to()) 9 else if e.from().taggedWith(“case node”) and !e.from().taggedWith(“default case”) then 10 Create to node for returned graph 11 Find the switch node associated with the case node 12 caseNode c = Create case node object 13 caseNodes.add(c) 14 Skip rest of checks and start with new edge 15 else if e.from().taggedWith(“default case”) then 16 Create from node for returned graph 17 Create to node for returned graph 18 Add edge between created nodes 19 else if e.to().taggedWith(“default case”) and !e.from().taggedWith(“switch node”) then 20 Create from node for returned graph 21 Create to node for returned graph 22 Add edge between created nodes 23 else if e.to().taggedWith(“case node”) and !e.from().taggedWith(“switch node”) then 24 Create from node for returned graph 25 Add edge between created nodes 26 else 27 Create from node for returned graph 28 Create to node for returned graph 29 Add edge between created nodes 30 foreach Node s in switchNodes do 31 foreach Node n in caseNodes do 32 if s == n.getSwitchNode then 33 Map n to s’s list of cases 50

Algorithm 8: Sort Case Nodes(caseNodes) Input: An array of nodes from the source CFG which represent the start of each case 1 foreach Node n in caseNodes do 2 foreach Node n+1 in caseNodes do 3 if n.lineNumber > n+1.lineNumber then 4 Swap Nodes case node into a 2-way branch point and adding one of the branches- the true destination branch.

We are not able to add the false branches at the initial creation, the reason why will become evident shortly. The only time a false branch is able to be added during this process is if we have an “empty” case (i.e. a case with no other logic in it) that simply falls through to the next case.

If this happens, we know that we can add both edges.

Once we have created the 2-way branch nodes and add the true edges, we determine what the

“default” destination is for each switch node. This is the destination that is taken should all of the cases evaluate to false or if there is a default case present. Determining what the “default” destinations are is necessary for properly adding the false edges to the cases that would be considered leaves to the n 2-way branch structure.

The final stage to creating the 2-way branches is dependent upon how the architecture which the binary is compiled for handles switch statements. From our work we have found that the

MIPS architecture adds an additional 2-way branch to the final graph if the number of cases is greater than or equal to 4. At runtime, this allows the computer to quickly reduce the number of cases that need to be checked. A “directional checkpoint” is added for every group of 3 or more branches. Thus if there are 4 cases, you have 1 directional checkpoint. Figure 5.2 shows this occurring in the binary CFG for a sample function with 10 cases. Each of the boxed branch nodes is a directional checkpoint. This example is shown to help illustrate what is happening on a larger scale switch statement. If the middle case evaluates to false, then you are able to check to see if you need to branch to the upper (larger) half or lower (smaller) half of the cases. Once all of the

2-way branches have been made for the cases, if there are at least 4 cases then we also must 51 construct the upper/lower checkpoint node. We know a couple of things then with this construction. First, the false edge from the middle case node must point to the checkpoint node.

Second, we know that the false edge of this checkpoint must point to the first case in the upper half. Lastly, we know that the true edge must point to the first case in the lower half. With the creation of this checkpoint complete, we are able to iterate through the rest of the nodes and properly set the false edge. This structure is compared to the PowerPC architecture which simply creates a “waterfall” of if statements. Our algorithm is able to do both due to the fact that if there are fewer than 4 cases for the MIPS architecture associated with a given switch statement we can just loop through the list of created nodes and create false edges from index i to index i+1. This idea can be applied to the PowerPC architecture as well then. Now that we have all of the false edges created, we have completed the creation of the 2-way branches from the original case nodes. Algorithm9 shows a programmatic view of this creation process.

Figure 5.2: Binary CFG for Switch with 10 cases

5.2.4 Switch Block Replacement

The final step in this algorithm is to replace the original switch and case statement block in the source code graph with our 2-way branch nodes. This is done by first iterating through the switch predecessor nodes and adding an edge from each predecessor to the first branch node (i.e. the 52

Algorithm 9: 2-way Branch Creation(sortedCaseNodes) Input: An array of sorted case nodes to use for branch creation 1 Node mid = Middle index from sortedCaseNodes 2 createdBranches = Array to add created branches to 3 foreach Node from mid to sortedCaseNodes.size()-1 do 4 Create 2-way branch node, c, for given case node m 5 Add true branch to node c 6 createdBranches.add(c) 7 foreach Node from mid to 0 do 8 Create 2-way branch node, c, for given case node m 9 Add true branch to node c 10 createdBranches.add(c) 11 if sortedCaseNodes.size() > 3 then 12 Create check point node, p 13 Add false edge from createdBranches.get(0) to p 14 Add false edge from p to createdBranches.get(1) 15 Add true edge from p to createdBranches.get(mid) 16 foreach Node n in createdBranches do 17 if n == mid then 18 Add false edge to default/switch false destination 19 else 20 Add false edge from n to n+1 21 else 22 foreach Node i = 1 through createdBranches do 23 Add false edge from i-1 to i 53 middle case). After this is done, we need to make sure that any “waterfall edges” are added. That is, if there are consecutive cases without a return or break statement which would cause a waterfall effect which is similar to what we see in working example with 4 cases. These edges are added by iterating through the nodes which were case predecessors in the original graph (i.e. along the true path of a another case) and adding edges to the 2-way branches as needed. Once this step is complete, the transformed CFG can be returned. Algorithm 10 shows this in a programmatic view.

Algorithm 10: Switch Replacement(createdBranches) Input: An array of the 2-way branch nodes that were created 1 foreach Node n in switchPredecessors do 2 Add edge from n to createdBranches.get(0) 3 foreach Node m in casePredecessors do 4 Add edge from m to corresponding branch node 5 return Transformed CFG

5.2.5 Transformed Working Example

We will now show the results of this algorithm on our two working examples. First, we will look at the 3 case switch statement with break statements at the end of each case. Figure 5.3 shows the original, source CFG for this function. Now in Figure 5.4 we can see the transformation algorithm being applied to the original CFG. Lastly we can compare our transformed CFG to that which is found in the binary, as seen in Figure 5.5, and see how our transform matches what is seen in the binary in terms of the branch nodes and over all control structure.

Now we will look at our example of a 4 case switch statement with a default case and no break or return statements. Figure 5.6 again shows our original, source CFG for the given function. Next, in Figure 5.4 we can see our algorithm applied to the source graph and get our transformed graph. Finally, in Figure 5.8 we have the binary CFG (compiled to the MIPS architecture) which

Figure 5.7 mirrors in terms of overall control structure. 54

Figure 5.3: Original CFG- Switch with 3 cases Figure 5.4: Transformed CFG- Switch with 3 cases

Figure 5.5: Binary CFG- Switch with 3 cases 55

Figure 5.7: Transformed CFG- Switch with 4 cases Figure 5.6: Original CFG- Switch with 4 cases

Figure 5.8: Binary CFG- Switch with 4 cases 56

5.3 XINU Example

Now we will walk through a real world example of this algorithm being applied to a function from

the XINU code base. The function in question is ttyControl.c and the source code for this

function can be seen in Listing 5.2. This leads to a source CFG that can be seen in Figure 5.9.

After applying our algorithm to transform the graph, we get what can be seen in Figure 5.10. We

are then able to compare this transformed graph to the graph which is found in the binary. The

binary CFG can be seen in Figure 5.11. The lettered markings on the graphs found in Figures

5.10 and 5.11 show the correlation of the transformed source control structure to that of the

binary. This helps to prove that our transform shows how the source code will be structured in

the binary executable.

1 devcall ttyControl(device *devptr, int func, long arg1, long arg2){

2 struct tty *ttyptr;

3 device *phw;

4 char old;

5

6 ttyptr = &ttytab[devptr->minor];

7 phw = ttyptr->phw;

8 if (NULL == phw){

9 returnSYSERR;

10 }

11

12 switch (func){

13 caseTTY_CTRL_SET_IFLAG:

14 old = ttyptr->iflags & arg1;

15 ttyptr->iflags |= (arg1);

16 return old;

17

18 caseTTY_CTRL_CLR_IFLAG:

19 old = ttyptr->iflags & arg1;

20 ttyptr->iflags &= ~(arg1);

21 return old; 57

Figure 5.9: ttyControl Source CFG

Figure 5.10: ttyControl Transformed CFG

Figure 5.11: ttyControl Binary CFG 58

22

23 caseTTY_CTRL_SET_OFLAG:

24 old = ttyptr->oflags & arg1;

25 ttyptr->oflags |= (arg1);

26 return old;

27

28 caseTTY_CTRL_CLR_OFLAG:

29 old = ttyptr->oflags & arg1;

30 ttyptr->oflags &= ~(arg1);

31 return old;

32 }

33

34 returnSYSERR;

35 }

Listing 5.2: ttyControl Source Code

5.4 Implications of the Switch Transform

The development of this transform for source code CFGs is very helpful in getting one step closer

to being able to verify the control structure of developer written source code to the control

structure of what is executed on a given machine. This transform allows us to take an n-way

branch (i.e. a switch statement) and translate it into the respective 2-way branches to mimic

what happens in the binary. This translation will always happen in the binary due to the fact

that you are only able to ever have 2-way branches in an executable. Additionally, it allows the

binary to execute in a binary search fashion when the code is ran which helps to minimize the

number of comparisons performed.

One of the interesting points to take away from the development of this transform is its

generality. Based on our testing with simple examples, as well as the cases found in the XINU,

this algorithm works with all possible combinations of switch statements. By that we are referring 59 to the fact that you can have a variable number of cases, it is not required to have a default case, and within each case you have the potential to have a break or return statement which would alter the control structure. This algorithm helps to show then that no matter what the source code structure appears to be, there is a simpler structure that underpins it all. 60

CHAPTER 6. SHORT CIRCUIT TRANSFORM

Often times in source code, there are multiple conditions that can or need to be evaluated in order for a condition to be true and subsequent code to be executed. Consider for the sake of an example, 4 boolean variables c1, c2, c3, c4 that are used to determine whether or not a branch is taken in the source code. The branch statement could look like anything from if ((c1 && c2) ||

(c3 && c4)) or if ((c1 || c2) && (c3 || c4)) or any other combination of the 4 variables and logical operators. The expression that these variables create is what will be referred to as a complex condition(al) (CC), i.e. there is the use of a logical operator inside the condition and there is more than one variable that potentially needs to be evaluated. A condition that does not rely on any logical operators and only uses 1 variable is what will be referred to as a simple condition (SC). Programmers have the ability to break up these complex conditions into their appropriately nested or linear conditions but for simplicity and readability they are often left in their complex form.

When a programmer goes to compile and run their code, the complex condition that has been formed in the source code must be broken down into SCs. This is due to the fact that assembly languages are only able to handle SCs. Breaking up of the CCs creates a challenge when attempting to verify the control structure created in a disassembled binary function to that of the source code since all possible comparison metrics change. Things such as the path count and number of nodes in the function’s CFG change as well as the CFG it self since there are more branches added in the binary function. The addition of branch nodes that occurs from breaking down the CC also creates another problem that we refer to as short circuiting. Short circuiting is a form of optimization that is performed by the compiler when deconstructing the CCs. Consider our initial example of if ((c1 || c2) && (c3 || c4)). When evaluating this condition, if c1 is true, there is no need to evaluate c2 since we know that the first half of the condition is satisfied, 61 all we need to do now is check the second half. That is you only evaluate as much as you need to in order to determine if the branch should be taken. This is reflected in how the simple conditions are organized in the disassembled CFG.

The remainder of this chapter is the presentation of an algorithm to accurately perform the deconstruction of complex conditions in the source code, for all branch points, in order to take short circuiting into account and attempt to have a unifying control structure between the source and binary CFGs. A high level overview of the algorithm can be seen in Algorithm 11.

Algorithm 11: SC-Transform(CFG) Input: Control flow graph, CFG, to transform Output: The transformed CFG 1 Find all CC nodes 2 Process all non-CC nodes 3 Process all CC nodes 4 return Transformed CFG

6.1 Working Example

While we provide detailed explanations of the steps that are taken in the algorithm, we will continue to use the working example of if ((c1 || c2) && (c3 || c4)) to illustrate various points throughout the explanation. The initial CFG that makes use of our working example can be seen in Figure 6.1 where the path to the printf statement is taken if the condition evaluates to true.

6.1.1 Finding CC Nodes

The first step that needs to be taken in this algorithm is to find all nodes that could cause short circuiting. This is done by iterating through all control flow conditions in the given CFG and tagging the nodes with CCs for processing later in the algorithm. It is important to isolate the CC nodes in order to ensure that all of the non-CC nodes maintain the correct relationship with the other nodes in the graph. That is, the first SC which will be added should have the tail 62

Figure 6.1: Working Example CFG of the original edge pointing to the CC pointing to the new SC. If we look at our working example, we want to ensure that the root of the graph points to the SC which will contain c1 in the transformed graph. In addition to all the CC nodes being tagged appropriately, the non-CC nodes and edges are also tagged to aid in their processing which is discussed in Section 6.1.2.

Algorithm 12 shows a more programmatic view of the steps taken at this point in the algorithm.

Algorithm 12: Finding CC Nodes(CFG) Input: Control flow graph, CFG, to transform 1 cfg = CFG to transform 2 conditionNodes = Condition Nodes in cfg 3 /*ArrayList to hold all nodes with CCs*/ 4 ccNodes = ∅ 5 foreach Node n in conditionNodes do 6 if n contains ||, && then 7 ccNodes.add(n) 8 n.tag(“cc node”) 9 foreach Node n : cfg.nodes do 10 if !ccNodes.contains(n) then 11 n.tag(“retain node”) 12 foreach Edge e : cfg.edges do 13 e.tag(“retain edge”) 63

6.1.2 Processing non-CC Nodes to Retain

Once we have identified and tagged the nodes that contain CCs, we are able to process all other nodes in the graph. In order to process all of the non-CC nodes, we iterate over all the edges in the given CFG. This is due to the fact that it allows us to easily process all 3 components at the same time- the edge tail, the edge header, and the edge itself- while maintaining the correct structure. Each edge falls under 1 of 4 different cases as seen in Table 6.1. The simplest case is if an edge falls under case 1, since the Header node is tagged as being nested and everything else is processed later with the other CC nodes. If the edge falls under case 2, then the only node that needs to be created for the transformed graph is the Header node. Cases 3 and 4 are where things start to get more complicated. Case 3 has the most complex logic that needs to be handled out of the 4 cases due to the fact that the edge we are processing may be a loopback edge or it may be the edge from a control flow node to the CC node. In the case of the edge being from a control

flow node to a CC node, we will refer to the Tail node as the CC-predecessor since it is the predecessor node to the CC node. Depending on if the edge is a loopback edge or if the edge contains the CC-predecessor as the Tail node, the Tail node is processed and added to an appropriate list in order to create the correct edges later on when the CC nodes are processed.

Lastly if the edge falls under case 4, the nodes and edges are created appropriately. Once all of the non-CC nodes have been processed, we are able to process and expand the CC nodes.

Algorithm 13 shows a programmatic view of processing the non-CC nodes.

Table 6.1: CFG Edge Combinations

Tail Node Header Node 1 CC CC 2 CC non-CC 3 non-CC CC 4 non-CC non-CC 64

Algorithm 13: Process non-CC Nodes(CFG, ccNodes) Input: Control flow graph, CFG, to transform and an array of the CC Nodes 1 cfg = CFG to transform 2 ccNodes = Nodes that have CCs, found in Algorithm 12 3 predecessors = List to hold CC predecessor nodes 4 loopbackTails = List to hold loopback tail nodes 5 foreach Edge e : cfg.edges do 6 if ccNodes.contains(e.tail) and ccNodes.contains(e.header) then 7 e.header.tag(“nested cc”) 8 else if ccNodes.contains(e.tail) and !ccNodes.contains(e.header) then 9 Create e.header node to be added to graph 10 Tag new node appropriately 11 Add node to graph 12 else if !ccNodes.contains(e.tail) and ccNodes.contains(e.header) then 13 Create e.tail node to be added to graph 14 Tag new node appropriately 15 if !e.taggedWith(“LoopBackEdge”) then 16 predecessors.add(e.tail) 17 if e.taggedWith(“LoopBackEdge”) then 18 loopbackTails.add(e.tail) 19 else if !ccNodes.contains(e.tail) and !ccNodes.contains(e.header) then 20 Create e.tail node to be added to graph 21 Create e.header node to be added to graph 22 Tag new nodes appropriately 23 Add nodes and edges to graph to be returned 65

6.1.3 Process CC Nodes

The last step in the algorithm is to process all of the original CC nodes. The challenge in this step is two fold. The first being that you must properly separate and create the control structure amongst the new SC nodes. The second being that you maintain the overall structure of the CFG while expanding the CC node into its respective SC nodes. The first step in this process is to take the list of CC nodes that have been identified and tagged in the first two steps and sort them such that, any CC nodes that are considered “nested” are processed first. A nested CC node is a

CC node which falls along one of outgoing edges from another CC node. An example of this can be seen in Figure 6.2 which shows a nested CC node in the XINU function xsh memdump.c. In the

figure, you can see that there is a CC node at the top and along one outgoing edge is a normal control flow node and along the other is another CC node. The second CC node would be considered nested under the first. It is important to sort the CC node list such that the nested nodes are processed first due to the fact that the SCs they will create become the true and false destination nodes for the parent CC node. So in order to maintain the correct structure, we need to expand the nested CC nodes first.

Once the nodes are properly sorted, we are able to start iterating over them and creating the

SC nodes. First we obtain the true and false destination nodes. Those are the nodes that the true and false edges of the original CC node point to in the CFG. Next, we take the CC node we are currently processing and identify the conditions as well as logical operators that connect them. It is important to identify the logical operators that form the connection due to the fact that it impacts how the new paths are formed. The only condition which does not matter, is the last one since this will always have to point to the original true and false destinations no matter what the case is for the previous conditions. Once we have the conditions and operators stored appropriately in an array, we can start constructing the SCs for the graph that will be returned.

If we are processing the first condition, there are the additional steps of pointing the predecessor node to the new SC as well as any possible loopback tails to the new SC. After that is checked, 66 we are able to set the true and false destinations for each new SC node. The last condition’s true and false destinations are set first in order to prevent any errors when the code is running. This primarily due to what happens if you have a logical OR condition. Consider the second half of our working example: (c3 || c4). We will have set c4’s true and false destinations first. Now when we iterate through the array of SC nodes and get to c3, all we have to do for setting the true destination is look to true destination of the next index. In this case, that would be the true destination of c4. Once we have the last condition’s true and false destinations set, we work our way back “up” through the conditions. This allows us to take a bottom up approach during this expansion and reconstruction in order to minimize what is unknown or needing to be double checked in the implementation code.

After all of this has happened, we return the transformed graph which contains the expanded

CC nodes, each of which only has a single condition and is therefore an SC node. The programmatic view of this is shown in Algorithm 14.

Figure 6.2: Nested CC Nodes

6.1.4 Transformed Working Example

The result of running this algorithm on any given CFG with CC nodes will result in the CFG being transformed such that each CC node will be expanded into SC nodes with the short circuiting paths between connecting them. We have now come full circle with our working example. If we recall, we started with the notion of if ((c1 || c2) && (c3 || c4)) as our 67

Algorithm 14: Process CC Nodes(CFG, ccNodes, predecessors, loopbackTails) Input: Control flow graph, CFG, to transform and node lists for processing Output: The transformed CFG 1 /*Sorting nested nodes to be processed first*/ 2 foreach Node n in ccNodes do 3 currentNode = n 4 foreach Node m = n + 1 in ccNodes do 5 nextNode = m 6 if !currentNode.taggedWith(“nested cc”) and nextNode.taggedWith(“nested cc” then 7 Swap currentNode and nextNode 8 foreach Node x in ccNodes do 9 trueDestination = x.getTrueEdge.to 10 falseDestination = x.getFalseEdge.to 11 conditionNodes = CC node conditions and logical operators info 12 foreach Node y in conditionNodes do 13 if y==0 then 14 Add edge from all x’s predecessors to y 15 Add edge from any loopback with x as the header to y 16 Set last condition true and false destinations 17 foreach Node z in conditionNodes do 18 Set true and false destinations for all other nodes 19 return Transformed CFG 68 initial condition. The initial CFG for this can be seen in Figure 6.1. After running this algorithm on that given CFG we have the following result, seen in Figure 6.3. As we saw previously, any path that leads to the printf statement is equivalent to the overall condition evaluating to true and this is still the case in the transformed CFG. Figure 6.4 shows a side by side comparison of the original and transformed CFGs for reference.

Figure 6.3: Working Example Transformed CFG

6.2 Implications of Design Decision

The development of this short circuit transform is important due with regards to our verification algorithm. First, this transform is useful in seeing how a programmer’s source code will likely look when it is compiled down to a binary. If we look at Figure 6.5, we can see what the CFG looks like for the disassembled binary of the working example. Then in Figure 6.8 we see the side by side comparison of the transformed CFG and binary CFG with a strong resemblance between the two which is important for the verification problem. 69

(a) Original CFG (b) Transformed CFG

Figure 6.4: Original and Transformed CFGs

Second, this transform and the algorithm used to create this transform leads to some interesting findings in the realm of formal verification and reduced ordered binary decision diagrams (BDDs). We can start by thinking of each expression in a CC node as the boolean formula required for the creation of a BDD. If we do this with our working example, we get the following formula: f(c1, c2, c3, c4) = (c1 ∨ c2) ∧ (c3 ∨ c4). Next, we can create the binary decision diagram for this formula and that can be seen in Figure 6.6. Now if we wanted to reduce the

BDD, we would repeatedly perform the following steps:

1. Remove redundant leaf nodes

2. Remove redundant non-leaf nodes

3. Remove redundant choice points or tests

We can start this reduction process on our example BDD and get Figure 6.8a after the removal of redundant leaf nodes. Next we would remove redundant non-leaf nodes which are highlighted red in Figure 6.8b. We would then remove redundant choice points or tests. In our case that would fall under the consolidation of portions of the graph- the portions that are consolidated are 70 highlighted blue in Figure 6.8c. After this happens, we are left with Figure 6.9 where we again apply the second step. Ultimately this leads to the final, reduced ordered BDD as seen in Figure

6.10.

We can now see that the resultant reduced BDD, is also seemingly identical to the one that we produced from our short circuit transform and Figure 6.11 shows this relationship. The only difference being that the true and false destinations are not connected by an edge in the BDD.

One of the benefits then of this algorithm is that we do not need to enumerate all possible outcomes in order to get a reduced ordered BDD if we know the boolean formula.

Figure 6.5: Working Example Binary CFG 71

Figure 6.6: Initial BDD

(a) Transformed CFG (b) Binary CFG

Figure 6.7: Relationship between transformed CFG and binary CFG 72

(a) Removal of redundant leaf nodes (b) Removal of redundant non-leaf nodes

(c) Removal of redundant choice points

Figure 6.8: BDD Reduction Steps

Figure 6.9: Removal of redundant leaf nodes part Figure 6.10: Final Reduced Ordered BDD 2 73

(a) Transformed CFG (b) Final BDD

Figure 6.11: Relationship between transformed CFG and final BDD 74

CHAPTER 7. ISOMORPHISM ALGORITHM TO CHECK CONTROL FLOW EQUIVALENCE

7.1 Introduction

In Tsai et. al. [53], we saw that they have analyzed the CFGs extracted from obfuscated programs to abstract basic graph transformations to model obfuscation. They also show that various control flow obfuscations can be expressed as a sequence of the basic graph transformations from the CFG of the original program P to the CFG of the obfuscated program

O(P )[53]. In [53], the end goal is to quantify the reengineering difficulty due to obfuscation as a graph distance between the CFGs from P and O(P ).

We have analyzed the CFGs extracted from binary programs to abstract basic graph transformations to model the compiler transformation from source to binary. We have shown that the compiler transformation can be expressed as a sequence of the basic graph transformations from the CFG of the source program P to the CFG of the binary program B(P ). Our end goal now, is to show that the transformed CFG is isomorphic the CFG of B(P ) modulo the linear blocks in the binary code (i.e., each linear block in B(P ) is mapped to a single node in its CFG).

7.2 Description of the Isomorphic Algorithm

In order to begin checking for the presence of graph isomorphism between the two CFGs, we start by “coloring” the nodes and creating the attribute vectors as described in Chapter2. The

“colors” in our case are a numeric label for each node in the graph. This is necessary for the isomorphism problem in order to ensure that we have a unifying identifier for each node across the two graphs. To generate and assign the labels, we start by labeling all of the nodes that we know to be present and which are easily identifiable. The order in which all of these identifying 75 nodes are labeled can be thought of as a “color”. With that in mind, we have the following coloring order:

1. Exit Points 3. Loop Headers

2. Root Nodes 4. Loop Back Tails

Both the loop headers and loop back tails are labeled in ascending order of appearance in the code. By labeling these things first we are ensuring that these key graph structures have the same labels in both the source and binary graphs. Once all of these nodes have been labeled, we are able to label all of the other nodes. This is done in ascending order based on the line numbers for the source code and assigned line numbers to the binary graphs. The binary graphs have their line numbers assigned to them upon being imported into our program and are based on the input

file from Radare.

Once all of the nodes are properly labeled, we begin to create the feature vector for each node.

This starts by calculating the depth from the root for each node which does two things for this checking algorithm. The first, and most important, is that it ensures that the same number of edges are present in both graphs. For example, consider a source node X which has a depth set of

{1, 2, 3}. This implies that there are 3 incoming edges to the node X in the source graph and we want to ensure that the same is true for both the number of edges and the depths in the binary graph. If this is not the case, then we know something has changed- either intentionally by the compiler or unintentionally. The second thing that the depth set for each node provides is an additional metric to check and help strengthen the results that two graphs are in fact isomorphic.

The remaining attributes that help to create this vector are lists of a node’s parents and children.

These are used to ensure that edges have the same labeled end points for the isomorphic check.

One final check must be performed before we can determine if the two graphs are isomorphism, and that is ensuring the loop headers in the binary graph are properly tagged. While working on developing this algorithm, we found that nodes which were originally tagged as loop headers in 76 the binary graphs may not be the true loop headers. In order to do this, we verify that the loop headers in the binary transformed graph point to the same types of nodes as its respective source loop header. Consider the example of mailboxAlloc.c from XINU. This function’s source code can be seen in Listing 7.1. On line 7, we have the start of a for loop where an index variable i = 0 being initialized. This loop happens while the value of i is less than NMAILBOX which is a constant defined to be 15. We can see the source transformed graph for this function in Figure 7.1 and the binary transformed graph in Figure 7.2 with the labels they would have originally been given. In the source graph, we see the node labeled 2 is a loop header and the node labeled 3 is a loopback tail and the same goes for the binary based on the original tags applied to each node.

However this labeling creates an issue when we go to check for the isomorphism between the two graphs. We can see that this will fail when we get to the node labeled 2 which in the source and binary would have the following child and parent sets.

• Source Child Set: {1, 3} • Source Parent Set: {3}

• Binary Child Set: {3, 4} • Binary Parent Set: {3}

The mismatched child sets cause the algorithm to fail at node 2. When looking at the source and binary transformed graphs, without the labels they appear to be isomorphic without any labels being present. However upon checking the initial binary loop header we, see that both outgoing edges point to branch nodes instead of one branch nodes and the exit as in the source. Just as we have worked to transform the source graph and bring it closer to what we see in the binary, we must now do the same to the binary graph. By ensuring that the binary loops point to the same structures as the source loops, the binary graph is able to be moved closer to the center of that spectrum and ensure that the labels are properly applied.

Now if we mentally move around some of the nodes, we can see that these graphs may actually be isomorphic. We can start by modifying the tags for the nodes such that node 3 in the binary graph becomes the new loop header and node 2 becomes the new loopback tail. Once we have 77

done this, we can relabel the graph and get what is seen in Figure 7.3. Upon initial checking, the

updated binary graph and original source graph are now isomorphic. Relabeling the graphs and

the resultant isomorphism then begs the question “What happened?”. It appeared to be an

optimization that allows the for loop to act more as a do-while loop due to the fact that the

index variable is initialized to 0 and we know the comparison with NMAILBOX will always be true

for that first iteration. Based on the result of this check the binary either continues to loop or

continues on with the rest of the code. Thus we can now see the necessity of checking to make

sure that a node identified as a loop header in the binary points to the same type of node as that

in the source.

Upon the completion of checking- and relabeling as necessary- all of the loop headers, we are

able to perform the isomorphism. This entails iterating through all of the nodes in both the

source and binary graph to check each of the following: a node’s children, a node’s parents, and a

node’s depth list. If each of these are true we increment a counter variable, otherwise we break

out of the loop since we know that we have found a point where the graphs are not isomorphic.

At this point, we check to see if the counter variable is equivalent to the number of nodes in each

of the graphs and if so, we know our result to be true. Otherwise we know that we can return

false as our result. A high level overview of this process can be seen in Algorithm 15.

1 syscall mailboxAlloc(uint count){

2 static uint nextmbx = 0;

3 uint i;

4 struct mbox *mbxptr;

5 int retval = SYSERR;

6 wait(mboxtabsem);

7 for (i = 0; i < NMAILBOX; i++){

8 nextmbx = (nextmbx + 1) % NMAILBOX;

9 mbxptr = &mboxtab[nextmbx];

10 if (MAILBOX_FREE == mbxptr->state){

11 mbxptr->msgs = memget(sizeof(int) * count);

12 if (SYSERR == (int)mbxptr->msgs){ 78

13 break;

14 }

15 mbxptr->count = 0;

16 mbxptr->start = 0;

17 mbxptr->max = count;

18 mbxptr->sender = semcreate(count);

19 mbxptr->receiver = semcreate(0);

20 if ((SYSERR == (int)mbxptr->sender) ||

21 ( SYSERR == (int)mbxptr->receiver)){

22 memfree(mbxptr->msgs, sizeof(int) * (mbxptr->max));

23 semfree(mbxptr->sender);

24 semfree(mbxptr->receiver);

25 break;

26 }

27 mbxptr->state = MAILBOX_ALLOC;

28 retval = nextmbx;

29 break;

30 }

31 }

32 signal(mboxtabsem);

33 return retval;

34 }

Listing 7.1: mailboxAlloc.c Source Code 79

Figure 7.1: mailboxAlloc.c Figure 7.2: mailboxAlloc.c Binary Transformed Graph Source Transformed Graph

Figure 7.3: mailboxAlloc.c Updated Binary Trans- formed Graph 80

Algorithm 15: Isomorphic Checking Algorithm(bin-CFG, src-CFG) Input: Binary and source CFGs to check Output: Isomorphism result 1 /* Handle the case of linear functions */ 2 if binCFG.BranchNodes && srcCFG.BranchNodes == 0 then 3 return True; 4 Label the source and binary exit points 5 Sort source and binary loop headers in ascending order of appearance in code 6 for Node l : loopHeaders do 7 Label l with the current label 8 Sort source and binary loopback tails in ascending order of appearance in code 9 for Node t : loopbackTails do 10 Label t with the current label 11 Label the source and binary root nodes 12 Label all other nodes in the source and binary CFGs 13 Calculate node depths for all nodes in source and binary CFG 14 Check to make sure binary loop headers are properly tagged and labeled 15 boolean result = false 16 int isoCounter = 0 17 for Node s : srcCFG and Node b in binCFG do 18 boolean childResult = s.Children.containsAll(b.Children) 19 boolean parentResult = s.Parents.containsAll(b.Parents) 20 boolean depthResult = s.Depths.containsAll(b.Depths) 21 if childResult && parentResult && depthResult then 22 isoCounter += 1 23 else 24 break 25 if isoCounter == srcCFG.nodes == binCFG.nodes then 26 result = true 27 return result 81

CHAPTER 8. RESULTS

In this chapter, we evaluate the effectiveness of our algorithm against the XINU operating system compiled for the MIPS and PowerPC 32-bit instruction sets using the GCC compiler.

This allows us to see the effectiveness in identifying whether or not the source and binary CFGs are isomorphic for the same functions against two different architectures and their respective instructions or optimizations. The source CFGs were obtained using Atlas [2], a software analysis and visualization platform, and the binary CFGs were obtained using Radare [3], an binary analysis framework. We will first explain the different types of isomorphisms that we check for after applying our transforms to the source and binary CFGs and running our checking algorithm. Then we discuss the results found as well as their implications.

8.1 Categorization using Isomorphism

We have already seen at a high level what the graph isomorphism problem entails in Chapters1 and2. For the purposes of these results, we refine the problem by defining different categories of graph isomorphism. The categories are as follows:

• N-Iso : Non-Isomorphic • L-Iso : Linear-Isomorphic

• A-Iso : Architecturally-Isomorphic • G-Iso : Graph-Isomorphic

Every function that we analyze will fall into one of the four categories listed above. These categories allow us to have a more detailed evaluation of our algorithm and consider cases such as the presence of architecture specific instructions. In the following subsections, we describe what would cause a function to fall into a specific category and provide examples from each category. 82

8.1.1 G-Iso Example

Within the G-Iso category, we find functions whose final transformed source and binary graphs are isomorphic in the original sense of the graph isomorphism problem. For our purposes, a graph falls into this category if nodes with corresponding labels from the source and binary transformed graphs both have the same labeled children, parents, and depth list. In Figure 8.1 we present an example for this category from the function getnet.c in XINU. As can be seen just by looking at the graphs, but also the labels on each node, the graphs are clearly isomorphic. Applying the checking algorithm to these graphs shows that the children, parents, and depths are identical for each node in the source and binary.

(a) Source Transformed (b) MIPS Transformed Graph (c) PPC Transformed Graph Graph

Figure 8.1: Transformed Graphs for getnet.c

8.1.2 L-Iso Example

Functions that are categorized as L-Iso are ones where both the source and binary CFGs consist solely of linear code segments. There is no need to perform an additional transform on these graphs since we can intuitively condense or expand the graphs such that they both have the same number of nodes and edges. Figure 8.2 shows the function x_reboot.c from XINU, which 83 our algorithm places into the L-Iso category due to the only nodes in the graph being linear code blocks.

(a) Source CFG (b) MIPS CFG (c) PPC CFG

Figure 8.2: CFGs for x reboot.c

8.1.3 A-Iso Example

The idea behind this category is to try and take into account the different architecturally specific instructions which may cause different optimizations for a given instruction set. For example, this could include instructions which allow for data flow optimizations and end in the consolidation of branch nodes. Figure 8.3 shows this happening for the function rarp_in.c from

XINU. In the graphs, we can see that there is one less branch node in both of the binary graphs than the source graph. This is due to the fact that nodes 4 and 5 in the source graph are able to be condensed into a single branch node. Both instruction sets have the ability to check that the pid variable is within the specified range via multiple non-branch instructions. The specific instructions that are occurring in each binary can be seen in Figure 8.4. For MIPS and PPC, both require looking at data that we do not necessarily have access to at the time of CFG generation. This data and the subsequent manipulation allows for the removal of the lower bound check being done in node 4 of the source transformed graph. Then this data is used in the lines sltiu v0, v0, 9 and cmplwi cr7, r0, 9 for MIPS and PPC respectively. These architecture specific instructions allow for the logical comparison of data and store the result in a given 84 register. In the case of MIPS, we check if v0 < 9 and if true, set v0 to 1 and 0 otherwise. For

PPC, we check to see if r0 < 9 and if so, we set cr7 to the decimal value of 4. Based on these results the respective binary CFGs are able to branch accordingly to the correct true or false destination.

(a) Source Transformed (b) MIPS Transformed Graph (c) PPC Transformed Graph Graph

Figure 8.3: Transformed Graphs for rarp in.c

8.1.4 N-Iso Example

This category serves as a catch all for functions that are not able to be placed in one of the previous three categories. Something within these functions is causing the labeling of the graphs to differ between the source final transformed graph and binary final transformed graph. This leads to a difference in each node’s parent, children, and depth lists which are used to verify whether or not the graphs are isomorphic. Figure 8.5 shows an example of the function dgread.c which is categorized as N-Iso by our algorithm. At compile time, the binary is able to reduce the number of branch nodes and edges which are present as well as the modifying who the children are for some of the branch nodes. This missing node causes an issue to arise with the labeling. 85

(a) MIPS Branch Consolidation (b) PPC Branch Consolidation

Figure 8.4: MIPS and PPC Code for rarp in.c Branch Consolidation

Lastly there are architecturally specific instructions found in each graph, however there are too many of them for us to definitively say that the graphs are architecturally isomorphic.

8.2 XINU Results

The following results are generated from running our algorithm against a set of 212 functions from the XINU operating system. Table 8.1 shows the breakdown for the number of functions in each respective category previously defined in Section 8.1. We found that the same functions appeared in the G-Iso and L-Iso categories for both instruction sets. This is an important finding to help prove that at the core, this algorithm is architecture independent.

Table 8.1: XINU Isomorphism Results

Category MIPS PPC N-Iso 65 55 A-Iso 65 75 L-Iso 25 25 G-Iso 57 57 86

(a) Source Transformed (b) MIPS Transformed Graph (c) PPC Transformed Graph Graph

Figure 8.5: Transformed Graphs for dgread.c 87

The primary differences in results occur in the A-Iso and N-Iso categories. The difference in

A-Iso occurs from having not exhausted all possible architecture specific instructions which would cause a change in control flow. The difference in N-Iso occurs from needing to take into account other compiler optimizations (i.e. definition-use data flow chains, etc.) and further refinement of the source CFG creation. Even with these differences occurring, there was still overlap to be found in these categories as can be seen in Table 8.2. In this table, we look at each of the categories as a set of functions and create a numerical venn diagram to see these relations. What this table shows is that of the functions not categorized as L-Iso or G-Iso, over half of them fall into the same category for both MIPS and PPC. That is, a function which is not in L-Iso or G-Iso may be in N-Iso for both MIPS and PPC. Lastly these intersections show that at the core, this algorithm is able to be generalized across instruction sets. Even if we only look at the N-Iso,

L-Iso, and G-Iso categories for both instruction sets, the same 57% of functions fall into each of those categories for MIPS and PPC.

Table 8.2: XINU Overlap Results

Category Number of Functions MIPS N-Iso ∪ PPC N-Iso 86 MIPS N-Iso ∩ PPC N-Iso 39 MIPS A-Iso ∪ PPC A-Iso 91 MIPS A-Iso ∩ PPC A-Iso 49 MIPS L-Iso ∩ PPC L-Iso 25 MIPS G-Iso ∩ PPC G-Iso 57

8.3 Discussion of Results

The results from this algorithm on the XINU codebase are very promising and show that this approach works for solving the verification problem. This is evident in the G-Iso and L-Iso categories alone. As for the A-Iso and N-Iso categories, moving the functions from those categories into the G-Iso category is a problem that would address at least one of the following for the source graph of each function: 88

1. Compiler optimizations: The transforms developed in this research focus solely on control

structure changes that are made by the compiler due to the semantics of a given language.

This does not include any changes that the compiler makes from looking at how variables

and data are used throughout a given piece of code. An example of this can be seen in

Figure 8.6 for the function xsh_date.c. In this function, the variable nargs is used in

multiple conditions with the node 2 being the first. If this evaluates to true, there is no need

to check nodes 4 and 5 are true after node 3 since we see from the original source CFG that

nodes 2 and 3 make up a complex condition. This idea is shown in the binary with both

edges from node 4 pointing to the exit. In order to develop a transform that can

accommodate this optimization, further work would need to be done which looks at the

variables used in the branch nodes and how they are evaluated.

2. Architecture specific instructions and optimizations: If this research were to be expanded to

include more instruction sets, there may be the need to develop other architecture specific

transforms. Additionally, we would need to exhaust all possible architecture specific

instructions and their impact on the control flow of a function. Lastly, this category

includes the developing things like a source CFG transform which would handle cases where

a conditional was condensed in the binary. This in turn would move the function into the

G-Iso category.

3. Source CFG Refinement: Functions that would fall into this category are ones where more

refinement would be needed on the generation of the source transformed graph for things

such as macros being used or static function calls being used in a branch node comparison

check. These refinements would help to move graphs into the G-Iso category.

What we have built and shown here is a solid foundational algorithm that these things would be able to added on to. Addressing the above points is only a matter of writing code to complete the necessary transforms to the source graphs. Then, in theory, we should see the N-Iso counts come down to 0 for all architectures. If this does not happen then we know that something else is 89

(a) Source Original CFG

(b) Source Transformed (c) Binary Transformed Graph Graph

Figure 8.6: Data Flow Optimization Example 90 happening- either we have a malicious binary causing the source and binary graphs not to match or the tools being used are not accurate and have errors in them. Both of which are beneficial pieces of knowledge to have.

Even though we may not have the transforms engineered that address the above challenges, we are able to identify the points within each A-Iso and N-Iso graph where the algorithm fails. This happens through the ability to easily reverse the source and binary graphs due to the single exit point that is established. Analogous to the idea behind the max flow-min cut theorem, we try and check as many nodes as possible starting from the root and going forward. If this fails, then we can do the same but from the exit. The root in our case acts as the source and the exit as the sink from the max flow-min cut theorem. Our algorithm differs slightly due to the fact that we do not want to find a cut since this would imply that the graphs are not isomorphic.

The ability to identify the cut point is still insightful for a human performing this analysis since it allows them to immediately know where to look for discrepancies between the source and binary graphs. Performing this isomorphic checking is challenging on small graphs, even when done by hand, and quickly becomes impractical to manually attempt as the graphs get larger. Take the example of dgread.c from Section 8.1.4. It is possible to identify by hand where the isomorphism fails just by looking at the graphs. However, it is easier if it is identified and highlighted for you so you are able to spend more time figuring out why the isomorphism fails as opposed to where it fails. Instead of treating a symptom of the failure, you are able to treat the disease causing the symptom. This identification can be seen in Figure 8.7. In these figures, the green node is where our isomorphic checking algorithm fails while traversing forward from the entry as the root. The yellow node in these graphs is where the algorithm fails while traversing the reverse graph forward, that is with all of the edges reversed and starting from the exit. By traversing the graph from the bottom up in a reverse fashion, we are able to see if the isomorphism failure has propagated to the rest of the graph or if it is in an isolated section. As seen in Figure 8.7, the failure is in an isolated section. Now consider the situation in which you are an analyst looking at 91 these graphs. You are going to be much more efficient if you know exactly where you need to look in the source code for a bug than having to perform these steps by hand. This is compounded on the fact that you already know there is a difference based on the output of our algorithm.

(a) Source Colored Graph (b) MIPS Colored Graph (c) PPC Colored Graph

Figure 8.7: Colored Graphs for dgread.c 92

CHAPTER 9. CONCLUSION AND FUTURE RESEARCH DIRECTIONS

The ability to establish trust in software that you have not written is a not an easy task.

Thompson elegantly conveyed to us why trust cannot be implicit from the start with software.

Instead, it needs to be established and that requires the blending of theoretical and applied concepts. We have seen what previous work has been done to address some of the issues around establishing this trust. This includes establishing CFI and model checking for the source code as well as different methods for extracting a CFG from a given binary. However, there has yet to be a novel way in which the source and binary CFGs are compared to establish CFE.

In this work we have presented our algorithm to establish CFE between a given source and binary CFG. Our algorithm utilizes feature vectors from each CFG which are comprised of distinguishing characteristics such as loopback edges, loop headers, and branch points. These vectors help to define the minimum set of nodes and edges required to represent the control structure. Additionally, we have developed structure preserving transforms for the CFGs to bring the source CFG and vector closer to the binary CFG and vector. Lastly, we design and implement an algorithm that applies a color preserving isomorphic label to each of the vertices in the source and binary vectors. The color labeling is performed using the distinguishing features that comprise the vectors which allows us to map a vertex from the source to a vertex in the binary.

We evaluated our transforms and algorithms agains the XINU operating system which was comprised of 212 functions with a source and corresponding binary CFG. These functions were placed into different isomorphic categories in order to better understand our results. From this analysis, we saw that our algorithm was able to determine that the CFE holds for 82 of those functions which did not contain any architecturally specific instructions. Additionally, there were a total of 49 functions found to be in the A-Iso category for each of the instruction sets. All of 93 these results help to show the effectiveness of our algorithm in determining the equivalence of two

CFGs and whether or not you can ultimately trust the given piece of code.

9.1 Future Research Directions

As this portion of our work comes to a conclusion, there are some interesting open questions and the potential for future research. There are two primary areas of work that we would able to expand on in the future. The first being from an engineering point of view developing the following transforms:

• Compiler and data flow optimization transforms for the source CFG

• Architecture specific instruction transforms which cause optimizations (e.g. node consolidations and differing edge destinations)

These would be able to be applied currently to the instruction sets that we have tested against.

All with the goal of trying to zero out the N-Iso category. Additionally, it would be interesting to see how this algorithm performs on even more instruction sets such as x86, ARM, and others.

Seeing how this algorithm performed against other instruction sets would also inform us of potentially other, architecture specific, transforms that are needed. All of these things can be considered engineering expansions to the tool we have already created to implement our final algorithm.

The second area that we would be able to expand upon in the future are research related questions. Specifically, the following questions come to mind which would be interesting to pursue with this algorithm at the core of them all:

• Can this algorithm and the ability to model the source code in the manner which it would be executed on a machine be used to prevent vulnerable code from being released into the wild?

If so, how? This question comes down to trying to determine if there are vulnerabilities 94

which we are not necessarily able to see at the original source level control structure, but we

would be able to see with the modified source control structure.

• Does this approach allow us to be more effective at identifying a piece of malware- either at the source or binary level? With this question, we try to determine how this algorithm and

graph theoretic approach can be used to better defend the software and systems we use.

Attackers only need to be right once with their malware, so it is in our best interest to

always ensure that we have the best defenses possible.

Both of these areas of work would allow this algorithm and the tools behind it to become even more effective at the work they have already shown to be possible. 95

BIBLIOGRAPHY

[1] Apache benchmark. http://httpd.apache.org/docs/current/programs/ab.html.

[2] Atlas. http://www.ensoftcorp.com/atlas/.

[3] Radare. https://www.radare.org/n/.

[4] Rose: An opensource compiler infrastructure maintained by the lawrence livermore national lab. http://rosecompiler.org/ROSE HTML Reference/index.html.

[5] Specint benchmark. https://en.wikipedia.org/wiki/SPECint.

[6] Abadi, M., Budiu, M., Erlingsson, U., and Ligatti, J. Control-flow integrity. In Proceedings of the 12th ACM Conference on Computer and Communications Security (2005), CCS ’05, Association for Computing Machinery, p. 340?353.

[7] Abadi, M., Budiu, M., Erlingsson, U., and Ligatti, J. Control-flow integrity principles, implementations, and applications. ACM Trans. Inf. Syst. Secur. 13, 1 (Nov. 2009).

[8] Allen, F. E. Control flow analysis. SIGPLAN Not. 5, 7 (July 1970), 1?19.

[9] Alrabaee, S., Shirani, P., Wang, L., and Debbabi, M. Sigma: A semantic integrated graph matching approach for identifying reused functions in binary code. Digital Investigation 12 (2015), S61–S71.

[10] Amighi, A., de Carvalho Gomes, P., and Huisman, M. Provably correct control-flow graphs from java programs with exceptions. In Papers the of 2nd International Conference on Formal Verification of Object-Oriented Software, FoVeOOS’11 (Oct. 2011), no. 26 in Karlsruhe Reports in Informatics, Karlsruhe Institute of Technology, pp. 31–48.

[11] Babai, L. Monte-carlo algorithms in graph isomorphism testing. Universit´etde Montr´eal Technical Report, DMS, 79-10 (1979).

[12] Babai, L. Graph isomorphism in quasipolynomial time. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing (2016), pp. 684–697.

[13] Balakrishnan, G., Reps, T., Melski, D., and Teitelbaum, T. Wysinwyx: What you see is not what you execute. In Verified Software: Theories, Tools, Experiments (2007). 96

[14] Bardin, S., Herrmann, P., and Vedrine,´ F. Refinement-based cfg reconstruction from unstructured programs. In Verification, Model Checking, and Abstract Interpretation (2011), Springer Berlin Heidelberg, pp. 54–69.

[15] Besson, F., and Thorn, T. J. D. L. M. T. Model checking security properties of control flow graphs. Journal of Computer Security, 3 (July 2001), 217–250.

[16] Blazy, S., and Trieu, A. Formal verification of control-flow graph flattening. pp. 176–187.

[17] Burow, N., Carr, S. A., Nash, J., Larsen, P., Franz, M., Brunthaler, S., and Payer, M. Control-flow integrity: Precision, security, and performance.

[18] Canfora, G., and Di Penta, M. New frontiers of reverse engineering. pp. 326 – 341.

[19] Carlini, N., Barresi, A., Payer, M., Wagner, D., and Gross, T. R. Control-flow bending: On the effectiveness of control-flow integrity. In 24th USENIX Security Symposium (USENIX Security 15) (Aug. 2015), USENIX Association, pp. 161–176.

[20] Carter, L., Ferrante, J., and Thomborson, C. Folklore confirmed: Reducible flow graphs are exponentially larger. In Proceedings of the 30th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (2003), Association for Computing Machinery, p. 106?114.

[21] Cifuentes, C., and Gough, K. J. Decompilation of binary programs. Software: Practice and Experience 25, 7 (1995), 811–829.

[22] Cifuentes, C., and Van Emmerik, M. Recovery of jump table case statements from binary code. Science of Computer Programming 40, 2 (2001), 171–188. Special Issue on Program Comprehension.

[23] Cooper, C. Who says paranoia doesn’t pay off?, September 2002. https://www.cnet.com/news/who-says-paranoia-doesnt-pay-off/.

[24] Deering, T., Kothari, S., Sauceda, J., and Mathews, J. Atlas: A new way to explore software, build analysis tools. In Companion Proceedings of the 36th International Conference on Software Engineering (2014), Association for Computing Machinery, pp. 588–591.

[25] Ding, R., Qian, C., Song, C., Harris, B., Kim, T., and Lee, W. Efficient protection of path-sensitive control security. In 26th USENIX Security Symposium (USENIX Security 17) (Aug. 2017), USENIX Association, pp. 131–148.

[26] Eagle, C. The ida pro book, 2011. 97

[27] Ge, J., Chaudhuri, S., and Tyagi, A. Control flow based obfuscation. In Proceedings of the 5th ACM Workshop on Digital Rights Management (2005), Association for Computing Machinery, pp. 83–92.

[28] Ghaffarinia, M., and Hamlen, K. W. Binary control-flow trimming. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security (2019), pp. 1009–1022.

[29] Grohe, M., and Schweitzer, P. The graph isomorphism problem. Commun. ACM 63, 11 (Oct. 2020), 128–134.

[30] Hogan, A. Skolemising blank nodes while preserving isomorphism. In Proceedings of the 24th International Conference on World Wide Web (2015), WWW ’15, International World Wide Web Conferences Steering Committee, pp. 430–440.

[31] Jaffke, L., and de Oliveira Oliveira, M. On weak isomorphism of rooted vertex-colored graphs. In Graph-Theoretic Concepts in Computer Science (2018), A. Brandst¨adt,E. K¨ohler,and K. Meer, Eds., Springer International Publishing, pp. 266–278.

[32] Karger, P., and Schell, R. Thirty years later: Lessons from the multics security evaluation.

[33] Karger, P., and Schell, R. Multics security evaluation: Vulnerability analysis. ESD-TR-74-193 2 (1974). http://csrc.nist.gov/publications/history/karg74.pdf.

[34] Khoo, W. M., Mycroft, A., and Anderson, R. Rendezvous: A search engine for binary code. In Proceedings of the 10th Working Conference on Mining Software Repositories (2013), MSR ’13, IEEE Press, p. 329?338.

[35] Kinder, J. Static Analysis of x86 Executables. PhD thesis, Technische Universit¨at, November 2010.

[36] Kinder, J., and Veith, H. Jakstab: A static analysis platform for binaries. In Computer Aided Verification (Berlin, Heidelberg, 2008), A. Gupta and S. Malik, Eds., Springer Berlin Heidelberg, pp. 423–427.

[37] Kinder, J., Zuleger, F., and Veith, H. An abstract interpretation-based framework for control flow reconstruction from binaries. In Verification, Model Checking, and Abstract Interpretation (Berlin, Heidelberg, 2009), N. D. Jones and M. M¨uller-Olm,Eds., Springer Berlin Heidelberg, pp. 214–228.

[38] Knuth, D. E. Structured programming with go to statements. ACM Comput. Surv. 6, 4 (Dec. 1974), 261–301.

[39] Koret, J. Histories of comparing binaries with source codes, Aug. 2018. 98

[40] Kruegel, C., Robertson, W., Valeur, F., and Vigna, G. Static disassembly of obfuscated binaries. In 13th USENIX Security Symposium (USENIX Security 04) (Aug. 2004), USENIX Association.

[41] Lal, A., and van Melkebeek, D. Graph isomorphism for colored graphs with color multiplicity bounded by 3. Tech. rep., Department of Computer Sciences, University of Wisconsin-Madison, 2005.

[42] Linn, C., and Debray, S. Obfuscation of executable code to improve resistance to static disassembly. In Proceedings of the 10th ACM Conference on Computer and Communications Security (2003), Association for Computing Machinery, pp. 290–299.

[43] Luks, E. M. Isomorphism of graphs of bounded valence can be tested in polynomial time. Journal of computer and system sciences 25, 1 (1982), 42–65.

[44] Mathon, R. A note on the graph isomorphism counting problem. Information Processing Letters 8, 3 (1979), 131–136.

[45] McKay, B. D., et al. Practical graph isomorphism.

[46] Minh Hai, N., Nguyen, B., Quan, T., and Ogawa, M. A hybrid approach for control flow graph construction from binary code. pp. 159–164.

[47] Nagarajan, V., Gupta, R., Zhang, X., Madou, M., de Sutter, B., and de Bosschere, K. Matching control flow of program versions. In 2007 IEEE International Conference on Software Maintenance (2007), pp. 84–93.

[48] Piccolboni, L., Menon, A., and Pravadelli, G. Efficient control-flow subgraph matching for detecting hardware trojans in rtl models. ACM Trans. Embed. Comput. Syst. 16, 5s (Sept. 2017).

[49] Riesen, K., Jiang, X., and Bunke, H. Exact and Inexact Graph Matching: Methodology and Applications. Springer US, Boston, MA, 2010, pp. 217–247.

[50] Shervashidze, N., Schweitzer, P., van Leeuwen, E. J., Mehlhorn, K., and Borgwardt, K. M. Weisfeiler-lehman graph kernels. J. Mach. Learn. Res. 12 (Nov. 2011), 2539–2561.

[51] Theiling, H. Extracting safe and precise control flow from binaries. In Proceedings Seventh International Conference on Real-Time Computing Systems and Applications (2000), pp. 23–30.

[52] Thompson, K. Reflections on trusting trust. Turing Award Lecture, Communications of ACM 7, 27 (1984), 761–763. 99

[53] Tsai, H., Huang, Y., and Wagner, D. A graph approach to quantitative analysis of control-flow obfuscating transformations. IEEE Transactions on Information Forensics and Security 4, 2 (2009), 257–267.

[54] Udupa, S. K., Debray, S. K., and Madou, M. Deobfuscation: reverse engineering obfuscated code. In 12th Working Conference on Reverse Engineering (WCRE’05) (2005), pp. 10–54.

[55] Wheeler, D. A. Countering trusting trust through diverse double-compiling. In 21st Annual Computer Security Applications Conference (ACSAC’05) (2005), pp. 13–48.

[56] Xu, L., Sun, F., and Su, Z. Constructing precise control flow graphs from binaries.

[57] Xu, M., Wu, L., Qi, S., Xu, J., Zhang, H., Ren, Y., and Zheng, N. A similarity metric method of obfuscated malware using function-call graph. Journal of Computer Virology and Hacking Techniques 9 (02 2013).

[58] Xu, X., Ghaffarinia, M., Wang, W., Hamlen, K. W., and Lin, Z. CONFIRM: Evaluating compatibility and relevance of control-flow integrity protections for modern software. In 28th USENIX Security Symposium (USENIX Security 19) (Aug. 2019), USENIX Association, pp. 1805–1821.

[59] Yin, W., Jiang, L., Yin, Q., Zhou, L., and Li, J. A control flow graph reconstruction method from binaries based on xml. In 2009 International Forum on Computer Science-Technology and Applications (2009), vol. 2, pp. 226–229.

[60] Zhang, M., and Sekar, R. Control flow integrity for COTS binaries. In 22nd USENIX Security Symposium (USENIX Security 13), USENIX Association, pp. 337–352.