secuBT : Hacking the Hackers with -Space

Mathias Payer Department of Computer Science ETH Zurich

Abstract Two interesting scenarios are (i) sandboxing of server processes, and (ii) execution of untrusted In the age of coordinated malware distribution and code. Server daemons that offer an interface to zero-day exploits security becomes ever more im- the Internet are under constant attack. If these portant. This paper presents secuBT, a safe ex- daemons are virtualized and restricted to only spe- ecution framework for the execution of untrusted cific system calls and parameters then an exploit binary code based on the fastBT dynamic binary is unable to escalate privileges and execute unin- translator. tended code. The second scenario targets plugins secuBT implements user-space virtualization us- and other downloaded untrusted code that a user ing dynamic binary translation and adds a system would like to run in a special sandbox. Whenever call interposition framework to limit and guard the the program issues a , the user can de- interoperability of binary code with the kernel. cide if it is allowed or not. The sandbox limits the Fast binary translation is a key component to possible interactions between the virtualized pro- user-space virtualization. secuBT uses and ex- gram and the . System calls are tends fastBT, a generator for low-overhead, table- checked depending on name, parameters, and call based dynamic (just-in-time) binary translators. location. We discuss the most challenging sources of over- Such a sandbox could be implemented as a ker- head and propose optimizations to further reduce nel module. This would require additional code these penalties. We argue for hardening techniques that runs in the kernel itself which poses an addi- to ensure that the translated program cannot es- tional security risk. User-space virtualization uses cape out of the user-space virtualization. binary translation to sandbox processes. The virtu- An important feature of secuBT is that only alization layer controls all instructions that can be translated code is executed. This ensures code va- executed by the virtualized . System calls lidity and makes it possible to rewrite individual are rewritten by the translation system so that an instructions. The system call interposition frame- authorization function is executed first. This au- work validates every system call and offers the thorization decides whether a system call (i) is al- choice to (i) allow it, (ii) abort the program, (iii) lowed, (ii) aborted, or (iii) redirected to a sandbox- redirect to an user-space emulation. internal function that emulates the system call in user-space. To the virtualized application it ap- pears as if the system call is executed directly. 1 Introduction We propose a sandbox that is built on a dy- namic binary translation (BT) system. Using BT In a world of increasing software complexity and ensures that all executed code is translated first, diversity it is important to sandbox and virtual- and therefore a program cannot escape the sand- ize running applications. Staying up-to-date on all box. If the translator encounters unsafe or invalid software applications and libraries is hard but nec- code it aborts the program. This builds a safe essary. On the other hand patches as well as mal- execution framework that validates executed code. ware discovery is reactive. secuBT is a step towards The translator can rewrite individual instructions, proactive security and fault detection. e.g. redirect system calls to handler functions.

1 Section 2 covers the design of the encapsulation original program and the code cache. This transla- framework and implementation of the binary trans- tion process ensures that the execution flow always lator fastBT [14]. Section 3 elaborates the security stays in the code cache. If the program is about to extension and the system call interposition frame- branch to an untranslated block of code then the work. The system is evaluated in Section 4, fol- translator is invoked to translate that block. The lowed by related work and our conclusion. translated block is then placed in the code cache and the execution of the program resumes at the newly translated block in the code cache. See Fig- 2 fastBT: Dynamic Binary ure 1 for an overview of such a generic translator. Translation Original program Translator Code cache

Fast binary translation is the key component to im- Opcode table plement user-space virtualization. The dynamic 0 1' translation system checks and verifies every ma- 1 chine code instruction before it is executed. Static 3' Trampoline to binary translation does not suffice because it is un- translate 4 2 3 2' able to detect hidden code or malicious code that Mapping targets the static binary translator itself. Direct 3 3' control transfers are translated and redirected to 4 1 1' the code cache. Indirect control transfers are trans- 2 2' lated into an online lookup and dispatch to guaran- tee that only translated branch targets are reached. Figure 1: Runtime layout of the binary translator. The translation system can change, adapt, or re- move any invalid instruction. Using the translation system, system calls are rewritten and redirected to an interposition system. 2.1.1 Translation Tables To limit the overhead of binary translation the translation process must be fast and must produce The basic translation engine is a simple table-based competitive code. This section presents implemen- iterator. When the translator is invoked with a tation details of fastBT [14], a fast and flexible bi- pointer to a basic block, it first adds a reference to nary translator that is used to implement the exe- the mapping table from the original program loca- cutable space protection and system call interposi- tion to the location in the code cache. The iterator tion. An important feature of fastBT is that the then loops through the basic block one instruction return addresses on the stack remain unchanged. at a time. This adds additional complexity to return To decode the variable-length x86 instructions instructions as they are translated to a lookup and a large multidimensional translation table is con- an indirect control transfer. An unchanged stack structed that encodes all possible combinations of has the following advantages: (i) the program can machine code instructions and parameters. Start- validate and its own call stack for debugging, ing with a base table each byte of the current in- (ii) exception handling uses return addresses to es- struction is checked. If the decoding of the instruc- cape to the correct frame, (iii) the program does tion is not finished in the current table, a pointer not know that it is translated, and (iv) the address redirects to the next table and the decoding pro- of the code cache is hidden from the program. cess continues with the next byte. This process is repeated until the instruction is completely de- 2.1 Basic Translator coded. As a next step the instruction and its arguments The translator processes basic blocks of the original are passed to a corresponding action function that program, places them in the code cache and adds handles the translation of this particular instruc- entries to the mapping table. The mapping table tion. Action functions can generate arbitrary code, is used to map between program locations in the alter, copy, or remove the instruction. Generated

2 code is emitted into the code cache. Therefore it makes sense to keep translated code in The translation process stops at recognizable ba- a cache for later reuse. As a result code is only sic block (BB) boundaries like branches or return translated once, reducing the overall translation instructions. Some BB boundaries like backward overhead. jumps are not recognizable, in such a case a part of fastBT uses a per cache strategy to in- the BB will be translated a second time. crease code locality. An additional advantage of At the end of a BB the translator checks if the thread local caches is that the action functions can outgoing edges are already translated, and adds emit hard-coded pointers to thread local data struc- jumps to the translated targets. If a target is not tures, otherwise the code would need to call expen- translated, the translator builds a trampoline that sive lookup functions. starts the translation engine for the corresponding The combination of the basic translator that BB, and adds a jump to the trampoline. stops at BB boundaries and the design of the code cache lead to a greedy trace extraction. These 2.1.2 Predefined Actions traces are formed as a side effect of the first ex- ecution and can possibly speed up the program. The translator needs different action functions to support the identity transformation. All safe in- structions that are executable in user-space are 2.1.4 Mapping Table copied to the code cache. Privileged instructions The mapping table maps between code locations in (e.g. all instructions that are not allowed to exe- the original program and code locations in the code cute in user-space) are intercepted and the program cache. The translator adds entries for all translated is aborted, otherwise the kernel would signal a seg- basic blocks. An entry consists of two pointers, the mentation fault or . first pointing to the original location, the second Special care is needed for control instructions. pointing to the translated location. These are handled by specific action functions that The hash function used in any lookup function check the target of the branch instruction. These returns a relative offset into the mapping table. action functions emit extra code that gives the il- Adding the result from the hash function to the lusion as if the original code was executed. base pointer of the mapping table results in the • The call action handler issues code that saves address of a table entry. This process is kept sim- the correct instruction pointer of the original ple so that inlined machine code is able check these location on the stack, but branches into code entries efficiently. that is located in the code cache. 2.2 Optimizations • The ret action handler emits code that pops the return instruction pointer from the stack, The basic, simple binary translator is able to use finds the translated target and branches to the hardcoded references in the code cache for direct corresponding translated target. control transfers. The translator must emit an on- line lookup of the target if the control transfer is • The indirect jump action handler emits code indirect (e.g. indirect jump, indirect call, function for a runtime lookup of the corresponding tar- return, or newly translated block), and the target is get and issues a branch to the translated tar- not known at translation time. This indirect jump get. routine resolves the indirect target, issues a map- If the target is not already translated then the ac- ping table lookup, translates new targets, and redi- tion function generates a trampoline that translates rects the control flow to the specified target. the target when it is executed for the first time. The execution of these indirect control transfers sums up to the majority of the overhead introduced through BT. Different optimization strategies like 2.1.3 Code Cache function inlining, an inlined lookup and dispatch, During the execution of a program it is likely that and inlined predictions are used to reduce the over- certain code regions are executed multiple times. head of indirect control transfers.

3 2.2.1 fastRET must be updated. This optimization speculates that the target does not change often. fastBT uses Function returns are indirect control transfers this optimization for indirect jumps relative to a where the target lies on the top of the stack. memory address (e.g. jmp *0x11223344). The basic translator calls the indirect jump routine and handles the return instruction like an indirect cmpl $cached target, (%esp) jump. je hit tr The fastRET optimization translates a return in- pushl target struction into a thread local lookup in the mapping pushl tld table and a branch to the translated target without pushl $addrOfCachedTarget an additional call. call indcall fixup Using the implementation shown in Figure 2 the hit tr: fastRET optimization uses 13 instructions com- pushl rip jmp translated target pared to more than 20 instructions if the general indirect jump routine is used. The fastRET op- Figure 3: Translated indirect call instruction with timization has two advantages: (i) no lookup is an included prediction, target is the target of needed for pointers to local data structures, they the call instruction, rip is the return instruction can be embedded in the code cache, (ii) the code pointer, and tld the pointer to thread local data. is inlined in the code cache, and no extra function call is executed.

pushl %ebx pushl %ecx 3 Security Features movl 8(%esp), %ebx # load rip movl %ebx, %ecx A program is secure if the execution of unintended andl HASH PATTERN, %ebx code is not possible. Unintended code does not subl maptbl start(0,%ebx,8), %ecx meet the users expectations and executes unin- jecxz hit tended system calls. If all system calls of an ap- popl %ecx plication are redirected to an interposition frame- popl %ebx work and inspected then the program is unable to pushl tld call ind jmp execute unintended actions like, e.g., deleting files, hit: escalating privileges, or spawning processes. movl maptbl start+4(0,%ebx,8), %ebx secuBT offers the possibility to inspect system movl %ebx, tld->ind jump target calls before they are executed. Using this frame- popl %ecx work malicious code is unable to break out of the popl %ebx sandbox, even if some unwanted code is executed. leal 4(%esp), %esp # adjust stack To support this user-space virtualization the under- jmp *(tld->ind jump target) lying binary translator must ensure that a program is not able to escape BT. If only safe instructions Figure 2: A translated return instruction with the are translated and all system calls are checked then fast return optimization, rip is the return instruc- a program will not be able to escape out of the tion pointer and tld the pointer to thread local virtualization. Unsafe code is caught during the data. translation process, and the program is terminated. Unsafe system calls are caught before they are ex- ecuted and the program is terminated. The system call interposition framework can be 2.2.2 Indirect call prediction extended by a policy-based user-space system call The indirect call prediction caches the last lookup authorization framework [13]. target and destination (see Figure 3). If the target User-space virtualization does not guard against is the same then the control can be transfered with- overwriting data structures in files that are already out a mapping table lookup, otherwise the cache mapped to memory.

4 3.1 Executable Space Protection 3.3 System Call Interposition To be effective an exploit must execute unintended Modern processors support a system calls (e.g., I/O, opening network sockets, extension called executable bit. Only code on pages executing other programs, and privilege escala- where this bit is set can be executed. This pro- tion). A mechanism that restricts the system calls tection guards against introduced code in unexe- a virtualized program can execute is an effective cutable regions like the stack. safeguard. secuBT implements such a protection mecha- The fastBT framework supports system call nism on a more precise per-section granularity for rewriting of sysenter instructions and int 80 in- regions defined in the ELF headers. The transla- structions. Whereas the kernel uses both tor checks for every memory location before it is systems alongside each other [9]. translated if it is in an executable section of any The system call interposition framework extends loaded or the program itself. The program the binary translator and redirects all system calls is terminated if the target is not in an executable to specific user-defined functions. It is possible to section. This mechanism protects against code in- define a specific function per system call that checks troduced into non-executable regions like the stack the parameters, call stack, and the system call num- and the heap. ber and either allows the system call, denies the The code regions of the original program are system call and aborts the program, or denies the marked as non-executable and only the code cache system call and returns a fake value. contains executable code. As the user program does not know the location of the code cache it is unable to change already emitted code. 4 Evaluation

This section provides an extensive analysis of the overhead introduced through the fastBT virtual- ization and secuBT sandboxing. This overhead 3.2 Protecting Internal Data Struc- is separated into (i) BT overhead alone, (ii) ad- tures through mprotect ditional overhead for syscall validation and exe- cutable space protection, and (iii) full virtualiza- A translated program is not able to directly dis- tion using mprotect to guard the internal datas- cover addresses of internal data structures of the tructures from attacks against the sandbox. binary translator. But because the binary transla- The benchmarks were run on an 9.04 sys- tor and the user program share the same address tem with an E6850 Intel Core2Duo CPU running space there is a probabilistic chance that the user at 3.00GHz, 2GB RAM, and GCC version 4.3.3. program can change memory locations that belong to the binary translator. 4.1 Overhead for different security An explicit protection can be achieved by - configurations protecting all memory pages used by the binary translator. As soon as the binary translator Table 1 shows overheads for the different SPEC switches to translated code write-bits are cleared CPU2006 benchmarks compared to an untrans- for all pages used by secuBT. If an untranslated lated run. The different configurations are: basic block is translated then the write privileges are added before the translator is started. These fastBT: A configuration without additional secu- transitions are controlled by secuBT and make it rity features, showing the overhead of the vir- impossible for the user program to change emitted tualization and binary modification toolkit. code in the code cache. This extensions adds a lot of additional mprotect system calls. The mprotect secuBT: This configuration shows the overhead of system calls are responsible for a lot of the transla- secuBT with executable space protection and tion overhead but result in a higher level of security. system call interposition.

5 secuBT (full): The last configuration shows full 67%. secuBT adds static overhead per translated encapsulation including protection of internal block and per system call. The SPEC CPU20006 data structures using explicit memory protec- benchmarks have a low number of system calls and tion through mprotect. high code reuse which is typical for server appli- cations. Therefore the secuBT extensions add no measurable overhead to these programs. Benchmark fastBT secuBT full sBT secuBT with full protection leads to more over- 400.perlbench 66.87% 67.70% 72.22% head because the number of system calls increases. 401.bzip2 4.34% 3.89% 4.19% But the overall overhead is low in these bench- 403.gcc 32.20% 31.97% 84.81% marks, although the translation overhead is a lot 429.mcf 0.25% 0.00% 0.25% 445.gobmk 15.71% 15.71% 18.00% higher. The translation overhead is small compared 456.hmmer 4.54% 5.51% 5.94% to the runtime overhead of the translated program. 458.sjeng 36.04% 35.90% 35.76% As soon as all active code is translated no further 462.libquantum 0.98% 0.98% 0.98% calls are necessary. For short 464.h264ref 8.19% 10.21% 10.21% running programs with low code-reuse the transla- 471.omnetpp 16.53% 15.73% 15.93% tion overhead would be higher. 473.astar 5.49% 5.32% 5.16% 483.xalancbmk 30.19% 29.38% 32.35% 410.bwaves 2.35% 2.46% 2.68% 416.gamess -3.50% -2.80% -2.10% 5 Related Work 433.milc 2.30% 2.18% 2.54% 434.zeusmp -0.25% -0.25% -0.13% Related work to secuBT combines ideas from dif- 435.gromacs 0.00% 0.00% 0.00% ferent fields of research. An important area are 436.cactusADM -0.66% 0.00% 0.00% systems that enforce some kind of security policy 437.leslie3d 0.00% 0.00% 0.00% by either limiting the instruction set or relying on 444.namd 0.49% 0.49% 0.49% some kernel infrastructure. 447.dealII 46.38% 44.02% 45.11% As secuBT is tightly coupled to the fastBT bi- 450.soplex 4.83% 4.46% 6.69% nary translator this section covers related work 453.povray 39.23% 39.78% 41.44% from binary translation as well. 454.calculix -1.12% -1.12% 3.35% 459.GemsFDTD 1.79% 1.79% 2.68% 465.tonto 11.35% 10.27% 13.51% 5.1 Enforcing Security 470.lbm 0.00% -0.11% 0.00% 482.sphinx3 0.35% 0.95% 1.42% Security can be enforced on many levels. Some of Average 11.60% 11.59% 14.41% them are limiting the system calls a program can use, limiting the instruction set, or using hardware Table 1: Overhead for different configurations ex- extension to limit the program. ecuting the SPEC CPU2006 benchmarks (relative to an untranslated run). The configurations are fastBT, secuBT, and secuBT with full memory pro- • The Google Native Client [18] is able to exe- tection. cute x86-code in a sandbox. Native client uses techniques similar to software-based fault iso- lation (SFI) [17] systems. The instruction set The average slowdown for fastBT below 12% is is limited to a save subset of the IA-32 ISA, tolerable. secuBT security extensions do not add making illegal operations impossible. Before any measurable overhead compared to fastBT. The the execution a verifier checks if the program full protection mechanism results in an overhead of is valid, then the program is executed without 14.41%. The overhead for fastBT and basic secuBT any additional virtualization. This limits the protection for most programs is between 0% and possible used instructions, the programs are 10% whereas some benchmarks like 400.perlbench, linked statically and no dynamic libraries can 433.gcc, 453.sjeng, 483.xalancbr, 447.dealII, and be used. Programs must be compiled with a 453.povray result in a higher overhead of 30% to custom-tailored compiler.

6 • Janus [10] is a system call interposition frame- 5; 11] is the binary-only IA-32 version of Dy- work that uses the Solaris process tracing fa- namo for Linux and Windows. The translator cility (ptrace) to allow one user mode process extracts and optimizes traces for hot regions. to filter the system calls of a second process. Hot regions are identified by adding profiling This framework builds on kernel support and information to the instruction stream. These has two drawbacks: (i) the traced application regions are converted into an IR, optimized is already in the kernel when it is stopped, this and recompiled to native code. DynamoRIO poses a potential security problem, and (ii) the was recently acquired by Google and newer overhead of switching between the inspecting versions are released as -source. process and the application is high. • PIN [12] is an example of a dynamic instru- • Vx32 [8] implements a user-space sandbox mentation system that exports a high-level in- built on BT that uses segmentation to hide strumentation API that is available at run- the internal data structures. Due to the use time. The system offers an online high-level of segmentation the Vx32 system is limited to interface to all instructions. PIN uses the 32-bit code. , system calls, and il- user-supplied definition and dynamically in- legal instructions are translated to traps that struments the running program. call special handler functions. The proposed The drawback of these systems is that they were BT results in a high overhead as there are no not designed with security in mind. A binary with optimizations for indirect control transfers. special crafted instructions is able to escape the bi- nary translation system and can execute untrans- 5.2 Binary Translation lated code, circumventing the protection mecha- nisms. Complete system virtualization by QEMU [3] of- fers full encapsulation, but comes with high over- head. Other full system virtualization tools like 6 Conclusion VMware [7; 6] and Xen [2] rely on kernel sup- port. A disadvantage of system virtualization is We present secuBT, a low overhead binary trans- that every VM is an independent system with its lation framework that implements security in user- own configuration. From a security and safety per- space and offers the ability to limit programs in spective the encapsulation of such an approach is their use of system calls and privileged instructions. needed without the complexity of individual sys- The secuBT framework uses dynamic binary tems is needed. secuBT offers user-space virtual- translation to support the full IA-32 ISA with- ization, combining encapsulation with simple con- out kernel support. Full binary translation escapes figuration on a single system. dangerous instructions at runtime and interposes The three binary translators that are most simi- system calls with an interposition framework. lar to fastBT in either architecture or function are secuBT combines the advantages of full system HDTrans, DynamoRIO, and PIN: translation without the disadvantages. Applica- tions are virtualized and encapsulated while using • HDTrans [16; 15] is a light-weight, table-based a shared system image with a single system config- instrumentation system. A code cache is used uration and no additional overhead to run multiple for translated code as well as trace lineariza- operating systems. tion and optimizations for indirect jumps. HD- The source code of the secuBT virtual- Trans resembles fastBT most closely with re- ization framework can be downloaded at spect to speed and implementation, but there http://nebelwelt.net/projects/secuBT. are significant differences in the optimizations for indirect jumps. Additionally HDTrans only covers a subset of the IA-32 instruction set. References

• Dynamo is a dynamic optimization system de- [1] Bala, V., Duesterwald, E., and Baner- veloped by Bala et al. [1]. DynamoRIO [4; jia, S. Dynamo: a transparent dynamic op-

7 timization system. In PLDI ’00 (Vancouver, the wily hacker. In Proceedings of the 6th BC, Canada, 2000), pp. 1–12. Usenix Security Symposium (1996). [11] Hazelwood, K., and Smith, M. D. Man- [2] Barham, P., Dragovic, B., Fraser, K., aging bounded code caches in dynamic bi- Hand, S., Harris, T., Ho, A., Neuge- nary optimization systems. TACO ’06: ACM bauer, R., Pratt, I., and Warfield, A. Trans. Arch. Code Opt. 3, 3 (2006), 263–294. Xen and the art of virtualization. In SOSP ’03 (New York, NY, USA, 2003), pp. 164–177. [12] Luk, C.-K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, [3] Bellard, F. QEMU, a fast and portable dy- S., Reddi, V. J., and Hazelwood, K. Pin: namic translator. In ATEC ’05: Proc. conf. building customized program analysis tools USENIX Ann. Technical Conf. (Berkeley, CA, with dynamic instrumentation. In PLDI ’05 USA, 2005), pp. 41–41. (New York, NY, USA, 2005), pp. 190–200. [4] Bruening, D., Duesterwald, E., and Amarasinghe, S. Design and implemen- [13] Payer, M., and Gross, T. secuBT: Enforc- tation of a dynamic optimization framework ing security through user-space process virtu- for Windows. In ACM Workshop Feedback- alization. Tech. rep. directed Dyn. Opt. (FDDO-4) (2001). [14] Payer, M., and Gross, T. Requirements [5] Bruening, D., Garnett, T., and Ama- for fast binary translation. In 2nd Workshop rasinghe, S. An infrastructure for adaptive on Architectural and Microarchitectural Sup- dynamic optimization. In CGO ’03 (Washing- port for Binary Translation (2009). ton, DC, USA, 2003), pp. 265–275. [15] Sridhar, S., Shapiro, J. S., and Bun- [6] Bugnion, E. Dynamic binary translator with gale, P. P. HDTrans: a low-overhead dy- a system and method for updating and main- namic translator. SIGARCH Comput. Archit. taining coherency of a translation cache. US News 35, 1 (2007), 135–140. Patent 6704925, March 2004. [16] Sridhar, S., Shapiro, J. S., Northup, E., [7] Devine, S. W., Bugnion, E., and Rosen- and Bungale, P. P. HDTrans: an open blum, M. Virtualization system including a source, low-level dynamic instrumentation sys- virtual machine monitor for a computer with tem. In VEE ’06 (New York, NY, USA, 2006), a segmented architecture. US Patent 6397242. pp. 175–185.

[8] Ford, B., and Cox, R. Vx32: lightweight [17] Wahbe, R., Lucco, S., Anderson, T. E., user-level sandboxing on the x86. In ATC’08: and Graham, S. L. Efficient software-based USENIX 2008 Annual Technical Confer- fault isolation. In SOSP ’93: Proceedings of ence on Annual Technical Conference (Berke- the fourteenth ACM symposium on Operat- ley, CA, USA, 2008), USENIX Association, ing systems principles (New York, NY, USA, pp. 293–306. 1993), ACM, pp. 203–216. [9] Garg, M. Sysenter based system call mech- [18] Yee, B., Sehr, D., Dardyk, G., Chen, anism in linux 2.6 (http://manugarg.google- J. B., Muth, R., Ormandy, T., Okasaka, pages.com/systemcallinlinux2 6.html). S., Narula, N., and Fullagar, N. Native [10] Goldberg, I., Wagner, D., Thomas, R., client: A sandbox for portable, untrusted x86 and Brewer, E. A. A secure environment native code. IEEE Symposium on Security and for untrusted helper applications: Confining Privacy (2009), 79–93.

8