English Shellcode

English Shellcode Joshua Mason, Sam Small Fabian Monrose Greg MacManus Johns Hopkins University University of North Carolina iSIGHT Partners Baltimore, MD Chapel Hill, NC Washington, DC {josh, sam}@cs.jhu.edu [email protected] [email protected] ABSTRACT General Terms History indicates that the security community commonly Security, Experimentation takes a divide-and-conquer approach to battling malware threats: identify the essential and inalienable components of an attack, then develop detection and prevention tech- Keywords niques that directly target one or more of the essential com- Shellcode, Natural Language, Network Emulation ponents. This abstraction is evident in much of the litera- ture for buffer overflow attacks including, for instance, stack 1. INTRODUCTION protection and NOP sled detection. It comes as no surprise Code-injection attacks are perhaps one of the most com- then that we approach shellcode detection and prevention mon attacks on modern computer systems. These attacks in a similar fashion. However, the common belief that com- are used to deliver and run arbitrary code on victims’ ma- ponents of polymorphic shellcode (e.g., the decoder) can- chines, often enabling unauthorized access and control of not reliably be hidden suggests a more implicit and broader system resources, applications, and data. Typically, the vul- assumption that continues to drive contemporary research: nerabilities being exploited arise due to some level of neglect namely, that valid and complete representations of shellcode on the part of system and application developers to properly are fundamentally different in structure than benign pay- define and reject invalid program input. Indeed, the canon- loads. While the first tenet of this assumption is philosoph- ical consequences of such neglect, which include buffer and ically undeniable (i.e., a string of bytes is either shellcode or heap overflow attacks, format string attacks, and (more re- it is not), truth of the latter claim is less obvious if there exist cently) heap spray attacks, categorically demonstrate some encoding techniques capable of producing shellcode with fea- of the most popular code-injection techniques. tures nearly indistinguishable from non-executable content. Generally speaking, an attacker’s first objective in a code- In this paper, we challenge the assumption that shellcode injection attack is to gain control of a machine’s program must conform to superficial and discernible representations. counter. The program counter is a special purpose ma- Specifically, we demonstrate a technique for automatically chine register that identifies the next instruction scheduled producing English Shellcode, transforming arbitrary shell- for execution. By gaining control of the program counter, code into a representation that is superficially similar to an attacker is able to redirect program execution and dis- English prose. The shellcode is completely self-contained— rupt the intended behavior of the program. With the abil- i.e., it does not require an external loader and executes as ity to manipulate the program counter, attackers sometimes valid IA32 code)—and can typically be generated in under redirect a victim’s machine to execute (already present) ap- an hour on commodity hardware. Our primary objective in plication or system code in a manner beneficial to an at- this paper is to promote discussion and stimulate new ideas tacker’s intent. For instance, return-to-libc attacks provide for thinking ahead about preventive measures for tackling awell-documentedexampleofthiskindofmanipulation. evolutions in code-injection attacks. In a code-injection attack, however, attackers redirect the program counter to execute code delivered by the attackers themselves. Depending on the details of the particu- Categories and Subject Descriptors lar vulnerability that an attacker is targeting, injected code can take several forms including source code for an inter- K.6.5 [Security and Protection]: Invasive software preted scripting-language engine, intermediate byte-code, or natively-executable machine code. Despite differences in the style and implementation of different exploits, e.g., buffer overflow versus format string at- Permission to make digital or hard copies of all or part of this work for tacks, all code-injection attacks share a common component: personal or classroom use is granted without fee provided that copies are the injected code. This payload, once executed after suc- not made or distributed for profit or commercial advantage and that copies cessful manipulation of the program counter, often provides bear this notice and the full citation on the first page. To copy otherwise, to attackers with arbitrary control over a vulnerable machine. republish, to post on servers or to redistribute to lists, requires prior specific Frequently (though not always), attackers deliver a payload permission and/or a fee. CCS’09, November 9–13, 2009, Chicago, Illinois, USA. that simply launches a command shell. It is for this reason Copyright 2009 ACM 978-1-60558-352-5/09/11 ...$10.00. that many in the hacking community generically refer to the payload portion of a code-injection attack as shellcode. will likely continue to evade intrusion detection and preven- Among those less familiar with code-injection attacks, there tion systems because malcode developers do not follow the is sometimes a subtle misconception that shellcode is nec- “rules”. As this cat and mouse game plays on, it is clear that essarily delivered in tandem with whichever message ulti- the attackers will adapt. So should we, especially as it per- mately exploits a vulnerability and grants an attacker con- tains to exploring new directions for preventative measures trol of the program counter. While this assumption typ- against code-injection attacks. ically holds in more traditional buffer overflow vulnerabilities, modern attacks demonstrate that attackers have de- 2. ON THE ARMS RACE veloped numerous techniques to covertly deliver (and ulti- In this paper, we focus on natively-executable shellcode mately store into memory) shellcode separately from and for x86 processors. In this case, machine code and shellcode prior to triggering the exploit. For instance, if an attacker are fundamentally identical; they both adhere to the same can manipulate memory at a known heap address, they may binary representation directly executable by the processor. store their shellcode there, using its address later when over- Shellcode developers are often faced with constraints that writing a return address on the stack [10]. We draw atten- limit the range of byte-values accepted by a vulnerable ap- tion to this distinction because our use of the term shellcode plication. For instance, many applications restrict input to in this paper specifically denotes the injected code irrespec- certain character-sets (e.g., printable, alphanumeric, MIME), tive of individual attacks or vulnerabilities. or filter input with common library routines like isalnum Typically, shellcode takes the form of directly executable and strspn. The difficulty in overcoming these restrictions machine code, and consequently, several defensive measures and bypassing input filters depends on the range of accept- that attempt to detect its presence, or prevent its execution able input. Of course, these restrictions can be bypassed altogether, have been proposed. Indeed, automated inspec- by writing shellcode that does not contain restricted byte- tion of user input, system memory, or network traffic for values (e.g., null-bytes). Although such restrictions often content that appears statistically or superficially executable limit the set of operations available for use in an attack, are now common (e.g., [23, 1, 16, 15, 27]). However, as ex- attackers have derived encodings to convert unconstrained pected, a number of techniques have been developed that shellcode honoring these restrictions by building equivalency circumvent these protective measures, or make their job far operations from reduced instruction sets (e.g., [25, 11]). more difficult (e.g., polymorphism [5, 7]). Of special note are the alphanumeric encoding engines [18] Recently, it has been suggested that even polymorphic present in Metasploit (see www.metasploit.com). These en- shellcode is constrained by an essential component: the de- gines convert arbitrary payloads to representations composed coder. The argument is that the decoder is a necessary and only of letters and numerical digits. These encodings are sig- executable companion to encoded shellcode, enabling the nificant for two reasons. First, alphanumeric shellcode can encoded portion of the payload to undergo an inverse trans- be stored in atypical and otherwise unsuspected contexts formation to its original and executable form. Since the such as syntactically valid file and directory names or user decoder must be natively executable, the prevailing thought passwords [18]. Second, the alphanumeric character set is is that we can detect its presence assuming that this portion significantly smaller than the set of characters available in of the payload will bear some identifiable features not com- Unicode and UTF-8 encodings. This means that the set of in- mon to valid or non-executable data. It is this assumption— structions available for composing alphanumeric shellcode is that shellcode is fundamentally different in structure than relatively small. To cope with these restrictions, patching or non-executable payload data—that continues to drive some self-modification

Load more