Or, Use This LATEX Class File to Pwn Your Computer Stephen Checkoway Hovav Shacham Eric Rescorla UC San Diego UC San Diego RTFM, Inc
Total Page:16
File Type:pdf, Size:1020Kb
Are Text-Only Data Formats Safe? Or, Use This LATEX Class File to Pwn Your Computer Stephen Checkoway Hovav Shacham Eric Rescorla UC San Diego UC San Diego RTFM, Inc. Abstract are all plain text and thus, na¨ıvely, “safe.” LATEX and BIBT X files are routinely transmitted in research envi- We show that malicious T X, BIBT X, and METAPOST E E E ronments — a practice we show is fundamentally unsafe. files can lead to arbitrary code execution, viral infec- Compiling a document with standard T X distributions tion, denial of service, and data exfiltration, through the E allows total system compromise on Windows and infor- file I/O capabilities exposed by TEX’s Turing-complete mation leakage on UNIX. macro language. This calls into doubt the conventional wisdom view that text-only data formats that do not ac- TEX is unsafe. Donald Knuth’s TEX is the standard cess the network are likely safe. We build a TEX virus typesetting system for mathematical documents. It is that spreads between documents on the MiKTEX distri- also a Turing-complete macro language used to inter- bution on Windows XP; we demonstrate data exfiltration pret scripts from potentially untrusted sources. In this attacks on web-based LATEX previewer services. paper, we show that a specific capability exposed to TEX macros — the ability to read and write arbitrary files — 1 Introduction makes it (and other commonly used bits of TEXware, The divide between “code” and “data” is among the most such as BIBTEX and METAPOST) a threat to system se- fundamental in computing. Code expresses behavior curity and data privacy. or functionality to be carried out by a computer; data We demonstrate two concrete attacks. First, as an ex- encodes and describes an object (a photo, a spreadsheet, ample of running arbitrary programs, we build a TEX etc.) that is conceptually inert, and examined or manip- virus that affects recent MiKTEX distributions on Win- ulated by means of appropriate code. The complexity of dows XP, spreading to all of a user’s TEX documents. data formats for media manipulated by desktop systems, The virus requires no user action beyond compiling an together with the inability of programmers to write infected file. Our virus does nothing but infect other doc- bug-free code, has generated a stream of exploits in uments, but it could download and execute binaries or common media formats. These exploits take advantage undertake any other action it wishes. of software bugs to induce arbitrary behavior when a Second, we describe data exfiltration and denial of ser- user views a data file, even seemingly simple ones such vice attacks against web-based LATEX and METAPOST as Windows’ animated cursors [16]. The inclusion of previewer services. Our findings have implications for powerful scripting languages in file formats like Mi- any online service that compiles or hosts TEX files on crosoft’s Word has led to so-called macro viruses,1 and behalf of untrusted users, including the Comprehensive to PostScript documents that violate a paper reviewer’s TEX Archive Network (CTAN) and Cornell University anonymity [3]. These two trends have combined in Library’s arXiv.org preprint server. the use of PDF files that include JavaScript to exploit Defenses. The lesson we teach here is one learned over bugs in Adobe’s Acrobat; by one report [19], some and over: As the Internet has made document sharing 80% of exploits in the fourth quarter of 2009 used easier and more pervasive, file formats once considered malicious PDF files. Thus the complexity and opacity trusted have become attack vectors, either because the of data formats has made data behave more like code. parser was insecure or because the scripting capabili- On the other side, a line of work culminating in the ties exposed to files in the format have unforeseen con- English-language shellcode of Mason et al. [13] has sequences. Barring a fundamental change in the way shown how to make code look more like data. that data-handling code is designed and implemented, In this paper, we present a case study of another un- we must set aside the idea that data, unlike code, can safe data format, one that is of particular interest to the be “safe”; we should instead treat data-processing code academic community: TEX. Unlike Word documents as inherently insecure and design systems that can with- or PDF files, the input file formats associated with TEX stand its compromise — as, for example, Bernstein has advocated [5]. 1Amusingly, some advocacy documents list “no macro viruses” as an advantage TEX has over Word; see, e.g., http://web.mit. For TEX specifically, there are three main approaches edu/klund/www/urk/texvword.html. to protect against abuse of interpreted languages. First, 1 one could audit the interpreter for vulnerabilities that al- the text were typed directly into the main document: low the attacker to subvert the intended restrictions on \input file (or in LATEX, \input{file}). the scripting language. Such vulnerabilities are com- \@input LATEX internal similar to TEX’s \input. monly found in supposedly safe file formats and fre- \@@input LATEX internal identical to TEX’s \input. quently allow the attacker to execute arbitrary code, as \jobname TEX primitive that expands to the name of in Dowd’s recent ActionScript exploit [8]. We know of the file being compiled without its extension. no such vulnerabilities in TEX, but their absence does \newread (LA)TEX macro to allocate a new stream for nothing to defend against capabilities granted to TEX file reading: \newread\file. scripts by design, including the file I/O that forms the \openin TEX primitive that opens a file and associates basis for our attacks. A second approach is to attempt to it with a read stream: \openin\file=foo.ext. establish a safe subset of commands, through blacklist- \read TEX primitive that reads a line from a file, as- ing, whitelisting, or other forms of filtering or rewriting. signing each character the category code currently (This is akin to code-rewriting systems in which code is in effect: \read\file to\line stores the tokens verified safe at load-time [17].) As we show below, the produced from the file into \line. malleability of the TEX language makes it difficult to fil- \readline e-TEX extension that behaves as \read ter safely. A final, more drastic approach is to treat the but assigns only the category codes “other” and entire system as untrusted and sandbox it using the op- “space.” erating system’s isolation mechanisms; as we show, this \relax TEX primitive that takes no action; just relaxes. seems like the most promising approach for TEX. \write TEX primitive that writes an expanded token list Observations similar to the ones we have made for TEX to a file: \write\file{foo}. apply to other data formats that are programmable (e.g., Other control sequences are used below, but either their using JavaScript) or require complicated and error-prone behavior is clear or their use is not of central importance. parsers. Ensuring that all programs that process such for- TEX parsing behavior. TEX’s behavior is usually de- mats are appropriately sandboxed represents a reimagi- scribed in terms of a “mouth” and a “stomach.” The exact nation of the way traditional desktop environments are behavior is fairly complex but the following simplified engineered; a redesigned system would dovetail with the description will suffice for this paper. TEX’s mouth reads principles laid out by Bernstein [5]. each line of input character by character and produces a 2 Low-level details of TEX stream of tokens which are acted on by TEX’s stomach. There are two types of tokens produced by T X’s In this section, we recall some features of the T X pro- E E mouth — character tokens and control sequences — and gramming language and the LAT X macro package. The E their production is governed by the category code — an discussion covers only the behaviors on which our attack integer in 0–15 — of the characters read. At any given relies; for more complete coverage we refer the reader time, each input character is associated with a single to [6, 11, 24]. Even so, the discussion is quite technical. category code. Except in certain situations, expand- Readers not interested in T X arcana are encouraged to E able tokens (e.g., macros) are expanded into other tokens continue to Section 3, referring back to this section for en route to T X’s stomach. Once in the stomach, T X reference as necessary. E E processes the tokens, performing assignments — such as Important control sequences. TEX and LATEX behav- changing category codes — and typesetting. ior is principally controlled by a variety of control se- When TEX encounters two identical characters tokens quences, conventionally a sequence of characters pref- with category code 7, (by default only ˆ ), followed by aced by a backslash (\). Below are some of the control two lowercase hexadecimal numbers, it treats the four sequences we will use in the remainder of the paper. characters as if a single character with ASCII value the \catcode TEX primitive that changes the category code hexadecimal number had appeared in the input. of a character: \catcode`\X=0 changes the de- fault category code of X from “letter” to “escape 3 Malicious TEX usage character.” It is generally assumed that it is safe to process arbitrary, \csname ... \endcsname TEX primitive that builds untrusted documents with TEX, and by extension LATEX. control sequences: \csname foo\endcsname is However, this is untrue; in fact, TEX can write arbitrary (almost) the same as \foo.