Masaryk University Faculty of Informatics

Design and implementation of a framework for viewing and analysis of malicious documents

Diploma Thesis

Bc. Richard Nossek

2013

1 2 Statement

I hereby declare that I have worked on this thesis independently using only the sources listed in the bibliography. All resources, sources, and literature, which I used in preparing or I drew on them, I quote in the thesis properly with stating the full reference to the source.

3 Acknowledgement

I would like to thank my advisor, RNDr. Václav Lorenc, for his guidance and advice during my work on this thesis.

4 Abstract

The goal of this thesis is to provide an in-depth assessment of the use of PDF (Portable Document Format) file format as an attack vector and the current state of the field of malicious document analysis. First, we provide detailed introduction into the inner organization and structure of PDF files and describe how different features can be used for obfuscation purposes. Next, we survey available options for viewing PDF documents in web browser environment, as well as tools for PDF document analysis. The practical part consists of designing and implementing a web application that serves as framework for malicious document analysis.

Keywords

Portable Document Format, PDF, malware, malicious documents, PDF analysis, PDF analysis tools, analysis framework.

5 Contents

Introduction 8 1 Portable Document Format 9 1.1 Version history 9 1.2 PDF architecture 11 1.3 PDF file structure 11 1.3.1 File header 12 1.3.2 File body 12 1.3.2.1 Boolean objects 12 1.3.2.2 Numeric objects 13 1.3.2.3 String objects 13 1.3.2.4 Name objects 14 1.3.2.5 Array objects 14 1.3.2.6 Dictionary objects 14 1.3.2.7 Stream objects 15 1.3.2.8 Null object 16 1.3.3 Cross-reference table 16 1.3.4 File Trailer 17 1.3.5 Incremental Updates 18 1.3.6 Object Streams 19 2 Vulnerabilities and exploits 20 2.1 PDF as an attack vector 20 2.2 Propagation and payload execution 21 2.2.1 Email 21 2.2.2 Drive-by downloads 21 2.2.3 COM objects 22 2.3 Adobe Reader vulnerabilities 22 2.3.1 Most frequently exploited vulnerabilities 22 2.4 PDF obfuscation techniques 25 2.4.1 Header 25 2.4.2 String objects 25 2.4.3 Name objects 26 2.4.4 Encryption 26 2.4.5 Filters 27 2.4.6 Code fragmenting 27 2.4.7 Object streams 27 2.4.8 JavaScript obfuscation 27 2.5 PDF JavaScript 28 2.5.1 Function getField 28 2.5.2 Function this.info 29 2.5.3 Function getAnnot 29 3 PDF online viewers and analysis tools 30 3.1 Online PDF viewers 30 3.1.1 Browser plugins 30

6 3.1.2 Server-side format conversion 30 3.2 Analysis Tools 31 3.2.1 PDF Tools 32 3.2.1.1 PDFiD 32 3.2.1.2 PDF-parser 33 3.2.2 peepdf 35 3.2.3 PDF Stream Dumper 37 3.2.4 Origami 38 3.2.5 jsunpack 39 3.2.6 Wepawet 40 3.2.7 PDF Examiner 41 3.2.8 PDF X-Ray 42 4 Application design 43 4.1 Framework definition 43 4.2 Document analysis 43 4.3 Design and prototypes 43 4.3.1 Database relations 47 4.3.2 Security concerns 48 5 Implementation 49 5.1 Used technologies 49 5.2 Document root organization 50 5.2.1 Upload page 50 5.2.1.1 Flash conversion script 51 5.2.1.2 Supplementing score system 51 5.2.1.3 Uploader component security 51 5.2.2 File browser page 52 5.2.3 Report page 53 5.3 Implementation overview 54 5.4 Deployment 55 Conclusions and future work 57 References 58

7 Introduction

In recent years, use of information technology (IT) has become more pervasive in most aspects of our everyday lives. Hand in hand with this change goes increased use of electronic documents due to individuals, businesses and governments adapting to the electronic environment. Portable Document Format (PDF) [1] files are currently one of the most used formats thanks to their rich feature list, portability and Adobe's freely available reader software. However, the popularity of the format inevitably drew the attention of malware authors, who quickly recognized an opportunity and began to use vulnerabilities in Adobe Reader [2] as an attack vector. What makes PDF files special in this regard is that it was soon discovered that the extensive PDF specification also provides legitimate ways of disguising malicious payload inside documents. These obfuscation techniques not only rendered traditional signature based detection of anti-virus (AV) software ineffective, but also made static analysis problematic. Most AV scanners have since implemented PDF parsing functionality, yet it was shown [3] that combining multiple techniques and avoiding common patterns can bring detection rates back close to zero even today. Specialized tools and understanding of the format are necessary to parse and correctly examine the contents of PDF files. The main goal of the theoretical part of this thesis is to review available options for viewing PDF files in web browsers and survey tools for analysis of PDF documents and test the extend of their functionality. However, in order to do so, we first need to understand the structure of PDF documents and the fundamentals of obfuscation techniques and vulnerabilities specific to the format. The first few chapters therefore cover this topic. In the practical part of this work, we design and implement a web application that serves as a framework for analysis of malicious PDF documents. One of the aims was to develop it using only free software [4]. The first chapter provides an overview of PDF version history [5], which also lists all the functionality added since release. It also describes how PDF documents are structured and organized internally. The second chapter first recounts how malicious documents are usually distributed and executed [6], offers a summary of all known vulnerabilities [7] related to Adobe Reader. Additionally, it explains how PDF obfuscation [8] works and lists the differences in Adobe's implementation of the JavaScript engine. In chapter three, we discuss available options for viewing PDF files in browser environment and then we review and describe all of currently freely available tools for PDF analysis (both offline and online). Namely, it is PDF Tools [9], peepdf [10], PDF Stream Dumper [11], Origami [12], jsunpack [13], Wepawet [14], PDF Examiner [15] and PDF X-Ray [16]. Chapters four and five cover the design and implementation of our web application. This includes establishing a of workflow for static analysis of malicious PDF documents, outlining potential security issues and providing instructions on how to deploy the application on a web server. Finally, we draw conclusions about the the problematic of malicious document analysis and discuss future possibilities regarding our application.

8 Chapter 1

Portable Document Format

This chapter summarizes version history of the Portable Document Format and describes how data is internally organized inside a PDF document.

1.1 Version history

Portable Document Format is a file format that was created by Adobe Systems in 1993. According to Adobe, PDF is a fixed-layout format used for representing two-dimensional documents in an independent manner of the application software, hardware and operating system that lets you capture all the elements of a printed document as an electronic image that you can view, navigate, print or forward to someone else. In essence, PDF allows users to view documents exactly as their authors designed them, regardless of any differences between the author's and reader's systems and without the need to have the software used to create the document. Nowadays, PDF has become a de facto standard for electronic distribution of documents. Even though Adobe Systems Inc. holds patents to PDF, anyone may create and publish applications that can read or create PDF documents without having to pay any royalties. Initially, PDF was created with the idea of paperless office in mind. PDF format was intended to provide a way for companies, corporation and other organizations to exchange documents electronically. The format was first publicly talked about at a Seybold conference in 1991 (then it was referred to as 'Interchange PostScript'). PDF 1.0 was introduced a year later at Comdex Fall. 1.0, software used to create and view PDF files, was released on 15th June 1993. PDF 1.0 included features such as bookmarks, internal links and embedded fonts. However, the format was not successful at first, mostly due to high pricing of creation tools and the lack of free version of Acrobat Reader. The first edition was revised twice. PDF 1.1, along with corresponding version of Adobe Acrobat 2.0, were released in 1994. It included several new features, such as password protection, encryption (MD5 and RC4), device independent colors, threads and links. PDF 1.2 [17] (and Adobe Acrobat 3.0) was released two years later in 1996. It introduced, among other things, interactive page elements, fill-in forms and Forms Data Format used for transmitting form data to and from the Web. PDF 1.3 [18], the second edition of PDF, was released in 2000 added support for the new features of the Adobe imaging model embodied in PostScript LanguageLevel 3. Most important features introduced were new data structures for efficient mapping of strings and numbers to PDF objects, several new types of functions, embedding of files of any type within a PDF document, several new annotation types, digital signatures, support for JavaScript, a way to capture information from the Web and converting it to PDF form and prepress support. Third edition of the format, PDF 1.4 [19], was released a year later in 2001. The most important addition in PDF 1.4 was the transparent imaging model, which allows objects to be painted with different degrees of opacity so that previously painted objects can show

9 through. Other new additions include enhancements to encryption (RC4 key lengths greater than 40 bits), the ability to import content from one PDF document to another, preferences for controlling are of page to be displayed or printed, annotation names and dictionaries, new trigger events, many improvements to Forms Data Format and interactive forms and metadata streams, which represent a new way for attaching descriptive information to PDF documents. Lots of changes have been introduced in PDF 1.5 [20] (fourth edition), which came out in 2003. The changes include the ability to display images using JPEG2000, greater compression of PDF files due to an extension to the use of streams, enhancements to interactive presentations, few new annotation types, feature enabling PDF files to be viewed as a slide-show, enhanced support for embedding and playback of multimedia, improvements to digital signatures and most importantly several new options for the encryption of documents – such as syntax for public-key security handlers using PKCS#7, PKCS#7 with SHA-1, public-key encryption (RSA up to 4096-bits). Support for Windows 98 was officially dropped. PDF 1.6 [21], the fifth edition of PDF, was released in 2004 and introduced several changes. These include the ability to increase maximum page size of a PDF document, further enhancements to the syntax of color spaces so that applications have more options when displaying colors that are not available on target device, embedding of OpenType fonts within PDF files, new options for markup annotations, a way to specify relations between the dimensions of real world objects and their document counterparts, the ability to lock the size of certain objects when they are printed (regardless of the dimensions of the rest of the page), cross-document linking to embedded files, modification detection and prevention (MDP) signatures and incorporation of 3D graphical data in U3D format. Encryption was enhanced to support the AES algorithm, PKCS#7 with SHA-256, DSA (up to 4096-bits) and selective encryption of embedded files. An important milestone came with the release of PDF 1.7 [22] in November 2006. Based on PDF Reference 1.7, Adobe Systems Inc. prepared initial draft of ISO 32000 document, which was then reviewed and edited by ISO Technical Committee 171 under a special fast-track procedure. The final revised documentation was approved in January 2008 and published as ISO 32000-1:2008 in July 2008 [23]. PDF 1.7 and consecutively the ISO 32000-1:2008 contains all of the functionality documented in PDF 1.0 to PDF 1.6, with few exceptions. Newly introduced changes included enhanced control over presentation of 3D artworks, new viewer preference settings that specify print characteristics (such as paper selection, page range, copies, scaling) to make PDF documents more suitable for use in legal communication, XFA 2.4 rich text elements and attributes, support for PKCS#7 with SHA- 384, SHA-512 or RIPEMD160. Adobe is not producing a PDF 1.8 reference. Instead, new extensions to PDF 1.7 were published – PDF 1.7 Extension Level 3 [24], PDF 1.7 Extension Level 5 [25] and which were released in 2008 and 2009, respectively. The extensions add support for 256-bit AES encryption, support for Unicode-based passwords and pass-phrases, improved way of attaching of multimedia files, XFA (2.5, 2.6 and 3.0) rich text features and a way to enforce viewer preferences. PDF 2.0 is currently in development and extension level 3 and extension level 5 have been submitted to ISO to be a part of the next version of the ISO 32000 standard.

10 1.2 PDF architecture

PDF is one of page description languages (PDL), which is a group of formats used to describe the appearance of a printed page in higher level than a simple output bitmap. The PDF format consists of 3 separate technologies:

• PostScript: is a PDL language a subset of which is used in PDF to generate the graphics and layout of the document.

• Font-Embedding system: embedding fonts into the document allows documents to be transferred across platforms or systems with ease.

• Storage system: a structured system which groups all the elements, compresses the data as needed and combines them into a single file.

Adobe PostScript is a PDL language, but unlike PDF, it is considered a fully fledged programming language. It has distinct disadvantages compared to PDF when it comes to viewing documents – most importantly the fact that as an interpretive programming language with an implicit global state, any instructions related to a given page in a PostScript file can change the appearance of any of the following pages. That means that in order to view a page with a PostScript viewer all previous pages need to be processed sequentially before displaying target page. Ordinarily, PDF code is generated from a source PostScript file by collecting and tokenizing the output graphical commands, gathering any related files, fonts or graphics and then compressing everything into a single file.

1.3 PDF file structure

The internal framework of a basic PDF file is constructed of following four main elements:

• A simple one line header indicating which version of the PDF specification was used to build the file.

• Body that contains all the objects that make up the file.

• Cross-reference table that contains all links to the indirect objects in the file.

• A trailer that indicates the location of the cross-reference table and few additional special objects within the body of the file.

11 Figure 1.1: Basic structure of a PDF file

1.3.1 File header

The first line of a PDF document is always a header of a form %PDF-1.X, where X is a number between 0 and 7. The header indicates the file is indeed a PDF document and also the version used to build it. However, beginning with PDF 1.4, if present, the Version entry found in the document's catalog dictionary will be used to determine the version to which the file conforms, instead of the header. If the document contains binary data (most PDF files do), the header line is followed by a comment line of at least 4 characters. This is to ensure correct behavior of file transfer applications that try to determine whether they should treat the file content as text or binary data.

1.3.2 File body

The body of a PDF file consists of a series of indirect objects such as fonts, images, bookmarks and so on. The objects are called indirect, because they are labeled (they can be referred to by other objects). Beginning with PDF 1.5, the body can also include object streams, where each stream contains a sequence of indirect objects. There are eight types of objects: booleans, numeric objects, strings, names, arrays, dictionaries, streams and null.

1.3.2.1 Boolean objects

Boolean objects in PDF documents are keywords true and false, where each represents their respective logical value.

12 1.3.2.2 Numeric objects

There are two subtypes of numeric objects – integer and real. An integer is written as one or more decimal digits, optionally preceded by a sign (ie. 5616, +9, -59). The value is interpreted as a signed integer and converted into integer object. Similarly, a real number is written as one or more decimal digits with a period (leading, trailing or embedded), optionally with a sign (ie. 22.7, -8.95 +.005). Such value is interpreted as a real number and converted into real object. The range and precision of these numbers depend on the machine the reader is running.

1.3.2.3 String objects

String objects are made of a series of zero or more bytes, which can be written in two different ways. Literal strings are written as one or more characters enclosed in parentheses. The string can contain any characters with three exceptions – unbalanced left parenthesis, unbalanced right parenthesis and the backslash. A balanced pair of parenthesis can be present within a string. The backslash character represents an escape character. The character following the backslash determines it's interpretation:

\n Line feed \r Carriage return \t Horizontal tab \b Backspace \f Form feed \( Left parenthesis \) Right parenthesis \\ Backslash \ddd Character code ddd (one, two or three octal digits)

The \ddd sequence is used for representing outside the printable ASCII character set (ie. (\053) is a single character string – a plus sign). A backslash character at the end of a line indicates that the string continues on the following line. In this case, the PDF reader ignores the string and the following end-of-line marker and treats the string as if it wasn't split. For instance, the following two strings are the same:

(Hello world!) (Hello \ World!)

Hexadecimal strings are written as a sequence of pairs of hexadecimal digits encoded as ASCII characters, enclosed in angle brackets. Each pair then represents one byte of the string. When the number of hexadecimal digits within a string is odd, it is automatically assumed the last digit is a 0 (ie. <901FA> is a 3 byte string with hexadecimal values 90 1F A0). This form is useful for including binary data within a PDF file. Special characters line feed, carriage return, form feed, horizontal tab and space are ignored.

13 1.3.2.4 Name objects

Introduced in PDF 1.2, name objects are defined by a sequence of any characters (with the exception of null). Within PDF documents, the sequence is preceded by a solidus sign (/), which is not a part of the name, but indicates that following characters represent a name. The sequence is unique to each object, meaning that two objects with identical sequences of characters in fact represent the same object. It also doesn't have any internal structure, the characters are not considered to be individual elements. Any regular characters in a name are written as themselves or by using it's 2-digit hexadecimal code, preceded by a number sign (#). Non-regular characters (and number sign itself) are always represented by a number sign followed by their 2-digit hexadecimal code. Examples of name syntax and corresponding result is following:

/Hello Hello /C#23 C# /A#42 AB

1.3.2.5 Array objects

Array objects are one dimensional, sequential collections of objects, written as sequence of objects enclosed in square brackets. PDF arrays are heterogeneous, meaning they can contain any combination of objects, including other arrays (ie. [/Hello (World) 123.45]).

1.3.2.6 Dictionary objects

Dictionary objects are associative tables consist of pairs of objects. Each entry consists of a key (first element) and a value (second element). The key is always a name object, but the value can be any object, including another dictionary. Because they collect and bind together the attributes of more complex objects (ie. font or a page of the document), dictionary objects are the cornerstone of any PDF document. They are written as a sequence key-value entries enclosed in double angle brackets (<< >>). Following is an example of a dictionary object:

<< /Type Hello /Subtype World /SampleItem 12.125 /Subdictionary << /SubItem1 false /SubItem2 (a string) << <<

Entries can contain null values (null object) and such entries are treated as if they don't exist. It is common to include Type and Subtype entries for more complex objects, in order to describe what kind of object the dictionary identifies.

14 1.3.2.7 Stream objects

A stream object is a sequence of bytes, which may be of unlimited length. Stream objects have the following format:

dictionary stream x bytes endstream

The stream always begins with a dictionary, which specifies length of the data that make up the stream (this is to retain the ability to differentiate streams beginning with line feed), followed by the keyword stream and end-of-line marker – either line feed or carriage return and line feed. Next come the sequence of zero or more bytes that are the contents of the stream, followed by another end-of-line marker and the keyword endstream. Beginning with PDF 1.2 the data may also be contained in an external file. In this case, the dictionary specifies the file and any bytes between the keywords stream and endstream are ignored. Stream dictionaries may contain several different entries. Entry Length, which indicates the length of the data to be streamed, is mandatory. Optional entries are Filter, DecodeParms, F, Ffilter, FdecodeParms, DL. The significance of the values in dictionary entries is following:

• Length: specifies the number of bytes between the keywords stream and endstream, excluding the end-of-line marker just before endstream.

• Filter: specifies the name of the filter that should be used to process the stream data. It is possible to use an array of filters – they will be applied in the specified order.

• DecodeParms: the value is either a dictionary or an array of dictionaries. The dictionary contains parameters to be used by the filter specified in Filter entry. In case of multiple filters, an array of dictionaries is used, where each filter has one entry (null entry is used for filters that have no parameters).

• F: specifies the file containing the stream data. If this entry is present, all bytes between stream and endstream keyword shall be ignored.

• FFilter: specifies the name of a filter used to decode the data in external file. Array of names is used for more filters.

• FdecodeParms: specifies the parameter dictionary to be used by the filter from FFilter entry. An array of dictionaries is used for more filters.

• DL: introduced in PDF 1.5, the entry represents how many bytes are in the decoded stream. However, this value is not reliable, because for many stream filters it is not possible to determine it precisely.

15 1.3.2.8 Null object

The null object, denoted by the keyword null, is unequal in type to any other object. Indirect object references to any nonexistent object are the same as null object. When specified as a value within a dictionary object, the value should be treated as if it doesn't exist.

1.3.3 Cross-reference table

The cross-reference table consists of links to all the indirect objects in a PDF file. Each line of the table specifies the location of a single object within the body of the document. The objects can be accessed randomly, which means that it isn't necessary to read the entire file when attempting to locate a particular object. Beginning with PDF 1.5 the cross-reference information (or part of it) may also be contained in cross-reference streams. The table consists of one or more cross-reference sections. Initially, the table has a single section and each time the file is incrementally updated one new section is created. The cross-reference section begins with a line containing the keyword xref, followed by one or more cross-reference subsections. If a file has never been incrementally updated it contains a single subsection numbered 0. The advantage of using the subsection structure is that a new cross-reference section can simply be added when we delete or add new objects. The subsections begin with a line of a form 'X Y', where X is a number of the first object in the subsection and Y is the number of entries in the subsection. This line is then followed by the cross-reference entries, one on each line, each 20 bytes long. There are two types of entries: one indicating objects that are in use and another for deleted objects. Both entry types have similar formats:

[10-digit number] [5-digit number] [f or n] [2 end-of-line characters]

The entry starts with 10-digit number denoting byte offset in the decoded stream (padded with leading zeros if needed) for objects that are in use or object number of the next free object for a free object entry, followed by a space and 5-digit generation number, then separated by another space is character n for objects that are in use or character f indicating free entry. The entry is closed by 2 end-of-line characters. The free entries in the cross-reference table form a linked list, where each entry specifies the number of the next one. The first entry in the table is always free, has generation number 65535 and represents the head of the linked list. Conversely, the tail of the list points back to object number 0 (the head of the list). Optionally, the free entries can point back to object number zero and have generation number 65535 (they are not a part of the linked list).The generation number is initially set to 0 for all entries in the cross-reference table with the exception of object number 0. Upon deletion of an indirect object, the corresponding entry in the cross-reference table is marked as free and the generation number is incremented by 1. The incremented value indicates what generation number should be used next time an object with the same object number is created. The maximum value a generation number can reach is 65535. Entries that reach the maximum value cannot be reused. Following is a simple example of what a cross-reference section of a PDF file might look like:

16 xref 0 1 0000000000 65535 f 3 1 0000025325 00000 n 23 2 0000025518 00002 n 0000025635 00000 n 30 1 0000025777 00000 n

This example contains a single cross-reference section with five entries in four subsections. The first subsection contains a single free entry – object number 0. The second subsection contains also a single entry for object in use – object number 3. Following subsection contains two objects (number 23 and 24), which are both in use. Generation number of object 23 indicates that it has been reused twice. Lastly, the fourth subsection contains a single object that is in use (object number 30).

1.3.4 File trailer

The trailer section is used to quickly locate the cross-reference table, as well as few other special objects. The trailer has the following format:

trailer

<< key1 value1 ...

keyn valuen >> startxref [byte offset] %%EOF

Reading a PDF file entails an unusual concept – the trailer is actually the first thing that is read after the header. This is to improve the speed at which the objects can be accessed when loading the PDF file. The trailer section is denoted by a keyword trailer on a single line, followed by a lines containing key-value pairs enclosed in in double angle brackets – the trailer dictionary. The dictionary is followed by two lines. The first one contains the keyword startxref and the second one the byte offset from the beginning of the file to the beginning of the xref keyword in the last cross-reference section in the decoded stream. The trailer dictionary can contain following 6 keys:

• Size: indicates the total number of entries in the cross-reference table.

• Prev: only used when the file has more than one cross-reference section (possible in

17 updated files). The value indicates the byte offset from beginning of the file to the beginning of the previous cross-reference section in the decoded stream.

• Root: the catalog dictionary for the PDF document. The catalog contains information how the document should be displayed on the screen, as well as references to objects that define the document's contents, outline and other attributes.

• Encrypt: contains the document's encryption dictionary.

• Info: contains the document's information dictionary, which contains metadata for the document.

• ID: a file identifier consisting of two-byte string that form an array.

1.3.5 Incremental updates

One of the advantages the PDF format offers is the use of incremental updates. Whenever a PDF file is updated, the original contents remain untouched, instead the changes are appended to the end of the file.

Figure 1.2: Incrementally updated PDF file

This way, small changes made to a very large document can be saved very quickly. It also means it is possible to save changes to documents in a situation where we cannot overwrite contents of the edited file (ie. when editing a document via HTTP connection). Incremental file structure is shown in figure 2.2. When a file is updated, the newly added cross-reference

18 section only consists of entries for objects that have been changed, replaced or deleted. Objects that have been deleted are not actually removed from the file. They remain unchanged within the file, but the corresponding cross-reference entries mark them as deleted. The newly added trailer is the same as the previous trailer, with the exception of the Prev entry (if there is one). A new Prev entry that points to the location of the previous cross- reference section is added instead.

1.3.6 Object streams

Introduced in PDF 1.5, an object stream is a stream object that contains sequence of indirect objects. The purpose of object streams is to allow indirect objects to make use of PDF stream compression filters. Not all objects can be stored this way – stream objects, objects with generation number greater than zero, a document's encryption dictionary and an object representing the value of the Length entry in an object stream dictionary cannot be stored in an object stream. The stream dictionary of an object stream contains all the regular entries, as well as few additional ones:

• Type: is a required entry that indicates what object the dictionary describes.

• Size: has value equal to the highest object number in the section plus one. Required entry.

• Index: is an optional entry consisting of an array of integers that contains a pair of integers for each subsection in the section. The first integer is the first object number in the subsection and the second one is the number of entries in the subsection.

• Prev: is used only when the file has more than one cross-reference stream, it serves the same function as Prev entry in the trailer dictionary (indicates byte offset from the beginning of the file to the beginning of the previous cross-reference stream).

• W: is a required entry representing the size of the fields in a single cross-reference entry.

19 Chapter 2

Vulnerabilities and exploits

Use of vulnerabilities in Adobe Acrobat and Adobe Reader as an attack vector first appeared in early 2007. In recent years, PDF gained much popularity and became a de facto standard for sharing of electronic documents – many PCs are even shipped with pre-installed Adobe Reader. Moreover, due to the way the format is designed, it not only has very large vulnerable surface, but also provides features that help malware authors to evade detection by anti-virus software. All of these are part of the reason why PDF has also been receiving more and more attention from malware authors lately. This chapter describes malicious document distribution channels, lists known vulnerabilities in Adobe Reader and Acrobat and discusses obfuscation techniques used in PDF files.

2.1 PDF as an attack vector

Before going into specific vulnerabilities and exploits, it is important to understand how exactly can the act of opening a malicious PDF document can lead to arbitrary code being executed on the victim machine. It should be noted that the vulnerabilities don't occur within the PDF format itself, but rather in the implementation of various PDF browsers. The workflow of infection via PDF document is shown in figure 2.2.

Figure 2.1: Incrementally updated PDF file

When a user opens a malicious file in a vulnerable browser, the code inside attempts to insert a certain sequence of bytes in previously determined location in the PDF reader's memory. The technique of inserting data into memory is called heap spraying [26] and the malicious piece of code being inserted is called shellcode. Vast majority of PDF related exploits uses JavaScript to perform heap spray. In this case, the desired effect can be achieved by concatenating a single character string with itself many times, growing it exponentially. The result of each iteration is then copied together with desired shellcode into an array. This way, it is assured that sufficient amount of memory has been sprayed to successfully carry on with the infection process. Shellcode in this instance represents a small segment of code, usually written in machine code. It serves as a payload for the exploit and typically downloads and executes another malicious program (e.g. a trojan), thus finishing the infection process. The term itself originally described code that spawned a unix command shell, but has since been accepted as a general term for code that allows an attacker to gain control over the victim machine after a vulnerability has been successfully exploited.

20 2.2 Propagation and payload execution

There are several ways infected PDF documents can be distributed. The most widely spread method entails attaching the malicious file to an email that is in some way enticing to the victim. Another possibility is to host the file on a Web page and deliver it when a victim visits the page. While not prevalent among the samples found in the wild, it is in fact also possible for a malicious document to execute it's payload without the victim ever opening the file.

2.2.1 Emails

In most cases, author of a malicious document relies on some form of social engineering to manipulate the victim into opening the infected file. Depending on the amount of premeditation that goes into creating such document, it is possible to identify two distinct techniques – email spam and targeted attacks. Use of mass mail campaigns with PDF attachments that entice the user is an ongoing issue. The attack relies on the victim viewing the document with a vulnerable viewer, thus executing the embedded payload. Therefore, the content of such messages is a key to its success and needs to be as convincing as possible. It usually revolves around current news headlines or political, religious and other controversial subjects. A common technique that has been used a lot in recent years is to copy the content of recent major news articles from the Web and spoof the sender's name to make it seem the mail originates from the author of the article. Attached PDF files often have legitimate content, so as not alert the user, but they also carry a payload that is dropped upon opening of the file. Targeted attacks are second, more dangerous form of attacks. They involve use of PDF files and social engineering techniques to target a specific organization or individual. Authors of such attack invest significant amount of time into researching and gathering information about the victim in order to come up with a document that is plausible and seems to originate from a legitimate source. Targeted attacks are much more rare then simple spam mails, due to the effort and planning that is necessary to execute them. On the other hand, they tend to be much more sophisticated and stealthy, with the victim often never realizing their machine has been infected.

2.2.2 Drive-by downloads

Another way to use malicious PDF documents to compromise victim computers is to host such files on a Web page. Unlike the email propagation methods, the victim doesn't need to actively open the document. Instead a web browser plug-in automatically launches the file and the infection process happens silently in the background. In most cases, the Web page contains JavaScript code [27] that gathers information about the visitor's machine and chooses one of several different exploits based on that information. These automated attacks are in most cases carried out by web exploit kits – packs containing malicious programs, that are available for purchase on the black market.

21 2.2.3 COM objects

Most malware authors rely on methods described in sections 3.1.1 and 3.1.2 to get the victims to open the infected document. However, it should be noted that on Windows operating systems with Adobe Reader installed, it is possible for the malicious payload of PDF documents to be executed without any user interaction at all. As a part of it's installation package, Adobe Reader installs Windows Shell Explorer Extension and iFilter, which both allow Windows components to read PDF files without any user input. Column Handler Shell Extension is a COM [28] object that allows Windows Explorer to read and process PDF documents (for example to render a thumbnail preview of the document). Column handlers are programs that provide Windows Explorer additional data to display. When a PDF file is listed in a window, the column handler extension is called to read the PDF document and extract all the necessary information (such as Title, Author, etc.). This however can cause the exploit to be triggered and the malicious code to be executed inside the Windows Explorer process. iFilter is a COM object that extends Windows Indexing Service and provides it with the ability to read and index PDF files. When a PDF document is found, the Windows Indexing Service calls the iFilter, which in turn loads the Acrobat PDF parser, thus executing the payload inside the PDF document.

2.3 Adobe Reader vulnerabilities

PDF can be in many ways considered a container format – 3D data, pictures, multimedia files, Flash [29] or JavaScript can all be embedded within a document. Documentation for vulnerability CVE-2007-5020, which entailed taking advantage of a flaw related to getmail() function, was released in September 2007 and since then many different exploits appeared and continue to be used with varying degrees of success. The exploits can broadly classified into three categories – JavaScript based, Flash based and others. JavaScript based exploits are often used in malicious PDF documents. Support for JavaScript programming was added in PDF 1.3, with many JavaScript functions being available in the form of APIs. JavaScript based exploits represent a significant portion of all PDF exploits and disabling support in PDF reader software is an effective way to neutralize them. Embedded Flash objects have been the prime source of PDF exploits in recent months. Currently, five out of last six published vulnerabilities revolve around flaws in , with all five being ranked as critical. In some cases, JavaScript is used to help set up or trigger the exploit.

2.3.1 Most frequently exploited vulnerabilities

Publicly known information security vulnerabilities are cataloged in CVE (Common Vulnerabilities and Exposures) [30] dictionary. Currently, thirty vulnerabilities related to PDF files can be found within the dictionary, each of them being exploited in the wild with varying degrees of frequency and success. This section lists all commonly exploited PDF

22 threats.

CVE-2008-0655 was released in early February 2008. The threat targets Acrobat Reader and Adobe Acrobat 8.1.1 and earlier versions. It carries high level of risk, since successful exploit allows a remote code execution and subsequent remote control of victim machine. The exploit involves a specially crafted PDF file, which contains a JavaScript that sprays the heap with malicious shellcode and then executes it using malformed Collab.collectEmailInfo() call. Example of code responsible for stack overflow is following:

var overflow = unescape("%%u0a0a%u0a0a"); while (overflow.length < 0x4000) overflow += overflow; this.collabStore = Collab.collectEmailInfo({subj: "",msg: overflow});

CVE-2008-2992 was originally published in April 2008, this vulnerability affects Adobe Reader and Adobe Acrobat 8.12 and earlier. It is another stack-based buffer overflow that allows attackers to execute arbitrary code on the victim machine. The vulnerability occurs due to a boundary error when parsing a malformed util.printf() JavaScript call. Following is an example of the JavaScript code that triggers the exploit:

var badnumber = "12999999999999999999"; for (i = 0; i < 276; i++) { badnumber += "8"; } util.printf("%45000f, badnumber);

CVE-2009-0927 was initially published in March 2009. The vulnerable versions are Adobe Reader and Adobe Acrobat 7 prior to version 7.1.1, Adobe Reader and Adobe Acrobat 8 prior to version 8.1.4 and Adobe Reader and Adobe Acrobat 9.0.0. It is another JavaScript based exploit, which uses a specially crafted PDF file to spray the heap with malicious shellcode and cause a stack-based buffer overflow via the Collab object's getIcon() method. The end result is remote code execution and consequent remote control of the infected machine. Following is an example of a JavaScript responsible for the buffer overflow:

var exploit = unescape("%09"); while (exploit .length < 0x4000) { exploit += exploit; } exploit = "N." + exploit; app.doc.Collab.getIcon(exploit);

CVE-2009-4324 is yet another JavaScript based vulnerability that had been widely exploited in the wild during December 2009, until a patch was released in January 2010. It targets Adobe Reader and Adobe Acrobat versions 9.x before 9.3, and versions 8.x before 8.2 on

23 Windows and Mac OS X. The exploit itself takes advantage of a flaw in newPlayer() method of doc.media object. Specifically, the bug is triggered by supplying a null argument to the newPlayer() method and using util.printd function to allocate a new object (of equal size) where the the media object should be. The code that triggers the exploit may look like this:

var exploit1 = "0d0c0d0c0d0c0d0c41706d7a554b4d67794f6f4d585a764" var exploit2 = "f4c566fb48584249666d5666f625a456775686a46525871" var exploit3 = "4e79617a614878756b754d4c57647a58704d4644624b4f" try {this.media.newPlayer(null);} catch(e) {} util.printd(exploit1 + exploit2 + exploit3, new Date());

CVE-2010-0188 was published in February 2010. It targets a vulnerability in Adobe Reader and Acrobat 8.x before 8.2.1 and 9.x before 9.3.1. The exploit takes advantage of the fact that Adobe used outdated version of libtiff library (a library used to read, write and manipulate with TIFF images), which is susceptible to exploit of vulnerability CVE-2006-3459. The malicious PDF file contains a base64 encoded TIFF image and an XFA form with embedded JavaScript. The JavaScript sprays the heap with shellcode and then performs a set call to the ImageEdit element of AcroForm.api plug-in. The call uses the TIFF image combined with an invalid parameter, thus triggering the exploit.

CVE-2010-2883 was disclosed to public in September 2010. The vulnerable versions of Adobe Reader and Adobe acrobat are 9.x prior to 9.4, and 8.x prior to 8.2.5 on Windows and Mac OS X. The vulnerability lies in the CoolType.dll library installed by Adobe Reader and Acrobat. Once the reader software initialized CoolType Typography Engine, a function called strcat (which is known to be insecure) is used to overwrite the stack. The exploit is then triggered via library icucnv36.dll, which redirects application flow control. CVE-2010-2883 is one of few critical vulnerabilities that is capable of bypassing techniques that prevent stack overflow and heap overflow (Data Execution Prevention and Address Space Layout Randomization) and does not rely on use of JavaScript or Adobe Flash.

CVE-2010-3654 was released in October 2010. The vulnerability exists in the authplay.dll component that comes with Adobe Reader and Adobe Acrobat 9.4 and earlier 9.x versions for Windows, Mac OS X and UNIX operating systems. The vulnerability can be exploited by a specially crafted PDF file, which contains JavaScript designed to spray the heap with malicious shellcode and a SWF file that causes type confusion in ActionScript bytecode language, which triggers the exploit.

CVE-2011-0609 was discovered and published in March 2011 and affects the authplay.dll component shipped with Adobe Reader and Acrobat 9.x through 9.4.2 and 10.x through 10.0.1 on Windows and Mac OS X. In this case, the malicious PDF file contains two SWF files. The first one sprays the heap with shellcode through AVM2 (ActionScript Virtual Machine version 2), retrieves the second SWF file from a hex encoded string object and then executes it. The second file then takes advantage of the fact that the AVM2 does not check bytecode streams properly before executing them. This leads to execution of memory previously initialized by the heap spray, thus triggering the exploit.

24 CVE-2011-0611 is another vulnerability involving authplay.dll library. It was released to public in April 2011 and the affected versions of Adobe Reader are 9.x before 9.4.4 and 10.x through 10.0.1 on Windows, 9.x before 9.4.4 and 10.x before 10.0.3 on Mac OS X. Vulnerable versions of Adobe Acrobat include 9.x before 9.4.4 and 10.x before 10.0.3 on Windows and Mac OS X. The exploitation process is quite similar to the one used to exploit CVE-2011-0609. The infected PDF file contains a SWF file that sprays the heap with malicious shellcode and retrieves a second SWF file from hex encoded string object. The second SWF contains code that crashes Adobe Flash and overwrites pointers in memory. This triggers the exploit and results in execution of the malicious shellcode. CVE-2011-0611 is currently the most frequently exploited PDF vulnerability.

CVE-2011-2462 is a vulnerability found in the Universal 3D (U3D) component shipped with Adobe Reader and Adobe Acrobat. Affected versions are 10.1.1 and earlier on Windows and Mac OS X, and Adobe Reader 9.x through 9.4.6 on UNIX systems. It was discovered and published in December 2011. The vulnerability can be exploited by a specially crafted PDF file. The process involves spraying the heap via JavaScript and then rendering the malformed U3D object, which causes memory corruption and triggers the vulnerability.

2.4 PDF obfuscation techniques

One of the main contributing factors of high popularity of PDF files among authors of malware is the fact that the format provides many ways and techniques to obfuscate the contents of a file. This is further aided by less than strict PDF reader implementations that often parse documents correctly even though the file contains severe deviations from the format specification, thus providing malware authors with even more options. It is quite easy to design and implement an effective obfuscation of PDF files and the purpose of employing such evasion techniques is twofold. Not only it makes manual analysis of contents of suspicious PDF files much more difficult and time consuming, but it also renders anti-virus scanners ineffective. Additionally, since many exploits rely on use of JavaScript, all obfuscation techniques specific to JavaScript also apply.

2.4.1 Header

According to the PDF specification, a PDF file should start with a header of a form %PDF-1.x (where x is a number that identifies the version of PDF) on a single line. However, Adobe Reader will parse the file correctly as long as the header is anywhere within the first 1024 bytes of the file. Moreover, the last character can be omitted and the header is still recognized. The goal of such non-conforming header is to confuse anti-virus software so that the file is not analyzed as a PDF document.

2.4.2 String objects

Strings inside a PDF often store JavaScript scripts or other malicious code. String objects can be represented in two distinct ways, both of which can be used by malware authors to add a

25 layer of obfuscation to their scripts. Strings enclosed in parentheses can be split over many lines by terminating each line with a backslash character. Additionally, each character can be represented by it's octal value preceded by a backslash. Strings can also be as hexadecimal characters (optionally separated by white space) enclosed in brackets. For example:

/JS (app.alert({cMsg: 'Hello World!'});) and

/JS (\97\112\122.\97\108\145\114t({\ c\77sg: 'H\145llo \ World!'});) and

/JS <6170702E616C657274287B634D73673A202748656C6C6F20 576F 72 6C 64 21 27 7D 29 3C> are all equivalent and valid strings.

2.4.3 Name objects

Name objects can contain hexadecimal representation of alphanumeric characters (ie. #61 is equivalent to character a). Any number of characters can be represented this way. This can be used to disguise important key entries in dictionaries – such as /JavaScript, /JS, /RichMedia and /OpenAction. For instance name object /#52#69#63#68#4D#65#64#69#61 is equivalent to name object /RichMedia.

2.4.4 Encryption

The contents of PDF files can be encrypted using AES or RC4 algorithms. Such encryption doesn't change the structure of the file itself. That means it is still possible to see all the indirect objects and their corresponding attributes, but all the strings and data streams are encrypted. The purpose of encryption within PDF documents is twofold. Firstly, it can be used to prevent any user that doesn't know the password from viewing the contents of the document (this is called user password). And secondly, encryption can be used to prevent printing or modification of the document (owner password). Using a password to encrypt malicious documents would be counter productive from malware author's point of view, because the user would get alerted by unexpected password dialog window. However, empty string is also considered a valid password. PDF documents encrypted using empty string as an owner password can be viewed in any reader software without any ill side effects. This provides malware authors with another layer of obfuscation, because the content of a suspicious PDF file needs to be decrypted before it can be analyzed.

26 2.4.5 Filters

PDF format allows data inside stream objects to be encoded and compressed using filters. Very common choice among malware authors is to use the FlateDecode filter, which decompresses data encoded using the zlib/deflate compression method. It is also possible to apply several filters in a cascade (ie. /Filter [/FlateDecode /Ascii85Decode represents a data stream that is encoded in ASCII base-85 representation and then compressed with zlip/deflate method.). Moreover, when a hexadecimal representation is used, it can contain arbitrary number of white spaces, therefore producing varying sequences of bytes when compressed by zlib/deflate method.

2.4.6 Code fragmenting

Malware authors often take advantage of the fact that it is possible to split up JavaScript into multiple statements and functions. The fragments can then be stored inside several indirect objects and then concatenated together. One special case is the use of name dictionaries. A name dictionary is used to reference objects by a name instead of object reference. All actions within the name dictionary are also executed when a PDF document is opened, in order to define JavaScript functions that can later be used by other scripts present in the document. This also ensures that the malicious script fragments are gathered and the JavaScript is executed as a whole. Additionally, it is possible to compress or encrypt each fragment separately, which helps the file evade detection by AV scanners and complicates manual analysis.

2.4.7 Object streams

Object stream is a special type of indirect object (denoted /ObjStm) that contains one or more other indirect objects. Any indirect object may be embedded inside an object stream (with the exception of other object streams) and as such can contain JavaScript and other potentially malicious content. Additionally, filters may also be applied to object streams to encode and compress their contents.

2.4.8 JavaScript obfuscation

The implementation of JavaScript engine within Adobe PDF reader is very similar to the one found in web browsers. As such, all the detection evasion techniques commonly used to obfuscate malicious code in JavaScript-based malware also apply to JavaScript in PDF. Since mere presence of obfuscated JavaScript within a PDF document is enough to consider the file highly suspicious, detailed analysis and description of these generic methods are outside of the scope of this work. In general, the most common techniques include:

• Regular expressions: can be used to used to retrieve and put together code hidden inside a long string. Regular expressions can produce fairly complex and efficient obfuscation, especially when combined with other techniques.

27 • Unescape and replace functions: unescape() function is used to decode hexadecimal representation of raw binary data. This can be used as a method of obfuscation when the decoded data are an ASCII string. Replace() function searches for a substring or regular expression within a string and then replaces any matched substring with a new substring. These two functions are often used in conjunction to help obfuscate malicious code.

• Eval function: eval() is a global function that allows a string to be evaluated as if it was an expression. Additionally, if the argument is one or several Javascript statements, eval() executes them. Typically used in combination with regular expressions and code splitting, eval() is one of the most common and effective ways to obfuscate malicious JavaScript code.

2.5 PDF JavaScript

Adobe's implementation of the JavaScript engine also offers several specific functions that allow retrieval of data from various indirect objects inside a PDF file. The purpose of these API functions is to provide access to the document's information dictionary (stores metadata, such as title, author, subject, keywords, etc.), as well data stored inside various widgets. However, this functionality can be exploited by malware authors, who can hide encrypted JavaScript or shellcode inside these indirect objects. The malicious code can then be retrieved using appropriate indirect object references. The benefit of such approach is the fact that all visible JavaScript appears to be legitimate and all referenced objects must be examined in order to determine whether the JavaScript is in fact harmful. Examples of API functions previously used by malware authors are following sections (3.3.1 to 3.3.3).

2.5.1 Function getField getField() function provides access to data stored in the field object of specified widget. For instance, following example inserts the current date into a text field:

function date() { var field = this.getField("Date"); field.value = util.printd("dd mmmm yyyy",new Date()); } date();

The field object can be used to store malicious JavaScript. For instance, document can contain following object:

/10 0 obj /Type Annot /Subtype /Widget

28 /DV (%61%70%70%2E%61%6C%65%72%74%20%28%22%48%65%6C%6C %6F %20%57%6F%72%6C%64%21%22%29%3B) /V (%61%70%70%2E%61%6C%65%72%74%20%28%22%48%65%6C%6C%6F %20%57%6F%72%6C%64%21%22%29%3B) /T (Harmful) endobj

Entries value (/V) and default value (/DV) both contain JavaScript obfuscated as escaped characters. Field name entry (/T) denotes the field name under which this object can be referenced. The following JavaScript code then shall, after decoding hexadecimal representation into raw data, execute the code stored inside the object:

eval(unescape(this.getField("Harmful").value);

In this case, the effect is simple pop-up alert box with text "Hello World!" in it. However, the JavaScript can be substituted with any malicious code, which can be executed or stored inside a variable for later use.

2.5.2 Function this.info

Document information dictionary is a dictionary object stored inside Info entry in the trailer of a PDF file. It contains various metadata, such as title, author, subject, keywords, producer, creator and creation date. Each entry may also be used to store arbitrary data and can be accessed via JavaScript API function this.info().

2.5.3 Function getAnnot

API function Screen.App.GetAnnot() operates in very similar fashion to getField() and can be used to access the data stored in screen annotation object (ScreenAnnot), which specifies a region of a page where media clips shall be played.

29 Chapter 3

PDF online viewers and analysis tools

In this chapter, we first discuss what options are available when it comes to viewing PDF files in web browser environment. Next, we examine existing tools for PDF document analysis. The goal is to evaluate their features and capabilities and discuss advantages and disadvantages they offer when performing static analysis of a malicious PDF document.

3.1 Online PDF viewers

The ability to view PDF file inside web browsers has undergone significant development in recent years. In general, there are two methods to render a PDF document in a browser window.

3.1.1 Browser plug-ins

The first is to use a plug-in for one of the native reader applications. The most prominent examples are plug-ins for Adobe Reader and Foxit Reader [31]. Most of the popular web browsers (Internet Explorer, Mozilla Firefox 18 or older, Safari, Opera) ship with Adobe PDF Reader pre-installed and set as a default PDF viewer. From security standpoint, this approach has the obvious disadvantage of introducing all the vulnerabilities in Adobe Reader software to the browser. Because vast percentage of malicious samples found in the wild targets specifically Adobe Reader due to it's popularity, Mozilla Firefox replaced the plug-in with new built-in browser based purely on web technologies called PDF.js [32]. PDF.js was released in February 2013 as a part of Firefox 19. It currently stands in a unique position among PDF browsers as it is written entirely in JavaScript and HTML5 (it is not a browser plug-in), released as open source and the development is community driven. When displaying a document, PDF.js decodes and extracts it's contents and then proceeds to displays it as HTML5. The downside is that the reader currently doesn't support many of the more advanced features and interactive elements of the PDF format. At the time of writing, no major vulnerabilities in implementation were found. Google Chrome ships with Foxit Reader plug-in pre-installed. Foxit Reader suffered from several major vulnerabilities in the past, some of which where shared with Adobe's Reader. However, it is a safe assumption that an attacker is more likely to target Adobe PDF Reader due to much larger userbase. Moreover, Chrome's rendering engine runs in a sandboxed module with restricted privileges [33]. These two factors alone greatly reduce the chance that user's system will be compromised by a malicious PDF file.

3.1.2 Server-side format conversion

The second approach is to convert the PDF file into another format, typically images, Flash movie or HTML markup, which can then be viewed in a browser. This way, the risk connected with opening and rendering suspicious PDF documents is carried by the web

30 server performing the conversion. Software, scripts and online services for conversion are readily available on all operating systems, both free and proprietary. The downside is that some of the more advanced features of PDF format, such as multimedia and interactive content, are stripped from the document. Some examples of PDF to bitmap conversion software published under GNU-GPL license include SWFTools [34], GIMP [35] or Poppler [36]. FlexPaper [37] is lightweight viewer component that can convert and publish PDF files on web server in HTML, HTML5 or Flash format. Crocodoc [38] is a proprietary solution that allows users to embed documents in HTML5 format using viewer JavaScript component or a cross-domain iframe. Online services that allow users to upload and view documents in image, Flash or HTML5 format include following:

Google Docs is a web based office suite developed by Google as a part of its Google Drive service. Users can upload their files to Google drive or use the embeddable online viewer found at https://docs.google.com/viewer. The viewer supports many different document and image formats. The only requirement is a compatible browser (currently Internet Explorer, Firefox, Chrome) with JavaScript enabled.

Scribd is a document sharing service available at http://www.scribd.com/. It allows users to upload documents in various formats and converts them into iPaper, which is a document format built with Adobe Flash. Uploaded documents can be made private and safely viewed in Flash Player.

A.nnotate was originally a proof-of-concept project on authoring structured content from text, issued by the Scottish government. It now functions as a web service (available at http://a.nnotate.com/) for storing and displaying documents in various formats, including PDF files. Uploaded files are processed rendered on the server side and then sent to the browser as images. As the name suggests, one of the main features is allowing users to attach custom annotations to the documents.

Docstoc is an online document repository and store, which can be found at http://www.docstoc.com/. The main goal is to allow users to share or sell their documents and it is more oriented towards needs of business owners. However, since it is free to use and allows anyone to privately upload and view their documents, it may be used to view suspicious or untrustworthy documents without risk. Uploaded documents are processed server side and served to the browser in form of a Flash file.

3.2 Analysis tools

PDF analysis tools are specific programs, which allow the user to examine the structure of an inspected document in detail. Typically, these tools are not capable of rendering the document itself, but instead provide means to view the contents of individual objects inside the file. To do this, they need to have the capability to remove effects of various obfuscation techniques described previously. In this chapter, we examine four such offline scripts and applications and three online tools. The selection covers all currently publicly available

31 options that are commonly used by researchers. Online virus scanners (such as virustotal.com, virusscan.jotti.org or virscan.org) are not included. This is because even though they are capable of processing PDF files and scanning them for known vulnerabilities, they do not allow manual analysis of the inspected document.

3.2.1 PDF Tools

PDF Tools is a set of Python scripts developed by Didier Stevens, freely available for download at http://blog.didierstevens.com/programs/pdf-tools/. Released in 2008, it was one of the first tools allowing in-depth analysis of malicious PDF files. The source code had since been dedicated to the public domain by the author, which means it isn't protected by intellectual property laws. The pack consists of three separate command line scripts, two of which are relevant for our purposes. Namely it is pdfid.py and -parser-py. The last script (make-pdf.py) allows one to create PDF documents with embedded JavaScript, which mostly useful for testing purposes.

3.2.1.1 PDFiD

This script performs a scan of a selected file and looks for certain predefined keywords. The idea is to first use this script to quickly establish overview of the structure of a document and a list of potentially malicious objects inside it. Next step is then to use the PDF-parser script to examine the contents of any suspicious objects in detail. The program is capable of dealing with the /Names obfuscation and is designed to be as simple as possible in order to keep security bugs to minimum.

Figure 3.1: PDFiD malicious file scan results

The latest version looks for following keywords: obj, endobj, stream, endstream, xref, startxref, trailer, /Page, /Encrypt, /ObjStm, /JS, /JavaScript, /AA, /OpenAction, /Acroform, /JBIG2Decode, /RichMedia, /Launch, /EmbeddedFile, /Colors > 2^24. The first eight keywords in this list are not

32 directly linked to any vulnerabilities inside the file and barring some exceptions will be present in every inspected document. They are included to give the analyst a general idea about the contents of the document, as malicious will often have single page and high number of suspicious keywords compared to the total number of objects. Keyword /ObjStm indicates there are object streams embedded in the document, which will need to be inspected later, as they may contain additional malicious objects. Rest of the keywords indicate presence of objects that are either directly responsible for known vulnerabilities, help trigger them or can be used to hide malicious payload. As such each occurrence warrants more thorough investigation. The script can be supplied several additional arguments to modify its behavior. Currently available options are:

• -s or --scan: is used to force the script to scan an entire directory.

• -a or --all: displays all possible name objects and their counts inside the file

• -f or --force: forces the script to scan a file even if it doesn't contain a valid PDF header

• -d or --disarm: is a feature swaps case in all /AA, /OpenAction, /JS and /JavaScript name objects. Since PDF language is case sensitive, PDF readers will simply skip these objects when parsing and rendering the document, thus disabling all JavaScript and automatic actions. It should be noted that objects inside object streams remain unaffected, so the disarm option can only be used reliably if we make sure none of the aforementioned names are contained inside the streams.

• -e or--extra: adds additional layer of analysis in form entropy calculations. Specifically, it will list all name objects found after last end of file marker, entropy of all bytes, entropy of all bytes inside streams and entropy of all bytes outside streams. The idea is to allow the analyst to detect whether additional, malformed data were appended to the document. Data inside object streams are compressed and encrypted so the byte entropy should be close to the maximum value of 8.0 (the number of bits in byte). Rest of the data is typically much less random with entropy around 4 or 5. If any additional encoded or encrypted data were added outside the streams, the entropy value is usually significantly higher.

3.2.1.2 PDF-parser

As the name suggests, PDF-parser is used to parse the structure of a PDF document and view contents of individual objects. The tool is a fully command line based, controlled via arguments passed to the script. Following list of all available arguments:

• -a or --stats: displays statistics of the objects inside the PDF document. This argument can be used for detecting unusual objects or as a way to fingerprint different samples of malicious PDF files that share origin and attack vector. This is because such files often have similar or identical statistics, despite having different size and content.

33 • -s or --search: is one of the most important arguments, which allows the user to search for specified string inside indirect objects. The result is list of objects that contain the provided string. The search is not case sensitive and can handle /Names obfuscation via canonicalization. This option doesn't search inside object streams, so their contents need to be inspected manually.

• -o or --object: another core command, it outputs the content data of an indirect object with specified id.

• -r or --reference: selects all objects referencing the specified indirect object.

• -t or --type: retrieves all objects of a given type. • -f or --filter: applies one or more filters to a stream. As of version 0.3.9 five different filters are supported (FlateDecode, ASCIIHexDecode, ASCII85Decode, LZWDecode and RunLengthDecode).

• -w or --raw: forces the script to output raw data, as opposed to the printable python representation.

• -d or --dump: dumps the contents of specified stream to a file. This is useful when the content is actually a malicious payload.

• -n or --nocanonicalizedoutput: disables canonicalization of names in output

Figure 3.2: PDF-parser - searching for Javascript

Figure 3.3: PDF-parser – viewing content of an object

34 All searches disregard versions, which means that if a document contains multiple versions of an object, all instances are output. Figures 3.2 and 3.3 show the process of searching for JavaScript in the the file we scanned by PDFiD in figure 3.1. We see that the referenced object is in fact a stream compressed using the FlateEncode method. Using the filter option then results in obfuscated JavaScript being extracted from the stream object.

3.2.2 peepdf

Developed and maintained by Jose Miguel Esparza, peepdf is a PDF analysis tool written in Python. It is available for download at http://eternal-todo.com/tools/peepdf-pdf-analysis-tool and the source code is published under GNU General Public License version 3. Compared to PDF Tools by Didier Stevens, the authors aim is to provide more comprehensive program, capable not only of dissecting PDF documents, but also analyzing embedded JavaScript and shellcode. Installation of Python-Spidermonkey module is required for JavaScript analysis functionality. Shellcode emulation requires installation of libemu library and and it's Python wrapper Pylibemu. The user interface is command line based. When running the script, the initial step is to set one of four possible execution modes. Basic execution (no additional arguments) displays various statistics about the inspected document as well as list of any suspicious objects. Figure 3.4 shows an an example output of basic mode execution on malicious document. XML output mode (-x) displays the same information as basic execution, but in XML format. Batch mode (-s) loads a sequence of commands previously prepared in a specified file and executes them, thus automatizing the analysis process. Finally, switching to interactive mode (-i) enables the user to choose from a new set of commands that allow more thorough inspection of the target document.

Figure 3.4: Basic execution of peepdf

35 The list of interactive mode commands is too long to describe each one in detail. In general, they can be grouped into four categories:

• PDF inspection: includes commands that allow the user to look for and inspect data inside objects. Search command looks for a specified string or hexadecimal value in the objects. This includes all streams both in their decoded and encrypted forms. Command info returns statistics about the document or any of it's elements. Object and stream decode and show content of a specified object or stream. It should be noted that as of version 0.2 only /ASCIIHexDecode, /FlateDecode and /LZWDecode filters are supported. Rawobject and rawstream also display the content of an object or stream, but without decoding or decrypting them. Commands tree and offset show visual representation of dependencies between objects and physical structure of a document. It is possible to xor stream data with arbitrary key or perform a brute force XOR search using xor and xor_search commands.

• PDF modification: this category isn't strictly relevant to analysis of malicious documents. It includes command that creates simple PDF files that execute specified JavaScript once opened (create) and commands that modify objects inside existing documents (ie. commands encrypt, decrypt, encode, decode).

• JavaScript analysis: is possible thanks to implementation of Python-Spidermonkey module, which allows evaluation and calling of JavaScript functions and scripts in Python. It allows the user to execute target JavaScript code (command js) and view all it's code stages (js_code). Command Js_join joins code that is split into multiple strings into a single variable. If necessary, js_unescape and js_beautify can be used to improve readability of obfuscated JavaScript. Lastly, js_analyse performs substitutions in code until the last stage of JavaScript is obtained and then scans the result for escaped bytes and shellcodes. However, this automatic analysis does not support PDF specific JavaScript functions (ei. GetAnnots). Because these functions are often used by malware authors, additional manual analysis is usually necessary.

• Shellcode analysis: is performed by libemu shellcode emulator wrapper. It is performed by running the sctest command on target file or variable, which attempts to hook win32 api calls made by the shellcode.

When searching for or accessing objects, it is possible to specify which version of the object we want to display. Figure 3.5 shows the process of searching for JavaScript in a suspicious document. We see that object 6 references object 111611, which in turn contains obfuscated JavaScript inside an encoded stream.

36 Figure 3.5: peepdf – searching for JavaScript in interactive mode

3.2.3 PDF Stream Dumper

PDF Stream Dumper is a PDF document analyzer developed by David Zimmer. The software is open source and runs only on MS Windows platforms (namely Windows 2000, Windows XP, Windows Vista and Windows 7). It was originally released in July 2010. Like peepdf, it was designed to be all in one tool for malicious document analysis and as such includes functionality for JavaScript and shellcode analysis. The program uses graphical user interface (GUI) for all of its components. The main window consists of a standard toolbar on top and the rest of the space is divided into three main sections. The section on the left hand side contains color coded list of all objects inside the file. Various filters can be applied to this list, so that only specific objects are shown. This can be used to filter out all duplicate objects or to show only objects that contain JavaScript, for instance. The main section then displays detailed statistics or the content of the selected object (plain text or hexadecimal format). The last section at the bottom is reserved for displaying errors, debug information and search results. PDF Stream Dumper includes all the necessary features for examining contents of PDF files. It is capable of dealing with obfuscated name objects as well as chains of multiple filters. Objects and streams are automatically decoded and decrypted. As of version 0.9.417 eight filters are supported (FlateDecode, RunLengthDecode, ASCIIHEXDecode, ASCII85Decode, LZWDecode, JBIG2Decode, CCITTFaxDecode, DecodePredictors). All objects that contain suspicious keywords are appropriately color coded (ie. red for JavaScript, green for /OpenAction and /AA). It is possible to look for arbitrary strings inside the file or use predefined search options for elements involved in known vulnerabilities (JavaScript, Flash and U3D objects, TFF fonts, action tags, name obfuscation, XML streams, filter chains or PRC files). The tool can visually format PDF headers to make them easier to read. It is also capable of decrypting encrypted documents, although this feature doesn't seem to be working correctly with some of the latest malicious PDF samples.

37 Figure 3.6: PDF Stream Dumper interface

JavaScript analysis is performed in a separate window. The tool implements multiple deobfuscation techniques including manual and automatic unescaping of selected code, refactoring of variables and functions and visual formatting. The interface also has access to a toolbox class that can among other things simplify fragmented strings, do hexdumps and read and write files. Scan for known vulnerabilities functionality performs a simple string search on names of vulnerable functions. Therefore, it is only effective for fully deobfuscated code. PDF Stream Dumper also implements several shellcode tools – a GUI wrapper for a shellcode analysis tool sclog, a GUI wrapper for libemu based tool scdbg, stubs for saving shellcode into executable and a XOR brute force feature.

3.2.4 Origami

Origami (found at http://esec-lab.sogeti.com/pages/Origami) is a PDF manipulation framework developed by Sogeti ESEC Lab. It is written entirely in ruby and allows users to create, analyze and modify documents. The tool was initially released in May 2011. Current stable version is distributed via RubyGems package manager and development version can be downloaded from https://code.google.com/p/origami-pdf/. While the core of the the framework is fully scriptable, the tool comes prepackaged with several scripts covering some of the commonly performed tasks. Script pdfdecompcress can remove all compression filters from the document, pdfdecrypt attempts to remove encryption specified document, pdfencrypt encrypts a document using specified encryption algorithm, pdf2graph generates a Graphwiz DOT file [39] out a PDF document, dfmetadata

38 extracts and displays document metadata. Pdf2ruby is a script that converts a PDF file into a ruby script that can later be recompiled by Origami into the original document. Pdfcocoon takes a specified document and embeds it inside another new PDF file and adds /OpenAction object that opens the original document upon launch. Pdfcop scans documents for signs of malicious structures that violate policies that user can specify in a separate configuration file. Finally, pdfwalker allows the user to view contents of a PDF file in GTK graphical user interface. For more customized tasks the user can write his own ruby scripts or take advantage of the pdfsh command. It opens a classic Ruby shell that includes Origami namespace. The API commands allow the user to view and manipulate contents of documents on object level. It is possible to add or remove pages, add actions and triggers, attach files, signatures and flash content.

Figure 3.7: Origami PFD Walker interface

3.2.5 jsunpack

Found at http://jsunpack.jeek.org, jsunpack is an online JavaScript unpacker created by Blake Hartstein. The functionality is provided by a back-end tool jsunpack-n that emulates browser functionality based on provided input. Currently supported input types are PDF files, packet captures (pcap files), HTML files, JavaScript files and SWF files. The source code of jsunpack- n is published under GNU General Public License version 2. The tool is capable of dealing with PDF obfuscation techniques including names

39 obfuscation, filter chaining as well as any JavaScript obfuscation. The latest version can also handle encrypted documents. Although, the decryption process doesn't seem to be working correctly for some samples, resulting in false negatives. It is not possible to manually inspect the PDF object structure. Instead, submitting a file or script generates a permanent report with analysis results and details. This report contains information on the analysis process, known vulnerabilities alerts, as well as any detected JavaScript or shellcode, which also become available for download as separate files. The report is then marked as malicious or benign. It should be noted that jsunpack only detects JavaScript based exploits. PDF files using other methods of triggering exploits are marked as benign, unless the tool can detect malicious shellcode inside one of the objects.

Figure 3.8: jsunpack – report sample

3.2.6 Wepawet

Wepawet (http://wepawet.iseclab.org/) is an online service for analyzing web-based threads developed by a group of researchers from University of California. The tool is proprietary and copyrighted, access is provided only for internal use by end users. It is capable of analyzing PDF, JavaScript or Flash files. Submitting a file or a url generates a report containing deobfuscation results, list of embedded shellcodes, malware and ActiveX controls. Figure X. shows a report for uploaded malicious PDF sample. It is not possible to manually inspect structure of the file or contents of PDF objects. Based on reports for multiple submitted samples, wepawet is capable of dealing with name and string obfuscation, multiple filters and encryption. Some of the latest

40 samples using encryption were incorrectly marked as false negatives. Based on the result of the analysis, the report marks the file as benign, suspicious or malicious. Only files exploiting known vulnerabilities are marked as malicious. Not all vulnerabilities are correctly identified (ie. CVE-2009-4324), but files containing these are still marked as suspicious due to presence of obfuscated JavaScript and shellcode. The tool also integrates shellcode analyzer shellzer, which generates a list of API functions called by the shellcode and identifies URLs fetched at run-time. However, this extended shellcode analysis functionality seems to be disabled at the moment, at least for the publicly available service.

Figure 3.9: Wepawet – sample report

3.2.7 PDF Examiner

PDF Examiner is a proprietary analysis tool developed by Malware Tracker. Free online version is available at http://www.malwaretracker.com/pdfexaminer.php. As the name suggests, it specializes on analysis of PDF documents. Uploaded files are dissected, objects are decoded, decrypted and the result is saved in the form of report. Compared to the other reviewed online tools, PDF Examiner allows manual analysis of the elements inside the file. The user interface consists of two main panels. The one on the left side displays a list of all objects in the PDF, the one on the right contents of currently selected item from this list. It is also possible to download deobfuscated content of objects as a file. The document information section contains basic statistics and a

41 list of detected exploits or suspicious uses of obfuscation techniques. Vulnerabilities were correctly identified by their CVE numbers for all tested samples. Beta OpenIOC link is generated for malicious PDF files. Figure 3.9 shows a sample report for a malicious PDF file. The tool is capable of dealing with name and string object obfuscation, chain filters (supported options are FlateDecode, ASCIIHexDecode, LZWDecode, ASCII85Decode, RunLengthDecode, CCITTFaxDecode) and encryption (40+128 bit RC4, 128 bit AESV2 and 256 bit AESV3). All tested samples were decrypted correctly. It can also automatically detect and remove common JavaScript obfuscation techniques.

Figure 3.9: PDF Examiner – sample report

3.2.8 PDF X-Ray

PDF X-Ray was originally a proprietary online tool developed by Brandon Dixon, hosted at http://pdfxray.com/. Even though the service hasn't been online for several months, it still should be included in this survey because the author decided to release the source code the public. The tool is build on uses customized peepdf and PDF Tools functionality to parse documents and convert them into JSON objects. The objects are then stored in Mongo, a document-oriented NoSQL database. The idea behind this approach is that the tool can base its detection not only on object size and names, but also on entropy and object similarity by comparing them to other malicious samples.

42 Chapter 4

Application design

This chapter establishes a definition of a framework and describes the process of designing our web application.

4.1 Framework definition

Word framework is used often in computer programming. At the same time, there is no standardized or universally-agreed definition for it. Before starting the design process, it is important to establish a general notion of what a framework is and what we expect from it. In general, its purpose is to tie together otherwise discrete components in order to create more efficient or useful software or to make complex technologies easier to handle. As a side effect, it often introduces a level of methodology that innately forces the developer to adhere to a concrete design approach. To make things more clear we can use the definition formulated by E. Gamma and other authors in the book Design Patterns [40]:

Framework is a skeleton of an application into which developers plug in their code and provides most of the common functionality.

What this means for the application developed as a part of this work, is that it should provide unified facilities for analysis of various formats of malicious documents. It should be extendable and customizable so that support for new document types or new analysis methods can be added with ease. Therefore, the first step towards designing such framework is to examine how we go about analyzing suspicious documents and establish a proper workflow for this process.

4.2 Document analysis

Previous chapters described all the fundamentals necessary to understand how vulnerabilities in PDF files come about and how malware authors exploit them. Before going into any detailed design, it should be clear how the process of analyzing a PDF document works. A lot of work in this area has originally been done by Didier Stevens. The tools he developed, along with his e-book on malicious document analysis, helped establish a solid foundation for general understanding of this process. The idea is to first triage documents by looking for typical signs of harmful content. This includes combination of automatic actions and JavaScript, Flash, embedded files, vulnerable encodings or obfuscation techniques. In practice, this is a matter of performing a simple string search. This step could be performed manually, but since we need to account for name object obfuscation and potentially deal with large documents, it is faster and more convenient to use an automated tool for this step. It should be noted that even if this search finds a match, it does not necessarily mean the document is malicious. All the features of

43 PDF language have their legitimate uses. So a positive match merely indicates that there is something worth investigating. Second step is then to perform a followup analysis based on the initial scan. The goal at this stage is to go through all the suspicious elements and determine whether their use is valid or not. For instance, large block of obfuscated JavaScript inside an encrypted object or compressed stream is a telltale sign of malicious intent. Use of some sort of tool is necessary at this point to deal with PDF specific obfuscation. While we need to know when given object is encoded or encrypted, viewing data in unencrypted form serves little purpose. Moreover, decompression, decryption and decoding of objects is a repetitive task. Therefore, the process should be automated and seamless. Final stage is then to extract any discovered JavaScript, Flash, shellcode or embedded files and analyze them. This task is no longer specific to document formats in general and we use tools appropriate for the job (deobfuscators and beautifiers for JavaScript, debuggers or emulators for shellcode).

Figure 4.1: Document analysis workflow

While the scope of this work is limited to PDF format, it should be possible to integrate support for analysis of other types of malicious documents found in the wild. In practice, the other often used option among malware authors are the Microsoft Office formats – namely Word, Excel and Powerpoint. While these documents have vastly different internal structures compared to PDF files, the analysis process is very similar. Simple scripts can be used to scan the document for presence of generic shellcode patterns or Visual Basic macros. Further tools are then used to navigate through the inner contents of the file and extract malicious elements for followup inspection. This translates into workflow that is identical to the one we already established for PDF files. Therefore, no additional changes are necessary.

4.3 Design and prototypes

First step towards designing an efficient application is determine what the user requirements are. The answer in our case is simple – the user trying to determine whether given document is malicious or not. This process is already described in detail in previous section, so the goal is to make sure it is translated well into the design. Moreover, it is important to ascertain that our design doesn't conflict with the notion of an extensible framework. We formulate following requirements:

44 1. The result is a web application that allows the user to perform malicious document analysis. 2. Allow users to upload their documents and include a comment. 3. Preliminary scan functionality for PDF documents. 4. Enable the user to browse uploaded files. 5. Enable the user to view the contents of uploaded files. 6. In-depth analysis functionality for PDF documents that includes: • view of object structure. • removal of PDF obfuscation. • view of contents of decoded objects. • user alert functionality for suspicious elements. • results displayed in form of report. • ability to access the analysis report later, or share it with others. 7. Comment system that allow users to attach comments to analysis reports. 8. Implementing support for other document types must be simple and possible without making additional changes to the application.

To go with the requirement list, we also prepare short use case list:

1. User visits the index (upload) page, chooses a file he wishes to upload and attaches a comment if he so desires. User is taken to the file browser category with his uploaded file actively selected. User reviews quick scan results and file statistics and proceeds to view the report page by selecting the 'analyze' option. In the report page, user views file statistics, suspicious objects and inspects contents of relevant objects based on this information. User attaches a comment to the report.

2. User visits the index page and selects 'files' option to load a list of uploaded documents. Selecting 'view' option opens new browser tab that shows the document's content and provides navigation tools.

3. User visits the index page and selects 'files' option to load a list of uploaded documents. User views file statistics and quick scan results by selecting them in the list. Selecting 'analyze' option takes the user to the report page of selected file. User views contents of the report, attached comments and leaves his own comment.

4. User visits the report page directly via a direct link to a report. He may view the contents of the report, attached comments and leave his own comment.

To minimize the number of clicks required to go through the analysis process, we can perform quick scan automatically right after the file is uploaded. The results can then be saved into database along with other metadata. A potential issue is performance of server- side scripts for document analysis. Having the scripts generate HTML markup, saving it in a separate file and then displaying it on the report page ensures that we have to analyze each file only once. Subsequent user views of the report page are then a simple matter of displaying the report file. This also solves the issue of extensibility and each file type can use

45 its own appropriate report layout. Figure 4.1 shows the resulting application flow for the primary use case scenario. Additionally, the user should be able to navigate to the upload, file browser or report page at will. Accessing the report page directly without specifying a document hash in the URL should present the user with a search box.

Figure 4.2: Application flow

Based on use cases and requirements, we can create a series of mockups of the primary pages that the user sees when navigating through the website. Mockups and wireframes are prototype models that are used to plan the layout of the user interface and relay the relationships between pages.

Main page is the default entry point for the application. At the same time it serves as the upload page. It contains navigation menu, an upload widget and a button that displays text area where user can attach a comment to his file. The two states of the upload page are shown in figure 4.3. On the left is the default state, on the right is the comment widget.

Figure 4.3: Upload page

Files browser page provides listing of all uploaded documents. Actively selected item displays details about the file in one column, list of suspicious elements in second column and two buttons that navigate to the report page or open the file in a viewer.

46 Figure 4.4: File browser page

Report page contains a direct link to the report that can be copied, content of the report itself and a comment section at the bottom of the page. For PDF files the report contains general information about the file, object structure and an interactive list of all objects. Selecting an object displays additional information about its content.

Figure 4.5: Report page

4.3.1 Database relations

The data is organized in two tables. Table uploads stores all the file metadata (name, type,

47 size, md5 hash, time of upload and the file path), uploader's comment and quick scan results. Table comments stores the comment information (user's name, optional URL entry, user's email, comment text, timestamp). Since each comment has to be associated with a file, we add a foreign key that points to the primary key in the uploads table. The database structure is shown in figure 4.5. The files and reports are stored in the file system, not the database.

Figure 4.6: Database tables and relations

4.3.2 Security concerns

Because the application processes, stores and displays user input, several security issues need to be addressed. Storing user input in a database opens a gateway for SQL injection attacks. All data must be validated for type, syntax and length. Any user data displayed in the browser must be safe to view in a browser. Displaying unsanitized user input may lead to cross site scripting attacks. Uploaded files also need to be thoroughly validated. Otherwise, an attacker could upload a server side scripts and have them executed by the application. Because we work with malicious files by design, they should not be stored in a web accessible location.

48 Chapter 5

Implementation

5.1 Used technologies

The standard web technologies used in development include HTML, CSS and a JavaScript framework for front-end development and a combination of PHP and MySQL for all the server-side data handling and storage. Full list of frameworks, plug-ins and scripts used is following:

• jQuery: is a free, open source JavaScript framework actively supported in multiple browsers (Firefox, Chrome, Internet Explorer, Safari, Opera). It is licensed under the MIT License. The library offers enhanced AJAX support and Document Object Model (DOM) element selection and manipulation. It also makes handling events and creating effects and animations on a website much easier. It is extensible via user made plug-ins and due to jQuery's popularity among web developers, there are thousands of them available on the Internet.

• Uploadify: is jQuery plugin that uses a combination of JavaScript, Flash and server- side scripts to allow the user to upload single or multiple files to the server. It is open source software developed by Reactive Apps published under the MIT License. It is not to be confused with UploadiFive, which is a commercial HTML5-based upload plug-in created by the same author.

• PDFiD: is a PDF analysis tool written in Python that was already reviewed and described in detail in chapter three. It is simple and fast and therefore our tool of choice for preliminary analysis of PDF files.

• PDF X-Ray Lite: is a Python command line tool developed by security researcher Brandon Dixon. The tool was released as a part of REMnux version 3, a trimmed- down version of Ubuntu distribution designed to help analysts reverse-engineer malicious software. As the name suggests it is a lightweight version of PDF X-Ray, which we reviewed in section 3.2.7. The main difference is that the back-end database support, API and reporting functionality have been removed. The tool merely converts PDF documents into JSON format, attempts to remove any PDF specific obfuscation and saves select output into a file (a plain HTML document by default). This exactly fits our design and the script can be easily extended by custom code if we need to change or extend it's behavior.

• SWFTools: is a bundle of several utilities that allow the user to read, modify, combine or convert various other formats into Flash files.

49 5.2 Document root organization

This section describes how the files are organized within the web root directory and is their purpose. The PHP files are all stored directly in the document root. Style sheets, images and JavaScript libraries each have their own sub-directories. Additionally, the PDFiD and PDF X- Ray scripts are in the document folder and any required modules and libraries are stored in a separate sub-directory. This is mostly for development purposes and may work in low traffic situations. However, if the application was to be deployed live, the analysis scripts should be hosted on a separate server and executed via remote procedure calls (RPC). Following list describes how each of the pages designed in previous chapter is implemented:

Figure 5.1: Upload page layout

5.2.1 Upload page

Also serves as the entry point for the application. The navigation bar sliding functionality is implemented in menu_slider.js JavaScript file. The JavaScript component of the Uploadify plug-in is used to limit the file size, set allowed file types, prepare data to be sent to the server-script (user comment and security token) and tell the script where to look for various required files (server-side PHP scripts, Flash uploader component and images). The server- side component (upload-server-side.php) is then responsible for saving the file into designated upload directory, running a quick scan appropriate to the file type (PDFiD in case of PDF files), running a PDF scoring script and saving the results along with all the metadata into the database. It also converts a copy of the file the file into Flash and saves it in a web accessible folder. The layout of the page is shown in figure 5.1.

50 5.2.1.1 Flash conversion script

To safely display the contents of uploaded documents back to the user, we convert them into a Flash file. The advantage of this approach is that we can keep the PDF files saved safely in non web accessible location and serve the Flash copies to the browser. The conversion is handled by pdf2swf script from SWTools bundle, using following command:

pdf2swf -T 8 -B rfxview. [path to pdf] -o [path to output flash file]

Parameter -T let's us specify a version of Flash to be used and parameter -B rfxview.swf tells the script to combine the file during conversion with a viewer. This allows the result to be browsed in Flash player and adds a navigation panel.

5.2.1.2 Supplementing score system

To help determine whether uploaded file may be potentially harmful, the uploads are scored using a python script. The original concept was put forth by Brandon Dixon in 2010 [41] and it took advantage of the fact that malicious PDF files are usually small in size (less than 2 megabytes), with single page, close all streams and objects and contain JavaScript. Many of the vulnerabilities that appeared in recent years took advantage of embedded files and Flash content without using JavaScript. As such, RichMedia needs to be After testing with fifteen malicious samples and twenty various clean samples, the following scoring system produced most reliable results.

• add 1 point to primary score for JavaScript or RichMedia • add 0.25 to primary score each for size under 1.8 megabytes, less then 2 pages, number of obj equals number of endobj, number of stream equals number of endstream • add 0.5 to secondary score each for embedded files, JBIG2Decode, Launch, OpenAction, AA, Colors

Based on the totals the file is judged suspicious when primary score is 1.5 or higher or when primary score is 1 and and secondary score is 0.5 or higher. The file is judged to be high risk when the primary score is 2 or higher and secondary score is 0.5 or higher. Otherwise the file is marked as clean. Using this system, the biggest source of false positives are small forms that make use of JavaScript. Using this system, there were two false positives caused by small PDF forms that used JavaScript for form validation. However, since small files containing JavaScript should always warrant followup inspection, these false positives are acceptable. Out of fifteen tested malicious samples eight were marked as suspicious, seven as high risk and none as false negatives.

5.2.1.3 Uploader component security

The uploader component is a potential source of several security holes that need to be addressed. First, the security token, consisting of a md5 hash value of current timestamp

51 concatenated with unique salt, needs to be validated by the server-side script. This is to make sure the user is accessing the upload functionality as intended. Additionally, it is not enough to validate file type on the client-side, as such checks can be easily bypassed. File type needs to be validated by the PHP script before saving the file. The upload directory needs to have permissions set up properly and ideally should be placed outside the document root. If it is not possible, extra measures need to be taken to make sure the contents cannot be publicly accessed. User's comment needs to be properly validated and processed so that they can be safely placed in a MySQL query and viewed in a browser.

5.2.2 File browser page

This page displays a list of all uploaded files. The list is displayed as jQuery's Accordion widget, where clicking on an item expands it and displays additional information. The markup is generated by instances of PHP class Rows (rows.class.php). Using a class to represent a table row ensures that we can easily implement AJAX functionality to modify the rows from the client-side, should we need it in the future. The layout of the file browser page is shown in figure 5.2.

Figure 5.2: File browser page layout

52 5.2.3 Report page

Figure 5.3: Report page layout

The page displays a report based on the md5 hash specified in the URL. The PHP script (report.php) checks whether a report for a file with specified md5 hash already exists and displays it if it does. If it doesn't and a matching PDF file is found in the database, it proceeds to run the PDF X-Ray Lite script with the file as a parameter. Once the report is generated it is displayed to the user. Figure X. shows the report page layout. When the page is visited directly without any md5 specified, the user is presented with a search box instead. The report itself is divided into five categories:

• General information: displays hash value of the file (md5, SHA-1 and SHA-256) , PDF header and exact file size.

• Object order: shows a list of all objects in the file, their offsets and exact size.

• Versions: displays version information extracted from the document. Each version entry has its author, creator, producer and creation or modification date.

• Suspicious objects: displays an accordion list (jQuery widget) of objects marked as suspicious due to their size or presence of JavaScript, Flash, automated actions or vulnerable encodings. Selecting an object displays its header data (with name obfuscation removed), hash, whether it was encrypted or not, list of references and in case of object streams, their decoded contents.

53 • All objects: displays an accordion list of all objects found in the file. Each item can be expanded to display its header data, encryption information, reference list and decoded object stream contents (if applicable).

The comment system is based on a template written by Martin Angelov [42]. When user attaches new comment, the form (implemented in comment-form.php) is submitted via AJAX and received by back-end script (submit.php). The comment markup is generated by the Comment class (comment.class.php), where each instance represents a comment fetched from the database. Gravatar service is used to display an avatar associated with user's email address. Thorough validation of user input is necessary because it is stored in the database and then displayed in the browser. The validation is performed by static method validate() implemented in Comment class file. It takes the input from each form field and passes it through an appropriate filter and encodes all special characters so that it is safe to store and view.

5.3 Implementation overview

This section describes how the application functions as a framework and how additional scripts or formats could be integrated. The overall structure of the application follows the logic outlined in section 4.2 and as such is divided into upload, browser and report components. The Server side upload component (upload-server-side.php) not only saves the file into the upload folder and saves metadata into database, but also executes any scripts whose output we intend to display in the browser component. The code where additional scripts may be added is following:

switch ($fileParts['extension']) {

case "pdf": // 1 represents PDF filetype $filetype = 1;

[PDF related script execution code – quickscan, score and coversion here]

break; }

Currently scripts that produce quickscan information, score and handle PDF to flash conversion are included. The output meant to be displayed in browser needs to be saved in database, or in case of conversion, saved into appropriate folder. Adding scripts for different filetype is a simple matter of adding another case to the switch. The browser component consists of Accordion widget that is populated from

54 database by PHP script (rows.class.php). This script generates HTML markup for each database row (i.e. each uploaded file) and the result is displayed inside the Accordion widget sections. The code where output is customized is following:

switch($this->data['filetype']) {

// PDF file case "1": [code generating HTML markup for PDF files here] break; }

The overview information for other document types will presumably will be very different than the one that is used for PDF files. Adding new case to the switch and writing custom HTML markup will be necessary. The report component (report.php) provides the all markup for the report page, except for the report itself. When the report script is called, it first checks that report for MD5 specified in the GET method. If it does, it proceeds to display it on the page. If it doesn't exist it proceeds to run the PDF X-Ray script and displays the report once it is generated. The script output includes all HTML markup that is necessary to display the report in Accordion format. The makereport function that constructs the report out of JSON objects is located in malobjectclass.py, which can be found in the lib folder. The code for inclusion of other scripts is similar to other two components:

switch($this->data['filetype']) {

// PDF file case "1": [code executing analysis script and displaying the results] break; }

5.4 Deployment

The first thing we need to deploy the application is a web server that can serve PHP pages. Apache version 2.2.22 and PHP 5.3.13 were used during development. PHP version needs to be 5.2 or greater, because the application uses filter functions introduced in PHP 5.2.0. As mentioned previously, the application connects to MySQL database server and the connection is set up in file connect.php. The default values use a local root account to connect to the database, which is only acceptable for development purposes. If the application was to be deployed live, new user account with minimal privileges should be set up. To create the two tables used to store application data. This can be achieved with following SQL statements:

55 CREATE TABLE uploads ( id INT(10) NOT NULL AUTO_INCREMENT PRIMARY KEY, filename VARCHAR(200) NOT NULL, filetype INT(2) NOT NULL, filesize VARCHAR(20) NOT NULL, filemd5 VARCHAR(32) NOT NULL, filedate INT(11) NOT NULL, filepath VARCHAR(200) NOT NULL, filescan VARCHAR(4000) NOT NULL, filescore VARCHAR(200) NOT NULL, comment VARCHAR(150) NOT NULL default '' ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

CREATE TABLE comments ( id int(10) NOT NULL AUTO_INCREMENT PRIMARY KEY, name VARCHAR(128) NOT NULL default '', url VARCHAR(255) NOT NULL default '', email VARCHAR(255) NOT NULL default '', body TEXT NOT NULL, dt TIMESTAMP NOT NULL default CURRENT_TIMESTAMP, file_id_fk int(10) NOT NULL, FOREIGN KEY(file_id_fk) REFERENCES uploads(id) ON DELETE CASCADE ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

Python needs to be installed in order to run PDFiD and PDF X-Ray scripts. Latest production version of Python 2 should be used because some of the code used by PDF X-Ray is not compatible with Python 3. Version 2.7.3 was used during development. Additionally, PDF X- Ray requires simplejson module to be installed. The module is packaged and distributed in the stand way, therefore installation is simple matter of executing the setup script from a terminal.

56 Conclusions and future work

Analyzing PDF documents isn't a simple task. The format itself is very diverse and to make matters worse, Acrobat Reader is very flexible when it comes to parsing the file. As a result malware developers are provided with a lot of options when it comes to designing a new threat. PDF then represents a significant obfuscation layer that can be difficult to remove. Each AV vendor uses his own proprietary PDF parsing technology and overall detection rates have improved substantially over last two years. However, as Jose Miguel Esparza had shown during the proceedings of CARO 2011 Workshop, more sophisticated use of obfuscation techniques that is tailored to evade commonly used indicators can substantially decrease the chance of detection and even render some of the specialized tools ineffective. There are some online services for PDF analysis that allow users to share links to generated reports, but none that directly promote cooperation between analysts and could serve as a unified platform for malicious document analysis. Such a tool would be very useful when dealing with newly emerging threats. The framework implemented as a part of this thesis, while functional as far as static analysis of PDF files is concerned, mostly represents a template of how I think analysis of malicious files can be done collectively. There are several areas that can be improved in the future – regarding both the PDF analysis functionality and the framework itself. On the framework side, a user login system could allow the users to manage their uploaded samples or track specific reports. Voting system could supplement the comment system to mark samples or any of their parts as benign or malicious. Another useful feature to consider is integration of some sort of cyber threat information sharing method, such as Open Indicators of Compromise (OpenIOC) [43] or Structured Threat Information eXpression (STIX) [44]. Furthermore, the tool currently doesn't incorporate any JavaScript or shellcode analysis functionality and external tools need to be used to examine them. Therefore, implementing a JavaScript deobfuscator or sandbox would further simplify the analysis process. Lastly, a potential area of interesting future research could be dynamic analysis of PDF files (not just extracted shellcode).

57 References

[1] Adobe Systems Inc.: Document Management – Portable Document Format – Part 1: PDF 1.7, First Edition [online] 2009, [ref. 25. 4 . 2012]. Available at: http://www.adobe.com/devnet/pdf/pdf_reference.html

[2] Adobe Systems Inc.: Adobe Reader XI [online] 2012, [ref. 25. 4 . 2012]. Available at: http://www.adobe.com/products/reader.html

[3] Esparza, J.M.: Obfuscation and [non-] detection of malicious PDF files [online] 2011, [ref. 15. 12 . 2012]. Available at: http://eternal- todo.com/blog/obfuscation-non-detection-malicious-pdf-files

[4] Wikipedia: Free software [online] 2012, [ref. 15. 12 . 2012]. Available at: http://en.wikipedia.org/wiki/Free_software

[5] Prepressure: The history of PDF [online] 2012, [ref. 25. 4 . 2012]. Available at: http://www.prepressure.com/pdf/basics/history

[6] Selvaraj, K., Gutierrez N.F.: The Rise of PDF Malware [online] 2010, [ref. 20. 4 . 2012]. Available at: http://www.symantec.com/connect/blogs/rise-pdf- malware

[7] Malware tracker: pdf current threats [online] 2012, [ref. 20. 4. 2012]. Available at: http://www.malwaretracker.com/pdfthreat.php

[8] Itabashi, K.: Portable Document Format Malware [online] 2011, [ref. 20. 4. 2012]. Available at: http://www.symantec.com/connect/blogs/portable- document-format-malware

[9] Stevens, D.: Malicious PDF Analysis E-book [online] 2010, [ref. 20. 4. 2012]. Available at: hhttp://blog.didierstevens.com/2010/09/26/free-malicious-pdf- analysis-e-book/

[10] Esparza, J.M.: peepdf [online] 2012, [ref. 20. 4. 2012]. Available at: http://eternal-todo.com/tools/peepdf-pdf-analysis-tool

[11] Zimmer D.: PDF Stream Dumper [online] 2010, [ref. 20. 4. 2012]. Available at: http://sandsprite.com/blogs/index.php?uid=7&pid=57

[12] Sogeti ESEC Labs: Origami [online] 2010, [ref. 15. 4. 2012]. Available at: http://esec-lab.sogeti.com/pages/Origami

[13] Hartstein B.: jsunpack [online] 2012, [ref. 20. 4. 2012]. Available at:

58 http://jsunpack.jeek.org/

[14] The Wepawet team: Wepawet [online] 2012, [ref. 20. 4. 2012]. Available at: http://wepawet.iseclab.org/

[15] Malware tracker: pdf examiner [online] 2012, [ref. 20. 4. 2012]. Available at: http://www.malwaretracker.com/pdf.php

[16] Dixon, B.: PDF X-Ray [online] 2012, [ref. 20. 12. 2012]. Available at: https://github.com/9b/pdfxray_public

[17] Portable Document Format Reference Manual Version 1.2 [online] 1996, [ref. 15. 3. 2013]. Available at: http://www.pdf-tools.com/public/downloads/pdf- reference/pdfreference12.pdf

[18] Adobe Systems Inc.: PDF Reference, Second edition [online] 2000, [ref. 15. 3. 2013]. Available at: http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devn et/pdf/pdfs/pdf_reference_archives/PDFReference13.pdf

[19] Adobe Systems Inc.: PDF Reference, Third edition [online] 2001, [ref. 15. 3. 2013]. Available at: http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devn et/pdf/pdfs/pdf_reference_archives/PDFReference.pdf

[20] Adobe Systems Inc.: PDF Reference, Fourth edition [online] 2003, [ref. 15. 3. 2013]. Available at: http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devn et/pdf/pdfs/pdf_reference_archives/PDFReference15_v6.pdf

[21] Adobe Systems Inc.: PDF Reference, Fifth edition [online] 2004, [ref. 15. 3. 2013]. Available at: http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devn et/pdf/pdfs/pdf_reference_archives/PDFReference16.pdf

[22] Adobe Systems Inc.: PDF Reference, Sixth edition [online] 2006, [ref. 15. 3. 2013]. Available at: http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devn et/pdf/pdfs/pdf_reference_1-7.pdf

[23] Adobe Systems Inc.: Document management – Portable document format – Part 1: PDF 1.7 [online] 2008, [ref. 15. 3. 2013]. Available at: http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/ devn et/pdf/pdfs/PDF32000_2008.pdf [24] Adobe Systems Inc.: Adobe Supplement to the ISO 32000, Base Version 1.7, Extension Level 3 [online] 2008, [ref. 15. 3. 2013]. Available at:

59 http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devn et/pdf/pdfs/adobe_supplement_iso32000.pdf

[25] Adobe Systems Inc.: Adobe Supplement to the ISO 32000, Base Version 1.7, Extension Level 5 [online] 2009, [ref. 15. 3. 2013]. Available at: http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devn et/pdf/pdfs/adobe_supplement_iso32000_1.pdf

[26] Ratanaworabhan, P., Livshits, B., Zorn, B.: Nozzle: A Defense Against Heap- spraying Code Injection Attacks [online] 2009, [ref. 14. 4. 2013]. Available at: http://research.microsoft.com/pubs/81085/usenixsec09b.pdf

[27] Zakas, N.Z.: JavaScript: Programujeme profesionálně. Brno: Computer Press, 2009

[28] Microsoft: COM: Component Model Object Technologies [online] 2012. Available at: https://github.com/9b/pdfxray_public

[29] Adobe Systems Inc.: Adobe Flash Player [online] 2012, [ref. 25. 4 . 2012]. Available at: http://www.adobe.com/software/flash/about/

[30] The MITRE Corporation: Common Vulnerabilities and Exposures [online] 2012, [ref. 25. 4 . 2012]. Available at: http://cve.mitre.org/

[31] Foxit Corporation: Foxit Reader [online] 2012, [ref. 20. 12 . 2012]. Available at: http://www.foxitsoftware.com/Secure_PDF_Reader/

[32] Gal, A., et al: PDF.js [online] 2012, [ref 17. 4. 2013]. Available at: http://mozilla.github.io/pdf.js/

[33] Barth, A., Reis, Ch., Jackson, C.: The security of the Chromium Browser [online] 2008, [ref. 20.4. 2013]. Available at: http://seclab.stanford.edu/websec/chromium/chromium-security- architecture.pdf

[34] Kramm, M., et al: SWFTools [online] 2003. [ref. 17.4 2013]. Available at: http://www.swftools.org/

[35] The Gimp Development Team: Gimp User Manual [online] 2010. [ref. 17.4 2013]. Available at: http://www.gimp.org/docs/

[36] freedesktop.org: Poppler wiki [online] 2013. [ref. 17.4 2013]. Available at: http://freedesktop.org/wiki/Software/poppler/

[37] Devaldi Ltd: Flexpaper Documentation [online] 2013. [ref. 17.4 2013]. Available at: http://flexpaper.devaldi.com/docs.htm

60 [38] Crocodoc: Crocodoc API doc [online] 2013. [ref. 17.4 2013]. Available at: https://crocodoc.com/docs/

[39] AT&T Labs Research: GraphViz Documentation [online] 2013. [ref. 17.4 2013]. Available at: http://graphviz.org/Documentation.php

[40] Gamme E., Helm R., Johnson R., Vlissides J.: Design Patterns: Elements of Reusable Object-Oriented Software [online] 1994, [ref. 15. 6 . 2012]. Available at: http://www.cs.up.ac.za/cs/aboake/sws780/references/patternstoarchitecture/G amma-DesignPatternsIntro.pdf

[41] Dixon, B.: Scoring PDFs Based on Malicious Filter [online] 2010, [ref. 28.12. 2012]. Available at: http://blog.9bplus.com/scoring-pdfs-based-on-malicious- filter

[42] Angelov, M.: Simple AJAX Commenting System [online] 2012, [ref. 27. 12. 2012]. Available at: http://tutorialzine.com/2010/06/simple-ajax-commenting- system/

[43] MANDIANT: OpenIOC [online] 2012, [ref. 28.12. 2012]. Available at: http://www.openioc.org/

[44] The MITRE Corporation: Standardizing Cyber Threat Intelligence Information with the Structured Threat Information eXpression [online] 2012, [ref. 28.12. 2012]. Available at: http://measurablesecurity.mitre.org/docs/STIX- Whitepaper.pdf

61