Design and Implementation of a Framework for Viewing and Analysis of Malicious Documents

Masaryk University Faculty of Informatics Design and implementation of a framework for viewing and analysis of malicious documents Diploma Thesis Bc. Richard Nossek 2013 1 2 Statement I hereby declare that I have worked on this thesis independently using only the sources listed in the bibliography. All resources, sources, and literature, which I used in preparing or I drew on them, I quote in the thesis properly with stating the full reference to the source. 3 Acknowledgement I would like to thank my advisor, RNDr. Václav Lorenc, for his guidance and advice during my work on this thesis. 4 Abstract The goal of this thesis is to provide an in-depth assessment of the use of PDF (Portable Document Format) file format as an attack vector and the current state of the field of malicious document analysis. First, we provide detailed introduction into the inner organization and structure of PDF files and describe how different features can be used for obfuscation purposes. Next, we survey available options for viewing PDF documents in web browser environment, as well as tools for PDF document analysis. The practical part consists of designing and implementing a web application that serves as framework for malicious document analysis. Keywords Portable Document Format, PDF, malware, malicious documents, PDF analysis, PDF analysis tools, analysis framework. 5 Contents Introduction 8 1 Portable Document Format 9 1.1 Version history 9 1.2 PDF architecture 11 1.3 PDF file structure 11 1.3.1 File header 12 1.3.2 File body 12 1.3.2.1 Boolean objects 12 1.3.2.2 Numeric objects 13 1.3.2.3 String objects 13 1.3.2.4 Name objects 14 1.3.2.5 Array objects 14 1.3.2.6 Dictionary objects 14 1.3.2.7 Stream objects 15 1.3.2.8 Null object 16 1.3.3 Cross-reference table 16 1.3.4 File Trailer 17 1.3.5 Incremental Updates 18 1.3.6 Object Streams 19 2 Vulnerabilities and exploits 20 2.1 PDF as an attack vector 20 2.2 Propagation and payload execution 21 2.2.1 Email 21 2.2.2 Drive-by downloads 21 2.2.3 COM objects 22 2.3 Adobe Reader vulnerabilities 22 2.3.1 Most frequently exploited vulnerabilities 22 2.4 PDF obfuscation techniques 25 2.4.1 Header 25 2.4.2 String objects 25 2.4.3 Name objects 26 2.4.4 Encryption 26 2.4.5 Filters 27 2.4.6 Code fragmenting 27 2.4.7 Object streams 27 2.4.8 JavaScript obfuscation 27 2.5 PDF JavaScript 28 2.5.1 Function getField 28 2.5.2 Function this.info 29 2.5.3 Function getAnnot 29 3 PDF online viewers and analysis tools 30 3.1 Online PDF viewers 30 3.1.1 Browser plugins 30 6 3.1.2 Server-side format conversion 30 3.2 Analysis Tools 31 3.2.1 PDF Tools 32 3.2.1.1 PDFiD 32 3.2.1.2 PDF-parser 33 3.2.2 peepdf 35 3.2.3 PDF Stream Dumper 37 3.2.4 Origami 38 3.2.5 jsunpack 39 3.2.6 Wepawet 40 3.2.7 PDF Examiner 41 3.2.8 PDF X-Ray 42 4 Application design 43 4.1 Framework definition 43 4.2 Document analysis 43 4.3 Design and prototypes 43 4.3.1 Database relations 47 4.3.2 Security concerns 48 5 Implementation 49 5.1 Used technologies 49 5.2 Document root organization 50 5.2.1 Upload page 50 5.2.1.1 Flash conversion script 51 5.2.1.2 Supplementing score system 51 5.2.1.3 Uploader component security 51 5.2.2 File browser page 52 5.2.3 Report page 53 5.3 Implementation overview 54 5.4 Deployment 55 Conclusions and future work 57 References 58 7 Introduction In recent years, use of information technology (IT) has become more pervasive in most aspects of our everyday lives. Hand in hand with this change goes increased use of electronic documents due to individuals, businesses and governments adapting to the electronic environment. Portable Document Format (PDF) [1] files are currently one of the most used formats thanks to their rich feature list, portability and Adobe's freely available reader software. However, the popularity of the format inevitably drew the attention of malware authors, who quickly recognized an opportunity and began to use vulnerabilities in Adobe Reader [2] as an attack vector. What makes PDF files special in this regard is that it was soon discovered that the extensive PDF specification also provides legitimate ways of disguising malicious payload inside documents. These obfuscation techniques not only rendered traditional signature based detection of anti-virus (AV) software ineffective, but also made static analysis problematic. Most AV scanners have since implemented PDF parsing functionality, yet it was shown [3] that combining multiple techniques and avoiding common patterns can bring detection rates back close to zero even today. Specialized tools and understanding of the format are necessary to parse and correctly examine the contents of PDF files. The main goal of the theoretical part of this thesis is to review available options for viewing PDF files in web browsers and survey tools for analysis of PDF documents and test the extend of their functionality. However, in order to do so, we first need to understand the structure of PDF documents and the fundamentals of obfuscation techniques and vulnerabilities specific to the format. The first few chapters therefore cover this topic. In the practical part of this work, we design and implement a web application that serves as a framework for analysis of malicious PDF documents. One of the aims was to develop it using only free software [4]. The first chapter provides an overview of PDF version history [5], which also lists all the functionality added since release. It also describes how PDF documents are structured and organized internally. The second chapter first recounts how malicious documents are usually distributed and executed [6], offers a summary of all known vulnerabilities [7] related to Adobe Reader. Additionally, it explains how PDF obfuscation [8] works and lists the differences in Adobe's implementation of the JavaScript engine. In chapter three, we discuss available options for viewing PDF files in browser environment and then we review and describe all of currently freely available tools for PDF analysis (both offline and online). Namely, it is PDF Tools [9], peepdf [10], PDF Stream Dumper [11], Origami [12], jsunpack [13], Wepawet [14], PDF Examiner [15] and PDF X-Ray [16]. Chapters four and five cover the design and implementation of our web application. This includes establishing a of workflow for static analysis of malicious PDF documents, outlining potential security issues and providing instructions on how to deploy the application on a web server. Finally, we draw conclusions about the the problematic of malicious document analysis and discuss future possibilities regarding our application. 8 Chapter 1 Portable Document Format This chapter summarizes version history of the Portable Document Format and describes how data is internally organized inside a PDF document. 1.1 Version history Portable Document Format is a file format that was created by Adobe Systems in 1993. According to Adobe, PDF is a fixed-layout format used for representing two-dimensional documents in an independent manner of the application software, hardware and operating system that lets you capture all the elements of a printed document as an electronic image that you can view, navigate, print or forward to someone else. In essence, PDF allows users to view documents exactly as their authors designed them, regardless of any differences between the author's and reader's systems and without the need to have the software used to create the document. Nowadays, PDF has become a de facto standard for electronic distribution of documents. Even though Adobe Systems Inc. holds patents to PDF, anyone may create and publish applications that can read or create PDF documents without having to pay any royalties. Initially, PDF was created with the idea of paperless office in mind. PDF format was intended to provide a way for companies, corporation and other organizations to exchange documents electronically. The format was first publicly talked about at a Seybold conference in 1991 (then it was referred to as 'Interchange PostScript'). PDF 1.0 was introduced a year later at Comdex Fall. Adobe Acrobat 1.0, software used to create and view PDF files, was released on 15th June 1993. PDF 1.0 included features such as bookmarks, internal links and embedded fonts. However, the format was not successful at first, mostly due to high pricing of creation tools and the lack of free version of Acrobat Reader. The first edition was revised twice. PDF 1.1, along with corresponding version of Adobe Acrobat 2.0, were released in 1994. It included several new features, such as password protection, encryption (MD5 and RC4), device independent colors, threads and links. PDF 1.2 [17] (and Adobe Acrobat 3.0) was released two years later in 1996. It introduced, among other things, interactive page elements, fill-in forms and Forms Data Format used for transmitting form data to and from the Web. PDF 1.3 [18], the second edition of PDF, was released in 2000 added support for the new features of the Adobe imaging model embodied in PostScript LanguageLevel 3. Most important features introduced were new data structures for efficient mapping of strings and numbers to PDF objects, several new types of functions, embedding of files of any type within a PDF document, several new annotation types, digital signatures, support for JavaScript, a way to capture information from the Web and converting it to PDF form and prepress support.

Design and Implementation of a Framework for Viewing and Analysis of Malicious Documents

2 Portable Document Format (Pdf)

EDTA Conference: Part 1

Preservation with PDF/A (2Nd Edition)

PDF/A in a Nutshell 2.0 PDF for Long-Term Archiving

PDF Format for Rfcs

Portable Document Format - Wikipedia, the Free Encyclopedia Page 1 of 19

Creation of Interactive 3D Documents to Support the Setup Process of Machine Tools”

PDF/UA in a Nutshell – Accessible Documents with PDF

Adobe Acrobat Professional Help

Implementing Paperless Automation of Accounts Payable Invoices for Small Business Accounting Systems

KIP S Tems with 7.0 and 8.0 Software

PDF/A Forever Long-Term Archiving with PDF