The Pennsylvania State University

The Graduate School

Department of Computer Science and Engineering

A FRAMEWORK FOR MIME TYPE IDENTIFICATION AND

CONTENT FILTERING IN THE FIREFOX

A Thesis in

Computer Science and Engineering

by

Matthew James Rummel

c 2012 Matthew James Rummel

Submitted in Partial Fulfillment of the Requirements for the Degree of

Master of Science

December 2012 The thesis of Matthew James Rummel has been reviewed and approved* by the following:

Patrick McDaniel Professor of Computer Science and Engineering Thesis Adviser

Trent Jaeger Associate Professor of Computer Science and Engineering

Lee Coraor Associate Professor of Computer Science and Engineering Director of Graduate Affairs

*Signatures are on file in the Graduate School iii Abstract

Modern Web browser architectures allow for extensibility in order to support an evolving variety of content. Each supported plugin interacts with the browser and underlying host through a diverse set of operations that bring new challenges to the security model. These capabilities provide the means for a growing number of attack vectors that leverage the lax MIME type verification utilities in browsers to disguise malicious files. Once loaded by a browser, these objects take advantage of the escalated privileges available to their concealed payload in order to execute commands on the client. Such attacks can be launched from files shared on social media sites, through email, or from a server controlled by the attacker. To protect against these threats, we offer MIME Detector, a Firefox browser extension to identify and monitor the browser’s use of loading objects. By utilizing a collection of open source tools and internal browser components, the tool is able to determine the MIME type of incoming content and enforce an acceptable use policy. Our testing shows that this research provides a solid framework towards providing users with a greater level of control over how Web based content interacts with their client. iv Table of Contents

List of Tables ...... v

List of Figures ...... vi

Acknowledgments ...... vii

Chapter 1. Introduction ...... 1 1.1 Camouflaging Malicious Content ...... 2 1.1.1 GIFAR ...... 2 1.1.2 Flash and ZIP Archives ...... 5 1.1.3 Chameleon Files ...... 6 1.2 Research Statement ...... 7

Chapter 2. Related Work ...... 9 2.1 Client Filtering ...... 9 2.1.1 String Based Filtering ...... 10 2.1.2 Control Flow Detection ...... 12 2.2 Server Filtering ...... 13 2.2.1 Common Approaches ...... 14 2.2.2 Automata Based ...... 15 2.3 Comparison to Project ...... 16

Chapter 3. Implementation ...... 18 3.1 User Interface ...... 18 3.1.1 Site Elements ...... 18 3.1.2 Settings ...... 19 3.1.3 Action Log ...... 22 3.2 Browser Interaction ...... 23 3.2.1 Channel Proxy ...... 23 3.2.2 Content Evaluation ...... 24 3.3 MIME Identification and HTML Parsing ...... 26

Chapter 4. Evaluation ...... 29 4.1 Rule Set Tests ...... 29 4.2 MIME Identification Tests ...... 32 4.3 Web Browsing Test ...... 34

Chapter 5. Conclusions ...... 37

Appendix. Web Browsing Test Results ...... 40

References ...... 52 v List of Tables

3.1 Monitored HTML tags and their associated reference attribute...... 19

4.1 The result of the tag test evaluation...... 31 4.2 The results of the camouflaged objects evaluation...... 33

A.1 A sample rule set for general Web browsing...... 41 A.2 An evaluation of identification results...... 44 A.3 A listing of items blocked by the extension...... 48 A.4 A comparison of collected performance metrics...... 51 vi List of Figures

1.1 Sample and HTML code to launch a GIFAR attack that lists a user’s files [9]...... 4 1.2 A Postscript file modified to contain HTML code and an HTML file with a GIF header [7]...... 7

3.1 The user interface tabs ...... 22 3.2 The stages of a file’s evaluation ...... 27 vii Acknowledgments

I am appreciative of the guidance I have received from my advisor, Dr. Patrick

McDaniel. His perspective and feedback were instrumental in leading this thesis to successful completion. I am also grateful for the support of my family and friends. Their unwavering encouragement has always been a positive influence in all of my endeavors.

Most of all, I would like to express my deepest gratitude to Allison for her understanding, patience, and reassurance throughout the duration of this project — I couldn’t have done it without her. 1

Chapter 1

Introduction

The incorporation of Web 2.0 technologies in the World Wide Web has brought substantial changes to both the user experience and security model of Internet appli- cations. As a platform for services and user content, Web based products allow for in- creased ease of collaboration and data dissemination amongst distributed parties. This vast dispersion of files originating from end users combined with the execution of client side code can also be leveraged to compromise the privacy of users and the integrity of their devices. A recent report by Symantec lists blogs and Web communication as the category of websites most frequently utilized to launch such an attack [1]. The report further cites plugins, including Oracle Java; Adobe Flash; and Adobe Acrobat Reader, as commonly providing a mechanism for many malicious exploits. Research has shown these manipulations to include cross site forgeries [8], cross site script attacks [7], and malware [10]. Additionally, it has been revealed that any type of file, even those as seem- ingly benign as images, can be used to exploit properties of Web architectures [9] [7].

Thus, the ability to add media to websites coupled with the requirement that browsers support rich content presents an ongoing challenge in browser security.

In this research, we examined a particular category of Web based attacks in which an object loaded into a browser is embedded with the payload of a malicious object of a different MIME type. By disguising malicious files in this manner, attackers are able 2 to circumvent content policies enforced by both browsers and servers. Such attacks have been described as content repurposing by Sundareswaran and Squicciarini [29] and

“chameleons” by Barth et al [8]. The objective of this project was to develop a framework to prevent such exploits implemented as a browser extension.

1.1 Camouflaging Malicious Content

Regardless of the method used to repurpose content, there are some common characteristics that can be recognized in each approach. Each attack implements some form of digital steganography, or the practice of disguising data by placing it within other data, thereby concealing the secret payload [4]. Although standard MIME types have recognizable signatures, the process of finding all signatures within a given payload of data has proven to be a difficult task at both the client and server. Furthermore, when

MIME types are inferred through different recognition techniques, it is possible that the server will identify the object as being of one type, while the client attempts to utilize it as though it were another. The following descriptions exemplify the attack vectors and capabilities of hidden Web content.

1.1.1 GIFAR

To date, the most highly publicized repurposing attack is the GIFAR, so named for its construction as a concatenation of an image, such as a GIF and a Java archive, or JAR. The GIFAR vulnerability was presented at the Black Hat USA Conference in

2008 based on research by Billy Rios and Petko Petkov. The attack was regarded as one of the top Web hacking techniques of that year based on its simplicity and ability 3 to compromise a victim’s privacy [14]. The vulnerability was patched shortly after the presentation and is no longer a threat in versions of Java since 1.6.0.11 and 1.5.0.17 [9].

A notable property that contributed to the effectiveness of the GIFAR is its distribution through images. While most Web applications will not allow executable code to be uploaded, images are frequently permitted and wildly shared in social media and content management applications. In addition to third party sites, an attacker may consider storing the malicious content on their own domain and attract users to their site through advertisements or other means. Once the GIFAR is stored on a third party server, the attacker must find a way to embed HTML code that enables the JAR to execute. This code can be inserted into the webpage due to lax text input sanitation or other attack methods whereby HTML can be injected. An additional method is to upload the GIFAR to a server and then send an HTML email to the victim. The HTML message would contain the an tag that embeds the tag, thus referencing the

GIFAR as a link. When the user clicks on the link, a page is loaded that invokes the applet and thus carries out the attack [29].

The overall extent of the a GIFAR’s effectiveness is largely based on the security measures in place on the client, the browser settings, and the security awareness of a potential victim. A number of these scenarios were discussed by Ron Brandis, a researcher at EWA-Australia [9]. If the firewall setting on the user’s local machine prohibits the GIFAR from establishing a connection back to a server controlled by the attacker, then no information can be retrieved. If a TCP tunnel can be established, then a fairly low level set of attacks could be launched to return information such as the target’s internal IP address, send spam emails, or forward commands to botnets. The 4

// Included in Evil.class in a JAR concatenated to evil.gif public class Evil extends JApplet { public void start() { Socket socket=new Socket(attackerIP, attackterPort); OutDataStream out=new DataOutputStream( sock.getOutputStream()); Process p=Runtime.getRuntime().exec("ls -l"): BufferedReader in= new BufferedReader(new InputStreamReader(p.getInputStream())); String line = ""; while ((line = in.readLine()) !=null) out.writeUTF(line+"\n"); } }

Fig. 1.1 Sample Java and HTML code to launch a GIFAR attack that lists a user’s files [9]. attacker could also potentially establish connections to a third party server, which on a large scale could lead to a distributed denial of service attack [29].

More detrimental attacks to the user’s system would require a signed applet in order to satisfy the restrictions enforced by the JVM’s Security Manager on access to the local system. If the applet in the GIFAR is signed with a fake certificate that lists a seemingly trusted source as its publisher, the user may be inclined to accept the certificate. This action would provide the GIFAR unrestricted access to run a number of attacks. Such attacks include executing commands, modifying files, or retrieving persistent cookies. Sample code for executing a command in a GIFAR and returning the results to an attacker is shown in Figure 1.1.

In addition to the use of image files for which GIFFARs are named, they can also be used to launch attacks from Microsoft Office Open XML files (.docx, .pptx, and 5

.xlsx), which use an XML format and ZIP compression to store office documents. Using a ZIP utility, one can insert the contents of the JAR file into the archive. As in the

GIF version, the attack is launched when the file is loaded and the JAR invoked by the

HTML.

1.1.2 Flash and ZIP Archives

Another camouflaging technique that uses Adobe Flash was presented by Michael

Bailey, a Senior Researcher at Foreground Security, at the Black Hat USA conference in

2010 [6]. Flash content contained in .swf files are controlled by executable code called

ActionScript. ActionScript is restricted by a same origin policy similar to that which governs JavaScript; ActionScript loaded from one domain cannot execute code, access cookies, or read content that originated from another domain unless permission has been explicitly granted. Of course if the Flash file is hosted on a server in the domain being attacked, then that file can execute scripts in the context of that domain. This enables a malicious Flash object to steal information or cookies from the domain under attack.

Unlike the GIFAR attack, a Flash executable does not require any additional HTML code in order to be invoked.

Simply changing the extension of a .swf file to a MIME type that the server permits may be sufficient to upload a Flash file in poorly secured applications. If the server has more restrictive filtering, the attacker could attempt to disguise the SWF by prepending it to a file that conforms to the ZIP format. As in the GIFAR attack, the

ZIP format is a good candidate to disguise Flash files as a ZIP file can be appended to any binary format and remain valid. However, unlike JARs, SWF files will not be 6 executed if concatenated at the end of another file. The solution is to append the ZIP

files to the SWF and then attempt the upload [5]. Baily reports that several server side validators will recognize the file to be of the ZIP format and thus the attack can commence. Currently, there have not been any published fixes from Adobe.

1.1.3 Chameleon Files

An attack technique developed by Barth et al. introduced a way of hiding HTML elements inside of Postscript and image files, creating what they described as “chamele- ons” [7]. The attack was implemented by editing the header of a Postscript file to include

HTML tags that executed JavaScript. The attack was tested by loading the modified

Postscript file into a HotCRP server, which is a content management product used for conference paper submissions. When a hypothetical reviewer of the paper accessed the page containing the modified file, the JavaScript was executed with the credentials of the logged in reviewer in the domain of the conference server and was able to rate the submitted paper with presumably high marks.

Chameleon image files were created by placing a GIF88 signature at the beginning of an HTML file. The constructed file was then uploaded into Wikipedia, a collaborative online encyclopedia. Although Wikipedia does search for embedded HTML in uploaded

files, it checks for only a limited set of potential tags. Using these unchecked tags, the

file was able to be uploaded. When the page was viewed with , the

JavaScript contained in the HTML was executed.

Both sample attacks executed by Barth et al. are examples of content sniffing cross site script attacks. The attacks relied on the privilege escalation that occurs in 7

% Post script to execute HTML !PS-Adobe-2.0 %%Creator: %%Title: evil.dvi %% Pages : 3 <-- HTML file hidden as gif --> GIF88

Fig. 1.2 A Postscript file modified to contain HTML code and an HTML file with a GIF header [7]. the browser when HTML tags are detected. Because HTML is able to run scripts, it is regarded as being of a higher privilege in comparison to other elements such as an image.

The inability of both the browser and the server to detect this mixing of privileges allows the attack to occur. A sample code segment depicting the Postscript and GIF attacks is shown in Figure 1.2.

1.2 Research Statement

The success of these attacks reveals that current techniques used to determine

MIME types on both the server and client are lacking in uniformity and effectiveness. It also highlights that browsers can make use of a file in the manner specified by the HTML without any verification of the content or the use intended by the application. Although the intended use could be inferred by the Content-Type header defined in the HTTP protocol, this field can be left empty by some servers, filtered by an external proxy, or be disregarded by a receiving browser based on it’s own MIME identification scan [7].

This research attempts to address the issue of acceptable content use in the browser by constructing a tool to determine an object’s MIME type and appropriately

filter content based on its context. We developed a tool called MIME Detector as an 8 extension to the Firefox browser to fulfill this ambition. This tool listens for incoming content as it enters the browser, performs scans on the data to identify the MIME types of the objects being loaded, and then determines which objects should be allowed or denied. Load requests are evaluated based on the default criteria of whether an object can be identified and if so, how many identities are recognized. If a type cannot be determined or more than one MIME type is detected, the object load is canceled. If it does evaluate to a single MIME value, the context of the object’s use as an element loading in an HTML page or in the browser window itself is evaluated by a user-defined rule set. Although many experts believe that ultimately MIME type filtering is an issue that must be solved by Web applications, the short term prevention of these security deficiencies can best be addressed by users in the client [18]. 9

Chapter 2

Related Work

Every browser and properly configured server that allows user uploads utilizes a method to determine a file’s MIME type. This is accomplished through various ap- proaches. Thus, the problem of identifying hidden malicious content is compounded by the disjointed methods used to determine a file’s type. Many attack vectors in the domain of content repurposing are made possible because a server determines a file to be of one type, while the browser interprets it as another. While a universal recognition technique may be beneficial to both content providers and users, no standard has yet been adopted. The following work highlights attempts to improve MIME type detection, content filtering, and security best practices at both the client and server.

2.1 Client Filtering

Browsers use their MIME identification utilities to determine how loading content should be rendered when the MIME type is not supplied by the server or is overruled by the browser. MIME sniffing is typically achieved by comparing the leading bytes of data to well known signatures. Common implementations of these methods have proven to be ineffective in stopping content based attacks. This has lead to the development of alternative techniques that attempt to improve upon standard capabilities. 10

2.1.1 String Based Filtering

Barth et al. blended the detection abilities of several content sniffing algorithms in popular Web browsers to form a single MIME recognition method [7]. The browsers incorporated in this experiment were , Firefox 3, Safari 3.1 and Google

Chrome. As Firefox and Chrome are open source, the content sniffing algorithms in these browsers were determined by source code inspection. Since Safari and Internet Explorer utilize proprietary algorithms, their detection abilities were modeled using a technique call string-enhanced white-box-exploration.

White-box-exploration executes code in an environment that records the con- straints of conditional statements encountered in the execution path of a given input.

The module then negates one of the recorded predicates in the execution path such that the opposite path is explored. For instance, if the condition x < 0 was satisfied in the current execution, this statement would be negated to x > 0. An input generator then analyzes the new constraints list and generates inputs that will exercise the chosen negated condition. These items are added to the set of inputs under consideration for successive rounds. A prioritized list of new inputs based on overall code coverage is selected from this set. The highest priority inputs start the next round and execution continues until all conditional paths have been explored.

Enhancements to white-box-exploration introduced by Barth el al. provide an abstraction of string conditional statements documented during execution. This is ac- complished by recording constraints on the output of string operations rather then the conditionals themselves. Thus, an independent representation of the operation is saved 11 instead of the byte level string comparison operations. A solver component that un- derstands these inputs is then utilized and a new generation of inputs was created. By analyzing string operations in a manner decoupled from their low level operations, con- tent sniffing algorithms were able to be modeled and compared regardless of the language in which the browser was written.

After modeling the proprietary MIME detection capabilities, the team’s com- parison of the browsers revealed significant differences in the byte length examined for signature matching, the conditions on which content sniffing is triggered, and signature definitions. These methods were consolidated through combining the induced properties of the browsers’ content recognition algorithms to create an improved browser content sniffer. This comprehensive algorithm was then pared in accordance with two key de- sign principles: avoiding privilege escalation and ensuring signatures are prefix disjoint.

Privilege escalation occurs when an object of one type is elevated to have the privileges of another type. For example, an object that is received with a Content-Type header of type image/jpeg should never be interpreted as application/x-shockwave-flash as these applications have the ability to run scrips whereas images do not. A signature set is said to be prefix-disjoint if any signature it uses to recognize HTML does not share a prefix with any other signature. HTML is considered to have the highest privilege of all

MIME types due to its ability to run JavaScript. Using these principals, the new con- tent sniffer proved to be more effective at recognizing hidden objects than each browser’s independent methods. This content sniffing procedure was adopted in Google Chrome and partially implemented in Internet Explorer 8. 12

2.1.2 Control Flow Detection

Sundareswaran and Squicciarini introduced a technique that focuses on modeling the browser’s actions as opposed to detecting component types. This procedure is im- plemented in the DeCore tool, or Detecting Content Repurposing Attacks on Clients’

Systems [28]. This process involves constructing control flow graphs with nodes that represent the current state of the browser and transitions depicting events on the local system or between the client and a server. The tool was developed as a browser extension compatible with both Firefox and Chrome. Using the the control flow model, the exten- sion is able to detect operations that indicate a repurposing attack may be occurring.

The DeCore system relies on two main components: an auditor and a detector.

The auditor monitors deviations in a browser’s behavior from what is expected through a series of verifications. First, it compares a file’s extension to a list of extensions that are acceptable for the given target. The target could be the main browser window or a contained plugin. For example, the component would specify that the Java plugin should only receive files with a .jar or .class extension and the Acrobat plugin should only be provided files of type .pdf. The auditor also analyzes interactions between the user and the browser and between the browser and servers. When a page is loaded, the auditor constructs a control flow graph by monitoring the DOM. This is achieved by analyzing the components in the DOM tree and mapping the possible interactions and events that can take place. Thus in the graph, the nodes represent the client state and the edges the action or IO necessary to move the bowser to a different state. When events occur, they are analyzed in the context of the browser’s state and the state of 13 associated files on the client’s device. An interaction is deemed to be suspicious if a state transition occurs that does not match an allowable transition in the control flow graph.

The detector is notified when such an unmapped transition occurs. These vio- lations are regarded as the signature of a potential content repurposing attack. The detector has various methods by which it determines if an attack is actually occurring, based on a defined interaction policy. Sundareswaran and Squicciarini experimented with several policies that included contingencies such as the state of the local file system, the number of loading pages per user interaction, and the state of cookies or other stored content, to determine if the unexpected transition was an attack. They were able to develop a policy that was successful in blocking a number of content repurposing attacks involving Flash and Java. They also reported that the monitoring required very little overhead in regards to system resources.

2.2 Server Filtering

An alternative approach to content filtering is to perform file analysis at the server side. As the creators and administrators of Web applications ultimately posses the most insight with respect to what MIME types should be permitted in their domain, server side approaches can offer site specific guarantees of security and functionality beyond that of generalized browser based measures. Furthermore, rendering speed is a critical metric for browsers and can have a significant impact on their market share [30]. Thus having protection methods that use client resources can have an appreciable impact on the Web browsing experience as well as the popularity of a given browser. The following techniques attempt to enforce content security at the server. 14

2.2.1 Common Approaches

Currently, upload filtering is the most commonly practiced method of protect- ing domains from malicious MIME types [28]. As in client side filtering, scanning for signatures and HTML tags are the most prevalent techniques. The advantage of per- forming this action at the server side is that servers likely have more resources available to perform such scans and thus can handle more resource intensive procedures for ex- amining files. The number of bytes examined by each browser also differs with some older browsers using as little as 256 bytes [2]. Furthermore, as new MIME formats are constantly evolving, browser filtering methods must continually be updated to support content that servers choose to provide. This concern can be mitigated by having servers handle the filtering and choosing only to allow signatures they recognize. Nevertheless, from the user’s prospective this protection is insufficient as it requires one to place their trust in the servers they visit on the Internet.

Perhaps the easiest method to ensure files uploaded by users do not exhibit script- ing concerns is to store them on servers with a separate domain [28]. Because the domain is different from that of a loaded HTML page, any included attacks that attempt to mis- use the same origin principal will be prevented. While the simplicity of this approach lends it to be highly effective, it may be cost prohibitive for smaller organizations. Fur- thermore, some attacks can still be launched from malicious content linked to the site from other servers.

Another protection technique is to restrict servers to only execute authenticated scripts [28]. In this scenario, scripts would be subjected to a hash function or another 15 form of verification to ensure that they were loaded from a legitimate source. This approach prevents malicious scripts from requesting or submitting information on behalf of the user as in the case of a cross site forgery. Nonetheless, this method does not guarantee that the authenticated scripts will not exhibit some form of information leakage that can be exploited. Furthermore, this method does not protect access to the client’s cookies or other exploits that can occur on the client side.

The most limited protection procedures involve performing a transformation on uploaded files [28]. For instance, on a server could apply a conversion algorithm on uploaded images. Many websites use such tools for uploaded images by modifying the size and pixel rate in order to conserve storage space. Such conversions would have a high probability of destroying any hidden files encapsulated in the images. However, this technique is only applicable to images as there is no other practical means of transforming binaries of other types without destroying the usable data. Furthermore, there is no guarantee of disabling hidden content by using these methods as it is possible for an attack to be designed that can withstand a known transformation.

2.2.2 Automata Based

A more complex server side solution is presented by Gebree et al. who developed an automata based script to filter MIME types. This method examines uploaded files for selected HTML tags that have attributes that could reference code [12]. Exam- ples of tags that could contain scripts include ,, and