ZAP-ESUP: ZAP Efficient Scanner for Server Side Template Injection Using Polyglots

Diogo Miguel Reis Silva

Thesis to obtain the Master of Science Degree in Information Systems and Computer Engineering

Supervisor(s): Prof. Pedro Miguel dos Santos Alves Madeira Adão

Examination Committee Chairperson: Prof. Luis Manuel Antunes Veiga Supervisor: Prof. Pedro Miguel dos Santos Alves Madeira Adão Member of the Committee: Prof. Miguel Nuno Dias Alves Pupo Correia

October 2018 ii Dedicado ao meu avoˆ Antonio´ e a` minha avo´ Luzita.

iii iv Acknowledgments

Primeiro de tudo queria agradecer aos meus pais por todo o suporte, ajuda, e amor que me deram. Tambem´ queria agradecer pelo apoio e carinho do meu irmao,˜ da minha namorada, dos meus avos´ e toda a fam´ılia. A qualidade deste documento nunca teria sido tao˜ boa caso nao˜ tivesse sido revisto varias´ vezes pelo meu orientador, a Rafaela, o Vasco, e o Filipe aos quais agradec¸o o grande trabalho e paciencia.ˆ Durante o desenvolvimento do meu scanner sempre que necessario´ tive a ajuda da equipa do ZAP thc202, Simon Bennetts e kingthorin pelo qual agradec¸o o tempo gasto e paciencia.ˆ Os ultimos´ dois anos da minha vida foram feitos de grandes amizades, felicidades e feitos grac¸as a toda a equipa dos STT, mas especialmente grac¸as ao Madlebro, ao Jofra, ao Jcfg, ao Majo, ao Xtrm0, ao NCA, ao Sabino e ao LordCommander. A equipa STT so´ e´ poss´ıvel grac¸as ao professor Adao˜ e ao seu incansavel´ empenho, consegui realizar o meu sonho de crianc¸a de me tornar um Hacker e no futuro trabalhar em algo que eu faria nos meus tempos livres, pelo qual lhe agradec¸o imensamente. Finalmente gostaria de agradecer aos meus amigos que me acompanharam durante todo o meu percurso academico´ e com os quais adorei trabalhar, o Nuno, o Alexandre e o Gonc¸alo.

v vi Resumo

Recentemente, Kettle [1] exposˆ a descoberta de um novo tipo de vulnerabilidade a` qual chamou Server Side Template Injection (SSTI). Esta vulnerabilidade acontece em Template Engines, que sao˜ programas usados para combinar modelos de dados com templates. Estes templates contemˆ tanto Hypertext (HTML) como codigo´ de Template, que define como o HTML dinamicoˆ e´ gerado dependendo do modelo de dados recebido. O codigo´ de alguns template engines permitem a execuc¸ao˜ de toda a funcionalidade da linguagem de programac¸ao.˜ Se o input do utilizador for incorrectamente inserido no meio do template em vez de ser usado como modelo de dados, um atacante pode executar codigo´ no servidor. SSTI pode ser considerada uma vulnerabilidade da classe A1-Injection que e´ a classe de vulnerabilidade com o maior risco de seguranc¸a segundo o Open Web Application Security Project (OWASP) Top 10 2017 [2]. Pelo que sei so´ existem dois varredores de vulnerabilidades que detetam e exploram SSTI, Burp Suite e Tplmap. Estas soluc¸oes˜ ou sao˜ proprietario´ (Burp Suite), ou temˆ uma quantidade limitada de payloads fixos e consequentemente restrito a um numero´ limitado de template engines (Tplmap). Nenhum deles consegue encontrar vulnerabilidades quando o input e´ guardado e usado depois em outras paginas (Stored SSTI). Neste trabalho, eu estudei as situac¸oes˜ onde SSTI pode estar presente, desenvolvi um varredor de vulnerabilidades que procura SSTI automaticamente num maior leque de situac¸oes˜ (reflected, stored, e blind), introduzi uma tecnica´ mais eficiente que usa payloads poliglotas para procurar SSTI com menos do que 25% dos pedidos feitos pelos outros varredores de vulnerabilidades, e conclui construindo e us- ando um conjunto de testes para comparar as soluc¸oes˜ existentes. Esta soluc¸ao˜ vai ser disponibilizada como um plug-in para o OWASP Zed Attack Proxy, que e´ uma ferramenta open-source para procurar vulnerabilidades em aplicac¸oes˜ web e que e´ usada por um grande numero´ de utilizadores.

Palavras-chave: seguranc¸a, apllicac¸oes˜ web, SSTI, ingecc¸ao,˜ varredor de vulnerabilidades

vii viii Abstract

Recently, Kettle [1] exposed the discovery of a new type of vulnerability which he called SSTI. A Template engine is software used to combine data models with templates which contain both static HTML and template code. This template code defines how the dynamic HTML is generated depending on the given , and some even allow full functionality. If the user input is incorrectly inserted in the middle of the template instead of being used as the data model, an attacker can execute code in the server. SSTI can be considered an A1-Injection that is the class with the highest security risk according to OWASP Top 10 2017 [2]. To the best of my knowledge only 2 solutions have been developed to detect or exploit SSTI, Burp Suite and Tplmap. These solutions are either proprietary software (Burp Suite), or have a limited amount of (fixed) payloads and consequently restricted to a limited number of template engines (Tplmap). None of them can find vulnerabilities when the input is stored and used in other pages (Stored SSTI). In this work, I studied the situations where SSTI may be present, developed a scanner that au- tomatically detects SSTI vulnerabilities in a broader range of situations (reflected, stored, and blind), introduced an efficient technique that uses polyglot payloads to detect SSTI with less than 25% of the requests made by the other scanners, and concluded by constructing and using a set of tests to com- pare with the existent solutions. The solution will be made available as a plug-in for OWASP Zed Attack Proxy, a widely used open-source penetration testing tool to find vulnerabilities in web applications.

Keywords: security, web application, SSTI, injection, vulnerability scanner, polyglot

ix x Contents

Acknowledgments...... v Resumo...... vii Abstract...... ix List of Tables...... xiii List of Figures...... xv Nomenclature...... 1 Glossary...... 1

1 Introduction 1 1.1 Motivation...... 1 1.2 Solution...... 3 1.3 Objectives...... 3 1.4 Thesis Outline...... 4

2 Background 5 2.1 Web Applications...... 5 2.2 Vulnerabilities Detection...... 5 2.3 Server Side Template Injection...... 9 2.3.1 Relation of SSTI with other types of vulnerabilities...... 11 2.3.2 Exploiting Server Side Template Injection...... 12 2.3.3 Real cases of SSTI...... 15 2.3.4 Analysis of situations where the vulnerability can happen...... 17 2.4 Web Scanners for Injection Vulnerabilities...... 19 2.5 Vulnerability Scanners for Server Side Template Injection...... 24 2.6 OWASP Zed Attack Proxy...... 27

3 Implementation 29 3.1 Architecture and Interactions...... 30 3.2 Components Description...... 32 3.2.1 Sink Manager...... 32 3.2.2 Efficient Vulnerability Detector...... 32 3.2.3 Message Comparator...... 37

xi 3.2.4 Arithmetic Evaluation Detector...... 41 3.2.5 Blind Vulnerability Detector...... 42 3.2.6 Syntax Fixer...... 42

4 Experiemental Evaluation 45 4.1 Simple Tests - Reflected Results...... 46 4.2 Stored and Blind SSTI Test Cases...... 47 4.3 Injection inside template code tests...... 49 4.4 Real Example Test...... 50 4.5 Performance Tests...... 50 4.6 Generalisation Capacity Tests...... 52

5 Conclusions 55 5.1 Achievements...... 55 5.2 Future Work...... 56

Bibliography 57

xii List of Tables

2.1 Possible test result classification...... 6 2.2 Probe pairs from Backslash Powered Scanner [37]...... 21

3.1 Capabilities depending on the ZAP Strength configuration...... 29 3.2 Tests to discover the best way of causing errors...... 35 3.3 Tests to polyglots...... 36 3.4 Specific tests to , DustJs and Go...... 42

4.1 Simple Vulnerabilities Detection Table. yes - found vulnerability; ybne - found vulnerability but says that is not exploitable; the column RCE says if I found an exploit for rce in some source or by myself...... 47 4.2 Stored and Blind SSTI tests results...... 48 4.3 Injection inside template code tests results...... 49 4.4 Performance Tests Table...... 51 4.5 Generalisation capacity tests results...... 52

xiii xiv List of Figures

2.1 Process of rendering a template...... 10 2.2 Reflected SSTI timeline...... 17 2.3 Stored SSTI with posterior injection and rendering timeline...... 18 2.4 Stored SSTI with immediate injection and rendering timeline...... 18 2.5 Blind SSTI with posterior injection and rendering timeline...... 19 2.6 Blind SSTI with immediate injection and rendering timeline...... 19 2.7 Backslash Scanner logic...... 21 2.8 Learning algorithm...... 23

3.1 Plugin Architecture...... 31 3.2 Polyglot testing logic...... 34

xv xvi Chapter 1

Introduction

1.1 Motivation

The web has nowadays an increasing importance in our society and economy. Contrary to most software, web applications need to be exposed to the internet, otherwise users would not be able to access them. This leaves the web server exposed to attackers, making them able to find and exploit vulnerabilities. One successful cyber-attack may cause high damages to a company, either in money or in reputation. Thus, it is of major importance to include security testing in the software development life cycle. It allows to reduce the number of vulnerabilities and consequently reduce the risks associated with security attacks. Manual security testing is a reliable way of testing software, but it is not an efficient way of doing it. If the application has a large set of functionalities or a big codebase, manual testing will require several people and a significant amount of time. A fast and efficient way to perform security testing is to use automated vulnerability scanners that can perform many more tests in a shorter period of time.

Recently, Kettle [1] exposed the discovery of a new type of vulnerability which he called Server Side Template Injection. With this vulnerability it is possible to achieve remote code execution on the server. A Template engine, also known as template processor, is software used to combine templates with data models. Web applications templates contain both static HTML and template code. This template code is what defines how the dynamic HTML is generated depending on the given data model and can have simple functionalities or the full power of a programming language. The templates are usually stored in individual files but can also be stored as a string in the middle of the code. The expected usage of these templates is to use a render function that receives 2 arguments: the template and the data model. If the wants to use the user input in the resulting page, he should add it to the data model as represented at Listing1. SSTI vulnerabilities exist when the user input is incorrectly inserted in the middle of the template instead of being used as the data model argument in the rendering function as represented at Listing 2. Since the user input is inserted in the template, an attacker can inject template code. If the template code has the full programming language functionality he can run malicious code in the server. SSTI can

1 1 template= ' hello ${data} ' 2 render(template, USERINPUT)

Listing 1: Correct usage of user input.

1 template= ' hello '+ USERINPUT+ '' 2 render(template)

Listing 2: Insertion causing SSTI.

be considered an A1-Injection that is the class with the highest security risk according to OWASP Top 10 2017 [2].

To the best of my knowledge, there are 2 automated vulnerability scanners able to find SSTI. The first is Burp Suite (Burp) [3] that is developed by PortSwigger. Burp has a crawler able to automatically obtain the locations it should test and has a vulnerability scanner for the more common vulnerabilities. The second is Tplmap [4], an open-source tool to find and exploit SSTI and code injection vulnerabilities. It is a command line interface tool that receive as input the locations to test.

The cost of commercial vulnerability scanners range between hundreds to tens of thousands of dollars [5,6]. These prices can be excessive for individuals or small companies that want to have their websites tested. Although Burp is one of the cheapest vulnerability scanners, at the time of writing this document, a single license of Burp Suite Professional costs e349 per year. The solution here is to resort to open-source security tools to improve the security of small websites. Tplmap is an open-source tool but contrary to Burp it does not have crawling capabilities, so the locations to test need to be declared manually or by another tool. Another problem of Tplmap is that it is a tool specifically designed for this vulnerability and, according to [7,8], generic tools able to deal with a wider spectrum of vulnerabilities are often chosen due to cost and time restrictions.

Bau et al. [6] evaluated several vulnerability scanners, and noted that all the tested tools performed very badly detecting stored vulnerabilities. After 8 years, both Burp and Tplmap stil do not have any capability to detect SSTI when the payload is stored and later injected.

From the 90 template engines enumerated at [9], Tplmap can only detect SSTI in 15 of them as it needs to develop a plugin for each of the template engines, and each of them contains tests made taking into account only the template itself. The tests in section 4.6 show that Burp is able to generalise in some cases but may not have been designed with the intention to find vulnerabilities in unknown template engines.

2 1.2 Solution

This thesis proposes a solution that automatically searches for SSTI in a more efficient way and in a larger number of situations than the existent solutions. Prior to developing this solution, I conducted a study of the possible situations where this vulnera- bility may happen based on the existing known cases of successful exploitation of the vulnerability, on other already well-studied vulnerabilities, and on other web vulnerability scanners. With the knowledge obtained from this study it was possible to develop a scanner able to tackle situations not contemplated by the other solutions, as is the case of stored SSTI. Based on this information, it was also created a repository of vulnerable web applications to test SSTI scanners which are publicly available at [10]. To detect SSTI vulnerabilities it is necessary to have some proof that the payload sent was injected in the template instead of being used as data model. Both Burp and Tplmap get proofs of injection by sending template code that when injected in the template will be rendered resulting in a intended result. My scanner also follows this approach but, to minimise the ammount of requests sent, it first checks if SSTI is probable to exist in that location. If it concludes that it is probable, it sends the rendering tests, otherwise it proceeds to the next location to save unnecessary requests. To know if it is probable to exist SSTI, my scanner tries to cause and detect the failure of the rendering process due to invalid syntax. This alone does not save many requests because the syntax of the several engines is different and require different payloads to cause errors, but I created 2 generic payloads, usually called polyglots in the literature, which are able to cause rendering errors in all the studied template engines. To detect errors this solution compares the response where the error might have occurred with a previous response in a normal situation and infers if the changes are caused by some error. This step allows to reduce the number of requests to 24% of the requests made by Burp and 7% of the ones made by Tplmap. My solution also tries to detect the vulnerability in templates I have not analysed, so instead of using tests that are made specifically to one template engine, it uses tests that should work in a broader number of engines. I wanted to create a tool with impact and useful to the community, so I decided to develop my ideas in the format of a plugin to OWASP Zed Attack Proxy (ZAP) that is already a well know and widespread tool for web security testing.

1.3 Objectives

Thus the 5 main contributions of this project are the following:

1. Study of the situations where SSTI is present.

2. Development of a vulnerability scanner able to detect the cases found in my study.

3. Development of an efficient way to find SSTI using polyglot payloads.

4. Development of web applications to evaluate the efficiency and correctness of SSTI vulnerability scanners.

3 5. Integration of the tool as a plugin in the ZAP scanner.

1.4 Thesis Outline

The remainder of this document is structured as follows. Section2 gives the information necessary to understand the problem, describes the existing work and solutions in the area, and studies the situations where SSTI may happen. Section3 describes the solution in more detail including the architecture, the process of development, and the reasons for the choices made. Section4 describes the evaluation done to the scanner and the web applications developed to study SSTI scanners. Finally, Section5 concludes by summarising this work.

4 Chapter 2

Background

2.1 Web Applications

A web application or “web app” is a computer program that uses the client-server computing model and the client runs in a web browser. Websites as online stores, web email servers, and social networks are examples of web applications. The client-server computing model is a distributed system that partitions tasks between a resource provider called server and a consumer to those resources called client. Usually, there are many clients to each server and the communication between them is made through the internet. The clients of the web applications are web browsers which are responsible to give an interface to the user and to do some client-side logic. The web servers are responsible for implementing the main logic of the application and to store the state if it exists. The communication between the client and the server is done throught Hypertext Transfer Protocol (HTTP). Each time the client connects to the server it receives some HTML code that defines the structure, some JavaScript code that allows dynamic pages and more functionality, and CSS that defines the style of the page. In more recent pages, the HTML sent is just the enough to load the JavaScript code and the rest of the content is then obtained and created dynamically by the JavaScript. Each website or web application can have one or more pages. In the past, each page had a different Uniform Resource Locator (URL), but now some web applications change the pages dynamically within the same URL.

2.2 Vulnerabilities Detection

Security Vulnerabilities A software security vulnerability can be defined as a flaw in the software design, implementation, or operation that can be exploited to violate the security policy [11]. Arkin et al. [12] state that most of the security vulnerabilities are related to an attacker unexpected but intentional misuse of the application. The vulnerability can be inserted during the several stages of development. Design flaws are vulner- abilities caused by a bad design during the software projection. Since this is one of the first stages in

5 software development, the code will be based on it. The later these vulnerabilities are found, the more work will be required to fix it. One example of design flaw is to do the security checking in the client side, wrongly assuming that it cannot be tampered. Security implementation flaws are the vulnerabilities caused by programming bugs that can be abused to obtain non-intended functionality. This is the case of SSTI where the template engines functions are wrongly used. Operational security flaws are the ones caused by a wrong configuration of the program or the location where it runs. One example of this type of vulnerability is the existence of a database management system accessible from outside the server and with weak credentials. According to [13], vulnerabilities can be divided in two classes:

1. Input validation vulnerabilities are caused by using malicious inputs without filtering or validating them before, leading to non-intended actions.

2. Logic vulnerabilities are caused by the usage of intended actions in a non-expected order with the objective of obtaining non-intended functionality or behaviour.

In this context, I define an attack as the attempt to use one or more security vulnerabilities to perform a malicious action. The piece of code that exploits the vulnerability is called exploit.

Vulnerability Testing I use the word test for each action or group of actions with the purpose of finding a unique vulnerability. The results of a test can be classified depending on the value it reports and its correctness. The usual classification is done defining the value as positive or negative, and the correctness as being true or false. If the test considers that there exists a vulnerability, the value is positive and negative otherwise. In the case the value returned is the accurate representation of the reality it is considered true, otherwise, it is considered false. From the combination of the previous values I get 4 possible classifications of tests results represented in table 2.1.

Correct result Incorrect Result Identified as vulnerable True positive False positive Identified as non vulnerable True negative False negative

Table 2.1: Possible test result classification.

Depending on the information we have about the tested entity, security testing can be divided into 3 categories [14]:

1. Black-box testing is done with only an external description of the software, for instance, if the pro- gram under testing is a web application, the only information available/needed may be its address.

2. White-box testing is done with access to the of the program. Since the source is known it is possible to achieve higher coverage.

3. Grey-box testing is the combination of the capabilities of the 2 previous categories allowing to obtain the best of each.

6 Miller et al. [15] introduced the term fuzzing to describe the automatic generation of tests. They aimed to test the reliability of Unix programs by sending unexpected inputs. At that time, many of the programming languages used, allowed or required the programmer to directly manipulate memory. The objective of the fuzzers was to generate inputs that caused memory corruptions or invalid memory ac- cesses, which leaded to crashes. The paradigms changed and much of the software is now programmed in higher level programming languages such as web applications. These languages do not have memory corruption issues making the old fuzzers useless to them. Since the publication of [15], new techniques to discover software vulnerabilities have been developed focusing on new types of vulnerabilities. Austin and Williams [16] compared the 3 most common vulnerability discovery techniques: static analysis, manual penetration testing, and automated penetration testing also considered vulnerability scanning. Manual penetration testing is a black-box testing technique without aid of an automated tool and it consists on interacting with the application with the objective of obtaining a non-intended behaviour. It may include sending tests to a specific vulnerability and tricking the logic of the application. The number of tests made manually by a human is small, consequently it is an inefficient way of doing repetitive tasks as sending tests for specific vulnerabilities and for this reason automated vulnerability scanners have been developed. Automated vulnerability scanners are able to send many requests in a short period of time allowing them to get better coverage on each entry point for the several known input validation vulnerabilities classes. Static code analysis is done in a white-box style and consists in analysing the code without executing it. According to Correia and Sousa [14] static code analysis is the attempt to do manual code review in an automated way. It can be done in a simple way, checking the existence of words from a list in the code or in a more sophisticated way, by taking in consideration the semantics of the programs. Each of these techniques has strengths and weaknesses. Static analysis uses a white-box approach which allows a better insight to how the program works compared to black-box, but to have this insight it needs to better understand the semantics of the programming language used. For this reason, it is difficult to create a white-box tool for all languages. Meanwhile, tools that use a black-box approach do not need to know each of the languages, making it possible to have a single tool to all of them. As an example, local file inclusion vulnerability consists in loading a file from the computer where the program is running. In a black-box approach the tests necessary to test this vulnerability can be the same to all the programming languages. Austin and Williams [16] concluded that manual testing was the most effective in finding design flaws, while the others obtained the biggest number of input validation issues found. In our case, and since each template engine has its own semantic and I want to make my tool as broad as possible, developing a black-box scanner is a better choice than a white-box. My goal is to develop a black-box vulnerability scanner.

Architecture of web vulnerability scanners Doupe´ et al. [5] states that web application scanners are constituted by 3 modules: a crawler, an attacker module, and an analysis module. The crawler is responsible to discover all the reachable pages and their respective input points. It receives as input

7 one or more URLs and parses the available content in those locations to obtain new URLs to explore. In my solution I will not develop this module since it is going to be integrated in ZAP that already has one. The attacker module analyses the input points found by the crawler. For each of the entry points and for each of the vulnerabilities it tests, the attacker module generates payloads that are likely to trigger the vulnerability. The last module is the analysis module which analyses the result of the requests of the attacker module to detect possible vulnerabilities. Antunes and Vieira [7] suggested an architecture for vulnerability testing tools for web service. The main difference from the web applications is that the interface is well defined and known by the client. Since the input points are known it does not have any module like the crawler of [5]. Another difference to the previously described architecture is the existence of a workload emulator. This component generates valid requests to understand the correct execution of the program, allowing the detection of anomalies or discard behaviours. This study is not exclusive to black box, so it includes a module named service monitor that collects information from the program execution.

Scanners Benchmarking After developing the web vulnerability detector for SSTI I need to test and compare it with the other existent solutions. Antunes and Vieira [8] proposed an approach for benchmarking the effectiveness of vulnerability scanners in web services. They consider benchmarks tools that allow to evaluate and compare different solutions in a specific property. For them, a benchmark consists of the workload, which is the task that needs to be done, the measures that characterise the effectiveness, and the procedure that are the rules to follow during the benchmark. The workloads can be of 3 types: real workloads, a real application; realistic workloads, a real application with vulnerabilities inserted for the purpose of the benchmarks; and synthetic workloads that are pieces of code created with the purpose of being vulnerable. The measures used to compare the scanners were:

• Precision - the ratio between the number of true positive tests and the total of positive tests.

• Recall - the ratio between the number of vulnerabilities found and the number of known vulnera- bilities

In order to have credibility the benchmarks should be repeatable by a person other than the creator, should be portable to allow its execution with a new tool, and must represent real cases. Doupe´ et al. [5] and Bau et al. [17] evaluated several automated black-box web vulnerability scanners. The authors of [5] consider that the scanners have a testing cycle that starts with crawling, then they do an input selection and injection, and at the end they do a response analysis. In their evaluation they consider all these steps because it will affect the final result of the scanner. Since in my work I do not implement the first step, only the last 2 are relevant for us. They tried to make the tests the more realistic possible using fully functional web applications. Bau et al. [17] also evaluated the complete testing cycle of the web application scanners. They tried to evaluate all vulnerability detection capabilities recommended in the Web Application Security Consortium evaluation guide for web vulnerability scanners [18] that includes the categories: Protocol

8 Support, Authentication, Session Management, Crawling, , Testing, Command and Control, and Reporting. The metrics used to evaluate the vulnerability detection were: elapsed time and scanner- generated network traffic, entry points coverage, vulnerability detection rate, and false positive rate. Scanner time is the amount of time necessary for the scanner to execute the full scan. This metric is important in the usability perspective since the usage of a tool that takes 66 minutes is different from one that takes 473 minutes. Network traffic can be related to the number of requests made and the resource utilisation by the server to answer these requests. The lower the number of requests the better. Coverage result in their case is related to the ability of the crawler to get all the available URLs, but in the scanners context it can be the percentage of the application code that is tested. The vulnerability detection rate is the same as recall in [8], ie., the number of vulnerabilities found over the number of known vulnerabilities. This is in my opinion the most important metric since it is the one that can show if the tool really does its job. The last metric, false positives results, is an important metric to evaluate the reliability of the scanner. As precision from [8], it indicates if the information of the scanner is useful. For instance, a scanner that reports all the possible vulnerabilities in each of the entry points it will have a vulnerability detection rate of 100%, but it is useless as almost all the vulnerabilities found do not exist.

2.3 Server Side Template Injection

Web template engines are used in . They allow the reuse of static elements from a web page and at the same time define dynamic elements with template code based on a data model generated by the program logic. According to [19] these systems are usually composed by 3 elements represented in Figure 2.1: the first are the templates that have the static HTML and the template code which are the places where dynamic information will be inserted; the second is the data model that comes from the program logic; and the last is the template engine that mixes the previous two. This separation between the business logic and the presentation logic allows parallel and independent devel- opment of both. At the time I am writing this thesis, there are more than 90 template engines [9], each of them with its unique templating language. Some of them, as Mustache [20], just have simple features, but others, as Jinja2 [21], are richer and allow full programming language functionality inside the template. Template languages vary even between the ones used in the same programming language. For instance, to execute python code in Jinja2 [21] the syntax is {{code}}, while in Mako [22] the syntax is ${code}.

The templates are usually created by the developer, but sometimes they can also be provided by the users. For this case some template engines have a secure mode, or sand-boxed mode, which has reduced capabilities in the execution environment compared to the normal version. However, as shown in [1], some of these protections can be easily bypassed possibly leading to remote code execution in the server.

What is Server Side Template Injection? If in a web application the user input is inserted in the template (Listing3) instead of being used as a rendering argument (Listing4), the server is vulnerable to

9 Figure 2.1: Process of rendering a template.

SSTI. Since the user input is inserted in the template, an attacker can inject template code. The reason why SSTI exists is very similar to the reason why SQL Injection (SQLI) exists, ie, if the programmer inserts the user input in the query instead of using it as an argument in the prepared statements.

1 ... 2 name= request.form['name'] 3 template= 'hello '+ name+ '!' 4 return Template(template).render() 5 ...

Listing 3: Incorrect insertion of user input causing SSTI in Mako template engine.

1 ... 2 name= request.form['name'] 3 template= 'hello ${data}!' 4 return Template(template).render(data= name) 5 ...

Listing 4: Correct usage of user input in Mako template engine.

The code in Listing3 can be easily exploited to obtain remote code execution. The attacker just needs to send template code constituted by python code with the start and end tags wrapping it. Example of exploit to the previous vulnerable code (in Mako): ${__import__("subprocess").check_output("ls")} This exploit executes commands in the system, mores specifically it lists the current directory.

10 Server-Side Template Injection: RCE for the modern webapps

Kettle [1] defines SSTI as: ”Template Injection occurs when user input is embedded in a template in an unsafe manner.” To the best of my knowledge, this was the first work on SSTI and this paper contains the fundamental basis for my work as it defines a methodology to find and exploit SSTI vulner- abilities. The methodology described in the paper is divided in three parts: detection, identification and exploitation.

SSTI can appear in the context of , as HTML content, or it can appear inside the code specific to the template engine. In the detection phase the payloads necessary to detect the vulnerability depend on the context. If the user input is inserted in the middle of plain text it will be detected by sending generic HTML which in case of the presence of a vulnerability will be reflected much like in Cross-site Scripting (XSS). To distinguish this reflection from XSS the paper proposes to send statements of template engines like ${7*7}. If the output contains 49 it means that the input was executed by the template engine.To detect the vulnerability when the input is inserted in the template code they propose to send closing tags followed by HTML tags. If it is vulnerable, then it will either fail to render or reflect the HTML tag without showing the closing tag. After identifying the existence of a SSTI vulnerability the next phase of this methodology is to identify the template engine in use. The Burp Suite scanner uses a decision tree of valid payloads to identify the template engine, representing engines as leaves of this tree. It starts by sending a payload that should only work for some engines, and if that works then it goes down that branch. Otherwise, it goes down the other branch. This is done recursively until the selected branch only contains one template engine.

The last phase is the exploitation and the paper explains a generic strategy to create SSTI payloads by hand. The recommended steps are:

1. Read the existing documentation to understand the syntax.

2. Look for security considerations because they are the ones that may lead to vulnerabilities.

3. Read the available resources in the template engine as methods, functions, variables and plug-ins.

4. See which of the existent resources are available at runtime in the template execution environment.

5. Search for possible attacks with the available resources.

The paper only covered the SSTI with the objective to achieve remote code execution in the machine, but it can be used to perform other attacks such as XSS.

2.3.1 Relation of SSTI with other types of vulnerabilities

Programming languages can be divided in 2 categories, compiled and interpreted [23]. Compiled languages are translated into before being used, while interpreted languages are inter- preted during runtime and then translated. Most languages used in web development are interpreted languages introducing an attack vector related to interference with rendering. Attacks that interfere with

11 the rendering are usually designated injections [2, 23, 24]. Some of the more prevalent injections are SQLI, code injection, command injection, and LDAP Injection. Template engines render/interpret tem- plates at runtime and if the user input is inserted in the template before rendering it will allow the attacker to control the rendering. This lead us to consider the vulnerability in the injection category. SSTI can sometimes be confused with a simple XSS vulnerability since the behaviour is similar: the attacker sends a payload and the payload is reflected somewhere without filtering. To find SSTI I can use techniques like the ones used in XSS and learn with the vast amount of real cases of the vulnerability. In fact, [25] the author claims to have found a XSS but in reality the input reflection was caused by SSTI which could be escalated to an Remote Code Execution (RCE). The existence of SSTI does not imply the existence of XSS neither the existence of XSS implies the existence of SSTI.

2.3.2 Exploiting Server Side Template Injection

Since the release of [1] the security of template engines has improved, making some of the described attacks impossible. Anyway, the ability to abuse template engines is not limited by remote code execu- tion, and in the following sections I will analyse which vulnerability classes can be reached by having user input inserted in the template before rendering. During this analysis, and whenever possible, I am going to exemplify with one template engine where I haven’t been able to obtain RCE. I intend to demonstrate that the imposed restrictions do not prevent an attacker of causing damage using other paths.

Remote Code Execution

RCE is the highest capability an attacker might want. Even when the code runs with low privileges the attacker can then make a privilege escalation attack, ending with complete control of the machine. Despite some of the security changes made in template engines since [1], most engines when vulnerable to SSTI are susceptible to RCE. There are several known exploits to achieve RCE with SSTI, many of them can be obtained in the source code of Tplmap, in the paper [1], or in other on-line sources [26]. With these exploits I can infer some information about the way this vulnerability can be abused. To get some practical knowledge, I created 17 vulnerable web applications with different template engines and its respective exploits. In some template engines the code accepted inside the tags is almost the full programming lan- guage. In these cases, the attacker can write one exploit as if he was programming normal code with no restrictions to exploit the vulnerability. These are the trivial cases and one example is the Python Mako engine [22], which can be exploited by just sending a payload as:

${__import__("os").popen('id').read()} where ${} are the specific tags from the template engine to execute code. Inside them what this exploit does is import the os module that contains functionalities to interact with the operating system; then it

12 uses the function popen from os that opens a process specified in argument received, in this case id; and lastly it reads the result. Another simple example is the exploit for PHP which calls the system function and gets a command executed in the server:

{system("ls")}

Denial of Service

The goal of Denial Of Service (DOS) attacks is the reduction of the usability of web applications or its complete crash. To cause DOS an attacker can starve the system resources, which is possible by performing actions that require high amounts of resources in a non-intended way. When the attacker can inject template code that is rendered, he may have several ways to cause DOS. The most dangerous situation is when RCE can be achieved, allowing destructive actions, for example by deleting files. In the case it is possible to execute code the attacker may tamper the stored data to make it inconsistent or incorrect, which may cause malfunctioning of the server. Some of the template engines have security features that do not allow the execution of code or system commands, but all the templates I used to do my test servers have cyclic/iterative primitives that allow the creation of nested cycles. With these cycles it is possible to lead the web applications to use high amounts of processing power, by doing some heavy operations in the middle of the nested cycles or by creating a much bigger response that would consume high amounts of resources.

Cross Site Scripting

Most of the times when you can have a template injection you can also have a XSS because the user input is inserted in the response without filtering. If the programmer is aware of XSS he will HTML encode special characters such as ‘<‘, ‘>‘, and quotes before inserting them in the HTML making the input safe. Even the browser may have protections [27] that detect if the user input with dangerous characters as ‘<‘ or ‘>‘ is inserted in the middle of the HTML. These protections take into consideration the input sent by the user and the output received from the server, but they cannot know what happens in the web application. If somehow the execution of the web application transforms the input, it is possible to bypass these protections. Using Google Chrome browser I accessed my purposely vulnerable (to SSTI) web application that uses the [28] engine, and when I sent the payload Chrome shows:

`This page isn’t working Chrome detected unusual code on this page and blocked it to protect your personal information (for example, passwords, phone numbers, and credit cards). Try visiting the site's homepage. ERR_BLOCKED_BY_XSS_AUDITOR`

This is due to the XSS Auditor of Google Chrome [27] that protects the users from reflected XSS by comparing the sent arguments with the response. One easy way to bypass this filter is by sending

13 one payload that when rendered by the web application will result in something very different from what was sent in the request. One example of payload that bypasses XSS auditor of Chrome Version 68.0.3440.106 (Official Build) (64-bit) is

<{{"script"}}>alert(1)<{{"/script"}}> which the rendering result is :

Twig protects against XSS when the template code inserts user input in the HTML by HTML encoding it (correct usage of template engines), but it does not do the same when the input is inserted in the template itself because the user is not supposed to control the template (incorrect usage of template engine). If “<”, “>” and “/” are outside the template code they will not be HTML encoded and if the strings “script” are inside the template code as {{"script"}} the result will be considered an invalid HTML tag because { is not an alphanumeric (”HTML elements all have names that only use ASCII alphanumerics” [29]) and allow to bypass the auditor. Other strategies may be developed to bypass the filter by using the engine features. These strategies may also allow to bypass Web Application Firewall (WAF) restrictions that otherwise would prevent XSS.

Local File Inclusion and Local File Write

Sometimes it is possible to read and write files in the system. Considering the case of Jinja2 [21], a Python engine, when sending the payload used above in 2.3.2 for Mako (adapted with the correct tags) I quickly notice that it does not work because the server returns the following error:

UndefinedError: '__import__' is undefined

This happens because __import__() was removed from the environment, as well as dir() and open() which do not work either. From the work available in [26] it is possible to define a way to bypass this restriction of not having import available in Python. The main idea is to go from any available object “up to the tree” to the superclass Object and then “down the tree” again to obtain one of the subclasses of Object. One of the subclasses of the Object is precisely file which can be used to read and write files. The Python language has objects, and those objects have methods and variables. One variable of the objects is the Object class, so if the payload:

{{"".__class__}} is sent the class String is obtained. With the class String is possible to get its superclass by getting the attribute __mro__. One of the available super classes is Object that is at index 2 and with the payload

{{"".__class__.__mro__[2]}} we can get the class Object. We now reached the root of the tree. We can then get all its subclasses with __subclasses__() and from the result we choose the index of the intended class. In this case

14 it is the index 40 that contains the File class, and with the class File we can now read and write to the system even without having the ability to execute commands or having all the Python language functionality.

{{"".__class__.__mro__[2].__subclasses__()[40]("/etc/passwd","r").read()}}

Information Leakage

Some template engines have access to the program global environment variables and if it is possible to access them from the template code an attacker can obtain sensitive information. In Jinja2, just by sending the payload {{config.items()}} we can get the configuration variables. One of them is SECRET_KEY that is used to sign cookies which allow us to create the cookies we pretend and to escalate privileges.

2.3.3 Real cases of SSTI

One important piece of information to understand this vulnerability is the public cases where the vulnerability was found. These cases include Common Vulnerabilities and Exposures (CVE) and the available bug-bounty reports. In this section I will describe some of the cases I found related to SSTI. The most relevant information are the place in the template where the input was inserted, when the payload was rendered, and the location where the rendering result was found.

SSTI at rider.uber.com, template engine Jinja2, rendering result in email [30] On 25 March 2016, Orange submitted a ticket to Uber reporting an SSTI vulnerability. When the user changed his profile name to a template format string {{'7'*7}} received one email with the string 7777777 which indicates that the code inside the tags was executed. There is no available information if the email was rendered between the request and the response or if it was rendered posteriorly. The input was inserted in the middle of HTML.

SSTI at unikrn.com, template engine Smarty, rendering result in email [31] On 29 August 2016, Yaworsk found that by editing his profile name to {7*7} and by inviting a friend by email the friend would receive the email with the code evaluated. With this vulnerability he was able to read local files from the server. As in the previous case, analysing the payloads I infer that the input was inserted in the middle of HTML. I also do not know if it was rendered between the request and the response or if it was rendered posteriorly.

SSTI in CMS Made Simple, template engine Smarty, rendering in response [32] On 11 July 2017, Gogebakan Mithat reported and SSTI vulnerability in the open source content management system CMS Made simple. The URL contained one parameter named “detailtemplate” that was vulnerable to SSTI. By sending the value string:{6*3} in that parameter the response contained the value 18. In this case the template injected by the payload was shown in the response.

15 SSTI at whitelabel error page, Spring template Engine, rendering in response [25, 33, 34] I found 3 report of vulnerabilities in web applications that used the Spring framework. In all of them vulnerability was found in a Whitelabel error page. On 13 October 2016, Dave Vieira-Kurz found an SSTI vulnera- bility (CVE-2016-4977) in Spring Security OAuth and the description of the vulnerability is available in his blog [25]. When the redirect_uri was incorrect the server returned the error page where the input was inserted and rendered. By sending the payload ${777-111} he received its result (666) and after confirming the existence of the vulnerability he used a payload that allowed him to obtain RCE in the server. On 13 December 2017, Tghawkins made a blog post [34] about a SSTI vulnerability found at datax.yahoo.com. The vulnerability existed in the URL parameter “active” at the Application Program- ming Interface (API) endpoint ”v2/taxonomy”. When this parameter value was not a Boolean an error was returned with the value in the middle of the text. If the value sent was {191*7} the error message in the response had the value 1337. The vulnerability found in [33] was very similar to the two previous ones just differing the parameter where the input was inserted and consequently rendered. The web application had an URL argument whose value was injected and rendered before the response and the rendering result was shown in the response in an error message.

SSTI at Craft CMS, Twig template Engine, rendering result in location header [35] In Craft CMS, when users updated their profile, they would have been redirected, but the destination was set by one post parameter named redirect. If in that parameter, we sent Twig template code, we would have been able to receive the rendered result in the location header of the response. The authors of this vulnerability were able to leak configuration information by abusing functionality added to Twig by Craft CMS.

SSTI at Craft CMS plugin SEOmatic, Twig template Engine, rendering result in response header [36] . On 19 July 2018, Sebastian( also known as @0xB455 in Twiter) found a vulnerability in the Craft CMS plugin SEOmatic. When he fuzzed parts of the API URL he found that by sending Twig template in the URL as /api/foo{{2*8}}bar/... he received the rendering result in the link header of the response. There is not any known exploit for Twig template engine alone but as in [35] the author managed to use features added by Craft CMS to leak configuration values including the database password. In summary, the vulnerability entry point was the URL path, the payload was injected and rendered before the response, and the result was shown in a response header value.

Related vulnerabilities cases

The number of publicly available examples of websites vulnerable to SSTI is not enough to obtain information about all the possible cases of where and how it might appear. In some of the previous described cases the vulnerabilities were found by investigating XSS vulnerabilities. This information reassure my idea that both vulnerabilities are closely related and consequently the situations where SSTI exist are like the ones of XSS. Besides reflected XSS that is similar to the previous SSTI examples, there also exists stored XSS and blind XSS. Stored XSS happens when the user input is stored by the web application and then it

16 is inserted without filtering in another page leading to code execution in that location. Blind XSS is a case of stored XSS but with the additional condition that the place where the injection happens is not accessible to the attacker. One possible example is the insertion of unfiltered user input in a log available at the administrator page. I believe that these subvariants also exist in SSTI.

2.3.4 Analysis of situations where the vulnerability can happen

Result Location and Time of Rendering

In order for an SSTI vulnerability to occur, the user input needs to be inserted in a template and that template needs to be rendered. Based on the location where the rendering result is shown, I divided SSTI in 3 sub-classes: reflected SSTI, stored SSTI, and blind SSTI.

Reflected SSTI If we send an SSTI piece of code to the web application and this code is inserted in the template that is used to create the response, we can immediately see the rendering result in the response and consequently identify the vulnerability. In the real examples we can see several cases where this happen [25, 32–36]. The timeline representation of Reflected SSTI is depicted in Figure 2.2.

Figure 2.2: Reflected SSTI timeline.

Stored SSTI Sometimes the user input is stored in some persistent way and may be used in another page of the website and if it is inserted in the template in an unsafe way it can also lead to SSTI. We will refer to these cases as Stored SSTI with posterior injection and rendering. I did not find any real case where a stored input caused an injection in another page. Nonetheless, stored XSS vulnerabilities exist, when user input is stored and later rendered in pages other than the one where the input was sent. In this case to detect and exploit the SSTI an attacker needs to send the payload in one page and then go to the other where the input will be injected and rendered. The timeline representation of the stored SSTI is at Figure 2.3. Template engines are not only used to generate web pages, they can also be used to generate emails or any other kind of texts. The rendering of this content with user input injected can happen during the processing of the client request but shown somewhere else. We will refer to it as Stored SSTI with

17 Figure 2.3: Stored SSTI with posterior injection and rendering timeline. immediate injection and rendering. Two possible examples of this injection cases are [30, 31], where the rendering input was received in an email. In these examples I am not able to confirm if the rendering was done before or after the response from the server. In Figure 2.4 it is represented a timeline of Stored SSTI with immediate injection and rendering.

Figure 2.4: Stored SSTI with immediate injection and rendering timeline.

Blind SSTI In the previous cases, where I talked about the user input being stored and later causing an injection I considered that it was always possible for the attacker to see the rendering result. Nonethe- less, the existence of blind XSS attacks shows us that the user input may end up in a restricted page as an administrator panel. Blind SSTI can also be split in blind SSTI with posterior injection and rendering (Figure 2.5) and blind SSTI with immediate injection and rendering (Figure 2.6).

18 Figure 2.5: Blind SSTI with posterior injection and rendering timeline.

Figure 2.6: Blind SSTI with immediate injection and rendering timeline.

Location in the Template Where Injection Occur

Other factor that differentiates the possible cases is if the input is inserted in the middle of HTML content (Listing5) or if it is inserted in the middle of existing code (Listing6). In the first case a simple template code would result in a correct syntax, but in the second the template can have an invalid syntax causing the rendering to fail. This factor is completely independent of the previously described cases and it is possible to happen in all of them.

2.4 Web Scanners for Injection Vulnerabilities

Black-box scanners for web vulnerabilities usually have two main components, one is the crawler and the other is the vulnerability scanner that is sometimes divided in attacker module and analysis. The crawler uses varied techniques to discover existing pages and the web forms on them. These web

19 1 ... 2 name= request.form['name'] 3 template= '...

Hello '+ name+ '

...' 4 return Template(template).render() 5 ...

Listing 5: Insertion of input in HTML zone.

1 ... 2 name= request.form['name'] 3 template= '...${name="'+ name+ '"}...' 4 return Template(template).render() 5 ...

Listing 6: Insertion of input in template code zone.

forms and other elements of HTTP requests, as cookies and headers, are possible entry points to the web application and constitute the web application attack surface. However not all scanners include this feature. Some scanners such as Tplmap [4] require the users to specify the entry point to test. After discovering the attack surface, the vulnerability scanner tests the entry points of the web application. There are very few published works on SSTI but since it is part of the injection class vulnerability, I searched for other kinds of injection vulnerabilities existent in web applications.

The Backslash Powered Scanner

Traditional scanners send a set of payloads for each type of vulnerability and then check for some kind of signature in the response, using for example a . This approach is limited to known classes of vulnerabilities and may lead to false negatives in case of some filter or WAF is being used. To tackle this, Kettle [37] introduced a new approach to find vulnerabilities from known and unknown classes using black box scanners. The main idea behind his solution is to mimic a human security expert. The scanner, instead of scanning for vulnerabilities, scans for interesting behaviours. It starts by making minimal changes in some part of the input and if it results in interesting behaviour investigates it further, otherwise ignores it.

His scanner has predefined probe pairs, that contain one element that will cause an error, and a very similar one that should not cause the error. Each pair is used to answer some question and a few examples are shown in Table 2.2. For each probe pair the scanner starts by sending a simple string as foo and gets the respective response. Afterwards, it adds the element of the probing pair that should cause an error (foo') and, if the response is the same as the first request it means that nothing interesting happened and it tries the next pair. On the other hand, if the behaviour was different from the expected one, it sends the correct element of the probing pair (foo\'). If the result of this second element of the probing pair is different

20 Question Prob pair Am I in a single-quoted string? z'\z vs z\'z Am I in a numeric context? x/0 vs x/1 Am I in a file path? ./../x vs ././x Am I a function invocation? sprintg vs sprintf Am I in a JSON value ? ","a"," vs ","a":"

Table 2.2: Probe pairs from Backslash Powered Scanner [37] from the result obtained with foo we found an interesting behaviour, and in this case it is investigated further by iterating into more specific questions. This logic is represented Figure 2.7

Figure 2.7: Backslash Scanner logic.

The scanner, in order to compare different responses, sees the response as a collection of attributes. The attributes collected are: status code, line count, word counting, reflection count, and the frequency of keywords. If we persistently get different properties values between the two elements of a probing pair, we have a different and interesting result that is worth exploiting. Although this solution is better than injecting fixed payloads, a big disadvantage of this tool is that because it just presents interesting behaviour instead of finding specific vulnerabilities, it needs a security expert to manually check the findings. The author left as future work the adaptation of the scanner to classify individual issues which I believe that can be applied to SSTI.

Automated Testing for SQL Injection Vulnerabilities: An Input Mutation Approach

Appelt et al [38] propose an automated black-box testing approach that detects SQLI vulnerabilities and try to produce executable inputs that bypass WAFs. Their solution starts by using an initial valid test case. Afterward, it applies mutation operators to the payload in different combinations. This will create possibly new attack patterns thus increasing the probability of detecting SQLI vulnerabilities.

To create executable SQLI code that bypasses WAFs the scanner uses three kinds of mutations: behaviour-changing, syntax-repairing, and obfuscation. Behaviour-changing mutations have the objec- tive of changing the application expected behaviour. For example, if in one numeric parameter of a request that is vulnerable to SQLI we add OR 1 = 1 to 9, sending 9 OR 1 = 1 instead of 9 we will get a different result. Syntax-repairing is used to fix incorrect syntax of the queries when a behaviour-changing

21 mutation caused an error. If we change the query that is made to the database it will change the syntax and possibly cause an error, so we need to change the injection to have a correct syntax again. Obfus- cation mutations are used to bypass WAF string pattern matches. One example is to replace spaces with equivalent characters such as + or /**/. The tests of this scanner were made with and without the presence of a WAF. In both cases the re- sults were better than previous scanners that used predefined attack payloads. When the WAF was used the results were much better because the other solutions were not able to execute any valid injection in the database. These techniques can contribute to the development of SSTI scanners able to bypass WAFs and to automatically generate Proof of concept (POC)s. I do not apply it in my case as this is not necessary to prove that the vulnerability exists.

KameleonFuzz: Evolutionary Fuzzing for Black-Box XSS Detection

Duchene et al [39] introduced a black-box XSS fuzzer for web applications using a genetic algorithm guided by an attack grammar. The goal of KameleonFuzz is to learn the model of the application in order to generate malicious inputs.

First, the scanner crawls the application to learn the control flow, and then it makes a taint flow analysis to know which parameters have been reflected. Since the inserted input can be slightly modified in the server, they use an algorithm based on heuristics to find reflected content instead of searching for the exact input, otherwise it could lead to false negatives. At this point the scanner knows which inputs should be tested and the input sequences that are needed to get the output. The attack grammar then produces several fuzzing values adapted to the context where the input is reflected. For instance, if the value where the input appears is inside a tag property value the needed exploit is different from the case where the input is reflected in the middle of the text. Afterwards, the scanner infers the taint for each of the sent values checking if they generated a valid XSS attack. If it did not, it calculates the fitness score for each of the results (where the fitness score is higher when the result is closer to trigger an XSS).

Then the next generation of fuzzing values is created applying mutation and crossover operators to the previous generation of input values with the highest fitness scores. This iteration ends when the defined stop condition is reached. This condition can be a time value, or whenever a vulnerability is found.

To test the KameleonFuzz the authors compared it against other four vulnerability scanners in seven different applications. In each of these applications the KameleonFuzz was the one with the best results and it only reported true positives. In fact, this scanner only reports true positives because it tests the result using the browser to execute the response.

The main weakness of this tool is the assumption that they can reset the application, but that ability may not be available to the scanner, as for instance in a bug bounty program or in any production environment where the data cannot be restarted or lost.

22 The authors left as future work the automatic generation of the attack grammar since in this paper it was created by hand, and the adaptation of these techniques to other classes of vulnerabilities. They also referred that their solution would be easily adapted to other injection vulnerabilities by creating the appropriate attack-grammar. However, in that case the tool would need to intercept somehow the result after being transformed, which was not the case in XSS as the result is seen in the client side.

XSS Analyser: Finding Your way in the Testing Jungle

Tripp et al. [40] created a grammar for XSS that can generate up to 500 ∗ 106 payloads. Although this number of payloads allows a better coverage of the vulnerability for an entry point compared to the usual number of payloads sent by other scanners, using all these payloads in one endpoint is not feasible in a reasonable amount of time. If we multiply these values by the number of entry points in a real application, then it becomes impossible. To solve this problem the XSS Analyser uses a learning algorithm that prunes the grammar by learning from the previous requests which inputs were not accepted, leading to a reduction in the search space. Figure 2.8 represents the learning algorithm.

The grammar used is composed by tokens, and a payload is a combination of tokens that follow the grammar rules. The algorithm starts by sending one payload generated from the grammar such as:

If the request fails or does not generate a valid XSS exploit, then the tool sends each token of the payload individually to find the cause of the failure. After the token responsible for the failure is identified, it tries to apply filter bypass strategies to it, and if some of these bypass strategies work, the grammar will be updated replacing the token by its bypassing version. The payloads will then be generated using the updated grammar. On the other hand, if all the bypass techniques fail, then the token is removed from the grammar (pruning) and the payloads generated afterwards will not contain this token, thus reducing the search space. The algorithm will run until it finds a valid XSS attack or until the grammar is empty.

Figure 2.8: Learning algorithm.

23 Comparing to the IBM AppScan [41] version existent at the time the paper was written, the XSS Anal- yser found more than double of the vulnerabilities with less than half of the requests made. Comparing to the brute force version of the algorithm, it obtained 99% of the coverage, with just an average of 10 requests compared to 2301.

2.5 Vulnerability Scanners for Server Side Template Injection

Burp Suite Pro

Burp Suite is a tool for web security testing being considered the default tool for this purpose in the industry. It is a graphical interface tool developed by PortSwigger Security and at the time of writing it had 3 different editions: Enterprise (3499e), Professional (350e), and Community (Free). All the versions have manual tools to test web applications as a proxy to intercept request, and capability to change and repeat requests. It is also possible to do fuzzing with user supplied word lists, but in the Community edition the number of requests per second is very limited reducing even more with the increase of requests made. A web crawler named spider is also available in all versions. The main difference between the Community version and the other versions is the automated vulnerability scanner. Their scanner has the purpose of finding several well-known vulnerabilities as XSS, SQLI, RCE but also SSTI since the release of [1].

Inference of Burp workings: Although [1] provides insights on how to explore SSTI vulnerabilities, it does not specify deeply how Burp Suite Scanner searches and exploit this vulnerability. To better understand how it operates I needed to do an analysis of its behaviour in vulnerable and non-vulnerable web applications. In the Burp menus I did not found any way to see the requests performed, I could only see the request that found the vulnerability if it existed. To collect the requests made by the scanner I created 2 equal web applications where one was vulnerable, and the other was not. Both stored the requests received from the scanner. Then I collected the payloads Burp sent when configured to find SSTI. The template engine used was Mako which tags are ${ and }. The following enumerations show the requests made by Burp when scanning for SSTI. In both I removed several payloads from the beginning that only contained alphanumeric characters because they did not offer any useful information and would occupy unnecessary space. Request Non Vulnerable

1. s6w8g${404*853}t7rmi.

2. yp73p{{429*544}}j8jyg

3. n91ue{{699|add:447}}mpxex

4. #set($a=242*353)e7x5x${a}lr1dv

5. l4f0k<%= 622*795 %>cg8qs

24 6. uat52

7. = 814*707

8. olggq{{.}}o5w0k{{..}}ak3mi

9. YuqdTQUJ}}m7u62’/”

10. YuqdTQUJ%}vxezt’/”

11. YuqdTQUJafwba%>r2oz1’/”

12. YuqdTQUJ%]bvfw4’/”

13. YuqdTQUJyke1e;//’;//”;//%>?>mp5gb’/”

Requests Vulnerable

1. al2i2${432*647}pm917

2. m35j0{649*135}{*commentedout*}{903*469}ec6vu

3. g84jv${”zyz”.toString().replace(”y”, ”u”)}rlft6

4. pdygy${”g3k3j”.join(”wprw5”)}b9ppi

From the analysis of the performed requests, I could infer the following. By making several scans I noticed that the requests made when the input point is not vulnerable are always the same, just changing the existent random numbers and strings. The requests made to both web application are the same until the payload al2i2${432*647}pm917. This payload contains the tags used by the template engine with arithmetic operations inside. The template engine would render it and the scanner receives the result of the arithmetic operation in the response. At that point the tool probably finished the detection phase described in the paper because it was able to execute an arithmetic operation in the web application. The following requests are probably used to the identification phase, also described in the paper, which tries to detect the template engine in use. Another prove that it was used to detect the template was that in the generated vulnerability report it correctly identified that the template engine used was Mako, and another was that instead of showing the payload that detected the vulnerability it shows the last payload which indicates the engine. The paper includes one last phase denominated attack, which would exploit the found vulnerability but this is not executed by the scanner. From the 9th request forward, the payloads change from payloads where the structure is RANDOM- CHARS + startTag + Multiplication + endTag + RANDOMCHARS to something as RANDOMCHARS + endTags + RANDOMCHARS + startTag + RANDOMCHARS. The possible purpose of this second type of payloads may be the detection of the vulnerability when it appears in the middle of existent code as the case exemplified at Listing6.

25 The information obtained from the tests available at Section 4.3 allow us to say that Burp is not able to cover all the possible cases where the vulnerabilities may happen described in the Subsection 2.3.4. When Burp sends a request with the payload it probably just searches for the rendering result in the response, leaving all the stored SSTI cases untested. Since it cannot find the vulnerabilities where the rendering result is not in the response, it is also likely that will not be able to detect the cases of blind SSTI.

Tplmap

Tplmap is an open source tool to detect and exploit SSTI and code injection vulnerabilities [4]. It is a command line tool but can also be used as a Burp Suite extension. This tool has no crawling capabilities and receives as input the page to test and the respective parameters. After a successful vulnerability exploitation, the tool has the capability to create an interaction with the server as if it was a remote shell. It supports around 15 different template engines and code injection vulnerabilities in Python, JavaScript, Ruby and PHP. For each of these templates the plugin created may have four different exploiting capa- bilities: remote command execution, blind code injection, code evaluation, and file read and write.

Inference of Tplmap workings: Tplmap is a modular tool and to add the capability to search for SSTI in a new template engine one just needs to create a new plug-in for that template engine. The same applies to test code injection in a new programming language. Each test only works for one template or language. To detect the vulnerability Tplmap can resort to two techniques: the rendering checking, and the time-based blind injection testing. The first technique is made by requesting to each of the plugins one rendering tests and the expected result. If this result is in the response, the endpoint is considered vulnerable and the tool identifies the engine as the one of the current plugin. Example of rendering test and expected result from Twig plugin:

... 'test_render': '"%(s1)s\n"|nl2br' % { 's1' : rand.randstrings[0] }, 'test_render_expected': '%(res)s
' % { 'res' : rand.randstrings[0] } ...

In this example the %(s1)s of test_render is replaced by a random string and the result is RANDOM\n"|nl2br. Then, if it is considered vulnerable Tplmap starts to discover if it is able to execute the language code allowing to read/write in the system. At last, it checks if it can execute commands in the system. If it can execute operating system commands it has an option to run a reverse shell on the system.

26 The time-based testing is done by sending 2 payloads: one that is true and should take time to execute, and one that is false and take the normal requests time. If both have the expected behaviour the target is considered vulnerable. This allows Tplmap to find stored and blind SSTI with immediate injection and rendering because the response will be delayed. Since in stored and blind SSTI with posterior injection and rendering the code will be executed later there will be no delay in the response and it will not be detected with this strategy. Since Tplmap does not have any crawling capability or a parameter to declare the locations where it should look for the result, it will not be able to detect stored and SSTI with posterior injection and rendering. If the sent payload is inserted in the middle of existing template code, Tplmap may be able to fix the syntax and then operate as in a normal situation.

2.6 OWASP Zed Attack Proxy

OWASP Zed Attack Proxy (ZAP) [42] is a graphical penetration testing tool to find web application vulnerabilities. It is an open-source project, open to the general public usage, modification, and exten- sion. Contains several features for manual testing as a proxy server, message modification, message replay, and unlimited fuzzing with built-in word lists or user inserted ones. It also has functionalities for automatic testing: two web crawlers and an active or passive vulnerability scanner. The vulnerability scanner and the other automated features can be used through the graphical interface or automatically in the testing pipeline of a web applications. This can be done by adding the plugin for Jenkins [43]. Jenkins is a well-known system to support the automation of software building and deploying. It is also possible to use these features through the Representational State Transfer (REST) API. ZAP has an architecture where it is possible to extend the existing functionality by developing inde- pendent add-ons. There are four types of add-ons, but the only relevant one for us is the Active-scanner rules which are used by the vulnerability scanner. The add-ons receive as input the HTTP message and the parameter to inject. After running, it should report the vulnerabilities found. I will develop one add-on to test the injection points for SSTI. One of the ZAP plugins is the PersistentXSSUtils scanner that is an active scan with the objective of finding stored XSS. It works in 3 separated phases:

1. First, it sends several random payloads in each of the entry points from all the pages ZAP is testing.

2. Second, it goes to all the pages and checks if the page contains some of its initially sent payloads. If they exist, it associates this page as a sink to the input parameter where the value was sent.

3. Lastly, it goes to all the entry points that have at least one sink, sends XSS payloads and checks for the vulnerability in the sinks.

I consider the first two steps as a taint analysis to the web applications because it checks where the input will be inserted (locations tainted by the input). This feature will allow us to obtain the sinks without making additional requests.

27 Another useful ZAP plugin is the ChallengeCallbackPlugin whose job is to receive requests from the vulnerable web applications. To use this functionality my scanner registers a random token and as response receives a URL associated with the token, that is then used in the payload sent by the scanner. If the web application is vulnerable it makes a request back to this URL controlled by the ChallengeCallbackPlugin, that will call the entity that registered the associated token, in this case our plugin. All the ZAP scanners must fulfil some restrictions. One of those restrictions is the maximum number of requests that should be done in total in each of the Strength levels. The maximum number in “low” is up to around 6, in “medium” is up to around 12 requests, in “high” is up to around 24 requests and in “insane” any number of requests can be made.

28 Chapter 3

Implementation

In this chapter I am going to present my solution and the reasons behind my decisions. The main goals for the scanner are:

1. Detect SSTI on the largest number of engines, including unknown template engines.

2. Cover all the cases described at Subsection 2.3.4, i.e., reflected SSTI, stored SSTI, blind SSTI, and inside or outside template code.

3. Use as few requests as possible.

The first choice I needed to do was to decide if the scanner used a white-box or black-box approach. White-box scanners need to know the semantics of the programming language in use making the cre- ation of a generic scanner impossible. This made the choice of black-box evident. Since I wanted the scanner to be useful to the community I decided to develop a plugin for a widely used tool. My choice was OWASP-ZAP because it is open-source and easy to extend. All the ZAP plugins should follow some guidelines and one of them is the maximum number of requests done in each of the possible strength values. In my solution the included capabilities vary between the strength levels, allowing us to follow this guideline and at the same time obtain the best possible performance. The capabilities in each configuration are represented on Table 3.1. With these

Use polyglot Reflected SSTI Stored SSTI Blind SSTI Syntax Fixing Low X X X Medium X X High X X X Insane X X X X

Table 3.1: Capabilities depending on the ZAP Strength configuration several possible configurations the scanner can fulfil the needs of broader types of users, from the ones that can only make a reduced number of requests to the ones that have time and resources to do a more intense scan. At all these possible configurations it has the capability to find reflected and stored SSTI. At the levels “high” and “insane” I added the capability to find blind SSTI. The scanner does this by sending payloads that make callbacks to the scanner. The Insane level includes the ability to find some

29 vulnerabilities where the input is inserted in the middle of existing template code. It consists in fixing the syntax in simple cases. What distinguish the “low” definition from the “medium” is the usage of polyglots that cause errors if a SSTI vulnerability exists. They cause errors in all the templates I analysed with just 5 requests at maximum. I adapted the Backslash Powered Scanner [37] detection technique, described in section 2.4, to detect errors. In my solution the plugin performs 3 different requests: the first is a reference request, the second a request using the polyglot, and the third a request using an innocuous version of the polyglot (backslashed). Then the scanner computes the differences between the reference response and the original response obtained by the crawler (or user); the differences between the polyglot’s response and the original response; and finally the difference between the innocuous’ response and the original response. Comparing the first two differences it is possible to infer which changes are the normal behaviour of the page, and which are the ones caused by an injection. If this difference is higher than a threshold, comparing the first and the third, it allow us to infer if the previous difference was indeed due to an injection, or just because some character in the polyglot was blocked. Notice that the third request, and the computation of the third difference is only needed in case we suspect that there exists an injection. At the beginning of this work there was also the objective of automatic exploitation of the vulnerability after finding it, but it was not a feature that ZAP developers wanted as it is not needed for vulnerability detection (and can also cause some unintended damage to the application under testing). Another reason was that my plan included the use of exploit grammars to generate payloads, that would allow to generalise the payloads to unknown template engines. This could have worked in some languages such as JavaScript due to the similarity between the code allowed in the template, but in others such as Java it would not work. To have this feature it was needed to detect the language in order to choose the language specific grammar. This feature was implemented but further removed because I did not implemented an exploitation module making of it useless. The scanner interaction with the user and the server is done through the ZAP interface. To scan an endpoint, ZAP calls each active scanner indicating the original user HTTP request, the location to inject, and the original value.

3.1 Architecture and Interactions

The scanner was designed with independent modules. This will allow the usage of those modules in future work as recommended in the guidelines for web application scanners architecture from [7]. The plugin can be divided in separate components that interact with ZAP, with the server under test, and between them. These interactions are represented in Figure 3.1. The Sink Manager (SinkM) has the objective of managing all the locations where the payloads appear which are the locations where an injection may occur. Other modules use this one to obtain the result of their actions in the locations where the vulnerability may happen. The efficient vulnerability detector sends polyglot payloads in the parameter under test, then asks

30 Figure 3.1: Plugin Architecture.

SinkM the normal similarity between responses with different inputs in that sink and the similarity be- tween the new response and the original one. With this it can define if there was a rendering error on the web application. SinkM uses the Message Comparator (MComparator) to obtain the similarity between the responses. If the Efficient Vulnerability Detector (EfficientVD) detects that it is possible to exist a vul- nerability, the scanner calls the Arithmetic Evaluation Detector (ArithmeticED). The ArithmeticED tries to find the vulnerability by sending multiplication code as 7777*2222 and then asks SinkM for the string representation of the actual state of the sinks. If it includes the multiplication result it considers that a vulnerability was found. This component tries to generalise the available tags to the maximum number of templates, but some templates do not support arithmetic operations and for this reason I created 3 specific tests that send one payload and then check if the expected rendering result was obtained. The syntax fixer module is a wrapper of the ArithmeticED that sends the ArithmeticED requests with prefixes to fix the syntax before rendering the normal payload. It sends them several times with different prefixes. The Blind Vulnerability Detector (BlindVD) operates by sending payloads that make a callback to ZAP’s ChallengeCallback plugin. This module receives requests to register a certain token and returns an URL to where the callback should be done. The plugin sends the payload with the intended URL and when ZAP receives one request in that endpoint it informs the plugin. The SinkM obtains the stored sinks from the Stored XSS scanner of ZAP.

31 3.2 Components Description

3.2.1 Sink Manager

The SinkM has the responsibility of storing and retrieving information from all the known sink points. At this moment these sink points only include the response to the request and the pages of the website where the input is shown later, but it would be easy to implement a sink abstraction for emails. SinkM does not have this feature because it needs some changes in the ZAP core. The sinks of the type stored are obtained from the Stored XSS scanner plugin. Before testing for XSS this plugin searches for the locations where its inputs are shown. Thanks to this, it is possible to know the locations where the input is going to be reflected without any additional requests. To obtain these locations SinkM makes a function call to the plugin with the endpoint and the parameter tested and it returns the sink locations. The stored sink points should be created immediately after sending one reference request to the server because once created they will obtain the current state of the sink and compare it to the one stored by the XSS scanner. The result of this comparison will be considered the normal variation of the response between 2 different inputs. Then, each time the sink is updated the actual similarity is stored. The same happens with the reflected sinks (the responses to the requests where the payload is sent), that at creation receive the original user request and one request created to serve as reference. The similarity between the two is stored to be the reference similarity. The stored sinks can be removed anytime if the scanner intends to. This is useful in the cases where it is already know that some of the sinks are not vulnerable, but still exist doubts in others. This is an important component to the detection of stored SSTI.

3.2.2 Efficient Vulnerability Detector

This component has the objective of deciding if it is worth to send all the payloads from ArithmeticED that will be described in Section 3.2.4. It allows us to save a large number of requests if a vulnerability does not exist. Since the case where the vulnerability does not exist is the most common case by far, my scanner will achieve a great performance. Considering a pessimistic number, the relation will be less than 1 to 100. The final number of requests is 3 in the normal case and 5 in the worst case instead of 12 in the arithmetic evaluation detection module or 17 in Burp. One trend in the security community during the writing of this thesis was the contests to see who could create the payload that could find XSS vulnerabilities in the biggest number of contexts with the lowest length possible [44–47] which is usually called a polyglot. This is the kind of thing that would allow the increase of coverage and the reduction of requests by scanners and testers. Inspired by this idea I decided to create one polyglot to identify SSTI. Contrary to XSS that needs to be aware only of HTML and JavaScript, this polyglot needs to work on several languages and templates making the task impossible with a reasonable size. However, to detect SSTI the polyglot does not need to render in the server, it just needs to prove that the engine is trying to render it. The easier way to detect that the input is being rendered in a template is the failure in rendering due to syntax errors. Causing errors is easier

32 to achieve than creating a correct syntax because does not need to have in consideration the existing code. If a normal SSTI payload is inserted in the middle of the HTML it will render and if it is inserted in the template code it will probably fail, but if the EfficientVD can create an incorrect syntax in the first it will also be incorrect in the second. Another advantage of creating an incorrect syntax is that a payload that causes errors in one template may also cause errors in other template engines with a similar tag syntax. This is an advantage because it requires less characters to cause errors in all the templates. I believe that even if the error is correctly caught by the web application there will still exist a detectable deviation from the normal behaviour. With all these facts I decided to detect SSTI by causing errors using the same polyglot payload for all the template engines.

Main logic of the scanner

My idea assumes that it is possible to detect errors because if I can cause errors but not detect them then this idea would not work. Some of the more evident errors can be detected by humans. For instance, if we see an error page we know that something unexpected happened and the same happens to a scanner if an error HTTP code is returned. In the subtler cases where it is difficult for humans to detect errors, what we would do is to compare the response when we try to cause the error with the response of a normal utilisation. This is exactly the idea I try to implement in my efficient scanner inspired by [37]. The objective of the Backslash scanner is to detect interesting behaviour while the EfficientVD objective is to improve performance without increasing the false negative rate, consequently the requirements will also be different. Some parameters such as search inputs cause big differences in the response, so the difference in HTML structure will always happen and the responses will always be considered different. When following the logic in Figure 2.7 the result obtained from the safe element of the probe pair will be different from the base request and consequently not interesting what could result in a false positive. To avoid this problem, instead of comparing the responses with a reference payload created by us we compare them with the original request from the user taking in consideration the alterations the reference request has from the original. My strategy is depicted in Figure 3.2. The first step is to send a reference request and compare it with the original user request. Then the EfficientVD sends the polyglot and compares the response with the original user request. If the difference between the similarity of these pairs is low, it means that the variation obtained from the original request is the normal one when a different input is sent. If the difference in similarity between the pairs is big it means that the scanner considers that a strange behaviour occurred. One possible problem is that the characters used in the template engines are usually special characters, including the ones used in XSS < and >, that may cause the request to fail not because of rendering but because some character was being blocked by the application or a WAF. To verify that the error is not due to these character problems the EfficientVD sends a safe version of the polyglot in which the special characters are escaped to make them innocuous to the template. If the similarity of this request with the user original request is also very different from the reference one, the EfficientVD considers that the error is not caused by the template engine and it is not vulnerable. Otherwise, it considers there exists a strange

33 behaviour and an SSTI might be possible. Details about how the comparison between 2 responses is done is discussed in the following Subsection 3.2.3.

Figure 3.2: Polyglot testing logic.

The reference request has the objective of detecting the normal variations caused by sending a different value in the parameter. To create a reference payload that is different but close to the one the user would send, we use characters already present in the user original request.

Causing errors to detect user input rendering

Not all template engines have the same tolerance to templates with incorrect syntax. This difference in tolerance is accompanied by different behaviour in case of incorrect syntax. Some template engines remove the element from the template, others return an error in its place [28], and some others even fail to complete the rendering of the template [21]. The errors more identifiable in the responses are the ones where the engine fails completely to render the template and the ones that return error messages.

Tests of payloads that may cause errors

I tested and compared the possible ways of causing detectable error with the objective of finding the one that is more effective and efficient. The tests, as well the results will be described in this section. To make these tests I used 18 web applications made by me, each with its own template engine. In all of them the user input was inserted in the middle of the HTML and then rendered. The code is available at our repository [10]. In a first experience I performed tests with 4 different tags/content combinations. I started by experi- menting sending each of the previously collected template starting tags (ex:${) for each vulnerable web application. Then, the second combination was to send just the ending tag (ex:'}'). My third combina- tion was to send the start tag followed with an ending tag (ex:'${}'). The last was the start tag, some

34 content, and the ending tag (ex:'${foo}'). The full list of payloads is:

• Start tags:{, ${, #{, {#, {@, {{, {{=, <%=, #set(, <#assign ex=, <#,

• End tags: }, }}, %>, ), >

• Both tags: {}, ${}, #{}, {#}, {@}, {{}}, {{=}}, <%=%>, #set(),<#>

• Both + var: `{gR}, ${gR}, #{gR}, {#gR}, {@gR}, {{gR}}, {{=gR}}, <%=gR%>,

#set($x=gR)${x},

<#assign ex="freemarker.template.utility.Execute"?new()> ${ex("expr gR")}

Engine Start Tag End tag Both Tags both+var Tags Jinja2 error reflected error disappear {{X}} Mako error reflected disappear error ${X} Tornado error reflected error error {{X}} Django reflected reflected error disappear {{X}} Smarty error reflected error error {X} Smarty(secure Mode) error reflected error error {X} Twig error reflected error disappear {{X}} FreeMarker reflected reflected reflected error <#Y>${X} Velocity reflected reflected reflected reflected #set($Y=X)${Y} Thymeleaf reflected reflected reflected reflected Jade error reflected error disappear #{X} Nunjucks error reflected error disappear {{X}} Dot reflected reflected error error {{=X}} Dust reflected reflected reflected disappear/error {#X}or{X}or{@X} EJS error disappear error error <%=X%> Vuejs reflected reflected reflected disappear {{X}} Slim error reflected disappear error #{X} ERB disappear reflected disappear error <\%= X\%> Number of errors 10 0 10 10 - Not reflected 11 1 13 16 -

Table 3.2: Tests to discover the best way of causing errors

In the Table 3.2 the results of each template are the ones obtained with its respective tags. The results obtained show that just sending the end tag causes no errors. The best it did was to cause the input in EJS template engine to disappear. The “start tag”, “both tags” and “both tags with var” all got 10 cases where they produced an error in the web application in the form of error message or error status code. From these 3, “both tags with var” was the one that obtained more stranger behaviours (errors and disappearances) in total. From this I concluded that the best option to detect errors is to send the start tag and the end tag with a variable name that probably does not exist in the template execution environment. This variable should not exist in the context because its nonexistence in the context is what causes the error.

35 Developing a SSTI Polyglot

Since the best way of causing error is to have the initial and end tag with something invalid inside, I created one polyglot getting each template start tag and end tag around the payload already existing. The following examples are simplified versions of the real process. Starting by the string “xux” I got something as:

{{= ${ {{ xux }} } }}

The resulting payload contained repeated combinations at the left and right caused by the usage of the same tags by several template engines. These characters were redundant. To reduce the payload size, I removed repeated tags obtaining:

{{= ${ xux }}

By reordered the tags in a way they could share the common characters I obtained something as:

${{= xux }}

The real payload obtained from this process was:

<#set($x<%={{={@{#{${xux}}%>) which we call general polyglot. The escaped version of the general polyglot was backslashed to avoid rendering, resulting in:

\<\#set\(\$x\<\%\=\{\{\=\{\@\{\#\{\$\{xux\}\}\%\>\)

Engine General Polyglot Escaped General Polyglot Jinja2 error reflected Mako error reflected Tornado error reflected Django error reflected Smarty error error Smarty(secure Mode) error error Twig error reflected FreeMarker reflected reflected Velocity reflected reflected Thymeleaf reflected reflected Jade error reflected Nunjucks error reflected Dot error reflected Dust error reflected EJS error reflected Vuejs error reflected Slim error reflected ERB error reflected Number errors 15 2 Not reflected 15 2

Table 3.3: Tests to polyglots

36 None of the payload types created errors in all the template engines. When I used the general polyglot, I got an error in all the templates except in the java template engines (FreeMarker, Velocity and Thymeleaf) which reflected the input. This behaviour was also observed on previous tests. Another unexpected behaviour was that the Smarty template engine returned errors when receiving backslashed versions of the general polyglot. From the results of the experience I could infer that the general polyglot does not work in any java template engine. This should be because java templates must handle errors better that the ones from other languages and calling an existing variable is not enough to cause a visible error. I searched in the documentation for the valid syntax and based on it I tried to create payloads that disrespected it. I created one payload for each of them that caused errors and by combining them in a certain order I obtained a unique payload to cause errors in the 3 java templates which I call Java polyglot. The constructed payload was:

<%={{={@{#{${xu}}%>

I used the new general polyglot against all the non-Java template engines and I got the same result as using the previous version of the general polyglot which was bigger. To solve the problem of Smarty returning errors in the backslashed version I added the Twig commentary tags {*...*} around the polyglot. The general polyglot and the java polyglot together were able to cause errors in all the templates, so it is an excellent option due to the lowest number of payloads needed to test all the existing template engines. However it also has some problems: one of them is the size of 19 characters which may be bigger than the parameter maximum size, and another one is the existence of many characters that are associated to other vulnerabilities and might be blocked by the application. Due to all its benefits the polyglots were chosen to detect input rendering. This solution does not come for free and in some corner cases can be less reliable than sending all the possible template tags and look for rendering. So, I decided to only use it when the strength definition is set to Low with the objective of reducing the requests to the allowed maximum of 5.

3.2.3 Message Comparator

The objective of the Message Comparator, as the name states, is to compare 2 different responses and return a value that represents how similar they are from each other. From each of the requests it needs the payload that was sent and the whole response, including headers and body. To compare different responses R1 and R2, the MComparator uses heuristics and each heuristic has a weight associated. The variable representing the final result is between 0 and 1: 0 means that the

37 responses are very different, and 1 that are very similar. Every evaluation starts with the total variable set to 1 and then it is multiplied with each of the heuristic results taking in account its weight, leading to the formula:

Y T otal = H(R1,R2) × weight(H) + (1 − weight(H)). H∈Heuristic

The MComparator has 6 heuristics with different weights, they are: body trees structure, status code, headers comparison, input reflection, line and word count. The names and main idea from the heuristics used are the same as the ones in [37], but they are calculated and implemented in different ways.

Body trees structure. The body of an HTTP response can have several content types. If two re- sponses have different types they are considered very different, so their value of similarity is 0. The same happens if one of the responses has the content type defined and the other does not. In web applications the template engines are mostly used to generate HTML content. This means that it is only needed to evaluate the cases where the response type is HTML. In the other cases the structure is considered equal and the value is set to 1. The HTML pages are structured by elements and each element may contain one or more children elements. I define root as the HTML element that contains all the others, and I define as leaf the elements that do not contain any children. It is possible to define one path between the root of an HTML page () and each of the leaves including all the elements between them. In the example

1 2 3 4

My first paragraph.

5
6

My second paragraph.

7
8 9

Listing 7: Example of HTMl code

represented at listing7 the existent paths are: html -> body -> p html -> body -> div -> p Two HTML responses can be compared by getting the percentage of paths in one response that also exist in the other. The MComparator checks the percentage of response R1’s paths that exist in R2, and then the percentage of R2’s paths that exist in R1. The final value is the lowest of these two percentages. This heuristic allows to obtain a high similarity in the search results because it is agnostic to the content that always changes depending on the input.

38 At this moment, the Body Trees Structure heuristic is only able to compare two requests with content type HTML because it is what is needed for the scanner, but I intend to extend it to others such as JavaScript Object Notation (JSON).

Status code. If the status code of the two responses is different it probably means that the responses also differ. For this reason, I give this heuristic the value of 1 if they are equal and 0 otherwise. This is the more reliable way to detect an error in the page, so the weight attributed to it is 1. Since the weight is 1, if the status code is different the pages will immediately be considered different.

Headers comparison. Beside the status code, other headers may also indicate that the web appli- cation handled the request differently. This heuristic returns the percentage of headers existent in both responses that also have the same value.

Line and word count. The existence of an error message in the response will cause the response to have less or more words/lines. This is caused by differences on the size of the normal response and the error message. The value obtained in these 2 heuristics is the division of the smallest by the highest number of words/lines in each response.

Input reflection. One of the indicators of the existence of SSTI is the insertion of user input in the HTML of the response. If the input is not rendered as template code it will be reflected exactly as it was inserted, otherwise it will be rendered and transformed. In case of an error, the input will not be reflected, unless it is part of the error message. The MComparator can detect the existence of SSTI by the disappearance of the reflection of the input. The payload cannot be a common word or have a small number of characters. Those payloads can appear in a response independently of the request made. Because of these situations the MComparator needs to know exactly which occurrences of the input in the response are really caused by the input payload. Consider a payload P1 different from the payload P2, response R1 resulting from sending P1, and response R2 resulting from sending P2. The number of P1 non reflected occurrences in R1 is probably equal to the number of occurrences of P1 in R2 (where P1 was not in the request). So, the MComparator considers the number of caused reflections of P1 in R1 the number of occurrences of P1 in R1 minus the number of occurrences of P1 in R2. The result of this heuristic is calculated by dividing the number of caused reflections in R1 by the number of caused reflections in R2, or the opposite in case the latter is smaller than the former. If P1 is a sub-string of P2 the MComparator needs to remove the occurrences of P2 from R2 before counting the occurrences of P1 on it, otherwise it gets the count of the reflections of P2 as if they were reflections of P1 and consequently it would consider that there exist more occurrences not caused by reflection that really exist. Consider that R1 is the code available in Listing8, P1 is “a”, R2 is the code available in Listing9, and P2 is “aaaa”. The data obtained from the responses is:

39 • The number of “a” occurrences in R1 is 2;

• The number of “a” occurrences in R2 is 5;

• The payload “a” is sub-string of “aaaa”;

• The number of “a” occurrences in R2 after removing “aaaa” is 1;

• The number of “aaaa” occurrences in R1 is 0;

• The number of “aaaa” occurrence in R2 is 1;

With this data the MComparator considers that the number of “a” reflections is 2 − 1 = 1 and the number of “aaaa” reflections is 1 − 0 = 1. Since both have the same number of reflections it considers that they had a similar behaviour.

1 2 3 4

The Amazing Website

5

No results found for: "a"

6 7

Listing 8: Response to payload “a”

1 2 3 4

The Amazing Website

5

No results found for: "aaaa"

6 7

Listing 9: Response to payload “aaaa”

Relevant keywords count. Some English words are very common in error messages. If the MComparator finds those words in a response, but it did not find them in the user original message it may infer the ex- istence of an error. The list of words is: error, problem, unexpected, template, line, syntax, warning, unknown, and token.

Heuristics weight. The MComparator could combine all the heuristics and give the same value to all of them, but that would not be the best choice. For example the status code changing is a more reliable indicator of different behaviour than the changing in the number of words as if the input parameter is

40 used to perform a search, two different inputs will always return two different number of words. For this reason, I gave different weights to each heuristic. The current weights for each heuristic were decided based on 3 factors: my perception of which are the ones that best indicate a possible error; the heuristics’ values obtained when I sent the polyglots to the vulnerable web applications; and the heuristics’ values when I sent the polyglot to real non-vulnerable websites. However, each web application has its own behaviour and having a static weight to each heuristic is a solution that might not be the best for any of them. To solve this problem, I created a functionality that uses a comparison between 2 responses to automatically tunes the weights. These process to automatically tune the heuristics’ weights starts by sending a reference request (with response RR) to the server and compare it with the original request (with response OR). Then, the weights attributed to each heuristic becomes its original value multiplied by its heuristic value obtained in the comparison, i.e., weightnew(H1) = weight(H1) × H1(OR,RR) So, if the weight of the heuristic H1 was initially 0.5 and the heuristic H1 result between OR and RR was 0.5 then the new weight of heuristic H1 becomes 0.5 × 0.5 = 0.25. Another possible solution would be to completely replace the weight by the heuristic value obtained. With this feature it is possible to tune the weights to the behaviour of each page in order to obtain a better comparison.

3.2.4 Arithmetic Evaluation Detector

This module has the responsibility of executing a reliable and deterministic detection of SSTI. We have chosen a solution that is simple but that allows to build a more generic tool. Other scanners send template specific tests, and if the tests fail they consider that no vulnerability exists. What the ArithmeticED does is simply send all the tags it knows with arithmetic operations inside and observes if the expected operation result is in any of the sinks response. This method allows to find the vulnerability in any existent template engine that respect 2 requirements: have one of the tags we collected in our engines examples, and evaluate and return the result if we send a multiplication operation inside the tags. For each of the known template tags the ArithmeticED sends a multiplication with the standard Number1*Number2 format with the corresponding tags around, where the numbers are randomly gener- ated for each request. After each request it gets the current state of each sink from the sinks manager and checks if the result of the multiplication is in the response. The result is searched in several big numbers representations (123.000.000 or 123,000,000 or 123000000) The biggest drawback of this solution is the existence of template engines that do not follow the second restriction. They do not evaluate a multiplication with the standard Number1*Number2 format. One example of such cases is the Django template engine which is one of the most common web frameworks for Python. For Django, DustJs and Golang template engines I created specific test cases for each of them as Tplmap does. Those tests send a certain input that when rendered by the specific

41 template engine return some expected result. The tests used are represented in Table 3.4 where N1 and N2 are random numbers.

Template Payload Expected Result Django {{N1|add:N2}} N1+N2 DustJs {{@math key="N1" method="add" operand="N2"/}} N1+N2 Go AAA{{print"N1""N2"}}BBB AAAN1N2BBB

Table 3.4: Specific tests to Django, DustJs and Go

3.2.5 Blind Vulnerability Detector

It is possible that the result of the rendering with the input injected is not accessible to the user, so the ArithmeticED cannot check if the input is being rendered. To detect blind SSTI we use the callback module of ZAP. First, the BlindVD registers a certain token in the callback module and receives an URL with a specific path that should be accessed by the vulnerable server in the case the vulnerability exists. Secondly, it adds the URL to payloads that when rendered make a request to the specified URL. When ZAP receives one request in that endpoint it informs the BlindVD about it. If the BlindVD receives this request it considers that the web application is vulnerable. The payloads that make the callbacks are made specifically to each template engine and are all predefined. One of these payloads is:

${__import__("subprocess").check_output( "wget http://124.124.124.124:58895/SSTI/d5BzH9",shell=True)}

This payload is equal to the payload to exploit Mako template engine that executes operating system commands. The unique difference is that instead of executing the command ls it executes: wget http://124.124.124.124:58895/SSTI/d5BzH9

Wget is a command line tool used for retrieving files using HTTP. In this situation it is retrieving the content from ZAP that has the ip address 124.124.124.124 and uses the path /SSTI/d5BzH9 which is associated to my scanner.

3.2.6 Syntax Fixer

The main objective of this module is to generate payloads able to execute if the payload is inserted inside existing template tags. It the user input is inserted in the middle of HTML the payloads sent by the ArithmeticED will be rendered, but not if inserted in the middle of code because we are creating an incorrect syntax. Considering the payload {9*9} an example of incorrect syntax is pictured in Figure 10 because the second starting tag is before the first closing tag. This module adapts the ArithmeticED payloads by adding one prefix that fixes the current syntax. In the Table 3.2 it is possible to see that a closing tag alone does not cause errors in all the templates except for EJS, so fixing the syntax after the payload is not necessary.

42 1 ... 2

No results found for: {x = {9*9} }

3 ...

Listing 10: Incorrect template syntax after insertion in middle of template code.

All the templates have a different syntax and in each of them the number of possible situations is too big. Since our scanner cannot fix all the cases I focused my efforts in fixing the most common situations. One common way of causing other types of injections (SQLI, code injection, XSS inside tags, command injections) is to escape from inside a string and insert the payload. Following these examples, 2 of my prefixes are simply the characters used to delimit strings, they are " and ’. The other prefix is a number that can be used in arithmetic operations, cycles and conditions. So my prefixes are: ", ’, and a number. The first thing needed to fix the syntax of a global template is to fix the code where the input is inserted. To fix the code, the Syntax Fixer (SyntaxFix) uses one of the 3 payloads described in the previous paragraph. Then, SyntaxFix needs to end the current template code to start a new one. To end the template code, it adds the end tag used in the template. At this point, the SyntaxFix is in a clean stage, so it can finally send the usual payload generated by the ArithmeticED. To generate a payload capable of detecting the vulnerability in Listing6, this scanner would proceed in the following way:

1. Get the character to fix code " →...${name="""}...

2. Add the ending tag →...${name=""}"}...

3. Add the normal payload →...${name=""}${99*99}"}...

The resultant code can be split in 3 parts: the first is the fixed code ${name=""}, the second is the payload to detect SSTI ${99*99}, and the last is "} which will not cause any errors in the rendering based on the results available in table 3.2. One possible improvement for this module is the creation of a polyglot that fixes the syntax being the payload 1}"}'} a possible candidate, where the } is the closing tag of the template under test. This solution fixes the syntax in the 3 previously described tests and it could be used in all the payloads instead of sending special tests. However, I still need to test it in all the templates to be sure about its behaviour. Another problem is the increase in size of the payloads that can be blocked by size restrictions as happen in [30]. The proposed solution is not the best possible way to fix the syntax because each template engine has several types of tags for different functions and to address them we would need to have deeper knowledge about each template. This is a trade-off I needed to make in order to have a tool independent from specific template engines.

43 44 Chapter 4

Experiemental Evaluation

Metrics: To evaluate my solution, I used some ideas from Doupe´ et al. [5] and Bau et al. [17], but I ignored the metrics related to crawling because it is out of the of my plug-in, which only tests a given endpoint. I considered as the main evaluation metrics the vulnerability detection rate and number of HTTP requests. I will not consider the number of false positives because none of the scanners had them. The result of these metrics will be compared against the same results using the scanners Burp Suite Pro and Tplmap.

Scanners configuration: In some of the tests where it was useful I collected the results using more than one scanner configuration because each user might have different needs, and consequently differ- ent configurations. If not stated otherwise the configuration of the scanners was:

• I set Burp Suite PRO 2.0 scanner with the “accuracy” option to minimise the false negative rate, and the “Audit speed” as fast.

• Tplmap was set with level equal to 0 which is the number of levels of nested code to fix before the payload. Besides this, it does not have any relevant configuration. Later in the fixing code tests this value was set to 1.

• ZAP-ESUP was configured with attack strength to “low” to test its optimisation, except for the test where the rendering result is never visible to the attacker (blind SSTI). In that case the chosen configuration was “high” because it is the lowest level where my tool uses payloads that make callbacks.

All the test web applications I developed are available at: https://github.com/DiogoMRSilva/websitesVulnerableToSSTI

45 4.1 Simple Tests - Reflected Results

Tests Specification

The first test intends to know if the scanners are able to detect simple SSTI vulnerabilities. Failing in these tests implies that they are not able to detect the vulnerability in a more complex situation with the same template engine. I created 19 vulnerable web applications using six different programming languages. The only thing the web application does is to receive one parameter “name” in the request, insert the parameter in the template before rendering, and return the rendering result in the response. Tplmap has no crawling capabilities. It has a command line interface that receives the URL to test, and if the parameter is in post parameters, then the user needs to declare that the request is a post request and the parameter to inject. The commands used for each of the tests were:

python tplmap.py -u http://IP:PORT -X POST -d name=aaaa --level=0

Both ZAP and Burp have crawling capabilities, but since the crawler quality is out of the scope of the evaluation, I wanted to put all scanners at the same level. So, I created one script that used each of the scanners as a proxy and made one post request. In the post body the parameter “name” set to “aaaa” was sent, similarly to the information given to Tplmap. Then on each of the scanners I ran the scan for SSTI in the endpoints collected. The vulnerability detection rate is calculated by dividing the number of vulnerabilities found by the number of existing ones.

Tests Results and Analysis

The results obtained from the test show that all scanners had a similar vulnerability detection rate in the simple vulnerabilities. ZAP-ESUP found 84.2% of the vulnerabilities, Burp 73.7%, and Tplmap 78.9%. This result should not be used as a strict distinction between all the scanners because they can detect the vulnerabilities in template engines which I did not contemplated. This information can be useful to understand the generalisation capabilities of the scanners. ZAP- ESUP is able to detect SSTI in a template engine not contemplated in my plugin development if it has the tags equal to some of the contemplated ones and if it executes simple arithmetic operations such as multiplication. If ZAP-ESUP did not know of Slim template engine (Ruby) but it did know about Jade (JavaScript) it would be able to find the vulnerabilities because it satisfies my conditions and the same does not happen with Burp and Tplmap as we can see in these 2 template engines results. Meanwhile, there are some template engines that do not execute arithmetic operations, e.g., Django [48], making ZAP-ESUP (before including specific tests) unable to find the vulnerability while Burp has specific tests for Django and finds it. The Golang [49] web application was created after the development of the plugin and the vulnerability was not found by any of the scanners. ZAP-ESUP was unable to find it because the template engine

46 Engine Language Burp ZAP Tplmap RCE Tags Jinja2 Python yes yes yes yes {{X}} Mako Python yes yes yes yes ${X} Tornado Python yes yes yes yes {{X}} Django Python yes no no no {{X}} Smarty PHP yes yes ybne yes {X} Smarty(secure Mode) PHP yes yes ybne no {X} Twig PHP yes yes ybne no {{X}} FreeMarker Java yes yes yes yes <#Y>${X} Velocity Java yes yes yes yes #set($Y=X)${Y} Thymeleaf Java no yes no no Jade JavaScript yes yes yes yes #{X} Nunjucks JavaScript yes yes yes yes {{X}} Dot JavaScript no yes yes yes {{=X}} Dust JavaScript no no ybne no {#X}or{X}or{@X} EJS JavaScript yes yes yes yes <%=X%> Vuejs JavaScript yes yes ybne yes {{X}} Slim Ruby no yes no yes #{X} ERB Ruby yes yes yes yes <\%= X\%> Go Golang no no no - - Vulnerability detection rate - 14/19 16/19 15/19 - Vulnerability detection rate % - 74% 84% 78% -

Table 4.1: Simple Vulnerabilities Detection Table. yes - found vulnerability; ybne - found vulnerability but says that is not exploitable; the column RCE says if I found an exploit for rce in some source or by myself. does not execute arithmetic operations failing in one of the conditions, but the others fail because they did not have a specific test for it. After these results, I added some special tests in my plugin to address the failed cases (Go, Dust, Django) as described in Table 3.4.

4.2 Stored and Blind SSTI Test Cases

Tests Specification

As we have seen in the public cases of SSTI in Subsection 2.3.3 and in my analysis of the possible subvariants of SSTI performed in Subsection 2.3.4, just looking for reflected SSTI in the server response to the request is not enough to find all cases of SSTI. For this reason, I created tests for stored and blind SSTI. To test the ability of detecting the vulnerability in special cases and not having the result influenced by the capability of each scanner to detect the vulnerability for a certain template engine I created these web applications using the Mako template engine. All tested scanners were able to detect the vulnerability in the simple test using this template engine, making it ideal for the tests. I could have created this test for each of the template engines, but I believe it would require a significant amount of work without resulting in more useful results. The first test tries to represent the stored SSTI with posterior injection and rendering represented in Figure 2.3. This web application is similar to the simple tests performed in Section 4.1 as it receives a

47 single name parameter. However instead of inserting it in the template rendered for the response it is stored in a file. Later, when the “/stored” endpoint is requested the content of the file is read and inserted in the template before being rendered.

Tplmap, by design, is unable to find this kind of vulnerabilities because it has no way to declare possible result locations or crawling capabilities. To test ZAP-ESUP and Burp I proceeded as I did with the simple tests, but I also added the “/stored” endpoint to the requests made by the script.

The second test simulates stored and blind SSTI with immediate injection and rendering represented in Figure 2.4 and at Figure 2.6. The test code is exactly the same as the one from Mako simple test but instead of using the user input in the template of the response, it is used in a template which result is never shown. To test this web application I proceeded exactly as in the simple tests in order to have the same conditions in the three scanners.

Tests Results and Analysis

Test Burp ZAP-ESUP Tplmap Input rendered in other visible location no yes no Rendering result not visible to attacker no yes yes

Table 4.2: Stored and Blind SSTI tests results

As expected Tplmap was not able to detect the vulnerability when the input is rendered in another location since it only had knowledge about the endpoint where the input is inserted (stored SSTI with posterior injection and rendering). I do not know why Burp did not find this vulnerability, but the most probable reason is that it only searches for input rendering in the response of the request itself. This means that Burp is unable to find several types of SSTI including some of the real cases analysed previously in Subsection 2.3.3. ZAP-ESUP was able to find this vulnerability since it searches on each of the sink points (endpoints where that parameter value is posteriorly shown) for signs of input rendering.

In the case of stored and blind SSTI with immediate injection and rendering where the input is ren- dered before the response to the client but the result is never visible to the attacker, Burp was not able to detect the vulnerability. The other two scanners detected the vulnerability but using 2 different strate- gies. While ZAP-ESUP was able to detect the vulnerability by sending requests that make callbacks to the attacker, Tplmap found the vulnerability by sending requests with template code that caused a time delay. If the delay was noticed in the response Tplmap considers it as vulnerable. Both solutions have benefits and drawbacks. ZAP-ESUP works even if the input is rendered after the response is received because it will make a callback to ZAP, whereas in the same case Tplmap will not detect the vulnerability because there will be no delay in the response from the server. On the other hand, If the server is behind a firewall that just allows the HTTP requests started from outside to the server, then the callback will not reach ZAP and the vulnerability will not be detected while Tplmap would detect it as it only considers the response time to detect the vulnerability.

48 4.3 Injection inside template code tests

As seen when defining the several variants of SSTI in Subsection 2.3.4, SSTI payloads can be injected inside or outside the template code. For that, I created two tests to evaluate the scanners in this area. In these tests I also did not want to have the results influenced by capability of each scanner to detect the vulnerability for a certain template engine, so I created these web applications with Mako template engine. To test this web application I proceeded exactly as in the simple tests in order to have the same conditions for the three scanners. The payload insertions in the template were:

...

${1234 == %s}

... and

...

${"prefix"+"%s"}

...

The user input will replace the %s in the template.

Tests Results and Analysis

Test Burp ZAP-ESUP Tplmap Input inserted in the middle of template code(math) no yes yes Input inserted in the middle of template code(text) yes yes yes

Table 4.3: Injection inside template code tests results

All of the scanners were able to detect the vulnerability when the input was inserted in the middle of template code inside a string. Burp detected this vulnerability as RCE. In some templates the code executed inside the specific tags is equal or very close to the one that is executed by the programming language making the RCE scanners able to find SSTI. In case the input is inserted in code doing an arithmetic operation Burp was the only one that did not find it. Tplmap found both vulnerabilities probably because of its feature to detect SSTI inside nested code up to a number of levels defined by the user. ZAP-ESUP was also able to detect the vulnerability when the strength definition was set to “insane” because if the strength is lower it will not even send payloads capable of fixing the syntax. I the future this may be changed by using a polyglot that fix the syntax. ZAP-ESUP was able to find the vulnerability in these two simple cases of injection inside existing code, but it will fail in more complex situations. On the other hand, Tplmap has a deeper knowledge about each of the template engines and makes several possible combinations with the elements of each engine to detect the vulnerability in these situations. This is however at the cost of losing generality and the need to have such payloads defined manually for each template engine.

49 4.4 Real Example Test

My simple web applications are not a complete representation of a real website because the com- plexity of the response and the functionality are very low. Since my scanner uses heuristics to determine different behaviours these characteristics will influence its results. To show that my scanner can work in real cases I decided to test its capabilities in a real web application. To make it harder, we required the vulnerability to be in a search input. Search inputs are one of the most challenging situations to my plugin detection of interesting behaviour since the page content always changes depending on the input. I have created a vulnerable version of the open-source e-commerce framework Spree [50] where I changed the main page search functionality. First, I needed to force it to use a string as template instead of following the predefined Model View Controller architecture. Then I injected the user input before rendering, leaving it vulnerable to SSTI. ZAP-ESUP was configured with strength set to “low” because it is the only one that uses the message comparison which is the component that may fail due to the variation in the search result. The vulnerability was found in the version where it existed, and the number of used requests was 14. In the non-vulnerable version of the web store the number of requests made was 4. The requests were the original request, the reference one, and the two polyglots. This result shows that my heuristics can detect different behaviours not only in my tests but also in a real website.

4.5 Performance Tests

The usage of vulnerability scanners can be very broad. The scans can be done before each commit or can be done once a week in a weekly build. The scan can be done in a test server or in a production server. Bug bounty programs are an example of the later. These variations lead to different needs and requirements for the vulnerability scanners. When the scan is not frequent, and a dedicated server is available, the scanners can send a large number of requests and take more than one hour to run. The same cannot happen each time a commit is done and so a light scan is more suitable in this case to increase usability. When a production server is being tested it is needed to have in consideration that breaking the server or slowing it down to the other uses causes damages to the company. For these reasons the scanner should be able to perform the best possible in each situation.

Tests Results and Analysis

Table 4.4 shows the performance results for the 3 scanners with some different settings in each.

Information about the table. In the table the results of ZAP-ESUP that have two values represent the number of requests including or not the requests made by ZAP to do the taint analysis. If the intended scan is only to SSTI the extra requests should be considered because they are necessary to my scanner, but in the case we are also scanning for others vulnerabilities, those requests should not be considered because they would be done to find stored XSS and my plugin just uses the information

50 Burp fast Burp Thorough Tplmap ZAP low ZAP med ZAP high Average on vulnerable 12 12 29 14/12 10/8 22 Maximum 24y 24y 55n 19/17n 14/12n 26yn Minimum 8y 8y 2y 10/8y 6/3y 18y Non vulnerable 17 19 55 5/4 14/12 26/24 Two sinks 14 14 55 19/16 9/7 29 Result not visible 14 14 8 19 14 30 Inserted middle code 14 14 55 6 14 26

Table 4.4: Performance Tests Table already available. The “y” and “n” near the results of maximum inform if the number of requests was from a scan that found a vulnerability. The line average represents the average number of requests to all the simple web applications and the maximum and minimum also refer to those tests.

Analysing the results it is possible to see that Tplmap is the scanner that does the highest number of requests to a vulnerable endpoint having an average of 29 requests. This is due to the effort to find vulnerabilities blindly, and due to the number of template engines contemplated since it does specific tests for each one. It Is also possible to notice that the highest number of requests made by Tplmap is 55 and is done each time there is no vulnerability. The average number of requests Tplmap does when the site is vulnerable is not so far away from the requests that ZAP-ESUP does when the settings are set “high” for strength. That is the level where ZAP-ESUP starts to also detect blind vulnerabilities making the comparison of the two more fair. Even with this configuration ZAP-ESUP makes less requests, where two of the requests are from the taint analysis. Burp definitions did not make any difference in the number of requests, just changing the number of requests when there is no vulnerability from 17 to 19.

ZAP-ESUP results may seem contradictory at first. Comparing the average of requests in “low” and “medium”, the reader sees that the values in “medium” are lower than in “low”. This happen because when “low” is selected my solution first uses the polyglots to know if the web application may be vulner- able. If the application is vulnerable it sends the same requests that “medium” sends. These actions increase the number of requests by 3 if it is a non Java engine and 5 if it is. This average is from the cases when the endpoint is vulnerable. Since the majority of the entry points in the web application will not be vulnerable, the most relevant value is the number of requests made when there is no vulnerability. Considering a case where the “low” setting is used to test a web application with 50 entry points, 1 of which is vulnerable and 9 other have strange behaviour. The number of requests made by Burp should be 12 + 17 × 49 = 845, Tplmap 29 + 55 × 49 = 2724, ZAP-ESUP ”medium” 8 + 12 × 49 = 596 requests, and ZAP-ESUP ”low” 12 + 17 × 9 + 4 × 40 = 325. The number of requests made by ZAP-ESUP in the “low” setting are 38% of the ones made by Burp and 12% of the ones done by Tplmap. In an optimistic situation where there is not any SSTI and the scanner predictor works correctly, the number of requests made by ZAP-ESUP in the “low” setting are 4/17 = 24% of the ones made by Burp and is 4/55 = 7% of the ones done by Tplmap. At the “high” strength my plugin always includes the 15 predefined requests that make callbacks to

51 ZAP and need to be engine specific. The remaining requests are the same that are made when the strength is set to “medium”.

Impact of the number of sinks in ZAP-ESUP’s performance Another important factor that needs to be taken into consideration is that in order to test stored SSTI, the number of requests performed by ZAP-ESUP grows linearly with the number of existent sinks for the entry-point under test. My scanner sends one payload and then searches on each of the sink points for signs of a possible SSTI. Thus to get the number of requests made by ZAP-ESUP to test stored SSTI in “low” and “medium” I multiply the number of sinks by the number of requests stated in Table 4.4, not counting those requests used for taint analysis. For instance for an entry-point with 6 sinks, ZAP-ESUP would need to perform 6 times the requests that would make whenever there was a single sink. When ZAP-ESUP searches for reflected SSTI there is only a single sink hence the values are the ones stated in Table 4.4.

4.6 Generalisation Capacity Tests

There exist 90 engines enumerated at [9] and implementing tests for each of those templates requires much effort and then use all of the tests in the scanner would require many requests. Since many templates use the same tags, one possible solution to this problem is to generalise tags already known to other possible templates. To simulate unknown template engines, I created 12 web applications that when receive the specified template tags will evaluate its content and return the result where the tags were previously. I made 12 different web applications where half used Python eval() function and the other half used Ruby eval() function. Each of the halves used the same 6 different tags as shown in Table 4.5. The tests have been realised using the same procedure as the simple tests.

Tags Evaluator Burp ZAP-ESUP Tplmap {} Python eval() yes yes yes {} Ruby eval() yes yes no ${ } Python eval() yes yes yes ${ } Ruby eval() yes yes no {{ }} Python eval() yes yes yes {{ }} Ruby eval() yes yes yes <%= %> Python eval() yes yes no <%= %> Ruby eval() yes yes yes #{ } Python eval() no yes no #{ } Ruby eval() no yes no {{= }} Python eval() no yes no {{= }} Ruby eval() no yes no

Table 4.5: Generalisation capacity tests results

From the results we can affirm that Tplmap has no generalisation capabilities because in 3 of the tags it found the vulnerability in just one of the programming languages. It has the ability to exploit both

52 eval functions and know all the tags, so if it were designed to generalise it would be able to detect the vulnerabilities. Burp on the other hand looks like it has a good generalisation capability and is mainly caused by the payloads we collected in Section 2.5 that have arithmetic operations inside allowing it to detect the cases that use these tags. It did not found the vulnerability when the tags were {{= }} because it is not also able to find SSTI in Dot which uses these tags, so we cannot expect generalisation. It was also not able to find the vulnerability when the tags where #{ } which are used in Jade that is a template Burp detects, indicating that Burp does not try to generalise the tags it knows. ZAP-ESUP was able to detect the vulnerability in all of them because the tool was designed with this possibility in mind.

53 54 Chapter 5

Conclusions

This thesis presented the study of Server Side Template Injection which is a recent vulnerability, the development of a black-box web application vulnerability scanner for SSTI, and the evaluation of my solution as well as the comparison with other SSTI scanners.

5.1 Achievements

I categorised SSTI variants depending on the time of rendering, location of rendering, and the lo- cation inside the template where the input may be injected. The study was based on real cases and well studied vulnerabilities which allowed me to define cases not contemplated by any of the already existent solutions. I defined 5 main types of SSTI: reflected SSTI, stored SSTI with posterior injection and rendering, stored SSTI with immediate injection and rendering, Blind SSTI with posterior injection and rendering, and Blind SSTI with immediate injection and rendering. Then, I developed a black-box web application vulnerability scanner in the format of a plugin to OWASP-ZAP which is a widely known scanner and used by and security specialists. The scanner development had as main objectives efficiency, the ability to generalise the tool to the largest number of engines possible, the ability to cover all the types of SSTI previously studied, and the use of an architecture that allow the future improvement and components reuse. To find stored SSTI, we use information already available in ZAP about the locations where the user input is reflected and each time we send a payload we check in that locations for the result. To reduce the number of requests we combined several existent ideas to create a new one. By detecting the vulnerabilities causing crashes as fuzzers, by using the Backslash Scanner ideas to detect errors, and by following the community trends using polyglots able to cause errors in all the templates, with just 2 payloads, I was able to detect SSTI with a maximum of 5 requests. To conclude my work, I have done several tests to the scanners and thanks to the previous study my scanner was able to find stored SSTI which none of the previously existent solutions were. This is an important feature since two of the collected real cases were of this type. Additionally, the performance tests proved that our efficient scanner using polyglots is able to find the vulnerability with just 24% of the

55 requests that Burp uses and 7% of the requests that Tplmap uses.

5.2 Future Work

Test others possible arrangement for components. I have made several decisions in my scanner that were the ones we found most suitable from the real world examples we have available and my own usage of the tool, but all the modules that we developed can be changed to work in different configurations. One example of configurations modification is to send the polyglots to decide if it is worth to try to fix the syntax.

Improve current scanner with current available features. Due to lack of time I did not finished the scanner in the state I intended. I developed a good way to balance the weights by using the reference request but it is not used yet since it would need the refactoring of the code and the development of more tests to ensure it works as expected. The efficient scanner and the message comparator are already obtaining good results, but are not being used at their full potential because I need to test them more in different websites or even develop an automated way to generate test cases.

Improve HTML comparison My method to compare HTML works well and I was not able to find a better way to do it, but I believe other areas of computer science can be used. For instance, if we create an abstraction of the HTML to graphs we might be able to improve it. A good work in this area could be the study of the format of real websites and the normal changes they suffer.

Definition of thresholds When we compared the similarity of two pairs of responses we used a value of 10% as the maximum distance between the two to be considered vulnerable, but as in the previous point a bigger data collection should be done and analysed.

56 Bibliography

[1] J. Kettle. Server-side template injection : Rce for the modern webapp. 2015.

[2] Top 10-2017 A1-Injection - OWASP. OWASP, 2017 (accessed September 5, 2018). URL https: //www.owasp.org/index.php/Top_10-2017_A1-Injection.

[3] PortSwigger Ltd. Burp Suite, 2015 (accessed October 7, 2018). URL http://portswigger.net/ burp/.

[4] E. Pinna. Tplmap, (accessed October 7, 2018). URL https://github.com/epinna/tplmap.

[5] A. Doupe,´ M. Cova, and G. Vigna. Why Johnny can’t pentest: An analysis of black-box web vulnerability scanners. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, pages 111–131. Springer, 2010.

[6] J. Bau, E. Bursztein, D. Gupta, and J. Mitchell. State of the art: Automated black-box web applica- tion vulnerability testing. In 2010 IEEE Symposium on Security and Privacy, pages 332–345. IEEE, 2010.

[7] N. Antunes and M. Vieira. Designing vulnerability testing tools for web services: approach, compo- nents, and tools. International Journal of Information Security, 16(4):435–457, 2017.

[8] N. Antunes and M. Vieira. Benchmarking vulnerability detection tools for web services. In Web Services (ICWS), 2010 IEEE International Conference on, pages 203–210. IEEE, 2010.

[9] Wikipedia. Comparison of web template engines, 2018 (accessed October 7, 2018). URL https: //en.wikipedia.org/wiki/Comparison_of_web_template_engines.

[10] D. Silva. SSTI tests repository, 2018. URL https://github.com/DiogoMRSilva/ websitesVulnerableToSSTI.

[11] R. Shirey. RFC 2828. Internet Security Glossary, 2000.

[12] B. Arkin, S. Stender, and G. McGraw. Software penetration testing. IEEE Security & Privacy, 3(1): 84–87, 2005.

[13] X. Li and Y. Xue. Block: a black-box approach for detection of state violation attacks towards web applications. In Proceedings of the 27th Annual Computer Security Applications Conference, pages 247–256. ACM, 2011.

57 [14] M. P. Correia and P. J. Sousa. Seguranc¸a no software. Lisboa: FCA, 2017.

[15] B. P. Miller, D. Koski, C. P. Lee, V. Maganty, R. Murthy, A. Natarajan, and J. Steidl. Fuzz revisited: A re-examination of the reliability of unix utilities and services. Technical report, Technical report, 1995.

[16] A. Austin and L. Williams. One technique is not enough: A comparison of vulnerability discovery techniques. In 2011 International Symposium on Empirical Software Engineering and Measure- ment, pages 97–106. IEEE, 2011.

[17] J. Bau, E. Bursztein, D. Gupta, and J. Mitchell. State of the art: Automated black-box web applica- tion vulnerability testing. In 2010 IEEE Symposium on Security and Privacy, pages 332–345. IEEE, 2010.

[18] Web Application Security Scanner Evaluation Criteria. Web Application Security Consortium, 2009 (accessed September 5, 2018). URL http://projects.webappsec.org/w/page/13246986/Web% 20Application%20Security%20Scanner%20Evaluation%20Criteria.

[19] -Wikipedia, 2018 (accessed October 7, 2018). URL https://en.wikipedia. org/wiki/Web_template_system.

[20] Mustache Template Engine, (accessed October 7, 2018). URL https://mustache.github.io/.

[21] A. Ronacher. Jinja2 Template Engine, 2014 (accessed October 7, 2018). URL http://jinja. pocoo.org/.

[22] M. Bayer. Mako Template Engine, (accessed October 7, 2018). URL http://www.makotemplates. org/.

[23] D. Stuttard and M. Pinto. The web application hacker’s handbook: Finding and exploiting security flaws. John Wiley & Sons, 2011.

[24] Injection Flaws - OWASP, 2015 (accessed September 5, 2018). URL https://www.owasp.org/ index.php/Injection_Flaws.

[25] D. Vieira-Kurz. CVE-2016-4977: RCE In Spring Security OAUTH 1&2, 2016 (accessed October 7, 2018). URL https://secalert.net/#cve-2016-4977.

[26] T. Tomes. Exploring SSTI in /Jinja2, Part II, 2016 (accessed October 7, 2018). URL https: //nvisium.com/blog/2016/03/11/exploring-ssti-in-flask-jinja2-part-ii.

[27] Chromium XSS Auditor, 2010 (accessed October 7, 2018). URL https://www.chromium.org/ developers/design-documents/xss-auditor.

[28] Twig Template Engine, (accessed September 5, 2018). URL https://twig.symfony.com/.

[29] HTML Standard - Tags name, 11 October 2018 (accessed October 11, 2018). URL https://html. spec.whatwg.org/#syntax-tag-name.

58 [30] T. Orange. SSTI at rider.uber.com, 2016 (accessed October 7, 2018). URL https://hackerone.com/reports/125980,http://blog.orange.tw/2016/04/ bug-bounty-uber-ubercom-remote-code_7.html.

[31] Pete(yaworsk). SSTI at unikrn.com, 2017 (accessed October 7, 2018). URL https://hackerone. com/reports/164224.

[32] M. Gogebakan. SSTI in CMS Made Simple https://www.netsparker. com/blog/web-security/exploiting-ssti-and-xss-in-cms-made-simple/. URL https://www.netsparker.com/web-applications-advisories/ ns-17-032-server-side-template-injection-vulnerability-in-cms-made-simple/.

[33] T. Adhikary. Spring boot rce, 2017 (accessed October 7, 2018). URL http://deadpool.sh/2017/ RCE-Springs/.

[34] Tghawkins. SSTI at datax.yahoo.com, 2016 (accessed October 7, 2018). URL https:// hawkinsecurity.com/2017/12/13/rce-via-spring-engine-ssti/.

[35] N. Berg and J. Kloosterman. SSTI in Craft CMS, 2016 (accessed Oc- tober 7, 2018). URL https://www.securify.nl/advisory/SFY20160608/ craft-cms-affected-by-server-side-template-injection.html.

[36] Sebastian(0xB455). CVE-2018-14716:Server Side Template Injection with Craft CMS plugin SEOmatic, 2018 (accessed October 7, 2018). URL http://ha.cker.info/ exploitation-of-server-side-template-injection-with-craft-cms-plguin-seomatic/.

[37] J. Kettle. Backslash Powered Scanning : Hunting Unknown Vulnerability Classes.

[38] D. Appelt, C. D. Nguyen, L. C. Briand, and N. Alshahwan. Automated testing for SQL injection vulnerabilities: an input mutation approach. In Proceedings of the 2014 International Symposium on Software Testing and Analysis, pages 259–269. ACM, 2014.

[39] F. Duchene, S. Rawat, J.-L. Richier, and R. Groz. Kameleonfuzz: evolutionary fuzzing for black-box xss detection. In Proceedings of the 4th ACM conference on Data and application security and privacy, pages 37–48. ACM, 2014.

[40] O. Tripp, O. Weisman, and L. Guy. Finding your way in the testing jungle: a learning approach to web security testing. In Proceedings of the 2013 International Symposium on Software Testing and Analysis, pages 347–357. ACM, 2013.

[41] IBM AppScann, (accessed October 7, 2018). URL http://www-03.ibm.com/software/products/ pt/appscan-standard.

[42] OWASP Zed Attack Proxy Project, (accessed October 7, 2018). URL https://www.owasp.org/ index.php/OWASP_Zed_Attack_Proxy_Project.

[43] Jenkins automation server., (accessed October 7, 2018). URL https://jenkins.io/.

59 [44] M. Karlsson and F. Almroth. The Ultimate SQL Injection Payload, 2013 (accessed October 7, 2018). URL https://labs.detectify.com/2013/05/29/the-ultimate-sql-injection-payload/.

[45] X. J. @filedescriptor. XSS Polyglot Challenge v2, 2018 (accessed October 7, 2018). URL https: //polyglot.innerht.ml/.

[46] A. Elsobky. Unleashing an Ultimate XSS Polyglot, 2016 (accessed October 7, 2018). URL https: //github.com/0xsobky/HackVault/wiki/Unleashing-an-Ultimate-XSS-Polyglot.

[47] M. Karlsson. Polyglot payloads in practice by avlidienbrunn at HackPra, 2014 (ac- cessed October 7, 2018). URL https://www.slideshare.net/MathiasKarlsson2/ polyglot-payloads-in-practice-by-avlidienbrunn-at-hackpra,https://youtu.be/ EeZDw64YwV0.

[48] Django Template Engine, (accessed September 5, 2018). URL https://docs.djangoproject. com/en/2.1/topics/templates/.

[49] Go Template Engine, (accessed September 5, 2018). URL https://golang.org/pkg/text/ template/.

[50] S. Solutions. SSTI at unikrn.com, 2018 (accessed October 7, 2018). URL https:// spreecommerce.org/.

60