Detecting and Removing Web Application Vulnerabilities With
Total Page:16
File Type:pdf, Size:1020Kb
IEEE TRANSITIONS ON REALIBILITY 1 Detecting and Removing Web Application Vulnerabilities with Static Analysis and Data Mining Ibéria Medeiros, Nuno Neves, Member, IEEE, and Miguel Correia, Senior Member, IEEE Abstract—Although a large research effort has been going WAP-TA Web Application Protection Taint Analyzer on for more than a decade, the security of web applications XSS Cross-site Scripting continues to be a challenging problem. An important part of that problem derives from vulnerable source code, often written in unsafe languages like PHP. Source code static analysis tools NOTATIONS are a solution to find vulnerabilities, but they tend to generate acc accuracy of classifier false positives and require considerable effort for programmers to manually fix the code. We explore the use of a combination fn false negative outputted by a classifier of methods to discover vulnerabilities in source code with less fp false positive outputted by a classifier false positives. We combine taint analysis, which finds candidate fpp false positive rate of prediction vulnerabilities, with data mining, in order to predict the existence kappa kappa statistic of false positives. This approach brings together two approaches pd probability of detection that are apparently orthogonal: humans coding the knowledge about vulnerabilities (for taint analysis) versus automatically pfd probability of false detection obtaining that knowledge (with machine learning, for data pr precision of classifier mining). Given this enhanced form of detection, we propose prd precision of detection doing automatic code correction by inserting fixes in the source prfp precision of prediction code. Our approach was implemented in the WAP tool and an tn true negative outputted by a classifier experimental evaluation was performed with a large set of PHP applications. Our tool found 388 vulnerabilities in 1.4 million tp true positive outputted by a classifier lines of code. Its accuracy and precision were approximately 5% tpp true positive rate of prediction better than PhpMinerII’s and 45% better than Pixy’s. wilcoxon Wilcoxon signed-rank test Index Terms—Web applications, input validation vulnera- bilities, false positives, source code static analysis, automatic I. INTRODUCTION protection, software security, data mining. INCE its appearance in the early 1990s, the Web evolved S from a platform to access text and other media to a ACRONYMS AND ABBREVIATIONS framework for running complex web applications. These ap- plications appear in many forms, from small home-made to AST Abstract Syntax Tree large-scale commercial services (e.g., Google Docs, Twitter, DT/PT Directory Traversal or Path Traversal Facebook). However, web applications have been plagued with K-NN K-Nearest Neighbor security problems. For example, a recent report indicates an LFI Local File Inclusion increase of web attacks of around 33% in 2012 [34]. Arguably, LR Logistic Regression a reason for the insecurity of web applications is that many MLP Multi-Layer Perceptron programmers lack appropriate knowledge about secure coding, NB Naive Bayes so they leave applications with flaws. However, the mecha- OSCI OS Command Injection nisms for web application security fall in two extremes. On PHPCI PHP Command Injection one hand, there are techniques that put the programmer aside, RFI Remote File Inclusion e.g., web application firewalls and other runtime protections RT Random Tree [12], [24], [37]. On the other hand, there are techniques that SCD Source Code Disclosure discover vulnerabilities but put the burden of removing them SQLI SQL Injection on the programmer, e.g., black-box testing [2], [4], [14] and SVM Support Vector Machine static analysis [15], [17], [27]. TEPT Tainted Execution Path Tree The paper explores an approach for automatically protecting TST Tainted Symbol Table web applications while keeping the programmer in the loop. UD Untainted Data The approach consists in analyzing the web application source Vul Vulnerabilitiy(ies) code searching for input validation vulnerabilities and inserting WAP Web Application Protection fixes in the same code to correct these flaws. The programmer Manuscript received August XX, 2014; revised September XX, 2014. is kept in the loop by being allowed to understand where I. Medeiros and N. Neves are with the LaSIGE, Faculdade de Ciên- the vulnerabilities were found and how they were corrected. cias, Universidade de Lisboa - Portugal (e-mail: [email protected] and This contributes directly for the security of web applications [email protected]). M. Correia is with the INESC-ID, Instituto Superior Técnico, Universidade by removing vulnerabilities, and indirectly by letting the de Lisboa - Portugal (e-mail: [email protected]). programmers learn from their mistakes. This last aspect is IEEE TRANSITIONS ON REALIBILITY 2 enabled by inserting fixes that follow common security coding or PostgreSQL. The tool might be extended with more flaws practices, so programmers can learn these practices by seeing and databases, but this set is enough to demonstrate the the vulnerabilities and how they were removed. concept. Designing and implementing WAP was a challenging We explore the use of a novel combination of methods task. The tool does taint analysis of PHP programs, a form to detect this type of vulnerabilities: static analysis and data of data flow analysis. To do a first reduction of the number mining. Static analysis is an effective mechanisms to find of false positives, the tool performs global, interprocedural vulnerabilities in source code, but tends to report many false and context-sensitive analysis, which means that data flows positives (non-vulnerabilities) due to its undecidability [18]. are followed even when they enter new functions and other This problem is particularly difficult with languages such as modules (other files). This involves the management of several PHP that are weakly typed and not formally specified [7]. data structures, but also to deal with global variables (that in Therefore, we complement a form of static analysis, taint PHP can appear anywhere in the code, simply by preceding analysis, with the use of data mining to predict the existence of the name with global or through the $_GLOBALS array) false positives. This solution combines two apparently opposite and resolving module names (which can even contain paths approaches: humans coding the knowledge about vulnerabil- taken from environment variables). Handling object orientation ities (for taint analysis) versus automatically obtaining that with the associated inheritance and polymorphism was also a knowledge (with supervised machine learning supporting data considerable challenge. mining). We evaluated the tool experimentally by running it with both To predict the existence of false positives we introduce the simple synthetic code and with 45 open PHP web applications novel idea of assessing if the vulnerabilities detected are false available in the internet, adding up to more than 6,700 files positives using data mining. To do this assessment, we measure and 1,380,000 lines of code. Our results suggest that the tool attributes of the code that we observed to be associated with is capable of finding and correcting the vulnerabilities from the presence of false positives, and use a combination of the classes it was programmed to handle. the three top-ranking classifiers to flag every vulnerability as The main contributions of the paper are: (1) an approach false positive or not. We explore the use of several classifiers: for improving the security of web applications by combining ID3, C4.5/J48, Random Forest, Random Tree, K-NN, Naive detection and automatic correction of vulnerabilities in web Bayes, Bayes Net, MLP, SVM, and Logistic Regression. applications; (2) a combination of taint analysis and data Moreover, for every vulnerability classified as false positive, mining techniques to identify vulnerabilities with low false we use an induction rule classifier to show which attributes positives; (3) a tool that implements that approach for web are associated with it. We explore the JRip, PART, Prism applications written in PHP with several database management and Ridor induction rule classifiers for this goal. Classifiers systems; (4) a study of the configuration of the data mining are automatically configured using machine learning based on component and an experimental evaluation of the tool with a labeled vulnerability data. considerable number of open source PHP applications. Ensuring that the code correction is done correctly requires assessing that the vulnerabilities are removed and that the II. INPUT VALIDATION VULNERABILITIES correct behavior of the application is not modified by the fixes. We propose using program mutation and regression testing to Our approach is about input validation vulnerabilities, so confirm, respectively, that the fixes do the function to what this section presents briefly some of them (those handled they are programmed to (blocking malicious inputs) and that by the WAP tool). Inputs enter an application through entry the application remains working as expected (with benign points (e.g., $_GET) and exploit a vulnerability by reaching a inputs). Notice that we do not claim that our approach is sensitive sink (e.g., mysql_query). Most attacks involve mixing able to correct any vulnerability, or to detect it, only the input normal input with metacharacters or metadata (e.g., ’, OR), so validation vulnerabilities it is programmed to deal with. applications can be protected by placing sanitization functions The paper also describes the design of the Web Application in the paths between entry points and sensitive sinks. Protection (WAP) tool that implements our approach [1]. WAP SQL injection (SQLI) vulnerabilities are caused by the use analyzes and removes input validation vulnerabilities from of string-building techniques to execute SQL queries. Fig. 1 code1 written in PHP 5, which according to a recent report shows PHP code vulnerable to SQLI. This script inserts in a is used by more than 77% of the web applications [16]. WAP SQL query (line 4) the username and password provided by covers a considerable number of classes of vulnerabilities: the user (lines 2, 3).