188 IEEE TRANSACTIONS ON RELIABILITY, VOL. 69, NO. 1, MARCH 2020 DIAVA: A Traffic-Based Framework for Detection of SQL Injection Attacks and Vulnerability Analysis of Leaked Data Haifeng Gu , Student Member, IEEE, Jianning Zhang, Tian Liu ,MingHu, Student Member, IEEE, Junlong Zhou , Member, IEEE, Tongquan Wei , Member, IEEE, and Mingsong Chen , Senior Member, IEEE

Abstract—SQL injection attack (SQLIA) is among the most com- NOMENCLATURE mon security threats to web-based services that are deployed on cloud. By exploiting web software vulnerabilities, SQL injection CPU Central processing unit. attackers can run arbitrary malicious code on target to CUDA Compute unified device architecture. acquire or compromise sensitive data. Although web application DBMS management system. firewalls (WAFs) are offered by most cloud service providers, ten- GPU Graphic processing unit. ants are reluctant to pay for them, since there are few approaches that can report accurate SQLIA statistics for their deployed ser- HTTP HyperText transfer protocol. vices. Traditional WAFs focus on blocking suspicious SQL requests. JSON JavaScript object notation. Few of them can accurately decide whether an attack is really harm- MD5 Message digest 5. ful and quickly answer how severe the attack is. To raise the ten- NIC Network interface card. ants’ awareness of the seriousness of SQLIAs, in this paper, we RegExp Regular expression. introduce a novel traffic-based SQLIA detection and vulnerabil- ity analysis framework named DIAVA, which can proactively send SHA Secure hash algorithm. warnings to tenants promptly. By analyzing the bidirectional net- SIMD Single instruction multiple data. work traffic of SQL operations and applying our proposed mul- SQL Structured query language. tilevel regular expression model, DIAVA can accurately identify SQLIA SQL Injection attack. successful SQLIAs among all the suspects. Meanwhile, the severity WAF Web application firewall. of such SQLIAs and the vulnerabilities of the corresponding leaked data can be quickly evaluated by DIAVA based on its GPU-based Notations dictionary attack analysis engine. Experimental results show that algo algorithm. DIAVA not only outperforms state-of-the-art WAFs in detecting SQLAs from the perspectives of precision and recall, but also en- cptxt Given ciphertext. ables real-time vulnerability evaluation of leaked data caused by dataList List of extracted data. SQL injection. dict Selected dictionary. Index Terms—GPU, network traffic, regular expression, SQL num Size of dictionary segment. injection attack, web application firewall. ptxt Decrypted plaintext. re3 The third-level RegExps. req String containing the HTTP request. resp String containing the HTTP response. Manuscript received June 5, 2018; revised October 20, 2018 and April 9, succ Indicator of attack success. 2019; accepted June 19, 2019. Date of publication July 24, 2019; date of current sup Set of supplemental prefixes and suffixes. version March 2, 2020. This work was supported in part by the National Key Research and Development Program of under Grant 2018YFB2101300, in part by the Natural Science Foundation of China under Grant 61872147 and I. INTRODUCTION Grant 61802185, and in part by the Natural Science Foundation of Jiangsu under Grant BK20180470. Associate Editor: W. Eric Wong. (Corresponding author: UE TO the outstanding merits (e.g., continuous availabil- Mingsong Chen.) H. Gu, J. Zhang, T. Liu, M. Hu, and T. Wei are with the Shanghai Key D ity, easy accessibility, flexibility) of , an Laboratory of Trustworthy Computing, East China Normal University, Shang- increasing number of enterprises and individuals from different hai 200062, China (e-mail: [email protected]; 10142510262@ business sectors, such as online shopping, e-banking, health- ecnu.cn; [email protected]; [email protected]; [email protected]). J. Zhou is with the Department of Computer Science and Technology, care, e-government, and social media have made their services Nanjing University of Science and Technology, Nanjing 210094, China available on the web. This shift will be further accelerated due (e-mail: [email protected]). to the maturity of the latest web technologies (e.g., HTML5), M. Chen is with the Shanghai Key Laboratory of Trustworthy Computing, East China Normal University, Shanghai 200062, China, and also with the Shanghai which enable richer web applications and more enhanced user Institute of Intelligent Science and Technology, Tongji University, Shanghai experiences. 200092, China (e-mail: [email protected]). Along with the prosperity of web applications, inevitably Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. they are becoming the main targets of malicious attackers. It Digital Object Identifier 10.1109/TR.2019.2925415 is reported that in the third quarter of 2017, a web application

0018-9529 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: East China Normal University. Downloaded on March 05,2020 at 01:35:52 UTC from IEEE Xplore. Restrictions apply. GU et al.: DIAVA: A TRAFFIC-BASED FRAMEWORK FOR DETECTION OF SQL INJECTION ATTACKS AND VULNERABILITY ANALYSIS 189 experiences on average 500Ð700 attacks per day [1]. As one that can accurately capture successful SQL injections lead- of the most serious threats to web applications, SQL Injection ing to the leak of confidential data. Attacks (SQLIAs) are widely used by attackers to obtain unau- 3) Based on GPU, we propose a parallel dictionary collision thorized access to sensitive information. As an example, a forum analysis approach that can quickly figure out the origi- of the popular multiplayer online game “Dota 2” was attacked by nal content of leaked ciphertexts, which enables the quick some SQL injection on July 10, 2016. It raises serious concerns evaluation of the vulnerability of leaked data. about user privacy, since around 2 million player records were Unlike traditional WAFs, DIAVA does not block malicious exposed, including email addresses, IP addresses, usernames, SQLIAs. Note that besides the online monitoring of network user IDs, and corresponding passwords. Although the user pass- traffic for cloud-based web services, DIAVA can be also used words of this forum were encrypted using the MD5 hashing to test the robustness of the newly deployed web applications algorithm, the original passwords can be easily figured out [2]. on cloud against SQLIAs. Experimental results using network Cloud computing [3] has become the mainstream platform to traffic captured from both real and simulated networks show the run web services, since it enables on-demand service provision superiority of our DIAVA framework in identifying malicious in a pay-as-you-go manner and facilitates both the deployment SQLIAs and evaluating the vulnerability of leaked data. and maintenance of web-based software systems. To protect web The rest of this paper is organized as follows. Section II applications on cloud from external attacks, Web Application presents the related work on various detection and analysis Firewalls (WAFs) are an indispensable mechanism that can be techniques for SQLIAs. Section III introduces the preliminary tailored for specific types of attacks. To protect valuable business knowledge of different types of SQLIAs. Section IV provides and customer data under SQLIAs, a WAF checks each input the implementation details of our DIAVA framework. Section V entry in the incoming SQL request. If the value matches an attack compares our framework with two state-of-the-art WAFs and pattern specified with a predefined set of rules (e.g., regular reports the performance evaluation results. Finally, Section VI expressions), the WAF will block the request [4], [5]. Although concludes this paper. WAFs are good at intrusion detection and prevention [6], few of them can answer questions such as “Can this attack cause II. RELATED WORK disastrous damages or data leak?” or “How severe the attack is?” This is mainly because WAFs usually adopt the unidirectional As more and more web applications are deployed on cloud, analysis which only checks the incoming SQL queries without SQL injection attacks have become a major threat to the web- considering the database responses. based services. According to a report [9], over 80% of web ap- Although WAFs are promising in defending most malicious plications on the Internet potentially have at least one serious attacks, few of cloud service providers offer such services for vulnerability. In order to detect and prevent SQLIAs, various free. For example, Alibaba charges its WAFservice according to web application program analysis and SQL injection detection the bandwidth of target web applications. Its enterprise-version techniques have been investigated. WAF costs around 1400 dollars per month to protect a website By analyzing the syntaxes and behaviors of web-based soft- with a limited bandwidth of 30 Mb/s. Consequently, most of ware, the vulnerabilities of web-based applications can be ex- cloud tenants especially small enterprises are reluctant to pay ploited to facilitate the detection of SQL injection attacks. For for this secure option. This is not simply because of the high example, Xie et al. [10] utilized static taint analysis techniques price but rather the unawareness of the severity of malicious to find SQL injection vulnerabilities in PHP scripts. In [11], SQLIAs [7], [8]. Most of tenants believe that with proper en- Livshits et al. introduced a static analysis approach based on a cryption mechanisms the privacy of data can be protected even scalable and precise points-to analysis. Their approach can be if malicious attacks happen. Without the evidence of real valu- used to find vulnerabilities in statically analyzed code, which able information leaked from their web services, it is hard to match the user-provided vulnerability patterns. In [12], Band- persuade them to invest money on WAF products. Clearly, the hakavi et al. proposed a program analysis-based method called bottleneck here is the lack of effective detection and evaluation CANDID. Based on the structure comparison with extracted approaches that can help cloud service providers to accurately programmer-intended queries, their approach can check the vul- collect the successful SQLIAs and quickly send the warnings of nerabilities of user input queries. To enable the vulnerability leaked information to their tenants. mining for web applications, Shar et al. [13] presented a vul- In order to proactively generate warnings of malicious nerability prediction approach using the supervised machine SQLIAs to cloud tenants whose web services are attacked, this learning. In [14], Wang et al. introduced a novel approach that paper presents a traffic-based framework named DIAVA for can distinguish malicious behaviors by learning SQL statements cloud service providers. DIAVAcan not only monitor and collect using program tracing techniques. In [15], Kieyzun et al. pre- successful SQLIA information, but also reveal the vulnerability sented a mutation-based attack generation method that of the leaked data. This paper makes the following three major can track taints through programs’ execution to identify SQL contributions: injection attacks. In [16], Kar et al. presented a novel method 1) We introduce a novel multilevel regular expression model that can detect injection attacks by modeling SQL queries as which can be used to accurately detect SQL injection graph of tokens and using the centrality measure of nodes to train attacks from incoming network traffic. a support vector machine. Their approach can protect multiple 2) By considering both SQL requests and database responses, web applications in a shared hosting scenario. In [17], Thomé we propose a bidirectional network traffic analysis method et al. proposed the concept of security slicing for the auditing of

Authorized licensed use limited to: East China Normal University. Downloaded on March 05,2020 at 01:35:52 UTC from IEEE Xplore. Restrictions apply. 190 IEEE TRANSACTIONS ON RELIABILITY, VOL. 69, NO. 1, MARCH 2020

TABLE I TYPES OF SQL INJECTION ATTACKS

common injection vulnerabilities. Although the above ap- III. PRELIMINARY KNOWLEDGE proaches can effectively identify vulnerabilities that may be ex- Table I presents the five major types of SQL injection attacks, ploited by SQL injection attackers, few of them can be used in which are widely used to extract or modify users’ sensitive data cloud computing since the source code of most web-applications in practice. The following sections will introduce each of them are not available for cloud service providers. in detail with examples. As runtime techniques to detect and prevent malicious SQL injections, Web Application Firewalls are promising in finding anomalies from incoming SQL requests [18]. By filtering out and A. Boolean-Based Blind SQLIAs blocking SQL injection attacks, WAFsare widely used to protect web applications. For example, GreenSQL [19] and Modsecu- To get unauthorized accesses to or extract sensitive informa- rity [20] are two popular WAFs based on regular expression- tion from databases, attackers can inject tautologies in WHERE based rules, which can define the patterns of suspicious SQL clauses, which can force a SQL query to return all results statements. To make WAFs more robust against SQLIAs, var- basically ignoring any other conditionals in the instrumented ious WAF testing approaches have been investigated. For ex- WHERE clauses. The salient features of this kind of attacks ample, Appelt et al. [21] examined the impact of WAFs and include the keywords such as MAKE_SET and ELT.Inthefol- = −− database proxies on black-box SQL injection testing. They pro- lowing SQLIA example, the predicate “OR 1 1 ” is added posed using database proxies as an oracle for SQL injection to the WHERE clause, which makes the evaluation of this clause testing. In [22], Appelt et al. presented an automated testing always true. As a result, all the students’ information will be approach to produce executable and harmful SQL statements leaked out. using their proposed mutation operators. Their method can gen- Example 1: SELECT * FROM students WHERE sname = = −− = erate randomized inputs to detect SQL vulnerabilities that are ‘abc’ OR 1 1 AND password ‘123’; executable are passing the firewall and are unduly revealing or compromising data in the database. Moreover, they proposed a B. Error-Based SQLIAs machine learning-based approach [23], [24] that can generate effective attack inputs to bypass the target WAFs. In [25], Liu In order to spy on sensitive information of database manage- et al. utilized feature matrix models to improve the efficiency ment system (DBMS), attackers can inject logically incorrect of SQL injection vulnerability penetration testing. In [26], Cec- statements in some SQL query to make DBMS return database- cato et al. presented SOFIA which is a security oracle for SQL specific error messages. When conducting error-based SQLIAs, injection vulnerabilities. It can be used to assess whether SQL- attackers try to figure out injectable parameters that can be used based test executions expose any vulnerabilities. Although the to perform database fingerprinting and data extraction. Error capabilities of WAFs can be enhanced by the above approaches, messages returned from DBMS may contain sensitive informa- most of them focus on incoming SQL queries. Few of them take tion that can help attackers to identify important parameters of the database responses into account. Therefore, they can hardly databases. The salient features of this type of attacks typically evaluate SQLIA severities. include the following key words like FLOOR, ORDER BY, and Similar to our approach, Sekar [28] proposed a taint-tracking- UPDATEXML. based approach which utilizes syntax- and taint-aware policies For example, assume that we want to query student in- to detect and block SQL injection attacks. Although his method formation whose age is 10 from a remote database ta- is also bidirectional considering both SQL request and database ble students via a web-based application. We use the URL responses, he did not investigate what database data are leaked. http://example.com/page.php?age=10 to initialize the query. Therefore, his approach cannot reveal how serious the SQLIAs Such URL will be sent to some HTTP server and transformed are. To the best of our knowledge, our proposed approach is the into a SQL statement “SELECT * FROM students WHERE age first attempt that can detect SQLIAs and evaluate their severities = 10.” To obtain the sensitive information (i.e., the column num- simultaneously based on network traffic analyses. Our DIAVA ber) of the table students, we can resort to the clause ORDER framework can not only accurately detect suspicious incoming BY. We may conduct the error-based SQLIA using the modi- SQL requests using our proposed multilevel regular expression fied URL like this: http://example.com/page.php?age=10 order based rules, but also enable quick vulnerability evaluation of by 5. The following SQLIA example shows the corresponding leaked sensitive data. injected SQL statement.

Authorized licensed use limited to: East China Normal University. Downloaded on March 05,2020 at 01:35:52 UTC from IEEE Xplore. Restrictions apply. GU et al.: DIAVA: A TRAFFIC-BASED FRAMEWORK FOR DETECTION OF SQL INJECTION ATTACKS AND VULNERABILITY ANALYSIS 191

Fig. 1. Overview of our DIAVA framework.

Example 2: SELECT * FROM students WHERE age = assumptions. Usually, by properly putting some time delay 10 ORDER BY 5; causes in a conditional statement of a SQL query, attackers can If the execution of this SQL statement returns an error, we can determine whether the condition is true or false based on the re- conclude that the table students has less than five columns. Oth- sponse time of target DBMSs. The following example shows erwise, we can infer that the table has more than four columns. how to use a time-based blind SQLIA to obtain the version By properly decreasing or increasing x in the clause ORDER BY information of a MySQL DBMS. x, attackers can easily figure out the exact column number of the Example: SELECT * FROM users WHERE id=1- table. IF(MID(VERSION(), 1, 1) = ’5’, SLEEP(100), 0); In this case, if the response of DBMS takes 100 s or more, we C. Stacked Queries can infer that the target MySQL version is 5.x. The intent of stacked query-based SQLIAs is to get the per- mission to access/add/modify data in databases or perform the denial of service attack. Attackers can inject multiple SQL state- IV. OUR APPROACH ments in a single query by concatenating them together with Fig. 1 presents an overview of our DIAVA framework. It con- semicolons in between. Therefore, the semicolon punctuation tains two parts: 1) the front-end that collects network traffic re- is the salient feature for stacked query-based SQLIAs. As an lated to SQLIAs; and 2) the back-end that identifies SQLIAs and example shown as follows, the original SQL statement (the one evaluates the vulnerability of leaked data by dictionary attack before the semicolon) tries to figure out the information of a analysis (i.e., dictionary-based collision). In our approach, we student whose name is abc. By appending a new SQL statement introduce a multilevel RegExp (regular expression) model con- after the original statement, the attackers can maliciously delete sisting of three RegExp sets that enable both SQLIA detection the table user. as well as vulnerability analysis of leaked data. Our framework Example 3: SELECT * FROM students WHERE sname adopts Hyperscan, which is a high-performance library devel- = = ‘abc’ AND password ‘123’; DROP TABLE user; oped by Intel to facilitate the multilevel matching of RegExps. Based on our proposed first-level RegExps, the traffic collection D. Union Query-Based SQLIAs module in the DIAVAfront-end can effectively filter out network To bypass authentications or extract sensitive data from traffic without containing suspected SQLIAs. When performing databases, attackers can inject a new select statement within behavior analysis of captured network traffic in the back-end, the original SQL statement. The following is an example that our framework uses the second-level RegExps to identify suc- uses the keyword UNION. In this case, both the original and cessful SQLIAs, and then uses the third-level RegExps to extract new SELECT statements will be executed. the leaked sensitive data in the form of either plaintexts or ci- Example 4: SELECT COUNT (*) FROM students WHERE phertexts. If the extracted leaked data are encrypted, the DIAVA sname = ‘abc’ UNION SELECT CardNumber FROM back-end employs a dictionary attack based approach that can credit WHERE sname = ‘admin’ −− ’AND password evaluate the vulnerability of the leaked data based on the com- = ‘123’; plexity of its deciphering process. To enable the real-time anal- ysis, our dictionary attack analysis module employs GPU as its underlying parallel deciphering engine, which can quickly fig- E. Time-Based Blind SQLIAs ure out the original contents of the leaked sensitive data. Based The time-based blind SQLIAs try to change the time behaviors on the front-end and back-end modules configured using our of DBMSs or web applications. Based on the typical keyword proposed multilevel RegExp model, DIAVA can efficiently and such as SLEEP, WAITFOR, and BENCHMARK, attackers can accurately identify successful SQLIAs from massive network either retrieve sensitive information or verify their established traffic, and check the vulnerability (i.e., severity) of SQLIAs

Authorized licensed use limited to: East China Normal University. Downloaded on March 05,2020 at 01:35:52 UTC from IEEE Xplore. Restrictions apply. 192 IEEE TRANSACTIONS ON RELIABILITY, VOL. 69, NO. 1, MARCH 2020 based on the corresponding leaked data. The following sections SQLIAs, we define a set of RegExps to model the matching pat- will introduce each component of DIAVA in detail. terns, which can accurately capture the network traffic related to SQLIAs. Note that in our approach, the front-end is extendable. A. Network Traffic Collection of SQL Injection Attacks Other RegExps can be easily incorporated into the first-level RegExps to detect more different types of SQLIAs. In order to detect malicious SQLIAs from network traffic, the To enable the SQLIA detection for different DBMSs (e.g., front-end of DIAVA monitors all the possible ports of host vir- PostgreSQL, Oracle, MySQL), for each suspected SQLIA the tual machines that are frequently used by the HTTP protocol. By DIAVA front-end converts the corresponding network traffic setting the proper size of hugepages and the length of queues for into a unified intermediate file in JSON format. As an example, receiving network packets, the front-end can collect all the net- Listing 2 presents the skeleton of a JSON-based front-end output work traffic passing through the network interface cards (NICs). related to a suspected SQLIA. To enable the following SQLIA In the front-end, we use Hyperscan as our high-performance en- detection and evaluation, each JSON file should record the infor- gine for regular expression matching. Based on regular expres- mation about the matched regular expression, the IPs and ports sions collected from well-known SQLIA tools (e.g., SQLMap of relevant clients and servers, and a sequence of sessions (de- [27]), we create the first-level RegExps for the purpose of SQLIA noted by “seq” in Listing 2). In the JSON file, we use a triplet detection from network traffic. By compiling our proposed first- in the form of timestamp, request, response to save the details level RegExps and converting them into a hyperscan database, of each session. Hyperscan can deal with network traffic in a streaming mode concurrently and efficiently based on automata theory and a rich set of APIs.

Listing 2. Skeleton of the front-end output in JSON format.

B. Behavior Analysis and Data Extraction In DIAVA back-end, the function of the behavior analysis and data extraction module is to determine whether suspected SQLIAs detected by DIAVA front-end can be successfully per- formed in their target web applications. If the SQLIAs succeed, this module will figure out whether there exists any leaked sensi- tive information. If the leaked data are in the form of ciphertexts, they will be sent to our GPU-based dictionary attack analysis module for further vulnerability evaluation. As shown in Fig. 1, this module consists of two parts: 1) the behavior analysis sub- module that parses front-end outputs in JSON format and deter- mines the status of a SQLIA, where successful indicates that the SQLIA successfully updates its target database or obtains some Listing 1. Excerpts from the first-level RegExps used in front-end. sensitive data, and failed means that the SQLIA is malicious but takes no effect; and 2) the leaked data extraction submodule that extracts the leaked data in ciphertexts for the severity anal- Listing 1 shows the skeleton of our first-level RegExps em- ysis using dictionary attack-based method. To enable accurate ployed in the DIAVAfront-end. These RegExps focus on match- behavior analysis and data extraction, we collected a large set ing the features of malicious SQLIAs existing in the request of successful SQLIA samples during the first half year of 2016. packets of network traffic. Since in HTTP protocol most of web Such information can be used to guide the RegExp generation requests are based on POST and GET methods, our front-end for the two submodules. mainly takes the network traffic related to these two methods of 1) Behavior Analysis: To determine whether a SQLIA is suc- HTTP requests into account. Note that due to space limitation, cessfully performed, based on the collected successful SQLIAs, Listing 1 does not list all the RegExps used in our framework. we propose a set of RegExps that consist of 14 RegExps as shown In the current version of DIAVA, we mainly consider the five in Listing 3. Although the RegExps in Listing 3 can identify types of SQLIAs as introduced in Section III. For each type of all the collected SQLIA samples, there still exist SQLIAs that

Authorized licensed use limited to: East China Normal University. Downloaded on March 05,2020 at 01:35:52 UTC from IEEE Xplore. Restrictions apply. GU et al.: DIAVA: A TRAFFIC-BASED FRAMEWORK FOR DETECTION OF SQL INJECTION ATTACKS AND VULNERABILITY ANALYSIS 193

Fig. 2. Workflow of behavior analysis. Listing 4. RegExps for the behavior analysis of error-based SQLIAs. cannot be detected, especially for the error-based SQLIAs which In the Map stage, our approach partitions the input session file are platform-specific. into multiple sessions (see the example of “seq” shown in List- To enable the processing of such kind of SQLIAs, we analyze ing 2), where each session is handled by a thread implementing the characteristics of existing error-based SQLIAs and summa- the behavior analysis function. Based on our proposed second- rize a comprehensive set of supplemental regular expressions level RegExps (shown in Listings 3 and 4), the session file can as listed in Listing 4. We combine both the common and sup- be analyzed in parallel. For each behavior analysis function, its plemental regular expressions as the second-level RegExps for input is a session defined in some front-end output. Its goal is SQLIA behavior analysis. Note that we do not conduct leaked to figure out some predefined strings instrumented by attackers, data extraction for network traffic which only contains Boolean- which can be used to determine the status of SQLIAs. In our ap- based, time-based blind, or stacked query-based SQLIAs. This proach, we use the second-level RegExps to identify the injected is mainly because that the analysis of these SQLIAs highly relies predefined strings from both request and response segments of on the order of collected sessions. However, such order informa- a session. tion is hard to be guaranteed during the practical network traffic In the Reduce stage, our approach tries to gather all the ses- collection. Since a single session cannot be used to determine sion analysis results and report the status of the SQLIA. Since the success of a SQLIA, we skip the leaked data extraction and our approach focuses on the detection and evaluation of leaked evaluation for these SQLIAs. sensitive data based on the bidirectional network traffic analysis, we define the SQLIA status (i.e., successful and failed) based on existence of predefined strings in the database responses. If some predefined string detected in SQL also exists in the database responses, our behavior analysis submodule will re- port that a SQLIA has been successfully identified. Otherwise, a failed SQLIA will be reported, since it is hard for DIAVA to determine whether this attack is successfully performed based on the collected network traffic. To facilitate the fast determi- nation of SQLIA statuses, our approach uses the Filter function when gathering the analysis results, which can classify the suc- cessful and failed attacks in sessions efficiently. Finally, if there is at least one successful attack during the session file analysis, a successful SQLIA will be reported. Otherwise, the resultant Listing 3. RegExps for behavior analysis of common SQLIAs. SQLIA status will be marked as failed. Besides figuring out SLQIA statuses, our behavior analysis submodule can be used to collect as much useful information Fig. 2 presents the workflow of our behavior analysis sub- (e.g., SQLIA statements, the location of SQLIAs, DBMS module. Since a front-end output may consist of a large quantity parameters) as possible to enable the following extraction and of sessions, to improve the overall analysis time for a single evaluation of leaked sensitive data. For example, according to SQLIA, we adopt both the Map and Reduce operations in our HTTP protocol standard, the request segment of a session can workflow. Note that during the behavior analysis, we do not need be divided into five fields, i.e., request-line, general-header, to reassemble the TCP streams for the collected network traffic. request-header, entity-header, and message-body. With the help This is because our approach focuses on the content of the traf- of predefined strings obtained using our proposed second-level fic. The order of sessions is not considered during the following RegExps, we can quickly locate the fields that contain SQLIAs vulnerability analysis. and identify the corresponding SQLIA statements.

Authorized licensed use limited to: East China Normal University. Downloaded on March 05,2020 at 01:35:52 UTC from IEEE Xplore. Restrictions apply. 194 IEEE TRANSACTIONS ON RELIABILITY, VOL. 69, NO. 1, MARCH 2020

2) Extraction of Leaked Data: Along with the increasing popularity of automated penetration tools in hacking web ap- plications, more and more SQL injection attackers like to steal sensitive information from databases. To evaluate the severity of SQLIAs, we need to figure out which sensitive data have been leaked to attackers. Therefore, after obtaining the traffic data of successful SQLIAs and the features of SQL injections from the behavior analysis, the major job of our leaked data extrac- tion submodule is to locate and identify which part of collected traffic contains the leaked data. By executing malicious instrumented SQL statements, attack- ers may obtain a large set of query results. Without effective techniques or tools, it is hard for attackers to quickly locate their desired sensitive data by themselves. Therefore, database attackers prefer to instrument some identifier strings within their malicious SQL statements in corresponding HTTP requests. Generally, these identifier strings are encoded in the form of hexadecimal strings, CHR functions (for Oracle database), or CHAR functions (for MySQL and MSSQL databases). Rather than changing the meanings of their host SQL statements, such identifier strings are only used as anchor points to facilitate the identification of sensitive data. In our approach, we propose a set of ten RegExps (the third-level RegExps) as shown in Listing 5, which are generated based on the collected successful SQLIA samples. Note that these ten RegExps can be used to identify all the identifier strings existing in the collected SQLIA sam- ples. In other words, they can be used to locate the positions of leaked data caused by the collected SQLIA samples. These Reg- Exps can process the following six major categories of identifier strings used in practice: In Algorithm 1, step 1 initializes two lists specialStrList and 1) hexadecimal strings (starting with “0x”); dataList, which are used to save identifier strings collected 2) CHR functions; from req and leaked data collected from resp, respectively. 3) CHAR functions; Steps 2Ð10 try to search for identifier strings in an iterative 4) “0x” strings encrypted with SHA-1, SHA-256, and manner until req has been completely scanned. In each itera- SHA-512; tion, we try to figure out the location of the identifier string at 5) common strings hashed by MD5; the head of req. We use three variable hexIdx, chrIdx, and 6) “0x” strings and CHAR functions hashed by MD5. charIdx to indicate the indices of the latest found identifier strings in the form of hexadecimal format, chr() function and char() function, respectively. As shown in step 2, at the begin- ning of each iteration, we set all these three variables to the largest integer value MAXINT. In step 3, the function GetIdx up- dates the value of hexIdx, chrIdx,orcharIdx with the location of detected identifier strings based on the third-level RegExps. If no identifier string is identified from req, the whole iteration will be terminated by step 5. Otherwise, steps 6Ð9 will record and sort the identifier strings in specialStrList with the order of () Listing 5. RegExps for data extraction. their occurrences. Since char function has two different vari- ants, in step 8, we use the function CheckCHARFormat to check which variant is used in req. For example, char(100, 100, 100) Since identifier strings act as anchor points for locating sen- and char(100)||char(100)||char(100) denote two strings with sitive data, our leaked data extraction submodule first needs to the same contents. To facilitate the following leaked data ex- scan the HTTP responses sent by DBMSs to figure out their lo- tract, we unify the detection of char() function-based identi- cations. Algorithm 1 presents the details of our approach that fier strings. By using the flag flg, our approach will convert utilizes identifier strings to identify leaked data. This algorithm char(100, 100, 100) into char(100)||char(100)||char(100). has two inputs: an HTTP request req containing some malicious Without loss of generality, the function GetSpecialChar will SQL statement, and the corresponding HTTP response resp that transform a string in the form of char(x1,x2,...) into another may contain leaked sensitive data. string in the form of char(x1)||char(x2)|| .... When all the

Authorized licensed use limited to: East China Normal University. Downloaded on March 05,2020 at 01:35:52 UTC from IEEE Xplore. Restrictions apply. GU et al.: DIAVA: A TRAFFIC-BASED FRAMEWORK FOR DETECTION OF SQL INJECTION ATTACKS AND VULNERABILITY ANALYSIS 195

Listing 6. SQLIA example with leaked data in ciphertext.

TABLE II EXTRACTED DATASET FROM THE RESPONSE BODY Fig. 3. SQLIA example with leaked data in plaintext.

identifier strings are figured out, step 11 will extract leaked data from the HTTP response resp. If there are k identifier strings in specialStrList, step 11 will extract k − 1 sensitive data and save Based on these three identifier strings, our approach can be them in dataList, where each sensitive data is located in between used to extract leaked data from the HTTP response (see lines two adjacent identifier strings saved in specialStrList. Finally, 12Ð18 in Listing 6), which are saved in the response segments the algorithm reports all the extracted sensitive data from resp. of sessions saved in corresponding front-end output. Since our Based on the third-level RegExps (see Listing 5) and approach can figure out four identifier strings from the HTTP re- Algorithm 1, our data extraction submodule can extract sen- quest though two of them are the same, from each session of the sitive data from the network traffic of database responses based HTTP response, we can extract three strings as leaked data. Ta- on identifier strings detected in HTTP requests. Fig. 3 gives ble II presents the extracted data from the HTTP response by the an example demonstrating how DIAVA extracts sensitive data successful SQLIA. In this example, the extracted user ID infor- leaked from a MySQL DBMS. In this example, an attacker mation is located in between “qjxzq” and “wcemuc.” The ex- injects his instrumented SQL payload in the field of HTTP tracted user name information is located in between “wcemuc” GET method trying to steal user information. Note that the ob- and “wcemuc.” The extracted user password information is lo- tained HTTP request (i.e., the first three lines in Fig. 3) from cated in between “wcemuc” and “qjqzq.” Note that both user network traffic is URL encoded. Therefore, we need to con- IDs and names are in plaintext, whereas user passwords are all duct the URL decoding for this request before the process of encrypted. leaked data extraction. For this example, the URL decoded string is GET /main.php?id=1 AND EXP((SELECT * FROM (SE- C. Dictionary Attack Analysis Using GPU LECT CONCAT(0x71706a6a71, (SELECT (ELT(9170=9170, 1))), 0x7162627671, 0x78))x)) & dbtype=1 & Submit=Submit. In the previous sections, we presented both SQLIA traffic col- By analyzing this request using Algorithm 1, we can find lection and detection modules of DIAVAbased on our multilevel that 0x71706a6a71 and 0x7162627671 are the two identifier regular expression model, which can accurately capture and ex- strings, where 0x71706a6a71 denotes the string “qpjjq” and tract sensitive information causes by malicious SQLIAs. Aiming 0x7162627671 indicates “qbbvq” in hexadecimal format. The at protecting the data privacy, usually the leaked sensitive infor- bottom part of Fig. 3 shows the response information returned mation of web applications is encoded using weak encryption by the MySQL DBMS. From this information, we can find that algorithms. To evaluate the vulnerability of leaked data, it is this is an error-based SQL injection attack. Based on the iden- necessary to check the decryption complexity of these cipher- tifier strings, our approach can extract the leaked data (i.e., 1 in texts. If a ciphertext can be easily decrypted and restored to its this example), which is in the plaintext form. corresponding plaintext, it means that the ciphertext is seriously Listing 6 gives another SQLIA example, where the leaked vulnerable. Note that even for the decryption of a simple cipher- data are encrypted. Unlike the HTTP request shown in Fig. 3, text, a huge number of computation efforts are needed. Since the HTTP request in this example has already been URL de- DIAVA may deal with a large quantity of SQLIA traffic at the coded. By using the third-level RegExps, we can figure out the same time, it is required that DIAVA can conduct the real-time identifier strings from this request, i.e., CHR(113)||CHR(106)|| vulnerability analysis of leaked data. To speedup the decryption CHR(120)||CHR(122)||CHR(113)), CHR(119)||CHR(99)||CHR process, DIAVA adopts GPU as its parallel computation engine (101)||CHR(109)||CHR(117)||CHR(99) and CHR(113)||CHR for the evaluation of leaked data, which can conduct the dictio- (106)||CHR(113)||CHR(122)||CHR(113), which correspond to nary attack efficiently to decipher the extracted sensitive data. In strings “qjxzq”, “wcemuc,” and “qjqzq,” respectively. the current version of DIAVA,we only consider four hash-based

Authorized licensed use limited to: East China Normal University. Downloaded on March 05,2020 at 01:35:52 UTC from IEEE Xplore. Restrictions apply. 196 IEEE TRANSACTIONS ON RELIABILITY, VOL. 69, NO. 1, MARCH 2020

Fig. 5. Workflow of our GPU-based dictionary attack analysis.

Fig. 4. Workflow for selecting dictionaries and encryption methods.

webpages of web applications. Based on the obtained web appli- weak encryption algorithms (i.e., MD5, SHA-1, SHA-256, and cation features, we can figure out the plaintext characteristics of SHA-512), which form part of several widely used security pro- leaked data, which can be used for the dictionary selection. For tocols (e.g., SSL, PGP,and IPsec). Note that since our evaluation example, if we know that the passwords of some web application module is dictionary independent, other encryption algorithm are in the form of all digits, we will not select a dictionary of can be easily incorporated into our framework. The following passwords that contain letters. Besides web application features, part will present the details of our GPU-based dictionary attack SQL payload features are mainly gathered during the behavior analysis. analysis. By analyzing such features (e.g., table names, attribute 1) Determination of Dictionaries and Encryption Methods: names, aggregation functions) from injected SQL statements, In our DIAVA framework, we construct the dictionary set (as we can infer the meaning or content of leaked data, which can shown in Fig. 4) based on the dictionaries provided by SecLists be used to facilitate the dictionary selection. [29], which is an open-source project widely used by security The third and fourth inputs are used to determine which weak testers. To enable efficient dictionary attack analysis, the dictio- encryption algorithms are used for generating the leaked data. nary items in SecLists are regrouped based on both the types In our approach, the third input is only used to determine the (i.e., all digits, all letters, mixture of digits and letters with spe- length of a ciphertext. Since different DBMSs have different de- cial characters) and lengths of dictionary items. We create a new fault built-in encryption methods, the DBMS features obtained dictionary for each new group. In other words, the items in the during behavior analysis can be used to determine the type of same dictionary have the same type and length. In this way, we encryption algorithms used by leaked data. For example, if we can save the dictionary attack time, since the unnecessary com- have a ciphertext (encoded in 32 hex digits) leaked from some parison with impossible scenarios can be significantly reduced. MySQL-based database, we can assume that this one is en- Before the dictionary attack analysis, it is required to iden- crypted by MD5 algorithm. This is because MD5 is the built-in tify which encryption algorithm is used for the construction of encryption algorithm of MySQL and the general MD5 ciphertext ciphertexts. Moreover, a proper selection of dictionaries will in- have a length of 32. Note that in the current version of DIAVA, crease the chance of the success of deciphering. Fig. 4 shows we only consider four major encryption algorithms (i.e., MD5, our approach to identify weak encryption methods for leaked SHA-1, SHA-256, and SHA-512) for the dictionary attack anal- data and selecting a dictionary for the collision analysis. To fig- ysis. Since the process of encryption algorithm determination ure out these two key factors in deciphering the leaked data, our is programmable, by extending ciphertext/DBMS features and approach requires the following four inputs: adding more determination rules, other hash-based encryption 1) features of web applications; algorithms can be easily integrated into our framework. 2) features of SQL payloads; 2) GPU-Based Dictionary Attack Analysis: Fig. 5 presents 3) the feature of encrypted leaked data; the workflow of our GPU-based dictionary attack analysis. To 4) features of DBMSs. increase the success ratio of dictionary attack, we extend the se- Among these four inputs, the first two are used for the selec- lected dictionary (see details in Section IV-C1) using the plain- tion of dictionaries. Since most SQLIAs target the user informa- text information (e.g., user name, date of birth) extracted from tion (e.g., user name, password) of web applications, the web the same leaked data as the ciphertext to be decrypted. Such application features mainly involve the format of user informa- plaintexts are concatenated with the dictionary items as its pre- tion. If the format of user information is available apriori,we fixes or suffixes. Based on the extended dictionary and GPU can select or construct a succinct dictionary with a higher suc- platforms, our approach can conduct dictionary attack in par- cess ratio for the attack analysis. To achieve such information, allel, which improves the overall decryption time of the given the DIAVA has a built-in web crawler which can retrieve such ciphertext. In our approach, we construct one extended dictio- important web application features from the sign-up and login nary for each given ciphertext. Therefore, it is not necessary to

Authorized licensed use limited to: East China Normal University. Downloaded on March 05,2020 at 01:35:52 UTC from IEEE Xplore. Restrictions apply. GU et al.: DIAVA: A TRAFFIC-BASED FRAMEWORK FOR DETECTION OF SQL INJECTION ATTACKS AND VULNERABILITY ANALYSIS 197

that DictAttack is a global function that is invoked by the CPU and executed by a GPU thread. The algorithm has six inputs. The inputs dict, sup, and cptxt denote the the selected dictionary, the set of prefix and suffix strings, and the given ciphertext for colli- sion analysis, respectively. Before the invocation of DictAttack, all these information are transferred from the CPU memory to GPU memory. Note that since generally the size of sup is small but it is used by all the GPU threads for dictionary extension, we use the GPU shared memory to store such information to fur- ther improve the overall dictionary attack time. The inputs succ is a global variable for all GPU threads indicating the success of the attack. If one GPU thread decrypts the given ciphertext successfully, we will update this variable to notify other GPU threads to terminate their collision analysis. The inputs num and algo are two parameters used to specify the number of dictionary items and encryption algorithm (i.e., MD5, SHA-1, SHA-256, and SHA-512) for cptxt, respectively. In Algorithm 2, steps 1Ð2 figure out the range of dictionary items assigned to the current GPU thread. Based on the two nested loops shown by steps 3Ð11, the GPU thread iteratively processes the collision analysis for each extended dictionary item, where the inner loop deals with the extension and collision analysis for only one dictionary item. In the inner loop, step 3 checks whether other GPU threads have conducted the success- ful attacks. If yes, the whole GPU thread will be terminated. To increase success ratio of the dictionary attack, steps 4Ð6 enumer- ate three combinations of concatenations between the original dictionary item and the strings provided by the supplementary construct a hash table for each extended dictionary to facilitate strings achieved from the same leaked data. In step 7, we encrypt the search for corresponding plaintext. Instead, our approach the extended plaintexts using the specified algorithm and save partitions the dictionary into a set of small segments with the the generated ciphertexts using a variable vector named result. same size and assigns each segment to a GPU thread for parallel Step 8 compares the content of each element of result against search in an SIMD manner. cptxt. If there does not exist a match, the function Equals will To reduce the overall deciphering time, our GPU-based dictio- return −1, and the inner loop will continue. Otherwise, Equals nary attack analysis approach involves three major steps. Since will return the index of the matched element. Meanwhile, the GPU cannot access the host memory directly, in the first step our GPU thread will mark the succ with true, save the deciphered approach first transfers all the inputs (i.e., selected dictionary, plaintext, and terminate its current execution. prefix, and suffix information, ciphertext) from host memory to GPU memory. To enable the parallel searching, the first step D. Back-End Outputs for Further Analysis then assigns the dictionary segments and corresponding prefix When the back-end finishes both the behavior and vulnerabil- and suffix data to specific GPU threads based on the GPU run- ity analyses of leaked data, all the statistics collected or inferred time settings. The second step mainly deals with the parallel from network traffic will be saved in a file in JSON format for dictionary extension and collision analyses. In this step, each further analysis. Listing 7 presents an excerpt from the back-end GPU thread tries to assemble a local dictionary based on its as- output of a successful SQLIA. In this example, the keys client_ip signed dictionary segment. After concatenating the dictionary and client_port denote the IP address and port of the attacker, re- items with the given prefix and suffix information, each GPU spectively, while the keys server_ip and server_port indicate the thread needs to encrypt the newly generated dictionary items IP address and port of the victim virtual machine. Meanwhile, with the identified encryption algorithm. If the encryption result the geographic locations of the attacker and victim denoted by equals the input ciphertext, it means that the dictionary attack client_location and server_location are also figured out by our succeeds. This step terminates only when either one GPU thread proposed RegExps.1 If the key exploit is set with “OK,” it means attacks successfully or the whole dictionary has been scanned. that the SQLIA is successfully performed. For each value of the Finally, the third step moves the attack analysis result from the key sqlinject, it contains all the information for a single session GPU memory to host memory and reports the overall attack- of the SQLIA, e.g., attack type (“ATTACK_TYPE”), session ing time and whether the given ciphertext has been successfully decrypted. 1Due to the space limitation, in this paper we only present the RegExps that Algorithm 2 presents the details of the GPU-based dictionary- are directly related to the extraction and evaluation of leaked data. We omit the based collision procedure used in our DIAVA framework. Note introduction to other RegExps used in DIAVA.

Authorized licensed use limited to: East China Normal University. Downloaded on March 05,2020 at 01:35:52 UTC from IEEE Xplore. Restrictions apply. 198 IEEE TRANSACTIONS ON RELIABILITY, VOL. 69, NO. 1, MARCH 2020

TABLE III TAMPERING METHODS OF ATTACK GENERATION IN SQLMAP

the network traffic of web services, in this experiment, we did not use any cloud platform to collect real network traffic. Instead, we collected the real network from the backbones of our campus network based on port mirroring using optical beam splitters. To enable the accurate identification of successful SQLIAs and vul- nerability evaluation of leaked data caused by SQL injections, we implemented our DIAVA framework based on programming language Python (version 2.7), Hyperscan (version v4.0.0), and CUDA (version 7.5). All the experimental results were obtained on a Linux machine with AMD 3.2GHz CPU, 16 GB RAM, and NVIDIA GeForce GTX 1050 GPU.

A. Results of SQLIA Detection Since our approach focuses on the SQL injection detection for web applications, during the preprocessing of the real network Listing 7. Skeleton of back-end output. traffic, we only collected the traffic from the frequently used ports (e.g., 80, 5000, 8080) of web-based applications. In the ex- periment, we randomly selected the network traffic of five days contents (“session”), and leaked data (“data”). In this example, in 2016. To achieve simulated network traffic, we use SQLMap the SQLIA type for the illustrated session is “UNION-BASED with four tampering methods as shown in Table III to generate INJECTION.” The value of session includes the statement of SQL injection attacks. The first method is to encode arithmetic the SQL injection attack and the location of its occurrence (i.e., and logical operators in SQL with HTML reserved characters. “request-line”) in the HTTP protocol. The value of data keeps For instance, the expression 1!=2 (or 12) can be converted all the information for leaked data. If some field of the “data” is into 1<>2 by HTML encoding. The second method tries encrypted, in this part we will provide the original ciphertext, the to replace whitespaces within a SQL statement with empty com- name of the encryption method, and the corresponding plaintext ments. For example, we can replace the whitespace within a SQL and the overall decryption time using our GPU-based dictionary snippet AND 1=1 with an empty comment and get AND/**/1=1. attack method. In this example, we only present the leaked in- The third method is to randomly change the case of SQL re- formation for a user. We can find that the leaked password of the served word letters, e.g., AND 1=1 can be converted into aNd user is encrypted using the MD5 algorithm. Since it only costs /**/ 1=1. The last method allows us to randomly insert com- 200 ms to decipher the password, the SQLIA can easily launch ments into a SQL statement, which makes the SQL injection a malicious attack to the corresponding web application. hard to be identified. It is important to note that SQLMap is extendable, i.e., all the other tampering methods can be easily integrated into SQLMap to enable the generation of unidentifi- V. E XPERIMENTAL RESULTS able SQLIAs. Based on the above four tampering strategies, we To evaluate the effectiveness of our proposed framework, we can attack the web application DVWA using the modified ma- conducted the experiment using network traffic captured from licious SQL queries. We collected the simulated network traffic both simulated and real networks. Our network simulation is on using the tool Wireshark. Note that both real and simulated net- top of a private cloud platform based on Openstack. To simulate work traffic data are saved in JSON files with a total size of the database transactions, our private cloud is installed with a 152.7 MB. PHP/MySQL web application DVWA [30] that supports a va- Table IV shows the comparison results of different SQLIA de- riety of database management systems, e.g., MySQL (version tection approaches. In this table, the first five rows use the real 5.1.73) and PostgreSQL (version 8.4). We adopted and modi- network traffic of the five selected days (in the form of mm/dd fied the open source tool SQLMap [27] to automatically generate as shown in the second column), and the last two rows adopt the SQL injections, and we used the open source database firewall network traffic collected in the simulated network environments GreenSQL (version 1.3.0) as a reference for comparison. Based with two different DBMSs (i.e., MySQL and PostgreSQL). All on the above settings, we collected the simulated network traffic such information is indicated by the first and second columns. using the tool Wireshark. Since the cloud itself does not affect The third column presents the number of victim IPs attacked

Authorized licensed use limited to: East China Normal University. Downloaded on March 05,2020 at 01:35:52 UTC from IEEE Xplore. Restrictions apply. GU et al.: DIAVA: A TRAFFIC-BASED FRAMEWORK FOR DETECTION OF SQL INJECTION ATTACKS AND VULNERABILITY ANALYSIS 199

TABLE IV COMPARISON OF DIFFERENT SQLIA DETECTION APPROACHES

by SQLIAs. The fourth column denotes the total attack records SQL 1 (PostgreSQL): SELECT * FROM users WHERE in the collected network traffic, while the fifth column gives id = 1 AND 7244= CAST((CHR(113)||CHR(107)||CHR the number of identified attacks that succeeded in obtaining the (113)) || (SELECT (CASE WHEN (7244=7244) THEN confidential information from databases. Note that our approach 1 ELSE 0 END))::text || (CHR(113)||CHR(107)||CHR focuses on SQLIAs that cause information leak, since it is based (113)) AS NUMERIC) on network traffic. The sixth column has three subcolumns. Its SQL 2 (MySQL): SELECT name,pwd FROM users first subcolumn indicates the number of detected SQLIAs by WHERE id = extractvalue (1, concat(0x7e, (SE- the method proposed in [31]. Its second and third subcolumns LECT distinct concat(0x7e7e7e, name, 0x3a, present the corresponding false positives and false negatives,2 passwd, 0x7e7e7e) FROM users limit 0, 1))) respectively. The seventh column shows the SQLIA detection SQL 3 (MySQL): SELECT name,pwd FROM users results for GreenSQL [19]. Since GreenSQL itself is not based WHERE id = updatexml (1, concat(0x7e7e7e, on network traffic analysis and its execution requires the con- (SELECT @@version), 0x7e7e7e), 1) nection with DBMSs, we only present the detection results for simulated SQLIAs. In the first subcolumn of the seventh column, B. GPU-Based Analysis of Dictionary Attack we present the detection results in the format of x(y), where x indicates the total number of reported SQLIAs and y means the We applied our GPU-based dictionary attack analysis engine number of real successful SQLIAs with leaked data. The last to assess the vulnerabilities of leaked data, which can be used column shows the detection results using our proposed frame- to generate SQLIA analysis reports to cloud tenants. To show work. To reflect the effectiveness of our approach, in the first the effectiveness and efficiency of our approach, we compared subcolumn we present the detection results in the format of x/y our GPU-based approach with the traditional CPU-based dictio- where x and y indicate the number of suspected SQLIAs iden- nary attack method with multicores. We adopted an AMD CPU tified in the first phase (using the first-level RegExps) and the with six physical cores supporting 12 concurrent threads. There- number of SQLIAs confirmed in the second phase (using the fore, in the performance comparison, we set the number of CPU second-level RegExps), respectively. threads to be 1, 4, 8, and 12, respectively. To fully investigate From this table, we can find that our approach outperforms the capacity of NVIDIA GeForce GTX 1050 GPU, in the exper- both approaches presented in [19] and [31]. Due to our pro- iment we set the size of GPU grids and blocks to both be 512. posed multilevel regular expression model, DIAVA can detect In this experiment, we focused on the dictionary attack of more SQLIAs than the two approaches. As an example for real leaked password information encrypted by hash-based encryp- network traffic of September 21, our approach can detect all tion algorithms. Note that the leaked information collected from the 3000 SQLIAs, while the approach in [31] can only detect the real network environment uses the encryption algorithm of 913 SQLIAs. For the example of simulated network traffic using MD5 or SHA-1, while the leaked information collected from the PostgreSQL, our approach can detect 1373 SQLIAs. However, simulated network environment uses the encryption algorithm of GreenSQL can only detect 337 SQLIAs. Furthermore, based on SHA-256 or SHA-512. All the leaked information was obtained our bidirectional network traffic analysis method, DIAVA can from the collected traffic as shown in Table IV. To construct an accurately figure out SQLIAs with confidential data leak. We appropriate dictionary for deciphering, we collected 2000 most can observe that both of false positives and false negatives using frequently used passwords from the open source password list DIAVA is 0, while the false positive of both approaches in [19] [29]. To improve the success ratio of dictionary attacks, we cre- and [31] is high. This is because neither approach takes bidirec- ated new passwords by concatenating each collected password tional network traffic into account. Moreover, GreenSQL has a with extracted characters (e.g., user name, email, data of birth) high rate of false negatives. In other words, our approach can from the leaked information as a prefix or suffix, and augmented detect a large quantity of SQLIAs that cannot be detected by the dictionary with these newly generated passwords. GreenSQL. For example, the following three SQLIAs cannot be For the case of real network traffic, we extracted 97 ciphertexts detected by GreenSQL, while DIAVA can identify all of them. encrypted by MD5 algorithm from the attacks happened on Oct. 16th as shown in Table IV. Fig. 6 presents the dictionary attack time for all the 97 ciphertexts using both the given GPU and CPU 2Since our approach focuses on detection of SQLIAs with leaked data, we define a SQLIA without leaked data as a false positive and an undetected SQLIA (with different number of threads) platforms. In the figure, the with leaked data as a false negative. x-axis indicates the indices of the extracted 97 ciphertexts, and

Authorized licensed use limited to: East China Normal University. Downloaded on March 05,2020 at 01:35:52 UTC from IEEE Xplore. Restrictions apply. 200 IEEE TRANSACTIONS ON RELIABILITY, VOL. 69, NO. 1, MARCH 2020

Fig. 6. Dictionary attack time for ciphertexts encrypted by MD5. Fig. 8. Dictionary attack time for ciphertexts encrypted by SHA-1.

Fig. 7. Successful attack time for MD5-based cipertexts.

Fig. 9. Successful attack time for SHA-1-based cipertexts. the y-axis denotes the dictionary attacking time. From Fig. 6, we can observe that our GPU-based method can achieve the best performance compared with CPU-based approaches with different cores. Moreover, the more CPU cores are used, the faster the dictionary attacks can achieve. In this example, the average time of dictionary attacks using GPUs is 74.92 ms, while the average dictionary attacking time based on the given CPU using one thread, four threads, eight threads, and 12 threads is 1613.80 ms, 441.71 ms, 341.73 ms, and 271.39 ms, respectively. Among all the 97 ciphertexts, 17 of them were successfully decrpyted. Note that since our GPU- and CPU-based approaches adopt the same underlying algorithm, their ability of dictionary Fig. 10. Dictionary attack time for ciphertexts encrypted by SHA-256. attacking is the same. Fig. 7 shows the time information for the 17 successful attacks. For these 17 cipertexts, the average attack time using GPU and CPU with one thread, four threads, eight shown in Fig. 9. We can find that our GPU-based approach out- threads, and 12 threads is 74.40 ms, 1518.67 ms, 423.07 ms, performs by the CPU-based method with one thread by several 319.60 ms and 243.56 ms, respectively. We can find that our orders of magnitude. For the CPU setting with 12 threads, our GPU-based approach outperforms CPU-based approach with GPU-based approach can achieve 5.82 times improvement in 12 threads significantly with a speedup of 3.27 times. terms of attacking time. To show the efficacy of our approach for the dictionary attack In the simulated network environment, we deployed web ap- of ciphertexts encrypted by SHA-1 algorithm, we selected 105 plications on our private cloud and save the users’ information in cipertexts from the real network traffic collected on Novem- a PostgreSQL database. We created 50 random SQLIAs contain- ber 1st as shown in Table IV. From Fig. 8, we can find that ing ciphertexts encrypted by SHA-256 algorithm. Fig. 10 shows our GPU-based approach significantly outperforms the CPU- the time information for these 50 attacks. The average attack time based approaches. The average attack time using GPU and CPU using GPU and CPU with one thread, four threads, eight threads, with one thread, four threads, eight threads, and 12 threads is and 12 threads is 39.14 ms, 1631.84 ms, 466.84 ms, 347.78 ms 45.09 ms, 1632.52 ms, 458.28 ms, 355.37 ms and 280.78 ms, and 287.94 ms, respectively. Our GPU-based approach outper- respectively. We can find that our GPU-based approach outper- forms CPU-based approach with 12 threads significantly with a forms CPU-based approach with 12 threads significantly with a speedup of 7.36 times. speedup of 6.23 times. Among these 50 SQLIAs, three of them with vulnerable ciper- Among all the 105 ciphertexts encrypted by SHA-1, three of texts were decrypted. Fig. 11 shows the time information for them are decrypted successfully. The attack time information is these three successful attacks. The average successful attack

Authorized licensed use limited to: East China Normal University. Downloaded on March 05,2020 at 01:35:52 UTC from IEEE Xplore. Restrictions apply. GU et al.: DIAVA: A TRAFFIC-BASED FRAMEWORK FOR DETECTION OF SQL INJECTION ATTACKS AND VULNERABILITY ANALYSIS 201

85.46 ms, 1529.78 ms, 419.05 ms, 295.93 ms and 261.63 ms, respectively. We can find that our GPU-based approach outper- forms CPU-based approach with 12 threads significantly with a speedup of 3.06 times.

VI. CONCLUSION Along with the prosperity of web applications that deploy their services on cloud, SQL injection attack (SQLIA) is among the most common security threats to cloud databases. Various Fig. 11. Successful attack time for SHA-256-based cipertexts. SQL injection techniques are utilized by malicious attackers to acquire or compromise sensitive data in cloud databases. How- ever, due to the unawareness of cloud tenants about the severity of SQL injection attacks, few of them adopt web application firewalls to protect their data privacy in databases. To figure out the successful SQL injection attacks as well as convince cloud tenants of the necessity of adopting WAFs, this paper proposed a traffic-based framework, which can accurately detect the SQL injection attacks and report their severity in a real-time way. Ex- perimental results using both real SQLIA traffic from Internet and synthetic SQLIAs generated by the tool SQLMap showed that our bidirectional traffic-based framework can not only accu- Fig. 12. Dictionary attack time for ciphertexts encrypted by SHA-512. rately identify SQL injection attacks with successful acquisition of sensitive data, but also enable the real-time vulnerability eval- uation of leaked data by SQLIAs.

REFERENCES [1] Positive Technology, “Web application attack statistics of Q3,” 2017. [2] P. H. O’Neill, “Years-old security flaw leads to Dota 2 forum hack that ex- posed 1.5M passwords,” 2016. [Online]. Available: https://www.dailydot. com/layer8/dota2-breach-2million/ [3] R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, and I. Brandic, “Cloud computing and emerging IT platforms: Vision, hype, and reality for deliv- ering computing as the 5th utility,” Future Gener. Comput. Syst., vol. 25, no. 6, pp. 599Ð616, 2009. [4] N. M. Sheykhkanloo, “A learning-based neural network model for the Fig. 13. Successful attack time for SHA-512-based cipertexts. detection and classification of SQL injection attacks,” Int. J. Cyber Warfare Terrorism, vol. 7, no. 2, pp. 16Ð41, 2017. [5] M. Kim and D. Lee, “Data-mining based SQL injection attack detection using internal query trees,” Expert Syst. Appl., vol. 41, no. 11, pp. 5416Ð time using GPU and CPU with one thread, four threads, eight 5430, 2014. threads, and 12 threads is 37.20 ms, 1513.61 ms, 461.39 ms, [6] M. Le, A. Stavrou, and B. Kang, “DoubleGuard: Detecting intrusions in multitier web applications,” IEEE Trans. Dependable Secure Comput.,vol. 312.76 ms and 257.35 ms, respectively. Our GPU-based ap- 9, no. 4, pp. 512Ð525, Jul./Aug. 2012. proach outperforms CPU-based approach with 12 threads sig- [7] P.Chen, J. Wang, L. Pan, and H. Yu,“Research and implementation of SQL nificantly with a speedup of 6.92 times. injection prevention method based on ISR,” in Proc. Int. Conf. Comput. Commun., 2016, pp. 1153Ð1156. To check the performance of our approach for the ciphertexts [8] L. Shar and H. Tan, “Predicting common web application vulnerabilities encrypted by SHA-512 algorithm. In the simulated network, from input validation and sanitization code patterns,” in Proc. Int. Conf. we set up web applications with a MySQL database to manage Automated Softw. Eng., 2012, pp. 310Ð313. [9] WhiteHat Security, “WhiteHat website security statistic report,” 2010. the users’ information. We randomly generated 50 SQLIAs for [10] Y. Xie and A. Aiken, “Static detection of security vulnerabilities in script- evaluation. Fig. 12 shows the attack time for the correspond- ing languages,” in Proc. USENIX Secur. Symp., 2009, pp. 179Ð192. ing cipertexts extracted from the simulated traffic. For the 50 [11] V. B. Livshits and M. S. Lam, “Finding security vulnerabilities in Java applications with static analysis,” in Proc. USENIX Secur. Symp., 2005, SQLIAs, the average attack time using GPU and CPU with one p. 18. thread, four threads, eight threads, and 12 threads is 92.53 ms, [12] S. Bandhakavi, P. Bisht, P. Madhusudan, and V. N. Venkatakrishnan, 1771.99 ms, 489.95 ms, 360.67 ms, and 312.56 ms, respectively. “CANDID: Preventing SQL injection attacks using dynamic candi- date evaluations,” in Proc. ACM Conf. Comput. Commun. Secur., 2007, Our GPU-based approach outperforms the CPU-based approach pp. 12Ð24. with 12 threads significantly with a speedup of 3.38 times. [13] L. K. Shar, H. B. K. Tan, and L. C. Briand, “Mining SQL injection and Fig. 13 presents the time information of successful dictionary cross site scripting vulnerabilities using hybrid program analysis,” in Proc. Int. Conf. Softw. Eng., 2013, pp. 642Ð651. attacks. Among the 50 attacks, three of them were successful. [14] Y. Wang and Z. Li, “SQL injection detection via program tracing and For the 50 SQLIAs, the average attack time using GPU and CPU machine learning,” in Proc. Int. Conf. Internet Distrib. Comput. Syst., 2012, with one thread, four threads, eight threads, and 12 threads is pp. 264Ð274.

Authorized licensed use limited to: East China Normal University. Downloaded on March 05,2020 at 01:35:52 UTC from IEEE Xplore. Restrictions apply. 202 IEEE TRANSACTIONS ON RELIABILITY, VOL. 69, NO. 1, MARCH 2020

[15] A. Kieyzun, P. Guo, K. Jayaraman, and M. Ernst, “Automatic creation of Tian Liu received the B.S. and M.E. degrees from the SQL injection and cross-site scripting attacks,” in Proc. Int. Conf. Softw. Department of Computer Science and Technology, Eng., 2009, pp. 199Ð209. Hohai University, Nanjing, China, in 2011 and 2014, [16] D. Kar, S. Panigrahi, and S. Sundararajan, “SQLiGoT: Detecting SQL respectively, and the engineering degree from the De- injection attacks using graph of tokens and SVM,” Comput. Secur.,vol. partment of Information and Statistic, Polytech’Lille, 60, pp. 206Ð225, 2016. Villeneuve-d’Ascq, France, in 2012. He is currently [17] J. Thomé, L. K. Shar, D. Bianculli, and L. C. Briand, “Security slicing working toward the Ph.D. degree in software engi- for auditing common injection vulnerabilities,” J. Syst. Softw., vol. 137, neering with the School of Computer Science and pp. 766Ð783, 2018. Software Engineering, East China Normal University. [18] L. K. Shar and H. B. K. Tan, “Defeating SQL Injection,” IEEE Comput., His research interests include the area of vol. 46, no. 3, pp. 69Ð77, Mar. 2013. GPU-based acceleration, cloud computing, emerging [19] GreenSQL . [Online]. Available: https://github.com/larskanis/ nonvolatile memory, embedded systems, and machine learning. greensql-fw. Accessed on: Jun. 2019. [20] BreachSecurity, “Modsecurity: Open source web application firewall.” [Online]. Available: http://www.modsecurity.org/. Accessed on: Jun. 2019. [21] D. Appelt, N. Alshahwan, and L. C. Briand, “Assessing the impact of firewalls and database proxies on SQL injection testing,” in Proc. Int. Ming Hu (S’19) received the B.E. degree from the Conf. Testing Softw. Syst., 2013, pp. 32Ð47. School of Computer Science and Software Engineer- [22] D. Appelt, C. D. Nguyen, L. C. Briand, and N. Alshahwan, “Automated ing, East China Normal University, Shanghai, China, testing for SQL injection vulnerabilities: An input mutation approach,” in in 2017. He is currently working toward the Ph.D. de- Proc. Int. Symp. Softw. Testing Anal., 2014, pp. 259Ð269. gree in software engineering with the Department of [23] D. Appelt, C. D. Nguyen, and L. Briand, “Behind an application firewall, Embedded Software and System, East China Normal are we safe from SQL injection attacks?,” in Proc. Int. Conf. Softw. Testing, University, Shanghai, China. Verification Validation, 2015, pp. 1Ð10. His research interests include the area of pro- [24] D. Appelt, C. D. Nguyen, A. Panichella, and L. C. Briand, “A machine- gram analysis, design automation of cyber-physical learning-driven evolutionary approach for testing web application fire- systems, and software testing. walls,” IEEE Trans. Rel., vol. 67, no. 3, pp. 733Ð757, Sep. 2018. [25] L. Liu et al., “An effective approach based on feature matrix for exposing SQL injection vulnerability,” in Proc. IEEE Annu. Comput. Softw. Appl. Conf., 2016, pp. 123Ð132. Junlong Zhou (S’15–M’17) received the M.E. and [26] M. Ceccato, C. D. Nguyen, D. Appelt, and L. C. Briand, “SOFIA: An Ph.D. degrees in computer science from East China automated security oracle for black-box testing of SQL-injection vulner- Normal University, Shanghai, China, in 2014 and abilities,” in Proc. Int. Conf. Automated Softw. Eng., 2016, pp. 167Ð177. 2017, respectively. [27] SQLMap. [Online]. Available: http://sqlmap.org/. Accessed on: Jun. 2019. He was a Visiting Scholar with the University of [28] R. Sekar, “An efficient black-box technique for defeating web application Notre Dame, Notre Dame, IN, USA, during 2014Ð attacks,” in Proc. Netw. Distrib. Syst. Secur. Symp., 2009. 2015. He is currently an Assistant Professor with the [29] SecLists. [Online]. Available: https://github.com/danielmiessler/SecLists. School of Computer Science and Engineering, Nan- Accessed on: Jun. 2019. jing University of Science and Technology, Nanjing, [30] F. Lebeau, B. Legeard, F. Peureux, and A. Vernotte, “Model-based vul- China. He was a Research Visitor with the University nerability testing for web applications,” in Proc. Int. Conf. Softw. Testing, of Notre Dame. He has published several papers in Verification Validation, 2013, pp. 445Ð452. the areas of his research interests, which include real-time embedded systems, [31] M. Wang, B. Qian, Y.Xu, and X. Wang, “Research of SQL injection attacks cloud computing, and cyber-physical systems. detection defense system based on the general rules,” Electron. Des. Eng., He is an Active Reviewer of several international journals, including the IEEE vol. 25, no. 5, pp. 24Ð28, 2017. TRANSACTIONS ON COMPUTERS, IEEE TRANSACTIONS ON INDUSTRIAL INFOR- MATICS, Journal of Systems and Software,andJournal of Circuits, Systems, and Computers.

Tongquan Wei (S’06–M’11) received the Ph.D. de- Haifeng Gu (S’18) received the B.E. and M.E. de- gree in electrical engineering from Michigan Tech- nological University, Houghton, MI, USA, in 2009. grees from the Department of Computer Science and He is currently an Associate Professor with the De- Technology, Sichuan Normal University, Chengdu, partment of Computer Science and Technology, East China, in 2013 and 2015, respectively. He is currently working toward the Ph.D. degree in software engi- China Normal University. He serves as a Regional Editor for the Journal of Circuits, Systems, and Com- neering with the Department of Embedded Software puters since 2012. He also served as Guest Editors for and System, East China Normal University, Shang- several special sections of the IEEE TRANSACTIONS hai, China. His research interests include the area of cloud ON INDUSTRIAL INFORMATICS and ACM TECS. His research interests include the areas of green and re- reliability, hardware/software covalidation, symbolic liable embedded computing, cyber-physical systems, parallel and distributed execution, statistical model checking, and software systems, and cloud computing. testing.

Mingsong Chen (M’11–SM’17) received the B.S. and M.E. degrees from the Department of Computer Science and Technology, Nanjing University, Nan- Jianning Zhang received the B.E. degree from the jing, China, in 2003 and 2006, respectively, and the Software Engineering Institute, East China Normal Ph.D. degree in computer engineering from the Uni- University, in 2018. He is currently working toward versity of Florida, Gainesville, FL, USA, in 2010. the Ph.D. degree in software engineering with the De- He is currently a Professor with the School of Com- partment of Embedded Software and System, East puter Science and Software Engineering, East China China Normal University, Shanghai, China. Normal University. He is an Associate Editor of IET His research interests include the area of com- Computers and Digital Techniques, and Journal of puter architecture, design automation of embedded Circuits, Systems and Computers. His research inter- systems, and software engineering. ests include the area of cloud computing, design automation of cyber-physical systems, parallel and distributed systems, and formal verification techniques.

Authorized licensed use limited to: East China Normal University. Downloaded on March 05,2020 at 01:35:52 UTC from IEEE Xplore. Restrictions apply.