<<

INSTITUTE FOR AUTOMATIC DETECTION OF DEVELOPMENT SECURITY VULNERABILITIES IN OF RESEARCH

IN BANKING SOFTWARE APPLICATIONS USING STATIC ANALYSIS TECHNOLOGY

Avinash Uttav Guide : Dr. V.Radha Sophomore B.Tech. Computer Science and Engineering Indian Institute Of Technology Guwahati

CERTIFICATE

This is to certify that Mr. Avinash Uttav, pursuing B.Tech course Indian Institute of Technology, Guwahati with Computer Science and Engineering as a major subject has undertaken a project as an intern at the Institute for Development and Research in Banking Technology (IDRBT), Hyderabad from May 7th to July 9th , 2012.

He was assigned the project “Automatic Detection of Security Vulnerabilities in Software Applications Using Static Analysis” under the guidance of Dr. V.Radha, Assistant Professor of IDRBT. During the course of the project he has undertaken a study of all Security Vulnerabilities prevailing, CODAN Framework in for Building Checkers.

In this project assigned to Mr. Avinash Uttav he has done excellent work.

We wish him all the best in all his endeavors.

Dr. V.Radha (Project Guide)

Assistant Professor IDRBT, Hyderabad

2

ACKNOWLEDGEMENT

I would like to express my sincere gratitude to the Institute for Development and Research in Banking Technology (IDRBT) and particularly Dr. V.RADHA,(Assistant Professor) was my guide in this project. This opportunity of learning the security vulnerabilities which many of the good programmers don’t know about and the application of the static analysis using CODAN in Eclipse was a boon to me as one rarely gets such exposure. I would not hesitate to add that this short stint in IDRBT has added a different facet to my life as this is a unique organization being a combination of academics, research, technology, communication services, crucial applications, etc., and at the same performing roles as an arm of regulation, spread of technology, facilitator for implementing technology in banking and non-banking systems, playing a role of an NGO (without being one) and many varied activities.

I want to express my sincere gratitude to the Shri B. Sambamurthy, Director of IDRBT, for his great support towards all the project trainees and his great enthusiasm towards achieving academic excellence in the institute. Also the support of Dr. M. V. N. K. Prasad, Project Trainees incharge, was overwhelming and worth appreciating.

I am extremely grateful to Dr. V.Radha for her advice, innovative suggestions and supervision. I thank her for introducing me to an excellent sofware application and giving me the opportunity to approach diverse sections of people starting from bankers to general public.

I am very thankful to Mr. Anil Kumar M.Tech. Information Technologies at IDRBT for helping me to get familiar with the application. He gave me a chance to study the application and its use from different perspective. I am thankful to my college, IIT Guwahati for giving me this golden opportunity to work in a high-end research institute like IDRBT. I am thankful for IDRBT for providing such an amazing platform for students to work in real application oriented research. Finally, I thank one and all who made this project successful either directly or indirectly.

Avinash Uttav Project Trainee Department of Information Technology IDRBT, Hyderabad

3

ABSTRACT

Much of the blame for security violations goes to bad software and that too at the coding level. Even the best security algorithms can be broken due to bad programming. There is a need to the vulnerabilities like , format string etc in a software before it is deployed. If one can give the facility to verify the presence of these vulnerabilities at the time the program is written, that helps the developer in correcting it then and there. We developed “checkers” using codan which is a light-weight static analysis framework in CDT (Eclipse's /C++ Development Tooling project). Our checkers perform real time analysis using Abstract Syntax Tree on the code to find some of the format string vulnerabilities, buffer overflow vulnerabilities in C language.

We used regular expressions for detecting format string vulnerabilities. We examined many of the buffer overflow vulnerabilities. We identify three common characteristics present in these vulnerabilities: one, data is read from an untrusted source, two, untrusted data is insufficiently validated, and three, untrusted data is used in a potentially vulnerable function. We have used taint analysis technique to find buffer overflow vulnerabilities.

4

Contents:

1. Introduction …………………………….………………………………….1 2. Software Vulnerabilities…………………………………………………… 3. Illustration to Vulnerabilities ………………………..2 4. Static Analysis……………………………………………………………... 5. Proposed Method…………………………………………………………...3  Detecting Format String Vulnerabilities  Detecting Buffer Overflow Vulnerabilities for Intrafunctional use.  Detecting Buffer Overflow Vulnerabilities For InterFunctionalCalls. 6. Implementation Details …………………………………………..………4-9 7. Conclusion ……………..……………………………………………...... 17

5

INTRODUCTION

Software vulnerabilities are one of the main causes of security incidents in computer systems. In 2004, the United States Computer Emergency Readiness Tea released 27 security alerts, 25 of which reported a critical software vulnerability [31].

Software vulnerabilities arise from deficiencies in the design of computer programs or mistakes in their implementation. An example of a design flaw is the Solaris sadmin service, which allows any unprivileged user to forge their security credentials and execute arbitrary commands as root. The solution to this problem is to redesign the software and enforce the use of a stronger authentication mechanism. Vulnerabilities of this kind are harder to fix, but fortunately, they are rare. Most software vulnerabilities are a result of programming mistakes, in particular the misuse of unsafe and error-prone features of the C , such as pointer arithmetic, lack of a native string and lack of array bounds checking.

Though the causes of software vulnerabilities are not much different from the causes of software defects in general, their impact is a lot more severe. A user might be willing to save their work more often in case a program crashes, but there is little they can do to lessen the consequences of a security compromise. This makes the problem of detecting existing vulnerabilities and preventing new ones very important for all software developers.

The oldest approach to finding software vulnerabilities is manual source code auditing. As the name suggests, this method involves reading the source code of a program and looking for security problems. It is similar to the code review process common in the software engineering discipline, but it places additional requirements on the auditor. In addition to familiarity with the software architecture and source code, he or she should have considerable computer security expertise to be able to identify vulnerable code. A comprehensive source code audit requires a lot of time and its success depends entirely on the skills of the

6 auditor. Despite these shortcomings, many vulnerabilities have been found using this method and it is still considered one of the best approaches.

A second option is the so-called fuzz testing or fuzzing. This method works by feeding invalid or malformed data to a program and monitoring its responses. A segmentation fault or an unhandled exception while processing malformed input is a good indication of a vulnerability. The fuzzing data may be randomly generated, or specifically crafted to corner cases. It can be fed to a program as command line parameters, network packets, input files, or any other way a program accepts user input. Fuzzing frameworks such as SPIKE [1] and Peach [6] allow for rapid development of new application- or protocol-specific testing tools that can then be used for vulnerability detection. The advantage of fuzzing is that it can be easily automated and used for regression testing during .Runtime checking is a technique for preventing the exploitations of vulnerabilities, but it can be used for vulnerability detection, especially in combination with fuzz testing. This method involves inserting additional checks into a program to ensure that its behavior conforms to a certain set of restrictions. For example, Mudflap [7] is a pointer use checking extension to the GCC compiler. It adds instrumentation code to potentially unsafe pointer operations in C programs and detects errors such as NULL pointer dereferencing, buffer overflows and memory leaks. ProPolice [9] is another GCC extension that uses instrumentation to protect application from stack smashing attacks. The Microsoft Visual C++ compiler includes a similar feature. Runtime checking can help uncover vulnerabilities during fuzz testing, even if they do not result in a system crash.

A disadvantage of both fuzz testing and runtime checking is that some vulnerabilities can be triggered only under a very specific set of circumstances, which might not arise during regular use of the program. For example, exploiting an application might require sending a long sequence of specifically crafted network packets to reach a vulnerable state. Such vulnerabilities can be found during a manual source code audit, but the process is often too time-consuming. Static source analysis tools greatly increase the efficiency of source code auditing. Tools like Flawfinder [35] and SPLINT [10] examine the source code of a program and report possible vulnerabilities. Due to limitations inherent in source code analysis, these tools produce both false positives (reporting safe code as vulnerable) and false negatives (missing vulnerable code). Despite these limitations, static source analysis tools are very useful forguiding security auditors to potentially vulnerable code and significantly reduce the time it takes to perform an audit.

7

SOFTWARE VULNERABILITIES

In this section we give an overview of software vulnerabilities commonly found in computer programs.

Format string bugs:

Format string bugs belong to a relatively recent class of attacks. The first published reports of format string bugs appeared in 2000, followed by the rapid discovery of similar vulnerabilities in most high-profile software projects. These include the Apache web server, wu-ftpd FTP server, OpenBSD kernel and many others. The format string bug is a relatively simple vulnerability. It arises when data received from an attacker is passed as a format string argument to one of the output formatting functions in the standard C , commonly known as the printf family of functions.

A format function is a special kind of ANSI C function that takes variable number of arguments out of which format string is one. A format string is an ASCIIZ string used to specify and control the representation of different variables. A format function uses the format string to convert into a string representation. These are called variadic functions since they accept variable number of arguments. The input format string decides the number of arguments to be read off the stack by the function.

When an attacker gets the control and provides format string to the format function, the dangers associated with format string vulnerabilities include a) the attacker can crash the program, b) he can view the process memory and the stack and c) he can overwrite the arbitrary memory.

Consider the following format function

printf(“%s”, a); (correct)

The first argument to printf() is a format string that specifies the number and types of the other arguments. Since no checking is done for printf() either at run time or compile time to verify whether it is called with the correct number and type of arguments, the following printf() call can be dangerous:

printf(a); (may be incorrect)

8

If the format string is under the control of the attacker, the printf() function can be used to data to an arbitrary memory location. An attacker can use this to modify the control flow of a vulnerable program and execute code of his or her choice.

Buffer Overflows

A buffer overflow generally occurs due to a programming flaw that allows more data to be written to a buffer than the buffer was designed to hold. This overflow can overwrite useful information, but what makes these attacks particularly damaging is that an attacker can often overflow the buffer in such a way that code of the attacker’s choosing executes on the victim’s machine. Such buffer overflow attacks are somewhat delicate, and developing such an attack is usually time-consuming, and requires a significant amount of trial and error.

The classic buffer overflow is a result of misuse of string manipulation functions in the standard C library. The C language has no native string and uses arrays of characters to represent . When reading strings from the user or concatenating existing strings, it is the programmer’s responsibility to ensure that all character arrays are large enough to accommodate the length of the strings. Unfortunately, programmers often use functions such as strcpy() and strcat() without verifying their safety. This may lead to user data overwriting memory past the end of an array.

 String buffer overflow

1. char dst[256]; 2. scanf(“%s”,s); 3. strcpy (dst, s); The string s is read from an untrusted source and its length is controlled by the attacker. The string is copied into a buffer of fixed size using the strcpy() function. If the length of the string more than the size of the destination buffer, the strcpy() function will overwrite data present after the end of the buffer.

9

 Memcpy buffer overflow 1. char src[10], dst[10]; 2. int n; 3. scanf(“%”,&n); 4. memcpy (dst, src, n);

The integer n is read from an untrusted source and its value is controlled by the attacker. No checking is done on n before it is passed as a third argument memcpy(). If the attacker chooses a value of n bigger than the size of the destination buffer, the memcpy() function will overwrite data present after the end of the buffer.

 Gets buffer overflow 1. char s[256]; 2. gets (s);

The gets() function is used to read user input and store it into a fixed size buffer. By supplying more data than the buffer can hold, an attacker can overwrite data present after the end of the buffer.

 Scanf buffer overflow

1. char s[256]; 2. scanf ("%s", s);

Scanf is used to read user input and store it into a fixed size buffer. If the “%s” format specifier is used, scanf will read user input until either whitespace or the end of is reached. By supplying more data than the buffer can hold, an attacker can overwrite data present after the end of the array.

10

Integer Overflows

A third kind of software vulnerability is the integer overflow. These vulnerabilities are harder to exploit than buffer overflows and format string bugs, but despite this they have been discovered in OpenSSH, Internet Explorer and the kernel. There are two kinds of integer issues: sign conversion bugs and arithmetic overflows. Sign conversion bugs occur when a signed integer is converted to an unsigned integer. On most modern hardware a small negative number, when converted to an unsigned integer, will become a very large positive number. Consider the following C code:

1 char buf[10]; 2 int n = read_int (); 3 if (n < sizeof(buf)) 4 memcpy (buf, src, n);

On line 2 we read an integer n from the user. On line 3 we check if n is smaller than the size of a buffer and if it is, we copy n bytes into the buffer. If n is a small negative number, it will pass the check. The memcpy() function expects an unsigned integer as its third parameter, so n will be implicitly converted to an unsigned integer and become a large positive number. This leads to the program copying too much data into a buffer of insufficient size and is exploitable in similar fashion to a buffer overflow.

Arithmetic overflows occur when a value larger than the maximum integer size is stored in an integer variable. The C standard says that an arithmetic overflow causes “undefined behavior”, but on most hardware the value wraps around and becomes a small positive number. For example, on a 32-bit Intel processor incrementing 0xFFFFFFFF by 1 will wrap the value around to 0. The main cause of arithmetic overflows is addition or multiplication of integers that come from an untrusted source. If the result is used to allocate memory or as an array index, an attack can overwrite data in the program. Consider the following example of a vulnerable program:

1 int n = read_int (); 2 int *a = malloc (n * 4); 3 a[1] = read_int ();

11

On line 1 we read an integer n from the user. On line 2 we allocate an array of n integers. To calculate the size of the array, we multiply n by 4. If the attacker chooses the value of n to be 0x40000000, the result of the multiplication will overflow the maximum integer size of the system and will become 0. The malloc() function will allocate 0 bytes, but the program will believe that it has allocated enough memory for 0x40000000 array elements. The array access operation on line 3 will overwrite data past the end of the allocated memory block

12

Illustration of Printf Vulnerability:

A printf format function is a special kind of ANSI C function that takes variable number of arguments out of which format string is one. A format string is an ASCIIZ string used to specify and control the representation of different variables. A format function uses the format string to convert C data types into a string representation. These are called variadic functions since they accept variable number of arguments. The input format string decides the number of arguments to be read off the stack by the function.

When an attacker gets the control and provides format string to the format function, the dangers associated with format string vulnerabilities include

 The attacker can view process memory

 He can crash the program

 He can view the stack

How to view process memory

Example:

#include main(int argc, char **argv) { char *secret1 = "this is secret 1\n"; char *secret2= "this is secret 2\n"; printf(argv[1]); }

We saved this program with name fs.c. Now we compile this program in gcc compiler and we stored the object file in fs using –o option.

13

Figure 1: Compilation of the program

Now we executed the program and passing the string “idrbt” as a command line argument. The program displayed idrbt as the output.

Figure 2: Execution of the program with input idrbt

Now if the attacker gets the control of the program and gives “%s” as a command line argument, the printf function reads the command line argument and since it is a format specifier, the printf format function retrieves first argument from the stack and displays that value as the output.

14

Figure 3: Execution of the program with input %s

If the attacker gives three format specifiers, the printf format function retrieves three format arguments from the stack and displays them as output.

Figure 4: Execution of the program with input %s%s%s%s%s%s

All the data in the program is stored in the stack so if the attacker gives more format specifiers the data in the stack can be retrieved.

15

Figure 5: Execution of the program with input %s%s%s%s%s%s%s

In order to understand how it happens we must load the program into gdb. GDB, the GNU Project debugger, allows us to see what is going on `inside' another program while it executes. We loaded the program into gdb.

16

Figure 6: loaded the program in gdb

View the assembler code of the main function.

Figure 7: assembler code of the program

In this program we have created a breakpoint at printf statement

17

Figure 8 : viewing the program stack and its elements

Now we executed the program and passing the string “idrbt” as command line argument. Since we created breakpoint at printf function, the program stops at the printf function.

Then we viewed the stack at this stage. The first eight addresses of the stack are shown in the above snapshot.

Now if we view the data present at these addresses, the second address points to idrbt string. The last address points to the data “this is secret 1” given in the program. The last second address points to the data “this is secret 2” given in the program. Therefore if the attacker gives more specifiers he can retrieve the data present in the program.

How to crash the program

There may be an address which refers to nothing. When we refer to that address since it points to nothing, the program will be crash. When we view first 16 locations of the stack, there is an address which is out of bounds .

Figure 9: giving more specifiers

How to view the stack

The %x specifier is used to print the output in form. If we give “%x” specifier, it prints the hexadecimal value of the first location. If we slightly modify and give “0x%08x”, it prints the hexadecimal value containing eight characters which looks like address of the first location.

18

If we repeat the above sequence, we can view the stack which is shown in the below snapshot.

Figure 10: Execution of the program

19

STATIC ANALYSIS :

Static source analysis tools examine the source code of a program without executing it. The purpose of these tools is to extract some information from the source code or judgments about it. The most common use of static analysis is in optimizing compilers. The sophistication of the analysis performed by tools varies from those that only consider the behavior of individual statements and declarations, to those that include the complete source code of a program in their analysis.

Different Methods of Static Analysis

The first thing a static analysis tool needs to do is transform the code to be analyzed into a program model, a set of data structures that represent the code. As you would expect, the model a tool creates is closely linked to the kind of analysis it performs, but generally static analysis tools borrow a lot from the compiler world. In fact, many static analysis techniques were developed by researchers working on compilers and compiler optimization problems. There are different methods to detect software vulnerabilities.

Pattern Matching

This is the simplest static analysis technique. Pattern matching is the act of checking some sequence of tokens for the presence of the constituents of some pattern. It is used to find all occurrences of “strcpy” in the source code. Most of these would be calls to the strcpy() function in the standard C library, which is often misused and is a good indicator of a potential vulnerability. Further inspection of each call site is required to determine which calls to strcpy() are safe and which are not. It cannot detect complicated vulnerabilities. Another drawback of pattern matching is that the number of false positives can be very large.

Lexical Analysis

Tools that operate on source code begin by transforming the code into a series of tokens, discarding unimportant features of the program text such as whitespace or comments along the way. The creation of the token stream is called lexical analysis. The tokens are matched against a database of known vulnerability patterns. Lexing rules often use regular expressions to identify tokens. This technique is used by tools like Flawfinder [6], RATS [3] and ITS4 [4]. Lexical analysis improves the accuracy of pattern matching, because a lexer can handle irregular whitespace and code formatting. The false positives reported by these tools are very high

Parsing and AST analysis

The next step in improving the accuracy of static analysis is the source code and building an abstract syntax tree (AST) representation of the program. This work is typically performed by the frontend of a compiler and this offers an opportunity for code reuse. A language parser uses a context-free grammar (CFG) to match the token stream. The grammar consists of a set of productions that describe the symbols (elements) in the language. The parser performs a derivation by matching the token stream against the

20 production rules. If each symbol is connected to the symbol from which it was derived, a parse tree is formed.

The abstract syntax tree allows us to analyze not only the syntax, but also the semantics of a program. Lexical analysis tools will be confused by a variable with the same name as a vulnerable function, but AST analysis will accurately distinguish the different kinds of identifiers. On the AST level macros and complicated expressions are expanded, which can reveal vulnerabilities hidden from lexical analysis tools. The pattern matching approach can be significantly improved by matching AST trees instead of sequences of tokens or characters

Tracking Control Flow

Many static analysis algorithms (and compiler optimization techniques) explore the different execution paths that can take place when a function is executed. To make these algorithms efficient, most tools build a control flow graph on top of the AST or intermediate representation. The nodes in a control flow graph are basic blocks: sequences of instructions that will always be executed starting at the first instruction and continuing to the last instruction, without the possibility that any instructions will be skipped. Edges in the control flow graph are directed and represent potential control flow paths between basic blocks. Back edges in a control flow graph represent potential loops. We can apply reach ability analysis on this control flow graph to see whether the data is flowing into privileged sections and violating the security or not.

Type qualifiers

Some more advanced vulnerability detection tools are based on the type qualifier framework developed by Jeffrey Foster [2]. He describes type qualifiers as properties that “qualify” the standard types in languages such as C, C++, Java and ML. Most of these languages already have a limited number of type qualifiers (for example the const and register keywords in C programs.) Foster proposes a general purpose system for adding user-defined type qualifiers by annotating the source code and detecting type inconsistencies by type qualifier inference

Tracking Dataflow

Detecting buffer overflows, integer overflows, array indexing, or pointer arithmetic problems requires more complicated analyses than pattern matching or type inference. These vulnerabilities arise when variables in a program can take on values outside a certain safe range. For example a strcpy() function call is vulnerable when the size of the source string exceeds the size of the destination buffer. Detecting the conditions under which this can occur without executing the program is a hard problem. In any nontrivial program there are dependencies between the data manipulated by the code, which further complicates the task. Data-flow analysis is a traditional compiler technique for solving similar problems and can be used as a basis of vulnerability detection systems.

Taint Analysis

Security tools need to know which values in a program an attacker could potentially control. Using dataflow to determine what an attacker can control is called taint propagation. It requires knowing where information

21 enters the program and how it moves through the program. Taint propagation is the key to identifying many input validation and representation defects. The concept of tainting refers to marking data coming from an untrusted source as “tainted” and propagating its status to all locations where the data is used. A security policy specifies what uses of untrusted data are allowed or restricted. An attempt to use tainted data in a violation of this policy is an indication of vulnerability.

The most well known example of this technique is the taint mode provided by the programming language [20]. When running in this mode, the interpreter flags all data read from files, network sockets, command line arguments, environmental variables and other untrusted sources as tainted. The language provides facilities for untainting (or “laundering”) untrusted data after the programmer has verified that it is safe. Tainted data may not be used in any function which modifies files, directories and processes, or executes external programs. If this rule is violated, the interpreter will abort the execution of the program.

Tainting is particularly well suited for interpreted environments because data can be flagged and inspected at runtime. Applying tainting to a compiled programming language requires static analysis and does not guarantee the same level of precision. The type qualifier inference used with great success by Shankar [8] to detect format string bugs is a form of taint analysis.

Taint analysis is used to determine the program locations where data is read from an untrusted source and where tainted data is used as an argument to a potentially vulnerable function. This is a forward data-flow analysis problem and can be solved efficiently using an iterative algorithm. There are only two possible lattice values: TAINTED and NOT TAINTED. The meet function for the lattice is based on the idea that anything that meets a tainted value becomes tainted.

22

PROPOSED METHOD:

In this chapter we explained our methods to detect printf format string and buffer overflow vulnerabilities. We want to detect vulnerabilities at the time of writing the code itself which helps the developer in correcting it then and there. With eclipse we can find out vulnerabilities at the time of writing the code itself. There is a tool called codan which would perform real time analysis on the code to find common defects, violation of policies, etc. We want to develop “checkers” using codan. Eclipse internally creates abstract syntax tree for the given program. We use this abstract syntax tree to find format string vulnerabilities and buffer overflow vulnerabilities.

Codan

C/C++ Development Tooling project ) which would perform real time analysis on the code to find common defects, violation of policies, etc. Framework contains common components and that is shared between static ana Codan which is a light-weight static analysis framework in CDT ( CDT is Eclipse's lysis tools for C/C++, such as:

. User Interface to control the Problems enablement and parameters . Different launch modes (as you type, on demand, as a builder) . An eclipse View to display additional problem information (such as extra backtrace or more complex problem parameters) . A Generic Marker type for problems with extra fields . API to run log the problems . Base classes for checkers . JUnit testing framework . QuickFix base classes

Profile Editor (Problem Preferences)

. We can enable or disable our checker . Severity of the Problem is specified. We can change the severity of the problem . When we keep cursor on the checker the description about the checker is displayed. . Codan is used to find out only vulnerabilities.

23

Figure 11: Codan profile

Codan is used by many people. The codan users are

. Tool Vendors . To create plug-ins containing end-user checkers and templates . Software Architects, Process Enforcement people . To create customized new checkers, based on templates (no programming involved) . To create problems profiles . Developer, Tester, Code Inspector . To check for errors as you type and have a quick way to fix them, during development . To find bugs, security violations, API violations, coding standard violations during debugging, testing, code inspection or before code commit

24

Developing a Checker in Codan

To create a checker you have to extend one of the framework classes, and add extension points where you specify a class name and describe a problem that checker finds, which includes name, message, default severity, etc. Name of the problem would be visible on the Configuration page.

To create a checker:

1. Define a problem(s) that your checker is capable of finding, cross check existing checkers to see if it is already there, create a bug report describing your intention to implement it 2. Create extension to define your checker and problem(s) it can find, define a new category or assign to existing one 3. Create an extension point for org.eclipse.cdt.codan.core.checkers. Specify one or more problem that your checker can detect with name, id, default severity and enablement. 4. Extend abstract checker that supports a given model, and implement it. Current base classes: AbstractCheckerWithProblemPreferences (Resource), AbstractCIndexChecker (Index), AbstractIndexAstChecker (C/C++ AST per compilation unit), AbstractAstFunctionChecker (C/C++ AST per function). 5. Create a quick fix for the problem (optional) 6. Creation extension to problem view (optional) 7. Create a junit test cases

Extension Point: messagePattern="Statement has no effect" 25

/>

Abstract Syntax Tree

We are using abstract syntax tree to detect security vulnerabilities and hence our model falls in static analysis. The eclipse creates abstract syntax tree internally for the given code and gives it as a tree structure so that we can traverse the tree.

An abstract syntax tree (AST) is a tree representation of the abstract syntactic structure of source code written in a programming language. Each node of the tree denotes a construct occurring in the source code like literals, expressions, controls etc. The syntax is 'abstract' in the sense that it does not represent every detail that appears in the real syntax. For instance, grouping parentheses are implicit in the tree structure, and a syntactic construct such as an if-condition-then expression may be denoted by a single node with two branches.

An abstract syntax tree reflects the semantic structure of a program. Analysis over the abstract syntax tree is used to verify context conditions, and to provide information for translating or interpreting the program. While writing the code, the abstract syntax tree of the newly written code is appended to the previous abstract syntax tree.

Consider the following program

#include int main() { int a; printf(“enter a number”); scanf(“%s”,&a); printf(a); } The abstract syntax tree of the above program is given below:

26

Figure 12: Abstract Syntax Tree

This abstract syntax tree feature is Available in CDT Testing feature: org.eclipse.cdt.ui.tests package. In order to view abstract syntax tree of a program, at the time of installing the cdt we must check this feature and then the cdt.

27

Figure 13: installation window of codan

Detecting Format String Vulnerabilities The format string bug is a relatively simple vulnerability. It arises when data received from an attacker is passed as a format string argument to one of the output formatting functions in the standard C library, commonly known as the printf family of functions. Regular Expressions

We are using regular expressions to detect these types of vulnerabilities. In computing, a regular expression provides a concise and flexible means to "match" (specify and recognize) strings of text, such as particular characters, words, or patterns of characters. Regular expressions are a way to describe a set of strings based on common characteristics shared by each string in the set. They can be used to search, edit, or manipulate text and data. Regular expressions follow their own syntax. The regular expressions are supported by java.util.regex API. This package consists of three classes Pattern, Matcher, and Pattern Syntax Exception. A Pattern object is a compiled representation of a regular expression. We can create a pattern object by invoking one of its public static compile methods, which will then return a Pattern object. A Matcher object is the engine that interprets the pattern and performs match operations against an input string. The matcher object is obtained by invoking the matcher method on a Pattern object. A Pattern Syntax Exception object is an unchecked exception that indicates a syntax error in a regular expression pattern.

28

Algorithm:

In our method we detect 2 types of printf vulnerabilities.

Type 1: The vulnerabilities that are due to not specifying the format string in a format function.

printf(a);

printf(argv[1]);

Type 2: The vulnerabilities that are due to excess number of format specifiers in format string when compared to the actual number of variables given in the format function.

printf(“%s%s%s”,a);

printf(“%d%d0x%08x%s”,a);

Our method consists of 4 major parts.

1. Expressing the regular expression for the vulnerable function.

2. Finding the signature of the function call.

3. Matching the signature of the function call with regular expression of the vulnerable function call.

4. Comparing the format specifiers in format string with arguments in format function.

Method:

1 Regular Expression: We take the regular expression of the vulnerable function as

"printf.(([_]?|[a-zA-Z])[_a-zA-Z0- 9,\\[\\]]*).”

printf exactly this sequence of six letters

[abc] any one of the letters a, b, or c

[a-z] any one character from a through z

[a-zA-Z0-9] any one letter or digit

X? optional, X occurs once or not at all

X* X occurs zero or more times

2 Finding the Signature: The algorithm traverses the abstract syntax tree of the program by visiting each node. It checks whether the node is expression statement or not. If it is an expression statement, we 29 check its child node to see whether it is a function call expression or not. If it is function call, we will retrieve the signature of that function. The signature contains the prototype of that function. It contains general information about function, its name and parameters.

3 Matching the Signature: We will give this signature as an argument to the matcher object. The regular expression of vulnerability function "printf.(([_]?|[a-zA-Z])[_a-zA-Z0-9]*).", is given as an argument to pattern object. The matcher object matches to the signature with regular expression and find whether both are equal or not. If they are equal then there is vulnerability.

4 Comparing the format specifiers: If the no of format specifiers in the printf() function are greater than the no of variables then there is a vulnerability. The regular expression for format specifier is

"%[-+#0]?[(0-9)*]?[\\.(0-9)*]?[hlL]?[cdieEfgGosuxXp]“

Flags width precision length specifier

• Where specifier is the most significant one and defines the type and the interpretation of the value of the coresponding argument.

• Width tells the Minimum number of characters to be printed. If the value to be printed is shorter than this number, the result is padded with blank spaces. The value is not truncated even if the result is larger.

• precision:

For integer specifiers (d, i, o, u, x, X): precision specifies the minimum number of digits to be written. If the value to be written is shorter than this number, the result is padded with leading zeros.

For e, E and f specifiers: this is the number of digits to be printed after the decimal point.

For s: this is the maximum number of characters to be printed. By default all characters are printed until the ending null character is encountered.

• Length tells how the argument should be interpreted. • Flags

0 -> tells Left-pads the number with zeroes instead of spaces, where padding is specified.

+ -> Forces to precede the result with a plus or minus sign (+ or -) even for positive numbers.

- -> Left-justify within the given field width; Right justification is the default

# ->

30

 Used with o, x or X specifiers the value is preceeded with 0, 0x or 0X respectively for values different than zero.

 Used with e, E and f, it forces the written output to contain a decimal point even if no digits would follow. By default, if no digits follow, no decimal point is written.

 Used with g or G the result is the same as with e or E but trailing zeros are not removed.

Algorithm FindVulnerability(Tree AST, String VulnSign, int Vulnerability) /** The algorithm takes the AST and VulnSign as input and returns the vulnerability no **/ String Signature, TreeNode Node; int No_Format_Specifiers; int No_var_args; Step 1: Node=AST; Step 2: while ( Node) { Node=Node.Child; /**Visit the child node**/ If ( Node = ExpressionStatement) { Node=Node.Child; /**Visit the child node**/ If (Node = FunctionCallExpression ) { Signature= Sig(Node); /** Get the signature of the function name **/ If( Signature = “printf”) { If (Signature = VulnSign) { Examples: return(1); /**There is a vulnerability**/ Printf(a); } Printf(c); Else { No_Format_specifiers=get_specifiers(Node); No_Var_Args=get_args(Node); /***get the no of specifiers and no of arguments ***/ If (No_Format_Specifiers > No_Var_Args) { Return(2); /**There is a vulnerability**/ Examples: Printf(“%d%d%d”,a); } Printf(“%d%s%s%d”,a,d); 31

Else

Return(0); /**No vulnerability**/ } Examples: } Printf(“hai”); Printf(“%d”,a); } } }

Detecting Buffer overflow Vulnerabilities intrafunctional:

Introduction A buffer overflow occurs when a program writes data outside the bounds of allocated memory. Buffer overflow vulnerabilities are usually exploited to overwrite values in memory to the advantage of the attacker. Buffer overflow mistakes are plentiful, and they often give an attacker a great deal of control over the vulnerable code. Many buffer overflow vulnerabilities are related to string manipulation.

String buffer overflow

char dst[256]; scanf(“%s”,s); strcpy (dst, s);

The string s is read from an untrusted source and its length is controlled by the attacker. The string is copied into a buffer of fixed size using the strcpy() function. If the length of the string exceeds the size of the destination buffer, the strcpy() function will overwrite data past the end of the buffer. This type of code occurs when a program is processing user input

Memcpy buffer overflow char src[10], dst[10]; int n; scanf(“%d”,&n); memcpy (dst, src, n);

The integer n is read from an untrusted source and its value is controlled by the attacker. No input validation is performed before n is used as the third argument of memcpy(). If the attacker chooses a value of n bigger than the size of the destination buffer, the memcpy() function will overwrite data past the end of the buffer.

There is a common pattern present in these vulnerabilities. All vulnerabilities in these classes are a result of an execution path with the following three characteristics: 32

1. Data is read from an untrusted source.

2. Untrusted data is insufficiently validated.

3. Untrusted data is used in a potentially vulnerable function or a language construct.

Taint Analysis

The concept of tainting refers to marking data coming from an untrusted source as “tainted” and propagating its status to all locations where the data is used. We use interprocedural taint analysis to determine the program locations where data is read from an untrusted source and where tainted data is used as an argument to a potentially vulnerable function. This is a forward data-flow analysis problem and can be solved efficiently using an iterative algorithm. There are only two possible lattice values: TAINTED and NOT TAINTED. The meet function for the lattice is based on the idea that anything that meets a tainted value becomes tainted.

ALGORITHM

1. Find all calls to functions that read data from an untrusted source. Mark the values returned by these functions as TAINTED and stored them in an array.

2. If a tainted value is used in an expression, mark the result of the expression as TAINTED. And store the result in the array

3. Find all calls to potentially vulnerable functions. If one of their arguments is tainted, report this as a vulnerability.

4. Repeat the above process till the end of the program.

Detecting Buffer Overflow Vulnerabilities for Inter Functional Calls

Memcopy Buffer Overflow For Inter Functional Calls void main(){ int a ,b; char string1[10],string2[20]; a=10; b=function(a); memcpy(string1,string2,b); } int function(int a) { int b;

33 scanf(“%d”,&b); a=b+6; return(a); } In the above program a tainted value is passed from function () to main (),which is being used by memcpy function for copying from string 1 to string 2; ALGORITHM:

 We have to apply taint analysis as discussed above and store the variables in the array containting tainted elements.  Keep a check of return statements ,whenever there is one we have to check the arguments returned  Compare these arguments with the array containing tainted elements .If a match is found store the function name in an array.  Traverse the whole tree twice and identify the function call expression for the stored function name.  Taint the value of variable in which returned value is stored .  Then again search for the use of the tainted objects in the changed array with the variables used in potentially vulnerable functions .  Mark the vulnerability there.

We can apply the same algorithm for all the other checkers developed to detect the intra functional vulnerabilities

34

Implementation Details:

Software Details

Operating System : Windows 7 Professional

Softwares Used : Jdk 1.6, Mingw compiler, Eclipse IDE

Tools and Technologies used

We have used Eclipse tool and Java technology in out project.

Installation Procedure

 Download jdk 1.6 software, run it and accept the default path. Install full program. After successful installation, create the following path in Environment variables as follows

Right click on Mycompter, Goto properties->Advanced System settings -> Environment variables, create a variable name PATH in user variables and variable value as C:\Program Files\Java\jdk1.6.0_21\bin.

 Install the MinGW Tools for C/C++

1. Log in to your regular user account. 2. Download this MinGW folder and run it. Your browser may warn that "This file is not commonly downloaded and may harm your computer" , but run it anyway. (In one dialog box, selecting 'additional options' allowed it to run. This installation will download various files from the web. 3. Accept the default installation folder C:\MinGW. Do NOT attempt to place in C:\Program Files\MinGW. (location cannot have spaces) 4. At the Select Component dialog, check the MSYS Basic System. 5. Add the C:\MinGW\bin folder to your Windows Path variable.

 Download the Eclipse IDE, run it and accept the default path. Install full program. Afterwards add the cdt plugins and run them. Afterwards add the codan plugins and run them.

Results of Printf Format String Vulnerabilities

Consider the following example in which it contains two vulnerabilities.

Example 1:

main()

35

{ int a,b,c,d[]={66,23,1}; char x[10],y[20]; printf(“enter the three numbers”); scanf(“%d%d%d”,&a,&b,&c); printf(“%s%s”,x,y); printf(a); // vulnerability printf(“%d%s%s”,a); // vulnerability } These vulnerabilities are detected then and there while writing the program in the eclipse.

Figure 14: printf vulnerability

The red mark in front the line indicates that there is a vulnerability. When we move the cursor to that red mark it displays that this is a printf() vulnerability as shown in the below snapshot.

36

Figure 15: displaying printf vulnerability

When the developer specifis the format string in that printf() function, then the vulnerability alert disappears as shown below.

Figure 16: no printf vulnerability

After writing the next line since there is a vulnerability it will displayed as vulnerability.

37

Figure 17: printf vulnerability

After adding the required no of variables the vulnerability will be disappeared.

Figure 18: no printf vulnerability

38

Example 2:

#include main(int argc, char **argv) { char *secret1 = "this is secret 1\n"; char *secret2= "this is secret 2\n"; printf(argv[1]); // vulnerability } The above example contains printf vulnerability. This vulnerability is detected then and there while writing the program in the eclipse. The red mark in front the line indicates that there is a vulnerability.

Figure 19: printf vulnerability

Results of Buffer overflow Vulnerabilities

Consider the following example 1:

main() { int n; char src[20],dst[10]; printf(“enter n value”);

39

scanf(“%d”,&n); memcpy(dst,src,n); //vulnerability } The above example contains memcpy buffer overflow. The value n is read from untrusted source so it is tainted. This tainted value is used in memcpy function so there is a possibility of vulnerability. This vulnerability is detected then and there while writing the program in the eclipse. The red mark in front the line indicates that there is a vulnerability.

Tainted value

Vulnerability, since it is using tainted value

Figure 20: Abstract syntax tree of the program

40

Figure 21: memcpy vulnerability

Example 2:

main() { int n,m; char src[20],dst[10]; printf(“enter n value”); scanf(“%d”,&n); m=n+1; memcpy(dst,src,m); //vulnarability } The above example contains memcpy buffer overflow. The value n is read from untrusted source so it is tainted. This tainted value is used in binary expression so the result becomes tainted. This result is used in memcpy function so there is a possibility of vulnerability.

41

Tainted value

Since it using tainted value, it becomes tainted

Vulnerability, since it is using tainted value

Figure 22: Abstract syntax tree of the program

42

This vulnerability is detected then and there while writing the program in the eclipse. The red mark in front the line indicates that there is a vulnerability.

Figure 23: memcpy vulnerability

Buffer Overflow Detection for Interfunctional Calls Example void main() { int a,b; char string1[10],string2[20],; b=10; a=function(b); memcpy(string1,string2,a); } int Function(int a);

43

{ int b,c; scanf(“%d”,&b); c=b+a; return(c); } In this program the value b is passed from main() to function(), function() returns a tainted value after completion because the value of c becomes tainted as it is returned after the execution of binary operation on b and a.

The tainted function is called here

Here value of b is tainted due to scanf

Here value of c is tainted

44

Figure 24:AST for the below program

Showing error for interfunctional memcopy vulnerability

Figure 25: showing checkers for interfunctional calls

45

Conclusion:

There are many static analysis tools like Flaw finder, RATS and ITS4 etc. All these tools find vulnerabilities after the application is developed. But we have developed checkers which helps the developer to find out the vulnerabilities at the time of writing the code itself.

We have studied in detail ,a variety of vulnerabilities that can be caused due to misuse of unsafe and error prone features of Cprogramming language, like pointer arithmetic lack of native string type and lack of array bound checking.

We have developed checkers in Codan Framework provided by Eclipse CDT package .These checkers are able to implement the taint analysis algorithms and checking the inter functional transfer of tainted object at the time of writing the itself.

The checkers are able to spot most of the mistakes but still there are exceptions ,like for loop ,while loop etc.

46

Figure26: screenshot of the checkers developed in this project

In the above pic the Format String Vulnerability is the only checker provided by eclipse in the latest release of CDT package. During the course of Project I have worked on introducing taint analysis in memory copy checker and developed interfunctional memorycopy checker for interfunctional transfer of tainted values.

47

References:

1. www.wikipedia.com 2. www.google.com. 3. http://wiki.eclipse.org/CDT/designs/StaticAnalysis 4. http://www.cplusplus.com/reference/clibrary/cstdio/printf/ 5. https://docs.google.com/file/d/1UNayw6WbckeiBLb2psf0xsyGmFf9dApwoB2BbaipU0-8c- fu3bxzCz9eSt9Q/edit?hl=en 6. http://wiki.eclipse.org/images/c/c7/CDT_APIs_for_code_introspection.pdf 7. Thesis by Alexander Ivanov Sotirov on Automatic vulnerability detection using static soure code analysis 8. Thesis by Simahadri Anil Kumar M.Tech Information Technologies IDRBT on Automatic Detection Of Security Vulnerabilities in Sofware Applicatons Using Static Analysis

48