Static Code Analysis of C++ in LLVM

Linköping University | Department of Computer and Information Science Bachelor Thesis, 16hp | Programming Spring term 2016 | LIU-IDA/LITH-EX-G--16/011--SE Static Code Analysis of C++ in LLVM Olle Kvarnström Supervisor, Jonas Wallgren Examiner, Klas Arvidsson Upphovsrätt Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/. Copyright The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circumstances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/. © Olle Kvarnström Static Code Analysis of C++ in LLVM Olle Kvarnström IDA, Linköping University [email protected] these rules manually would require great time and effort, and ABSTRACT Just like the release of the Clang compiler, the advent of the work is therefore more suitably done with an automatic LLVM in the field of static code analysis already shows great tool. However, since these rules are custom tailored to suit promise. When given the task of covering rules not ideally the needs of Saab, there is a limited set of tools available. At covered by a commercial contender, the end result is not only the moment, Saab strives to enforce a higher degree of overwhelmingly positive, the implementation time is only a strictness than contemporary tools can provide. fraction of what was initially expected. While LLVM’s Creating a new tool for static code analysis from scratch support for sophisticated AST analysis is remarkable, being would be a major undertaking, requiring a lot of time and the main reason these positive results, it’s support for data effort. However, a new actor in the open source community flow analysis is not yet up to par. Despite this, as well as a might provide a solution. lack of thorough documentation, LLVM should already be a strong rival for any commercial tool today. LLVM, formerly an acronym for Low Level Virtual Machine, is a compiler infrastructure, which originates from INTRODUCTION a thesis on aggressive multi-stage optimizing compiler technology [2]. The design of the LLVM internal code Motivation representation would allow for sophisticated C and C++ code Developing software for airborne systems is a delicate analysis [3]. process, where even the smallest flaws may lead to death, severe damage to property, or harm to the environment. This Objective defines airborne systems as being safety-critical [1]. This paper examines the use of LLVM in static code analysis. The end goal is evaluating if LLVM is suitable for working Working in an environment with such high stakes requires a in conjunction with, or even replacing, using a commercial great deal of care to be taken. Both the source code and tool. To do this, we focus on covering rules not correctly resulting programs must be thoroughly analyzed and tested supported by the current tools at Saab. These tools cover the to iron out any minor inconsistencies. Analyzing code can be rules either too strictly, which emits false positives, too done either dynamically or statically. leniently, which causes false negatives, or not at all. Dynamic code analysis means that you let the code execute, It is important to have a 1:1 relationship between issues and usually in a controlled environment. The execution is error message. False positives is a major issue in static carefully studied, and used to confirm that the program analysis today, and is the main contributing reason for static behaves as expected. This type of analysis can be difficult to analysis tools not being used in a wider scale [4, 5]. False apply in a safety-critical system with limited resources. negatives are also problematic, since they force the end user Static code analysis, on the other hand, reads the program’s to do additional manual checks. source code and reasons how the running application will In order to implement custom rule checkers in LLVM, a behave. This means that the code can be analyzed without prototype tool for analyzing C++ source code is created. This access to the safety-critical system, and without any risks prototype runs through a Continuous Integration(CI)-tool as involved. Since access to the system might be limited, this part of the build process. It is also desirable for the prototype can also save the developers’ time. to be usable by the developers, either by running it manually, Saab is a Swedish defense and security company, commonly or through their Integrated Development Environment(IDE). known for its aircraft fighter JAS 39 Gripen. In its The rules to cover are the following: continuous ambition to produce cutting-edge technology, 1. No operations are allowed to make machine or static code analysis is a natural part of Saab’s software compiler dependent presumptions on a type's bit development process. To ensure that their software is not representation. only safe, but also easy to maintain and verify, it is developed 2. Variable and function names are not to be reused in according to strict code standards. These standards can have nested scopes, or from base classes. hundreds of rules, which are oftentimes much stricter than 3. Pointers shall not be used for iterating over arrays the underlying program language. Assuring compliance with for reading or writing. 1 4. Nothing should be defined in the global scope, then analyzed for inconsistencies. Below are a few common except for in header files. methods. 5. The assignment operator in a class must return a Abstract syntax tree const reference to this, or void. An abstract syntax tree(AST) is an attempt to describe the 6. Pointers to functions must have a typedefined type. structural essence of the underlying programming language 7. Inside a function, all definitions must happen before [8]. Each node in the tree represents a language construct, any executable code. Exception: Defining variables where the trunk of the tree contains operators, and the leaf in a for-loop’s initializer is always permitted. nodes are variables. By comparing the nodes’ relation to its 8. When assigning or comparing a pointer to NULL, neighbors, one can detect local errors such as type the NULL must be type casted to the same type. differences. The strength of analyzing the AST, is in its 9. #define shall not be used for numeric constants. efficiency when it comes to assessing well-defined, 10. Bitwise operations on Booleans are not allowed. structural rules [9]. AST-detection has previously been 11. No function parameters shall have side effects. successfully applied in finding cases of code reuse in a large 12. No class, struct or enum declaration shall happen at code base [10]. This was done by checking for duplicates, or the time of its type definition. near duplicates, of parts of the AST. AST-detection is 13. The keywords register, auto, export and mutable are however limited, and more advanced methods are needed for not to be used. more sophisticated checks. 14. The operators *=, /=, <<=, >>=, &=, ^=, %= and |= are not to be used. Data flow analysis 15. Pointers to functions are not to be used. Data flow analysis separates the code into blocks, which are 16. The standard types are not to be used. Exceptions: sequences of instructions with one entry point and one exit void and char. point. Variables within these blocks are analyzed on how 17. Free standing block statements delimited by {} are they will behave, depending on which paths the code not permitted. execution will take [11]. For example, when reaching an if- statement, the program will investigate each possible route Research Questions separately, and afterwards compile a set of all the variables’ • Can LLVM be used to cover the rules described possible values. This set will be used in future instructions, above, without emitting false positives or false to ensure that no matter which route the code takes, the negatives? variable data is correct. • How long does it take to write a rule check in LLVM? The strength of using data flow analysis is its thoroughness • How well suited is LLVM for static analysis? in which it tests all possible, even how unlikely, ways the code execution may travel.

Static Code Analysis of C++ in LLVM

The LLVM Instruction Set and Compilation Strategy

What Is LLVM? and a Status Update

CUDA Flux: a Lightweight Instruction Profiler for CUDA Applications

Implementing Continuation Based Language in LLVM and Clang

Code Transformation and Analysis Using Clang and LLVM Static and Dynamic Analysis

Sham: a DSL for Fast Dsls

Introduction to LLVM (Low Level Virtual Machine)

Polymorphisation: Improving Rust Compilation Times Through

Formalizing the LLVM Intermediate Representation for Verified Program Transformations

“Which Targets Does Clang Support?”

Bringing Next Generation C++ to Gpus

Llvmlinux: the Linux Kernel with Dragon Wings