Evaluating the Ratio of Alive Code in Java Third-Party Libraries
Total Page:16
File Type:pdf, Size:1020Kb
DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2019 Evaluating the Ratio of Alive Code in Java Third-Party Libraries A Comparison between a Static and a Dynamic Approach ANDREAS BROMMUND KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE Evaluating the Ratio of Alive Code in Java Third-Party Libraries A Comparison between a Static and a Dynamic Approach ANDREAS BROMMUND Master in Computer Science Date: July 11, 2019 Supervisor: Pontus Johnson Examiner: Elena Troubitsyna School of Electrical Engineering and Computer Science Host company: Omegapoint Stockholm AB Swedish title: Analysera andelen använd kod i tredjepartsbibliotek – en jämförelse mellan ett statiskt och ett dynamiskt tillvägagångssätt iii Abstract Today’s software development heavily relies on the use of third-party libraries. However, some libraries have a rich set of functionalities where only a few of them are used. This leads to an unnecessary complex codebase that needs maintenance. This thesis compares two methods used to calculate the ratio of used code in the third-party libraries. The first method uses the already existing tool JTombstone, which analyses the code statically. This static approach always examines the whole program. However, it overestimates the result. The sec- ond method uses a dynamic approach. This method always underestimates the result, because, only the part of the program which is executed will be examined. The dynamic code analyser tool modifies all classes which contains in the third-party library. At the beginning of every method a print statement is added, which prints the signature of the current method. In this way, the list of all executed methods is generated. The findings of the thesis are that the first approach always yields higher value and the difference between the two methods decreases while the code coverage increases. The thesis cannot state which method is the best; however, a good solution is to combine both methods to generate an interval which bound the correct value. iv Sammanfattning Dagens mjukvaruutveckling förlitar sig mycket på användningen av tredje- partsbibliotek. Emellertid innehåller många av biblioteken mycket funktiona- litet men bara en liten del av dem används. Det här skapar onödigt komplex mjukvara som måste underhållas. I den här uppsatsen jämförs två olika metoder som används för att beräkna an- delen använd kod i tredjepartsbibliotek. Den första metoden använder JTomb- stone, det här verktyget analyserar koden statiskt. Eftersom den analyserar ko- den statiskt kommer hela projektet alltid bli analyserat, däremot kommer verk- tyget beräkna ett för högt värde. Den andra metoden bygger istället på en dyna- misk utvärdering av koden. När man använder ett dynamiskt tillvägagångsätt så utvärderas bara den delen av koden som kördes, det här leder till att pro- grammet kommer att generera ett för lågt resultat. Verktyget som analyserar koden dynamiskt modifierar alla klasser som tillhör tredjepartsbiblioteket. I början av varje metod lägger verktyget till en utskrift, som skriver ut metodsignaturen för den specifika metoden. På så sätt erhålls en lista av metoderna som har blivit anropade. Uppsatsen kom fram till att den första metoden alltid genererar ett större värde. Resultaten visar också att skillnaden mellan de två metoderna minskar när testerna testar en större del av koden. Med de resultat som genererades går det inte att avgöra vilken av de två metoder som är bäst. En bra lösning är att kombinera metoderna och med hjälp av de två resultaten skapa en övre och undre gräns för det korrekta värdet. Contents 1 Introduction 1 1.1 Background . .1 1.2 Research Question . .2 1.3 Hypothesis . .2 1.4 Delimitations . .2 1.5 Contribution . .2 2 Theory 3 2.1 Third-Party Libraries . .3 2.2 Dead Code . .4 2.3 Dynamic and Static Dispatch . .4 2.4 Call Graph . .5 2.4.1 Sound and Precise Call Graph . .6 2.5 Code Coverage . .6 2.6 Static Code Analysis . .7 2.6.1 Class Hierarchy Analysis . .8 2.6.2 Rapid Type Analysis . 11 2.7 Dynamic Analysis . 13 3 Related Work 15 3.1 Call Graph Construction for Java Libraries . 15 3.2 DUM-Tool . 16 3.3 Dead Code Elimination for Web Systems Written in PHP: Lessons Learned from an Industry Case . 17 4 Methods 18 4.1 Test Data . 18 4.2 Dead Code Granularity . 19 4.3 Tools Used . 19 v vi CONTENTS 4.3.1 Java . 19 4.3.2 Javap . 20 4.3.3 JTombstone . 20 4.3.4 Java Agent . 21 4.4 Experiment Process . 21 4.4.1 Initialisation . 22 4.4.2 Dynamic Analysis . 23 4.4.3 Static Analysis . 23 4.4.4 Calculat Code Coverage . 24 4.4.5 Calculating and Validating the Result . 24 5 Results 25 6 Discussions 29 6.1 Result . 29 6.2 Methodology . 30 6.3 Sources of Error . 31 6.4 Future Work . 32 6.4.1 Improve the Code Coverage . 32 6.4.2 Improve the Static Code Analysis . 32 6.4.3 Analyse the Functionality of the Code . 33 6.5 Ethical Considerations . 33 7 Conclusions 34 Bibliography 35 Glossary API Application Programming Interface. 3, 15, 19, 21 CHA Class Hierarchy Analysis. 8–12, 20 CPA Closed-package assumption. 15 HTTP Hypertext Transfer Protocol. 18 JVM Java Virtual Machine. 21 OPA Open package assumption. 15 RTA Rapid Type Analysis. 11–13, 32 vii Chapter 1 Introduction This chapter has a brief background of the problem statement, the aim of the thesis and the research question. 1.1 Background Today’s software development heavily relies on the use of third-party code libraries and the reuse of code is a necessary part of modern development. It is an easy way to include functionality and speed up the development process. However, the risk analysis usually is omitted when the decision to include a library is decided. Most of the third-party libraries have a rich set of functions where only a few of them are used in the software. This leads to an increased codebase with a high ratio of dead code. [1] An increased codebase leads to a bigger attack surface, and more resources are needed to maintain the software. To prevent being vulnerable to known vulnerabilities, the maintainer must always be updated about new patches and actively patch the libraries. The burden of patching increases with the number of libraries and when the size of the codebase increase. Therefore, it is neces- sary to have control of the dependency tree to reduce unnecessary code. A first step to take control of the dependency problem is to gain knowledge of which dependencies are included. Secondly, it is necessary to find which func- tions are essential for the project and which are not. One way of solving this is to measure the ratio of used and unused code in the dependencies. This thesis 1 2 CHAPTER 1. INTRODUCTION is focused on the second problem and investigates two different approaches to solve this challenge. 1.2 Research Question Is a dynamic approach a better technique for measuring the amount of unused code in third-party libraries included in Java open source projects with high code coverage, compared to a static approach? 1.3 Hypothesis 1. The number of methods categorised as alive is higher for the static anal- ysis approach compared to the dynamic approach. 2. The difference between the two methods decreases when the code cov- erage increases. 1.4 Delimitations Test data only contains a few projects, and all of them must be written in the programming language Java. Only static code analysis and dynamic analysis is investigated. 1.5 Contribution In this thesis, a new dead code analyser is developed. The analyser measures the number of used methods in the third-party libraries. The application is classified as a dynamic approach and is compared with the already existing tool JTombstone1. 1http://jtombstone.sourceforge.net Chapter 2 Theory The relevant theory is presented in this chapter. 2.1 Third-Party Libraries Third-party libraries are a vital part of this thesis and must be defined more precisely. The definition is taken from Heinemann et al. [1]; however, they use the term software reuse. All code not written by the developers themselves are categorised as third-party code. Furthermore, code which is provided by the operating system or the programming language is not included in the defini- tion. Therefore, the Java API is not classified as a third-party library. To distinguish between the third-party code and the self-written code in the upcoming sections, in this chapter and throughout the rest of the thesis, the following six definitions are used; the definitions are taken from Romano et al. [2], • internal code is the code written specifically for the software, • external code is code in the third-party libraries, • internal classes are the classes in the internal code, • external classes are the classes in the external code, • internal methods are the methods in the internal classes and • external methods are the methods in the external classes. 3 4 CHAPTER 2. THEORY 1 public int add(int a, int b){ 2 int c = a; 3 int d = 3; 4 if(false){ 5 System.out.print("Dead"); 6 } 7 return c + b; 8 } Listing 2.1: This code snippet exemplifies dead code. For example, line 5 is dead because of the if statement. 2.2 Dead Code Dead code is one of the key concepts in this thesis, and it is crucial to under- stand the meaning of the concept. In computer science, dead code or unreach- able code has different meanings in different fields. Software engineers define dead code[3] as the part of the program which never is executed.