<<

DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2020

Improving Testing in an Agile Environment

JÉRÔME DE CHAUVERON

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE 1 Abstract has evolved at an ever-increasing pace over the past years, one of the forces behind this acceleration is the move from on-premise application to cloud- based software: Software as a service (SaaS). Cloud computing has changed the way applications are deployed used and tested. Time-to-market for software has decreased from the order of month to the order of day. Furthermore, Source Code Management based on Git (introduced in 2005), changed the way software is developed allowing for collaborative work thanks to automatic merge and versioning. Additionally, continu- ous integration (CI) tools developed on top of Git facilitate regular testing, build and deployment.Nevertheless, despite of being necessary Integration tools, they require an extensive amount of cloud resources which may stretch the duration of code integra- tion. The goal of this thesis is to optimize the speed of the CI pipeline, improve the software tests while optimizing the load on the cloud resources.

As a result of the work of this thesis the completion time of the software test has been decreased by 21%, and the total completion time decreased by 18%. Furthermore, new bugs and anomalies were detected thanks to improved soft- ware test and new approaches for the emulation of extreme scenarios. The bugs were corrected making the system more resilient and improving the user experience.

1 Sammanfattning

Programvaruutveckling har accelererat i en allt snabbare takt under de senaste åren. En av krafterna bakom denna acceleration är övergången från lokal applikation till molnbaserad programvara: Programvara som en tjänst (SaaS). Molntjänster (Cloud computing) har förändrat sättet applikationer distribueras, används och testas. Tiden till att mjukvaran når marknaden har minskat från månader till dagligen.

Dessutom har källkodshantering baserad på Git (introducerades 2005), ändrat hur programvaran utvecklas, vilket möjliggör samarbete, detta tack vare automatisk sam- manslagning och versionshantering. Dessutom underlättar integrationsverktyg (CI) utvecklade ovanpå Git, regelbunden testning, byggnad och utveckling. Trots att de är nödvändiga integrationsverktyg, kräver de en omfattande mängd molnresurser som kan tänja på varaktigheten för integration.

Målet med denna avhandling är att optimera hastigheten på CI-pipeline, förbättra programvarutester medan belastningen på molnresurserna optimeras.

Som ett resultat av arbetet med denna avhandling har mjukvarutesternas slutföringstid reducerats med 21 %, och den totala kompletteringstiden för kontinuerlig integration reducerats med 18 %. Dessutom upptäcktes nya buggar och avvikelser, tack vare förbättrade mjukvarutester och nya tillvägagångssätt för emulering av extrema scenarier. Buggarna korrigerades, vilket gör systemet mer motståndskraftigt med en förbättrad användarupplevelse.

2 Acknowledgements I would like to thank my supervisor Shatha Jaradat for her hard work, time, and guidance throughout this thesis. I would also like to thank my manager and colleagues at Dassault system for their cooperation, advice and friendship. It has been a very challenging experience. I would like to thank my examiner Mihhail Matskin.

Febuary, 9 2020 J´erˆomede Chauveron

3 Contents

1 Introduction 8 1.1 Motivation ...... 8 1.2 Background ...... 8 1.3 Problem ...... 9 1.4 Purpose ...... 9 1.5 Goal ...... 10 1.6 Research Methodology ...... 10 1.7 Delimitation ...... 10 1.8 Ethics and Sustainability ...... 11 1.9 Outline ...... 11

2 Background and Related Work 12 2.1 Background on ...... 12 2.2 Pipeline Description ...... 14 2.3 Code and Testing Metrics ...... 16 2.4 Integration Testing & ...... 17 2.5 Fuzzy testing Background ...... 19 2.6 Background ...... 21 2.7 Literature review and related work ...... 22

3 Methods 25 3.1 Code coverage for End-to-End testing ...... 25 3.2 Automating Coverage for End-to-End tests ...... 26 3.3 Background and Set-up description ...... 27 3.4 Implementation description ...... 28 3.5 Load Testing Tools Comparison ...... 28

4 Experiments 30 4.1 Functional test traces analysis ...... 30 4.2 Parameters influencing failure rate ...... 32 4.3 Gitlab Runner usage optimization ...... 34 4.4 Troubleshooting and Fuzz testing Results and Analysis ...... 36

5 Conclusion and future work 39

4 List of Figures

1 Software testing Pyramid [1] ...... 13 2 Git practices description [3] ...... 14 3 CI/CD Pipeline description ...... 15 4 Representation of process arrival in Gitlab [6] ...... 15 5 Example of Cyclomatic number computation ...... 16 6 Code Example for Halstead complexity computation ...... 17 7 Representation of an End to End Test scenario [5] ...... 17 8 Code example of an unreachable path via blackbox testing ...... 19 9 SAGE algorithm description [14] ...... 20 10 Flow Chart of the AFL algorithm ...... 21 11 GAN architecture [21] ...... 23 12 Instrumenting code an example ...... 25 13 HTML Code Coverage Report ...... 26 14 Troubleshooting setup description ...... 28 15 Evolution of Katalon test duration ...... 30 16 Pipeline execution time distribution ...... 31 17 Job triggering pipeline failure distribution ...... 33 18 Pipeline success function of the number of .ts files modified ...... 34 19 Merge Pipeline total execution time evolution ...... 35 20 Example of an asynchronous request failure ...... 37 21 Example of Linear size increase of factor 2 on a JSON of depth 2 ...... 37 22 Example of Recursive size increase of factor 2 on a JSON of depth 2 . . . . . 37

5 List of Tables

1 Comparison of different End-to-End testing tools ...... 18 2 Comparison of the different tool used for troubleshooting ...... 27 3 Comparison of the different Inter-process communication tools ...... 28 4 Comparison of the different tool for load testing ...... 29 5 Correlation Between code metrics and Pipeline success rate ...... 32

6 List of acronyms

API Application Programming Interface AWS Amazon Web Services CI/CD Continuous Integration / Continuous Deployment HTTPS Hypertext Transfer Protocol Secure JSON JavaScript Object Notation IDE Integrated Development Environment OS Operating System QA Quality Assurance REST REpresentational State Transfer SSL Secure Socket Layer TCP Transmission Control Protocol UI User Interface VM Virtual Machine

7 1 Introduction 1.1 Motivation The DevOps approach and the Agile methodology for software development had a profound impact on the way software is being developed, shipped, and deployed. The era of main- frames (1970-1980) was defined by technologies such as Cobol and Multiple Virtual Storage (MVS) with 1-5 years release cycle and extremely high risks in terms of meeting the cus- tomer needs. Then came the era of client/server (1990) with still high risk and relatively slow release cycle: 3-12 month, the current era of cloud is pushing down the release cycle to the order of days. This new approach pushing for faster release cycle and thus driving quality analysis and towards more automation as is tedious, lengthy, and expensive. This thesis was conducted during the development of the UI for a web application referred to as the web-app with the purpose of improving software testing and the CI pipeline efficiency.

1.2 Background Software quality is a leading concern for companies as a bug might be extremely expensive in terms of time spent correcting it and impact on the client. To maintain a certain level of quality, tests are used to certify that the system can handle a typical user scenario (with different network configurations). Moreover the system needs to remain functional despite having given servers unreachable, high network congestion or unexpected API response. In fact those conditions may occur in a real-world scenario despite being very unlikely to arise in a testing environment. The objective is to have the most relevant tests possible and to have them being automated while using the minimum resources in terms of VM’s. Relevant test meaning that the tests need to go through as much as possible of the system and remain close to real-world usage. Moreover, the test should access that the system is functional in various scenarios The studied system contains multiple types of tests the one detailed in this thesis are the following:

• Functional test • Load Test • Unit Test

• Chaos testing To evaluate and improve the relevancy of those tests we are using:

• Code coverage a metric indicating how much of the code is being executed through a test (can be applied to any type of test but in our case is designed for functional testing). It is the percentage of the line being executed during the test as well as the execution detail line-per-line. The objective is to maximize the percentage of executed code and complete the functional test by unit tests for the part of the code not executed during the functional tests.

8 • Mitmproxy a proxy to modify on the fly response given by the server to test the resilience of the UI

• Netem a network emulation tool: adding delay, loss, packet re-ordering, and breaking connection 1

The Code management platform used is Gitlab which enables us to extract traces of the activity and pipeline history and will be used to optimize the continuous Integration pipeline.

1.3 Problem Continuous Integration pipeline enables software to be built, tested and integrated in the master directory continuously. Enabling more collaborative work and faster release cycle. However, a CI pipeline is expensive to set-up and maintain due to the different jobs re- quired: deployment of the code, End-to-End test, Unit test and Build. All those jobs require a VMs, and maintaining them is a tedious process. Moreover the CI pipeline con- tains E2E test checking the entire system. But the E2E test is lengthy in terms of execution time and hard to maintain. On top of this, those tests need to remain relevant and evolve with the system. Otherwise a bug won’t be detected when code is pushed, making the bug harder to locate and correct.

Finally, due to their execution time, the jobs contained in continuous integration pipeline may generate a delay in the test feedback loop of the continuous integration pipeline. This would significantly decrease developers’ productivity and impact the release cycle.

1.4 Purpose In this master thesis we evaluate the improvements to a CI pipeline efficiency and the possi- ble enhancement of software tests. Furthermore we intend to keep the cloud resources usage low by running as few VMs as possible. In the first place, we use functional testing combined with metrics on the code namely: Cyclomatic complexity, and code coverage. It is important to ensure the test relevance in terms of how much of the code is being tested, the variety of the scenarios and the different path in the control structure of the code. The motivation behind this is to improve the code quality and to accelerate the CI pipeline execution time, crucial for the developers productivity and the product development time- line. Furthermore, we use different troubleshooting techniques: adding delay ,connection loss, altering JSON and emulating a service is down. With the aim of testing all the possible case scenarios that would not be explored by a traditional functional test or most manual test.

Troubleshooting is performed to meet with the user experience requirement of the web- app in extreme conditions (e.g., high latency, down services). Due to the heterogeneity of the cloud environment, the consistency of the different services used by the web-app cannot be guaranteed. Troubleshooting is performed with the purpose of improving the end user experience in real world scenarios, that may not be reproducible in a testing environment. Moreover, we are

1Netem is a tool acting directly on the Qdisk table of the operating system (works only on Linux based Operating system)https://wiki.linuxfoundation.org/networking/netem

9 monitoring the VMs used to complete those jobs with the intention of reducing the number of required cloud resources for the continuous integration pipeline. We defined the goals as explained in the coming section.

1.5 Goal The goals of this thesis are automating the software tests and improving them: their execu- tion time, their relevance compared to both typical real-world usage and extreme scenarios. Another goal is testing as much as possible of the system in terms of quantity of code exe- cuted.

Hence the final objective is to increase the quality of the code, the resiliency of the sys- tem and the CI pipeline completion time. As a matter of fact fast and continuous test feedback is crucial for the developers’ productivity. In order to reach the defined goals the following work is completed:

• Testing the system in extreme scenarios: high latency and unavailability of certain resources

• Testing the system with inconsistent data structures coming from the rest API • Introducing code coverage with functional and End-to-End testing

• Introducing Fuzz testing to test the system with unexpected input • Performing load testing on the system The ultimate goal is to run those jobs on a minimum number of runners while maintaining the completion time low. In fact we want the test feedback as fast as possible for the developers.

1.6 Research Methodology The methodology used in this thesis is a quantitative research approach based on the fol- lowing results

• Experiments gathering values on the performances before and after the different im- provement. Both on the continuous integration pipeline and the End-to-End test.

• Comparing the improvement with the original system and with to the state-of-the-art of research established in software testing.

1.7 Delimitation It is essential to note the material outside of the scope of this thesis. First, the work of this thesis is concentrated on software testing and CI optimization. However, the software itself developed referred to as the web-app, won’t be studied in this thesis both for the UI and the back-end. Moreover the optimization of the cloud resources allocation outside of the software delivery pipeline is also out of the scope of this thesis. Furthermore the deployment of the web-app in production () is also outside of the scope of this thesis.

10 1.8 Ethics and Sustainability The data analyzed in this thesis is collected from the code-management tools (Gitlab)2 used for the development of the web-app. These data were anonymized, such that the Data Set cannot be used to retrieve user per user information: the number of lines committed or the pipeline failure rate. Nevertheless, as this data-set contains confidential information (code, commit name) it will remain in the company for confidentiality and intellectual property reasons. Nonetheless, all the details on how this data is used and analyzed is given in full details in this thesis and can be reproduced on any other Gitlab project. Finally, regarding the social, economic and sustainability implications, there are no major negative implication as the data retrieval is performed only once and the data-set is fully nameless. Moreover, favorable implication to this thesis is the performance increase on the deployment pipeline and optimization of cloud resources usage.

1.9 Outline This thesis is divided into the following chapters:

• Introduction Introduction to the field of research - Chapter 1 • Background and Related Work A comprehensive summary of the research problem - Chapter 2

• Methods The methods and tools used for this thesis - Chapter 3 • Experiments The experimentation’s setup, results and their analysis - Chapter 4 • Conclusion on the work done wrapping the thesis - Chapter 5

2GitLab is a web-based platform with CI/CD tools on top of a Git-repository.

11 2 Background and Related Work 2.1 Background on Software testing Detailed below are the different types of software tests:

• Unit test is a test for functions or small fraction of the code. It does not take con- text or any external resources (database dependencies) into account.Unit verify small independent part of the code based on preconditions.

• Fuzzy Testing is a test that modifies a program’s input while remaining in the required range of structure and monitoring any unexpected output or error. Fuzz testing or is mainly used to detect security breaches or vulnerabilities and is one of the fastest evolving research field in software testing.

• Smoke test is a test verifying that the basic functionalities are working as planned.

• Integration Test is a test similar to a unit test besides the fact that it tests both the system and its dependencies. For instance, in case of a function calling a database, a unit test wouldn’t call a real database whereas an integration test would.

• Functional test is a test similar to an integration test in the sense that it both checks the system more than one component at a time, for instance, the UI and the database. However a functional test also checks the values returned by those different components as defined in the product definition. (e.g. an integration test would query a database while a functional test would test the value displayed in the User Interface) Note that a functional test can be model based test: the system under test expected behavior are defined in model (UML or SysML) and abstract test suite is created based on the model. Afterward actual tests are generated for the system.

• End to End (E2E) test is a test that verifies the complete set of functionality of a system by emulating the behavior of a user. In the case of the web-app, the test is performed using a browser automation tool (e.g. Katalon).

• Regression test is a test triggered after a change in the system (such as new or updated library/framework, new code or re-factoring). It checks if this change generates an error or unexpected behavior.

• Security test is a test that checks any potential security breach or flaw in the system to ensure its security.

• Performance/Load test is a test that puts the system under high loads and controls its performances. It is performed with JMeter in the case of our web-app

• ”Chaos test” is a test that trouble shoot the system and assesses how it responds in the worst conditions: services down, extreme delays, server not responding or inco- herent data structure in an API response. It is used to increase software resilience by maintaining a decent user experience in an extreme case scenario. This concept introduced by Netflix with the Simian Army and Chaos Monkey test Framework (emulating AWS instance / Region failure).

12 • Manual Test is a test completed by a QA engineer to explore the system manually, without the use of any external automation tool.

Figure 1: Software testing Pyramid [1]

This set of different types of test is often referred to as the software testing pyramid [2] (as shown in the Figure 1) , as being the most basic simple, efficient testing framework. It contains at its core the most basic test Unit test verifying only small fraction of the code and the highest level the most difficult to maintain test: Integration, Functional, and End-to-End tests checking the entire system. Additionally note that the unit test are not enough to verify a system. Unit test cannot check multi-threaded code, private methods or the compatibility of different elements.

High-level tests require the whole app to work, but break as soon as one of the compo- nents is down, making them not fully reproducible and hard to maintain. Furthermore, for the scope of this Thesis, we focus mainly on End-to-End, Chaos, Fuzzy and Performances testing even though other types of tests (smoke, regression, unit) are also being used.

Having test at integration level is essential to maintain the quality of the system, moreover automating them decreases the risk of human error the cost in terms of human resources and enables faster testing even though some testing might not be adapted for the automation tools.

The approach taken for testing here is Behavior-Driven Development (BDD) [8] suiting best the End-to-End testing. Moreover, there are also ways to prevent bugs in the first place with code parsing, e.g. tools such as DeepCode TSLint or Sonar3. For the devel- opment of the web-app we are using TSLint as it is made explicitly for TypeScript code whereas DeepCode or Sonar are more generic and harder to integrate in the pipeline. Those tools are based on general best practices such as keeping functions small and with small number of arguments, variables, as well as decreasing the cyclomatic complexity (detailed for instance in the book Clean Code [7]). Those tools give recommendation on the code without executing it, thus they do not replace software testing. In Figure 2 is a detailed

3Static code analysis tools Deepcode https://www.deepcode.ai/, TSLint https://palantir.github.io/tslint/, Sonar https://www.sonarqube.org/

13 diagram of what code is executed in the different case scenario of the CI pipeline.

2.2 Pipeline Description First of all the development and Git practices are detailed in Figure 2. Each developer has

Figure 2: Git practices description [3] one branch and in most cases each branch is created to resolve a Jira story 4. Then after a story is finished, a merge request is created with the master branch. It is either approved or rejected by a different developer than the one working on the branch. In the interest of having an external view on the code and keeping developers updated with the new code. The code on the master brach is then used for the deployment in production. However, this is not done through GitLab, and it won’t be discussed in this thesis, as it is a lengthy and complicated process on its own.

Each of the actions detailed previously triggers a different pipeline containing different jobs (test, build, deployment). The CI pipeline is composed of the following steps, as detailed in Figure 3, note that this pipeline was specifically designed for the development of the web- app. in red are the improvements to the pipeline completed during this thesis. Furthermore, the end to end test completion time was decreased as a result of the work for this thesis. Each step (represented as a box in Figure 3) contains multiple jobs: Deployment to a server, Test. This pipeline is triggered on 3 different types of events: Push, Merge and Production Deployment via a cron job. Each event triggers a different type of pipeline. Actually, as the number of Git push is considerably more significant than the number of merge the pipeline triggered with a push contains far less jobs. In the interest of reducing the load on the runners and the feedback loop time for the developers. Note that in the CI pipeline to move from step n to n + 1, step n needs to complete successfully, (similarly to the End-to-End Katalon test described in figure 7). Each step is composed of different jobs that run on Gitlab Runners. Each job is labeled with tags and runners are given a set of tags (such as: Linux, Windows, Deployment, UnitTest) and the job is then executed on the runner with the corresponding tag. This can lead to an unfair load balance between the runners. The allocation and matching

4Jira is a project management tool developed by Atlassian used for agile software development

14 Figure 3: CI/CD Pipeline description of a job to a runner can be described as the following Birth-Death process: a specific case of Continuous-Time Markov Chain as shown in Figure 4

λ λ λ λ (0, 0) (0, 1) (1, 1) (1, 2) (1, n) µ µ µ µ

λ

Figure 4: Representation of process arrival in Gitlab [6]

Note that this Markov Chain only describes a single runner. With the State of the System being described with the following two values: (a, b). a being the number of jobs being processed and b the number of jobs in the queue. λ is the birth rate thus the rate of Job ar- riving at the runner, and µ the death rate so the rate the runner executes jobs. In this case, the birth and death rate are equal for all state as all the jobs are considered to have equal execution time (in fact each runner mainly runs one type of job). Moreover the number of runners remains invariant for the scope of this thesis. However, the arrival rate depends on the time. The goal in optimizing runner resources is to have as few runners as possible being idle while maintaining a waiting time as low as possible.

Gitlab is used for the code management and build part, for End-to-End testing Katalon is being used and integrated directly into Gitlab (detailed in 2.2) for the E2E testing. Note that there are other tools for CI but we are using Gitlab as it is the best suited for this web-app, it is open-source and covers multiple steps of the integration process.

15 2.3 Code and Testing Metrics In order to guarantee the efficiency and the relevance of our test at functional level, we use the following metrics:

1. Code Coverage : Instrumenting the code to gain insight on which line is being executed showing what our tests do not cover 2. Cyclomatic complexity: It represents the number of independent paths through a given program. It is defined as follows :

(i)M = E − N + 2P (ii)M = E − N + P

With E the number of edges, N the number of vertices and P the number of connected elements and M the complexity The (i) formula is used in general case while the (ii) is used for strongly connected graph

Figure 5: Example of Cyclomatic number computation

As shown in Figure 5 the cyclomatic complexity for the left graph is thus 9−8+2∗1 = 3 As well as for the righ graph : 10 − 8 + 1 = 3 3. Halstead Complexity η1 : Number of distinct operators N1 : Total number of operators η2 : Number of distinct operands N2 : Total number of operands The difficulty is defined as : D = η1 × N2 It is a metric indicating how long it is for a 2 η2 person to understand the code. This metric is critical as it can increase the time for code review development. The volume is defined as V = N × log η The effort is defined as E = D × V and represents the time required for the actual coding. For instance with the following code Figure 6:

16 main ( ) { i n t a = 1 ; i n t b = 2 ; int sum= a + b; printf(”%d”, sum); }

Figure 6: Code Example for Halstead complexity computation

The unique operators are: main () int = + , ; printf The unique operands are a b sum ”%d” 1 2 η1 = 9, η2 = 6, η = 15 N1 = 16, N2 = 9, N = 25 Thus we have a difficulty of D = 6.75 and an effort of E = 6.75 × 97 = 654.75

Integrating code coverage with software is usually done with Unit tests nevertheless for this thesis it will be implemented with End-to-End test, thus a custom mechanism needs to be set-up to retrieve code coverage from the Katalon tests

2.4 Integration Testing & Functional testing We are representing an End to End test scenario as the following state machine: (inspired by the model described in Testing modeled by finite-state machines [9]) Let: m be the total number of test cases and ϕ(i) the number of actions performed on the ith Test Case. P(s) the probability of failure in state s S Being the Set of all possible states Sij Represents State i and action j (an action being a click a form submission)

S11 S12 S1ϕ(1) S21 S2ϕ(2) Smϕ(m) Success

P(S ) 11 P(S ) 12 P(S1ϕ(1)) P(S21) P(S ) P(S2ϕ(2)) mϕ(m)

F ailure

Figure 7: Representation of an End to End Test scenario [5]

Note that a test is considered as successful if and only if our agent reaches state Success

17 Thus the likely-hood of success is defined as X 1 − P(s) s ∈ S

Moreover, the different P(s) depends on what type of actions was taken as well as at what time the action was initiated. The total time of execution for the test is set to tmax = 90 minutes. Thus the later an action is performed, the more likely it is to trigger a failure. Our goal in developing this app is to decrease the likely-hood of an error and thus to increase the overall resilience and user experience of the web-app. In the case of our web-app we use in total m = 82 test case, one of them written for code coverage retrieval (executed before the last) and one to communicate with a proxy for UI troubleshooting.

To perform the End-to-End test, we have various options on what tools we can use. Detailed below in Table 1 is a concise comparison of the different End-to-End testing tools preformed for the work of this thesis:

Tool Description Usage Entirely manageable through a GUI Testing tool based on Katalon Can be programmed with Selenium-like with an GUI scripting in groovy, and extensions Headless Chrome for E2E testing Puppeter JS scripting Integrates code coverage, Open-Source Selenium Testing tool built with java, Open-source Scripting with Java Testing tool fully written in JS PhantomJS JS scripting Doesn’t use any browser Open-source Testing tool working only with Chrome, Cypress JS scripting Open-Source Testing UI similarly to Selenium, SoapUI Offers Web-services (includes mocking), Java Programming and includes load testing Testing UI, Offers rich Graphical user interface Usable through a GUI TestCraft Offers automatic scheduled test playback, or Selenium code database monitoring and confirmation email testing Testing UI, with rich graphical user interface Autify Usable only through a GUI Offers integration with Jenkins and TestRail Testing UI, with rich GUI Mable Offers integration with Jira, Jenkins, CircleCI Usable only through a GUI as well as Email and PDF validation and API testing

Table 1: Comparison of different End-to-End testing tools

For our specific requirements, we are using Katalon as it gives both the simplicity of a clickable GUI to manage tests and the possibility to customize test with scripting. This fa- cilitates the implementation of the troubleshooting test (done with Mitmproxy and Netem) into Katalon test scenario.

18 2.5 Fuzzy testing Background One other type of tests used in this system is Fuzzy testing (also called Fuzzing) it consists in modifying the input of the system in such a way that its value is different enough from the baseline that it triggers unexpected behavior. Furthermore, the system is being monitored for crashes, memory leaks or any exception thrown.

Fuzzing is usually performed in order to detect security breaches, and can also be used to detect errors or unexpected behavior. The most relevant part of the code to be tested are the part accessible by non-privileged users since they are the potential hacker of the system. The attack would be a file download, an API call or an HTML form. A Fuzzer is characterized by the following parameters: • Mutation or Generation based algorithm: Describes whether the Fuzzer uses an input seed to generate new input values using a genetic algorithm: Mutation based. In opposition to generating new inputs with a generation based algorithm internally either from scratch or using a grammar such as BNF spec: Generation based. Note that using a generation algorithm can take up much more time: the algorithm can mutate the input by flipping bits one by one.

• Structure Input aware (smart) or unaware: Describes whether the Fuzzer is aware of the input expected by the system or generates an input entirely randomly.

• Program Structure Aware or Unaware : Describes whether the Fuzzer is generating input that maximizes the code coverage, testing as much as possible of the code: whitebox (structure aware). It would also try as many branch in terms of the control structure and function call as possible. Alternatively if the fuzzer would use the code as a black box fully unaware of its structure or complexity. Consequently, Fuzzer unaware of the program structure, tends to have lower coverage and detects fewer bugs. For instance in Figure 8 the first if statement would have only 1 chance in 1032 of being executed. However Fuzzer unaware of program structure runs much faster as it doesn’t go through the code or analyze execution traces. Hence the most efficient way to test a program through Fuzzing is to start with a black box Fuzzer as it is fast and easy to set-up and then if not enough bugs were discovered to move to a white box Fuzzer reaching more code and control path.

i f ( a==27){ abort ( ) ; return 1 ; } e l s e { continue (); }

Figure 8: Code example of an unreachable path via blackbox testing

Multiple different Fuzzing algorithm can be used, below is a comparison to the most broadly used Fuzzing algorithms to choose from for the Fuzzy testing experiment on the web-app:

19 • AFL (American Fuller Loop) developed by Google. Modifying the input of a pro- gram based on techniques meant to increase code coverage and the number of control structure executed. AFL utilize a set of rules based on experiments on Fuzzy testing it was shown that by flipping a single bit triggers the exploration of (on average) 70 new path (in the control structure) while flipping 2 bits 20 and four bites 10 additional path. [15]

• SAGE (Scalable Automated Guided Execution) developed by Microsoft, its high- level architecture is detailed in Figure 9. The given input are checked if they trigger a

Figure 9: SAGE algorithm description [14]

crash, then the code coverage is computed and constrains are automatically established based on the executed code. For instance in the following statement: if(x+3==7) the constraint is x to equal 4. This is then given as an input to a constraint solver such as Z3 SMT solver5 [14]

• DART (Directed Automated Random Testing) one of the first dynamic test gener- ation algorithm, it analyses traces from the program execution and code coverage to explore all the possible path in the control structure. One of its extension is CUTE handling multi-threaded programs.

• T-fuzzing (Transform Fuzzing) modify the program to execute the different control structure to execute all the code [16]. Unlike the previously described algorithm T- Fyzzing doesn’t modify the input. Hence it is also coverage based. One of the most efficient, easy to use and set-up fuzzer is the American Fuzzer Loop (AFL). The algorithm used to fuzz-test the web-app for the work of this thesis is derived from the AFL algorithm. 6 algorithm, it is a mutation based structure aware and program structure. The AFL genetic algorithm is described in figure 10 (a chart created for the work of this thesis to describe the AFL algorithm) : AFL was chosen among others Fuzzer algorithms since it is easy to set-up for web-application,

5Constrain satisfaction solver developed by Microsoft, solving logical problems 6First Created by MichaA˚ Zalewski AFL is now maintained by google https://github.com/google/AFL

20 Initial input seed

Mutation: generating new population

Computing Fitness below desirability threshold Restoring Fitness original score population

Fitness above desirability threshold Adding the new elements to the Population

Figure 10: Flow Chart of the AFL algorithm and runs faster than other structure-aware Fuzzer, and has a very developer community supporting it. Moreover AFL has been used in the development and test of large scale and established project revealing error and bugs in MySQL, Mozilla Firefox or OpenBSD.

The Fuzzer we use for the web-app is derived from AFL, it modifies the top-level keys of a JSON and if a bug or an incoherence is detected (similar to computing the fitness score). The Fuzzer then modifies the JSON deeper from this specific key’s value (similar to adding new element to the Population). The reason we are using this approach is that we want explore the JSON tree as in a real world scenario all the keys can be deleted. In addition to this we want to keep the Fuzzing completion time relatively low. Hence we assume that if deleting the top-level key doesn’t trigger a failure we may proceed to a more critical key of the JSON causing failure.

2.6 Load testing Background Loads testing consist of putting high demand on a system in order to test its limitations. Those tests increase the load on:

• Network (congestion, reordering, delay) • Database Server

• Web Server

21 • Load Balancer • Client-side application Thus for the purpose of fitting a real-world scenario, we use different types of load tests:

• Capacity test: Simulates slowly increasing and decreasing load for a reasonably long time

• Stress test: Simulates spike or short burst of loads, to asses that the system can successfully recover after a failure.

• Scalability test: Simulates increasing loads under varying performances (number of CPU, memory, storage, network capacity) while measuring system performances.

• Robustness test: Simulates a long period of loads to asses the system sustainability. • Volume testing: Simulates increasing volume of load on the system without increas- ing load (for instance increasing the number of keys without having a significantly larger request) For the scope of this thesis, we focus on Load test and Volume testing.

2.7 Literature review and related work As discussed previously, software testing and continuous integration are a critical part of software development and an ever-evolving area of research. We can divide it in 3 different areas: Chaos and Fuzzy Testing, Load testing, and finally the Pipeline traces analysis.

First Chaos testing the research conducted in this domain (for Chaos testing it has mainly been by Netflix) [4] consisted in emulating Cloud malfunction for the purpose of improving the Back-end resilience here our goal is to improve the Front-End resilience to similar cloud and network-related issues, and more such as inconsistent data structure, network delay or loss. In the same way as chaos testing emulates a malfunctioning or unexpected behavior from either the system itself or an outside component, Fuzz testing asses if the behavior of the system with unexpected input is acceptable.

Fuzz testing has been lately one of the most active area in research on software test- ing. Note that most of this research is aimed at improving whitebox testing as it is the most complex type of Fuzz testing, and is the type we’ll study in this part. The performance of Fuzz testing is described in [18]: the number of vulnerabilities discovered via a Fuzzer (AFL Fuzzer) is 11 while only 3 were discovered using Unit Test. However, as mentioned in [18], not all system satisfies the prerequisites of Fuzz testing. Fuzzer don’t necessarily work in a multi-threaded environment and the compiler might not support code coverage. Overall the output of this paper is that Fuzz testing is cheaper than Unit test and can achieve better results however it might not be applicable to all software. Furthermore, it is shown that resource usage should be monitored as an uncontrolled mem- ory allocation would crash the system, and be hard to detect.

One other active topic of research is improving whitebox Fuzz testing performances.

22 In [17] is a description on how Fuzz testing can be improved (using the SAGE algorithm). The main takeaway message of this paper is that the quality and relevance of the original seed file is crucial and prevails on the number of iterations performed: in fact in this paper 76 % of the bugs were found while using 2 to 4 iterations. Consequently, input generation is decisive in the Fuzzer performances, and to get higher quality and more variety input an approach proposed in An Intelligent Fuzzing Data Generation Method Based on Deep Adversarial Learning [19] is to use a Wasserstein Generative Adversarial Network (WGAN). GAN architecture is composed of two neural networks: a Generator network (such as an auto encoder) and a discriminator trained to recognize whether an input is real data or was synthetically generated Described in Figure 11.

Figure 11: GAN architecture [21]

This network is used to generate network frames, that are structurally similar to the training test without having received any information on the type of network protocol. Consequently, it can detect pattern in the input that were not necessarily written in the technical speci- fication. For instance, with the EtherCAT protocol were discovered: 137 packet injection attack 353 man in the middle and 31 working counter-attack. In addition to this, the train- ing time is fairly short: A Test Input Accepted Rate (TIAR) of 85 % after 40 epoch. Note that here the WGAN network trains faster and has higher TIAR than the GAN network with the same meta parameters.

Then another area of research is the pipeline traces analysis it is aiming at improv- ing the project pipeline success rate and thus the deployment time. The research already done on this topic mainly concern the release cycle and the re-usability of the code with for instance the COCOMO metric COnstructive COst MOdel [11] being used as a difficulty to write and integrate the code. Furthermore, Load testing and tools bench-marking is also a very active area of research in software testing. Research was performed in comparing the different types of load test and tools being used, and how each control a different aspect of the platform resilience to load [12]. Moreover, other studies were conducted to compare the different tools [13]. Both of those studies concluding that JMeter is both provides more ac- curate results and has a more accessible User interface that can be customized with plugins. Note that for the specific needs of this thesis we’ll use JMeter7. 7A tool used for load testing and monitoring performances, focused on web-application

23 Finally one more area of research is intelligent software testing. It utilize both fuzzing and search based exploration to explore an application. It is primarily developed by Face- book for Android application testing: Sapienz project [20] and was able to discover 558 unique crashes in the top 1000 most popular android application. As previously described autonomous UI or functional testing is expensive and tedious to maintain however it usually represents well the typical real-world usage of a system.

https://jmeter.apache.org/

24 3 Methods 3.1 Code coverage for End-to-End testing Code coverage is defined as a measure of the quantity of code being executed during a test. It gives a metric on how much of the system is being tested. Hence indicating how relevant those tests are. Moreover, by having line-by-line coverage on the functional test, the code not verified by E2E test can be checked with additional Unit tests. Some code is ”unreach- able” with End-to-End testing, for instance code executed in case of a page not found error is never executed with End-To-End testing. Conventionally code coverage is performed with Unit tests, here the goal is to integrate code coverage on our End-to-End tests. To obtain code-coverage on E2E test, one possible option would be to shift all our script to a Puppeteer script or Cypress and to use their integrated code coverage functionalities. However, as the work to convert and maintain the current test scenario to Puppeteer or Cypress script is not negligible, Katalon and the current script are kept. Moreover, Puppeteer and Cypress doesn’t offer all the functionality that Katalon has such as: video recording for instance.

Thus to integrate coverage, the following 4 steps are performed: 1. Instrument the code Adding counters for every instruction, as well as one for every branch in the control structure (as detailed in figure 4). It is essential to keep the original non-instrumented files stored in a backup directory as they will be required to display the coverage line-by-line and to recover the original code of the web-app. For this, we use a tool called Istanbul8 that automatically generates instrumented .js files. Other tools exist, such as blanket.js however they do not offer the ability to merge multiple coverage objects or to generate a very detailed HTML coverage report. Note that this process can be lengthy for a large project as each file’s instrumentation can take up to 7 seconds (for a thousand line of code). Moreover, all those extra instructions add up on the execution time of the code: for the web-app it is a 30-40% increase in execution time.

counter .statement[0]++ function add(a, b) { function add(a, b) { counter. function[0]++ return a + b counter .statement[1]++ } return a + b variable = true } counter .statement[2]++ variable = true

Figure 12: Instrumenting code an example

8Istanbul is a JavaScript code coverage tool instrumenting the code (adding counters for each instruction) and converting raw coverage JSON into HTML reports https://istanbul.js.org/

25 2. Run E2E Test Test run with custom capability browser ( as named in Katalon) : a browser with an extension here: Custom JavaScript (CJS)9 to inject JS code in the browser in the test part of the process might take longer to run due to the instrumentation of the code, this must be taken into account when writing the tests

3. Retrieve Coverage & conversion to HTML Call a custom JS function injected using CJS to display the coverage object in the browser console as a JSON object. The result is an array of coverage objects (one object for each page reload) it is then merged into one using Istanbul-merge. Then we use Istanbul to convert this raw JSON object into an HTML page.

4. De-instrumenting the code Recovering the original state of the system, by deleting the instrumented directory and restoring the un-instrumented backup directory. After converting the coverage JSON object to HTML we obtain the following result.

Figure 13: HTML Code Coverage Report

The branches’ execution percentage thus usually decreases with increasing cyclomatic com- plexity.

3.2 Automating Coverage for End-to-End tests The objective with the code coverage is to have it included in Gitlab as a cron job and running it regularly. Code-coverage is performed as a mean to guarantee the relevance of the test throughout the development of the web-app. It guarantees that all the code is tested. To obtain code-coverage we created a web-based protocol instrumenting and de-instrumenting the JS code on our web server in order to be able to generate the instrumented code re- motely. This protocol calls a python script utilizing the previously mentioned tools: Istan- bul, Istanbul-merge. This test is running weekly at a time when both of the teams are not pushing code as the

9Chrome extension to inject JS (or CSS) code in any website, https://chrome.google.com/webstore/detail/custom-javascript-for-web/poakhlngfciodnhlhhgnaaelnpjljija

26 instrumentation slows down the web-app with a 30% increase in JavaScript execution time. Picking a time for the coverage test is not trivial as running it requires a Gitlab runner and a web-server for 30 minutes. Moreover, some of the teams working on this system have a -9 Hours time shift with europe, thus coverage runs at 1AM on Sunday. Coverage cannot run more than once a week mainly for this reason.

3.3 Background and Set-up description The web-app is deployed in a Cloud environment in which we cannot control the availability of all the services or the consistency of the data structure. We need to have a system that can remain functional in a heterogeneous cloud environment. Thus we want to be able to emulate how the web-app would react to an incoherent data structure (such as a missing key in a data-structure), an unexpected 500 response code or a missing resource.

Tool Language Possible action OSI level Intercept HTTP and HTTPS query and response Mitmproxy Python Application and modify them on the fly Toxiproxy Go Cut TCP connection, decrease Bandwith add latency Network Modify HTTP/HTTPS Fiddler JS responses on the fly similarly to Mitmproxy Application script written in JS Update the Qdisk table (in Unix system) Netem Shell to add delay packet loss or reordering Network Emulating WAN network characteristics Integrated with Chrome developer tools Google Chrome GUI (more tools - Network conditions) Application Doesn’t offer any automation

Table 2: Comparison of the different tool used for troubleshooting

For the specific need of this study the tool that we using is Mitmproxy as it gives the pos- sibility to edit any HTTP response. It is open-source, easy to set-up, programmable with Python script and works seamlessly on Windows and Linux. Furthermore Netem is used to emulates different network conditions as it is highly configurable and can easily be automated. Overall based on the research on existing troubleshooting outlined in 3.3 it is the best suited tool for the work of this thesis. However Google Chrome developer tools is also used instead of Netem in some cases, as it doesn’t require any extra installation and is the easiest solution but doesn’t include all the capabilities of Netem and won’t work with another browser.

Concerning the tool used to enable communication between the proxy and Katalon we have the following options shown in Table 3. From table 3 comparison of different protocols we concluded that the best suited for our use case is Name pipes as it works with different programming languages (in our case Java, Groovy and Python) and it is compatible with both Linux and Windows, a critical point in our case as both OS are used on CI servers.

27 Mechanism Description OS compatibility Multiple processes are given access to a bloc of Shared Memory All OS memory requires a synchronization mechanism A pipe implemented through a file multiple Named pipe All OS Process can read/write seamlessly Socket Data is sent over the local interface (loopback) All OS A signal is sent to another process Signal Generally used to command a process Most OS and not to transfer data Pipe A one way buffered communication channel Most OS

Table 3: Comparison of the different Inter-process communication tools

3.4 Implementation description As discussed previously to test the web-app in extreme scenarios, we are using Mitmproxy a python proxy to intercept HTTP traffic at the network layer and modify them on the fly. These methods also work with HTTPS connections, but the SSL certificate is generated by Mitmproxy. Hence this approach doesn’t work with a system using certificate pinning (only allowing certificates that were issued by the Certificate Authority (CA) of the website domain). This limitation isn’t specific to Mitmproxy and also applies to other software. Our set-up for chaos troubleshooting is detailed in Figure 14. Note that this set-up was specifically created for the work of this thesis.

Figure 14: Troubleshooting setup description

3.5 Load Testing Tools Comparison For this specific set of tests, we are using Apache JMeter even though it isn’t the most accessible tool to use or set-up. Nevertheless it is open-source, free, offers numerous testing possibilities and an extensive community of users offering support, documentations and code-examples. This choice was done after research performed for the work of this thesis on

28 Tool Language Possible action Increase load and volume Apache JMeter Java and SSL certificate validity Offers basic performance graph open-source JMeter in Katalon Groovy/Java Integrating Katalon API testing in JMeter Increase loads and volume WebLoad Java Integrates with most CI/CD tools Offers a UI with performance graph Increase load and volume Jython Grinder No integrated UI or performance graph display Java implementation of Python Open-source Increase loads and volume Tsung Erlang with randomized user arrival distribution HTML performances report Open-Source Increase loads and volume Also support multiple L.7 protocols Gatling Scala Yields very detailed HTML reports Open-source

Table 4: Comparison of the different tool for load testing the most efficient load testing tools and presented in Table 3.5. But it should be remembered that there exists software improving the already existing load testing tools adding a layer on top of them, for instance, Taurus: a wrapper for JMeter, Grinder or Gatling adding functionalities making test writing easier and providing live and detailed JUnit reports Open-source.

29 4 Experiments 4.1 Functional test traces analysis Using the Gitlab API we were able to extract metrics (completion time, pipeline failure rate, Katalon test failure rate, commit size) on the first 3 steps (code management, Build, Testing) of the CI pipeline and use it to improve the success rate of the pipeline. We are analyzing them in the view of decreasing the deployment time.

First to detect the faulty part of the code we are not able to use Fault location techniques on the Katalon End-to-End tests as the two main techniques for fault location (spectrum or mutation-based) detailed bellow cannot be applied to our End-to-End tests.

• Spectrum Based fault location: Running multiple tests and analyzing where a fault could occur in the code based on which line is executed in the successful and non- successful tests.

• Mutation-Based fault location: Running the same test multiple times but slightly modifying the code (one operator, for instance) exploring what is the impact on the test output. Note that both of these techniques require code coverage. The End-to-End test-scenario takes over 20 minutes and having different instances of it and running them multiple time would increase the testing time to over an hour and it is not acceptable for a continuous integration pipeline. For this reason, we assume that the committed code triggering failure in the test is in its entirety, faulty code. As we ran the tests we were able to notice that there is a correlation between the cyclomatic complexity and the part of the code that triggered tests to fail.

Figure 15 show how the Katalon tests execution time evolved in function of the advance- ment of the project ( pipeline number):

Figure 15: Evolution of Katalon test duration

As shown in Figure 15 the total completion time of the test doesn’t increase linearly and

30 tends to a limit after 40% of the test launched. In fact at the beginning as numerous pages are being added to the web-app and numbers of test are being written for it and changing pages is one of the longest part in the execution of End-to-End test. Note that the metric here is pipeline number, not the date for privacy reasons. Thus we can conclude that even though our project is getting bigger the testing time re- mains acceptable even with End-to-End tests. Furthermore, the Katalon test execution represents 75.4% of the total merge pipeline execution time as detailed in Figure 16 bellow, thus optimizing it is crucial to reduce the pipeline execution time.

Figure 16: Pipeline execution time distribution

After implementing code coverage and optimizing the Katalon test based on the result ob- tained, the completion time of the Katalon test was significantly decreased: A 200 seconds decrease in Katalon execution time. Furthermore, optimizing the waiting time allocated for page loading: the time between the execution of Siϕ(i) and S(i+1)1 (as described in Figure 7) decreased the total execution time by 43 seconds. The total time waiting spent waiting for page load during the Katalon test is 400 seconds, representing 36% of the total Katalon test execution time. The results of the Katalon test completion time optimization are summarized in the table 4.1 bellow.

Finally, note that there is another way to increase the completion time of an End-to-End test by mocking the API response. By using a proxy to intercept the API request and responding with a prerecorded response we gain a significant amount of time waiting for a

31 Optimization Time gained (seconds) Test refactoring after Code Coverage 200 (16%) Object detection 27 (2.2%) Page load waiting 43 (2.8%) Total 270 (21%) remote server to respond. With this approach the API response time would be divided by 10. However, this approach is not suitable for our case since both the front and back end are tested at the same time and so the API response cannot be mocked.

4.2 Parameters influencing failure rate The second part of the data-analysis was pursued with a view to discover a correlation be- tween the number of tests failing and metrics related to the code pushed (type and number of line modified, code complexity, commit size). First, the overall success rate is: 70.6%. However, the failure rate of the pipeline containing the Katalon tests (only the Git merge trigger Katalon, commit doesn’t) is 70.75 %. Even though only 21% of the pipelines con- tains Katalon Tests, Katalon is responsible for 46 % of all pipeline failures. Finally among all the pipelines triggered by a merge (executing Katalon) Katalon Tests caused 69.1 % of the failures. The distribution of jobs triggering pipeline failure is shown in Figure 17. In order to identify the relevant factor, we computed the Bravais-Pearson Correlation be- tween the failure rate and the different factors (number of modified file, number of lines committed, type of file committed). Correlation is computed using the following formula : P (xi − x¯)(yi − y¯) rxy = pP 2 2 (xi − x¯) (yi − y¯)

With rxy being the correlation between x and y, x and y the two array compared (having the same size) and x¯y¯ the mean value of those arrays The Correlation is considered as being

Parameter Bravais-Pearson Correlation Number of modified Files 0.13 Number of .html lines committed -0.21 Number of .ts lines committed -0.58 Number of .groovy (Katalon test) lines committed 0.01 Total number of lines added 0.35 Cyclomatic complexity -0.82 Halstead Complexity -0.74

Table 5: Correlation Between code metrics and Pipeline success rate low if it is higher than -0.5 or lower than 0.5. From the above table 5 we can see a high correlation between the cyclomatic complexity and the pipeline failure rate.

Thus as detailed in the above Figure a large part of those parameters do not have any impact on the total pipeline failure rate. The parameters having a significant correlation with the failure rate (≥ 0.5 or ≤ −0.5) are code Complexity metrics: the cyclomatic and the

32 Figure 17: Job triggering pipeline failure distribution

Halstead complexity. This means that higher number of possible paths in the code tends to increase the risk of pipeline failure (note that it may not necessarily be due to a Katalon test failure) as it gets more challenging to write script exploring all the possible path whereas exploring more functions is easier when writing test.

The Halstead complexity also has an impact on failure rate. Variables or operands added to the code cause error or incompatibility with part of the rest of the code. Moreover, multiple developers write the code it gets harder to understand code with high Halstead complexity. Furthermore functions with high Halstead complexity written by multiple developers are more likely to cause a pipeline failure.

To tackle this issue of increasing failure rate we installed a tool (tslint) measuring the cyclomatic complexity within the IDE (code editor) in order to keep it within boundaries and keep the functions relatively small and straightforward. Also note that a code containing too many variable operands could cause some problem later on as the functions get harder to understand and complete by other developers

Moreover, one surprising fact in favor of the use of End-to-End tests is the very low corre- lation between the number of .html lines and the failure rate.

33 HTML files contain the page structure and the button identifiers that Katalon utilize to navigate through the app. This is thanks to efficient communication between the QA and the developer: tests and the code are updated simultaneously. And the smart-locator tech- niques used for object (button, link, form input) location on the web-page. By using different locators attributes: data-*, class, id, or text, we are able to decrease the number of object not found causing Katalon test failure.

Figure 18 presents the pipeline success rate in function of the number of .ts files com- mitted, we can see that the number of files and success rate are inversely proportional. Note the lasts failure rates for 6 and 7 files are based on a few pipelines (3-4) only whereas the others are based on hundreds. Consequently, this affirms that smaller commits tend to have a higher success rate at being integrated in the project as they have lower pipeline failure rate. Finally it should be recalled that the CI pipeline is ever-evolving and jobs are

Figure 18: Pipeline success function of the number of .ts files modified constantly added or withdrawn to the pipeline. Thus it is not relevant to plot the overall pipeline success rate in function of the time, as both the optimization and the pipeline modification change the success rate.

4.3 Gitlab Runner usage optimization As mentioned in part 2.2 the load on runners (VM’s used for the CI pipeline with Gitlab) can be modeled as a markov chain as in Figure 4. More precisely it is an MM1 queue, this queue becomes unsustainable if we have λ > µ. Note that only the arrivals between 2pm and 6pm are considered as it is the only time when the load on the system is significant. The first part of the server optimization concerns the runner dedicated for End-to-End testing usage and preventing it from being the bottleneck of the CI pipeline since it is the longest job and can only be ran on a specific runner (it requires specifics software installed). From the pipeline traces analysis the rate of arrival of new jobs on the Runner associated with End-to-End test (Katalon test) is 1.2 job per hour. Hence since the arrival rate is = 1.2/hour and µ = 2.57/hour the load on the Katalon runner is sustainable and won’t keep forever

34 increasing even with high loads.

However the CI pipeline is composed of other types of jobs: unit test, build or deploy- ment. Those jobs don’t run on the same server as the Functional test indeed they require less specific software to be installed. They also require active runners thus the goal would be to evaluate the minimum number of runners required in order to maintain a functioning CI pipeline.

Each of those jobs is running on a different runner and for each Git Push. Meaning that each job is triggered on average every 9.5 minutes. Based on the completion time of those jobs and the frequency of git push or merge we can establish that the two runners already present in the system are enough. However the load balance can be optimized: meaning that the arrival rate λ would be more consistent among all the runners. And the optimization of deployment speed by choosing runners on the same cluster would enable on average a 15 % decrease in both Git merge and Git push execution time. For this allocation optimization only the job tags needs to be updated and for intellectual property reasons the runners’ tags cannot be disclosed in this thesis.

On top of this another runner is dedicated for the Katalon tests, note that adding sec- ond runner dedicated for the Katalon test would enable faster pipeline completion. In fact with up to 3 Katalon job in the waiting list at peak load, adding a second runner would enable faster pipeline execution: more precisely, it would decrease the waiting time for Katalon jobs by 2 minutes. Indeed 10% of the Katalon test are queued and the Katalon test execution time is 22 minutes. However due to the associated cost and maintenance, for a relatively small gain it was decided not to add a second Katalon runner.

Moreover, we can modify GIT DEPTH: a parameter indicating how much of the Git his- tory is retrieved when the code is copied to a Gitlab Runner. Decreasing GIT DEPTH to 5 would reduce by 6.2% the repository size and thus decrease the copy to the different runners.

Overall from all the optimization performed on the pipeline execution time we have been able to decrease the total pipeline execution time by 320 seconds as shown in Figure 19

Figure 19: Merge Pipeline total execution time evolution

35 With: 1∗ being the Katalon test optimization 2∗ being the runner usage optimization 2∗ being the runner location and Git DEPTH optimization

4.4 Troubleshooting and Fuzz testing Results and Analysis From the Mitmproxy Troubleshooting deleting the keys we got the following results :

• 1/3 of the key deleted triggered no error on the UI and had thus no impact on the User experience

• 1/3 of the key deleted compromised some of the information displayed but the platform was still fully functional and all the buttons/links were working,

• 1/3 of the key deleted caused the platform to stop working: no display, or links to stop working. We can thus conclude that for the key-deleting troubleshooting part 30% was use-full to detect critical bug freezing the app, the 30% compromising the displayed information was not repairable as the information was missing. Finally for the 30% of the key deleted not triggering any error we can conclude that the REST API contains too much data for some use-case as some keys are not used and could be deleted to decrease transfer time. It should be remembered that those results were obtained by first deleting the highest level keys and then deleting keys deeper into the JSON structure based on the high level key experiment results (a method derived from the AFL algorithm). Only deleting the top-level keys also triggers failures or unexpected behavior however it would give a less comprehensive analysis on the system resiliency as in a real-world not only the top-level keys are altered.

Then after adding delay (1.5 to 30 seconds) with Netem and 3 bugs were detected, all of them crashed the app completely. This test was critical since this type of delay occurs in real words scenario, not necessarily as a network delay but also as a response delay from the server it happened on the web-app in a testing environment (since testing environment have minimal resources).

Moreover, as some of the resources of the web-app are loaded asynchronously, having long delays may lead to a dependency error. Some resource depend on each other to be executed as shown in Figure 20. One error of this type was detected and Then we modified the HTTP response code to 404 and 500 (not found or server error) for both CSS and JS resources.

For each JS resource turned-into a 404 or 500 the web-app completely crashed (also note that to perform those tests Chrome is enough as some specific given URL can be blocked through Chrome developer tool). The 404 and 500 HTTP response code for the CSS file had a serious impact on the user interface and the web-app became barely non-usable. Disabling JavaScript from the browser made the web-app non-usable while no error message was displayed an issue corrected later with a clear error message.

36 Figure 20: Example of an asynchronous request failure

Finally for the API returning JSON content to test the resilience of the web app to ex- tremely large responses, the JSON response are being increased in size through the proxy both linearly and recursively. As shown in figures 21 and 22 the tree representing the JSON data structure (JavaScript Object Notation a value pair text-based data format), on the left the original JSON and on the right the JSON after size increase.

• Linear size increase:

Figure 21: Example of Linear size increase of factor 2 on a JSON of depth 2

• Recursive size increase :

Figure 22: Example of Recursive size increase of factor 2 on a JSON of depth 2

37 Let m the size increase factor and D the maximum depth of the JSON We have the total size of the returned JSON in the order of O(m × D) for linear increase and O(mD) for Recursive increase.

When running with recursive size increase we noticed that the actual time increase (in terms of code execution) is in the same order as the JSON size increase (Respectively with size increase factor 2×, 3×, 4× time increase was 2.2×, 3.4×and 4.7×). But it crashes the web-app with an increase factor higher than 7 as the HTTP response containing the JSON was 16 MBytes which is too large for Google Chrome to handle.

However, with a linear size increase of factor m 10, the execution time difference was not measurable, and no difference was noticeable in the user interface. Even though the JSON are substantially larger than what the rest of the system (database, back-end servers) is designed to handle. Moreover, we obtained similar results for both of those experiments with different websites and their API not only our web-app. Note that the experiments were run on the same browser: Google Chrome. This highlights the fact that the time increase is mainly due to the processing time by the JS Chrome engine (V8) interpreting the code and executing as well as the machine performances and not about the complexity of the algorithm executed.

Comparing to state of the art in Fuzz testing (notably Fuzzing using a GAN network [19]) the techniques used for this thesis enabled the detection of bugs or unexpected behavior however no security breach was detected. However the advantage of the approach used here is that it is fast and simple to set-up, and it can be run on any web-application or API returning a JSON, furthermore the issues were detected in scenarios very close to a typical real-world scenario, but outside of the typical test scenario and test environment.

38 5 Conclusion and future work

Software testing has become more and more critical as the deployment cycles are getting shorter and shorter on the cloud. End-to-End testing is used (in addition to the Unit test, smoke test, load test) as in addition to testing the code itself it also tests all the resources (databases, API). Furthermore, those resources caused failures but they only represent a small percentage (10% of the Katalon failures are not related to the web-app development) of the total failure rate. Moreover, the failures are rarely due to DOM inconsistencies, and more to actual TypeScript code. However by default Katalon End-to-End tests do not pro- vide any coverage option, thus we had to use a custom implementation: using Istanbul to instrument the code, CJS to retrieve the coverage object and again Istanbul to convert this object to a human-readable HTML file.

This proves that the End-to-End test can offer metrics on how much of the code is exe- cuted, used to improve tests. Nonetheless End-to-End test won’t offer error detection and still remain more time consuming to maintain and to execute. Nonetheless End-to-End test remain mandatory since error not anticipated in unit test preconditions or the compatibility between components won’t be detected with unit test. Moreover thanks to code coverage and optimization (Page loading and Object locators) the execution time of the Katalon test was decreased by 21%. Furthermore, by using a proxy to modify the HTTP and the JSON response on the fly by: adding delay, increasing packet loss, modifying response code (404, 500) and deleting keys in the JSON, errors were detected. Those failures would have never been detected in a typical testing environment nevertheless they may happen in a real-world scenario, with lossy network connection or unpredictable cloud resources. Those failures were corrected with either a clear error message or no error and the application handling the API response or lossy network by itself.

The statement from this thesis is that a continuous integration pipeline can be improved both in terms of software testing with notably code coverage, with on-the-fly API response edits to test the resiliency of the system. Pipeline analysis to help detecting which parameter is causing pipeline failure. Then using tslint to detect pipeline failure threats directly when the code is written. In terms of execution time with Gitlab-runners allocation optimization.

Finally, there is room for improvement in Fuzz testing to detect new errors and new scenar- ios. One of the most promising technique being Generative Adversarial Networks to generate input for Fuzz testing especially for . The technique discussed in this thesis can be improved by not only deleting JSON keys but also altering the values contained in the JSON. Modifying the values, would test how the front-end code handle different type of response from the API. Considering the E2E test optimization the completion time can be decreased by using a proxy mocking the API response. Only if the test is used to check the front-end side and not the back-end. Additionally a way to decrease the failure rate of E2E test is to create smart healing mechanism that automatically detect changes in the UI (button location, text, element id), and updates the test automatically without crashing.

39 References

[1] The Practical Test Pyramid - Ham Vocke - https://martinfowler.com/articles/practical- test-pyramid.html [2] Software Testing pyramid refference : https://www.softwaretestingnews.co.uk/adopting- the-test-pyramid-model-approach/

[3] Gitlab Workflow - https://about.gitlab.com/solutions/gitlab-flow/ [4] Chaos Engineering - Basiri A., Behnam N., Rooji R., Hochestein L., Kosewski L., Reynolds J., Rosenthal C. - IEEE Software 2016 [5] All About Test Part 2 - Testing Stateful System - Simeon Sheye - https://www.codeproject.com/Articles/530312/All-About-Test-Part-2-Testing- Stateful-System [6] An M/M/1 Markov chain representation in Crafting a Real-Time Information Aggre- gator for Mobile Messaging - Jenq-Shiou Leu - Journal of Computer Systems Networks and Communications 2010

[7] Clean Code: A Handbook of Agile Software Craftsmanship - Robert C. Martin [8] A Study of the Characteristics of Behaviour Driven Development - Carlos Solis , Xi- aofeng Wang et al.IEEE SEAA 2011 [9] Testing Software Design Modeled by Finite-State Machines - IEEE TRANSACTIONS ON TSUN S. CHOW 1978 [10] Cyclomatic complexity density and productivity - Gill G. K., Kemerer C. F. - IEEE Transactions on Software Engineering 1991 p.29 [11] Improving software development management through software project telemetry - Johnson P. M., Kou H., Paulding M., Zhan Q., Kagawa A., Yamashita T. - IEEE Software 2005 [12] Comparative Study on Performance Testing with JMeter - Dr. Niranjanamurthy, Kiran Kumar S, Anupama Saha, Dr. Dharmendra Chahar - International Journal of Advanced Research in Computer and Communication Engineering 2016

[13] Comparative Study of Performance Testing Tools: Apache JMeter and HP LoadRunner - Rizwan Bahrawar Kha - Master Thesis BTH 2016 [14] SAGE: Whitebox Fuzzing for Security Testing- Patrice Godefroid, Michael Y. Levin, and David Molnar - Microsoft - (https://patricegodefroid.github.io/public psfiles/cacm2012.pdf) [15] Binary Fuzzing strategies: what works, what doesn’t https://lcamtuf.blogspot.com/2014/08/binary-fuzzing-strategies-what-works.html [16] T-Fuzz: Fuzzing by program transformation - Hui Peng, Yan Shoshitaishvili, Mathias Payer - IEEE Symposium on Security and Privacy 2008

40 [17] Automated Whitebox Fuzz Testing- Patrice Godefroid, Michael Y. Levin, David Molnar - NDSS 2008 [18] Fuzz Testing in Practice: Obstacles and Solutions - Jie Liang, , Mingzhe Wang , Yuan- liang Chen, Yu Jiang and Renwei Zhang

[19] An Intelligent Fuzzing Data Generation Method Based on Deep Adversarial Learning - Hui Peng, Yan Shoshitaishvili, Mathias Payer - IEEE 2019

[20] Sapienz: Multi-objective Automated Testing for Android Applications - Ke Mao, Mark Harman, Yue Jia

[21] A Large-Scale Study on Regularization and Normalization in GANs - Karol Kurach, Mario Lucic, Xiaohua Zhai, Marcin Michalski, Sylvain Gelly

41 TRITA TRITA-EECS-EX-2020:84

www.kth.se