Branch Coverage Prediction in Automated Testing †
Total Page:16
File Type:pdf, Size:1020Kb
Received 26 April 2016; Revised 6 June 2016; Accepted 6 June 2016 DOI: xxx/xxxx RESEARCH PAPER Branch Coverage Prediction in Automated Testing y Giovanni Grano | Timofey V. Titov | Sebastiano Panichella | Harald C. Gall Department of Informatics, University of Zurich, Zurich, Switzerland Summary Correspondence Software testing is crucial in continuous integration (CI). Ideally, at every commit, all the test cases Giovanni Grano, Binzmhlestrasse 14, Zurich, should be executed and, moreover, new test cases should be generated for the new source code. Switzerland Email: grano@ifi.uzh.ch This is especially true in a Continuous Test Generation (CTG) environment, where the automatic generation of test cases is integrated into the continuous integration pipeline. In this context, developers want to achieve a certain minimum level of coverage for every software build. How- ever, executing all the test cases and, moreover, generating new ones for all the classes at every commit is not feasible. As a consequence, developers have to select which subset of classes has to be tested and/or targeted by test-case generation. We argue that knowing a priori the branch-coverage that can be achieved with test-data generation tools can help developers into taking informed-decision about those issues. In this paper, we investigate the possibility to use source-code metrics to predict the coverage achieved by test-data generation tools. We use four different categories of source-code features and assess the prediction on a large dataset involv- ing more than 3’000 Java classes. We compare different machine learning algorithms and conduct a fine-grained feature analysis aimed at investigating the factors that most impact the prediction accuracy. Moreover, we extend our investigation to four different search-budgets. Our evaluation shows that the best model achieves an average 0.15 and 0.21 MAE on nested cross-validation over the different budgets, respectively on EvoSuite and Randoop. Finally, the discussion of the results demonstrate the relevance of coupling-related features for the prediction accuracy. KEYWORDS: Machine Learning, Software Testing, Automated Software Testing, Coverage Prediction 1 INTRODUCTION Software testing is widely recognized as a crucial task in any software development process 1, estimated at being at least about half of the entire development cost 2,3. In the last years, we witnessed a wide adoption of continuous integration (CI) practices, where new or changed code is integrated extremely frequently into the main codebase. Testing plays an important role in such a pipeline: in an ideal world, at every single commit, every system’s test case should be executed (regression testing). Moreover, additional test cases might be automatically generated to test all the new —or modified— code introduced into the main codebase 4. This is especially true in a Continuous Test Generation (CTG) environment, where the generation of test cases is directly integrated into the continuous integration cycle 4. However, due to the time constraints between frequent commits, a complete regression testing is not feasible for large projects 5. Furthermore, even test suite augmentation 6, i.e., the automatic generation considering code changes and their effect on the previous codebase, is hardly doable due to the extensive amount of time needed to generate tests for just a single class. yThis work is an extension of the conference paper presented at the MaLTeSQuE 2018 workshop. 2 Giovanni Grano et al As developers want to ensure a certain minimum level of branch coverage for every build, these computational constraints cause many challenges. For instance, developers have to select and rank a subset of classes for which run test-date generation tools, or again, allocate a search- budget (i.e., time) to devote to the test generation per each class. Knowing a priori the coverage that will be achieved by test-data generation tools with a given search-budget can help developers in taking informed decisions to answer such questions: we give following some practical exam- ples of decisions that can be taken exploiting such a prediction. With the goal to maximize the branch-coverage on the entire system, developers might want to prioritize the test-data generation effort towards the classes for which they know a high branch-coverage can be achieved. On the contrary, they would avoid to spend precious computational time in running test-case generation tools against classes that will never reach a satisfying level of coverage; for these cases, developers will likely manually write more effective tests. Similarly, knowing the achievable coverage given a certain search-budget, developers might be able to allocate such a budget in a more efficient way, with the goal to maximize the achieved coverage and to minimize the time spent for the generation. To address such questions, we built a machine learning (ML) model to predict the branch coverage that will be achieved by two test-data gen- eration tools —EvoSuite 7 and Randoop 8— on a given class under test (CUT). However, being the achievable branch-coverage strongly depending from the search-budget (i.e., the time) allocated for the generations, we run each of the aforementioned tools by experimenting four different search-budgets. It is important to note that the branch-coverage achieved with every budget represents the dependent variable our models try to predict. Therefore, in this study we train and evaluate four different machine learning models for each tool, where every model is specialized into the prediction of the achievable coverage given the allocated search-budget. It is worth to note that this specialization is needed to address particular questions, like the choice of the search-budget for each test-case generation. To select the features needed to train the aforementioned models, we investigate metrics able to represent —or measure— the complexity of a class under test. This, we select a total of 79 factors coming from four different categories. We focus on source-code metrics for the following reasons: (i) they can be obtained statically, without actually executing the code; (ii) they are easy to compute, and (iii) they usual come for free in a continuous integration (CI) environment, where the code is constantly analyzed by several quality checkers. Amongst the others, we rely on well-established source-code metrics such as the Chidamber and Kemerer (CK) 9 and the Halstead metrics 10. To detect the best algorithm for the branch-coverage prediction, we experiment with four distinct algorithms covering distinct algorithmic families. At the end, we find the Random Forest Regressor algorithm to be the best performing ones in the context of brach-coverage prediction. Our final model shows an average Mean Absolute Error (MAE) of about 0.15 for EvoSuite and of about 0.22 for Randoop, on average over the experimented budgets. Considering the performance of the devised models, we argue that they can be practically useful to predict the coverage that will be achieved by test-data generation tools in a real-case scenario. We believe that this approach can support developers in taking informed decision when it comes to deploy and practical use test-case generation. Contributions of the Paper In this paper we define and evaluate machine learning models with the goal to predict the achievable branch coverage by test-data generation tools like EvoSuite 7 and Randoop 8. The main contributions of the paper are: • We investigate four different categories of code-quality metrics, i.e., Package Level, CK and OO, Java Reserved Keyword and Halstead metrics 10 as a features for the machine learning models; • We experiment the performance of four different machine learning algorithms, i.e., Huber Regression, Support Vector Regression, Multi- Layer Perceptron and Random Forest Regressor, for the branch prediction model; • We perform a large scale study involving seven large open-source projects for a total of 3,105 Java classes; • We extensively ran EvoSuite and Randoop over all the classes of the study context, experimenting four different budgets and multiple executions. The overall execution was parallelized over several multi-core servers, as the mere generation of the such amount of tests would take months on a single CPU setup. Novel Contribution of the Extension This paper is an extension of our seminal work 11, in which we firstly propose to predict the branch coverage that will be achieved by test-data generation tools. Following, we summarize the novel contribution of this paper in respect to the original one: • We introduce a new set of features, i.e., the Halstead metrics 10. They aim at determining a quantitative measure of complexity directly from the operators and operands in a class; • We introduce a Random Forest Regressor —an an ensemble algorithm— for the branch prediction problem; Giovanni Grano et al 3 4 budgets 1 EvoSuite Automated Testing Tools Randoop Jacoco Achieved Coverage 2 3105 Java classes Final Dataset from 7 projects Metric Extractors Observation Frame FIGURE 1 Construction process of the training dataset • We add a fine-grained analysis aiming at understanding the importance of the employed features in the prediction; • We build the model with three more different budgets than the default one. We generate a test suite multiple times for each class and budget, averaging the results; our preliminary findings were based on a single generation. • We use a larger dataset to train and evaluate the proposed machine learning models. Structure of the Paper Section 2 describes the features we used to train and validate the proposed machine learning models; details about the extraction process are also detailed in this section. Section 3.1 introduces the research questions of the empirical study together with its context: we present there the subjects of the study, the procedure we use to build the training set and the machine learning algorithms employed.