Towards Meaningful Software Metrics Aggregation

Maria Ulan, Welf L¨owe, Morgan Ericsson, Anna Wingkvist Department of Computer Science and Media Technology Linnaeus University, V¨axj¨o,Sweden {maria.ulan | welf.lowe | morgan.ericsson | anna.wingkvist}@lnu.se

The problem of metrics aggregation is addressed by the research community. Metrics are often defined at Abstract a method or class level, but quality assessment some- times requires insights at the system level. One bad Aggregation of software metrics is a challeng- metric value can be evened out by other good met- ing task, it is even more complex when it ric values when summing them up or computing their comes to considering weights to indicate the mean [1]. Some effort has been directed into met- relative importance of software metrics. These rics aggregation based on inequality indices [2,3], and weights are mostly determined manually, it based on thresholds [4–8] to map source code level mea- results in subjective quality models, which surements to software system rating. are hard to interpret. To address this chal- In this research, we do not consider aggregation lenge, we propose an automated aggregation along the structure of software artifacts, e.g., from approach based on the joint distribution of classes to the system. We focus on another type software metrics. To evaluate the effective- of metrics aggregation, from low-level to higher-level ness of our approach, we conduct an empir- quality properties, Mordal-Manet et al. call such type ical study on maintainability assessment for of aggregation metrics composition [9]. around 5 000 classes from open source soft- Different software quality models that use weighted ware systems written in Java and compare metrics aggregation have been proposed, such our approach with a classical weighted lin- as QMOOD [10], QUAMOCO [11], SIG [12], ear combination approach in the context of SQALE [13], and SQUALE [14]. The weights in these maintainability scoring and anomaly detec- models are defined based on experts’ opinions or sur- tion. The results show that approaches assign veys. It is questionable whether manual weighting and similar scores, while our approach is more in- combination of the values with an arbitrary (not nec- terpretable, sensitive, and actionable. essarily linear) function are acceptable operations for metrics of different scales and distributions. Index terms— Software metrics, Aggregation, As a countermeasure, we propose to use a proba- Weights, Copula bilistic approach for metrics aggregation. In previous research, we considered software metrics to be equally 1 Introduction important and developed a software metrics visualiza- tion tool. This tool allowed the user to define and ma- Quality models provide a basic understanding of nipulate quality models to reason about where qual- what data to collect and what software metrics to ity problems were located, to detect patterns, correla- use. However, they do not provide how software tions, and anomalies [15]. (sub-)characteristics should be quantified, and metrics Here, we define metrics scores by probability as should be aggregated. complementary Cumulative Distribution Function and link them with joint probability by the so-called cop- Copyright © by the paper’s authors. Use permitted under Cre- ula function. We determine weights from the joint dis- ative Commons License Attribution 4.0 International (CC BY 4.0). tribution and aggregate software metrics by weighted product of the scores. We formalize quality models In: D. Di Nucci, C. De Roover (eds.): Proceedings of the 18th Belgium-Netherlands Software Evolution Workshop, Brussels, to expresses quality as the probability of observing a Belgium, 28-11-2019, published at http://ceur-ws.org software artifact with equal or better quality. This

1 approach is objective since it relies solely on data. It 2. If all scores of one software artifact are greater or allows to modify quality models on the fly, and it cre- equal than all scores of another software artifact, ates a realistic scale since the distribution represents the same should be true for the overall scores. quality scores for a set of software artifacts. i l i l s1 ≥ s1 ∧ · · · ∧ sn ≥ sn ⇒ i i l l 2 Approach Overview F (s1, . . . , sn) ≥ F (s1, . . . , sn), i l We consider a joint distribution of software metrics where sj = sj(ej(ai)), sj = sj(ej(al)) (4) values, and for each software artifact, we assign a probabilistic score. W.l.o.g, we assume that all soft- 3. If the software artifact perfectly meets all but one ware metrics are defined such that larger values indi- metric, the overall score is equal to that metrics cate lower quality. The joint distribution of software score. metrics provides the means of objective comparison F (1,..., 1, sj, 1,..., 1) = sj (5) of software artifacts in terms of their quality scores, We propose to express the degree of satisfaction which represent the relative rank of the software arti- with respect to a metric using probability. We define fact within the set of all software artifacts observed so the score function of Equation (1) as follows: far, i.e., how good or bad a quality score compare to other quality scores.

Let A = {a1, . . . , ak} be a set of k software artifacts, sj(ej(a)) = P (Ej > ej(a)) = CCDF ej (a) (6) and M = {m1, . . . , mn} be a set of n software metrics. Each software artifact is assessed by metrics from M, We calculate the Complementary Cumulative Dis- and the result of this assessment is represented as k×n tribution Function (CCDF). This score represents the performance matrix of metrics values. probability of finding another software artifact with an evaluation value greater than the given value. For a We denote by ej(ai) for ∀i ∈ {1, k}, ∀j ∈ {1, n} an (i, j)-entry, which shows the degree of performance multi-criteria case, we can specify a joint distribution in terms of n marginal distributions and a so-called for an software artifact ai measured for metric mj. We T k copula function [16]: denote by Ej = [ej(a1), . . . , ej(ak)] ∈ Ej the j-th col- umn of performance matrix, which represents metrics values for all software artifacts with respect to metric Cop(CCDF e1 (a),..., CCDF en (a)) = mj where Ej is the domain of these values. P r(E1 > e1(a),...,En > en(a)) (7) For each software artifact ai ∈ A and metric mj ∈ M, we define a score sj(ai), which indicates the degree The copula representation of a joint probability dis- to which this software artifact meets the requirements tribution allows us to model both marginal distribu- for the metric. Formally, for each metric mj we define tions and dependencies. The copula function Cop sat- a score function sj: isfies the signature (2) and fulfills the required proper- ties (3), (4), and (5). We consider a weight vector, where each wi repre- ej(a): A 7→ Ej sents the relative importance of metric mi compared sj(e): Ej 7→ [0, 1] (1) to the others: n Based on the score functions sj for each metric, our T X goal is to define an overall score function such that, for w = [w1, . . . , wn] , where wi = 1 (8) any software artifact, it indicates the degree to which i=1 this software artifact satisfies all metrics. Formally, we We compute weights using a non-linear exponen- are looking for a function: tial regression model for a sample of software artifacts mapping metrics scores of Equation(6) to copula value n F (s1, . . . , sn) : [0, 1] 7→ [0, 1] (2) of Equation(7). Note that these weights regard depen- Such an aggregation function takes an n-tuple of dencies between software metrics. Finally, we define metrics scores and returns a single overall score. We software metrics aggregation as a weighted composi- require the following properties: tion of metrics score functions: n 1. If a software artifact does not meet the require- Y wj F (s1, . . . , sn) = sj (9) ments for one of the metrics, the overall score j=1 should be close to zero. We consider a software artifact al to be better than F (s1, . . . , sn) → 0 as sj → 0 (3) or equally good as another software artifact ai, if the

2 total score according to Equation (2) of al is greater CBO, Coupling Between Objects than or equal the total score of ai: DIT, Depth of Inheritance Tree al  ai ⇔ F (al) ≥ F (ai) (10) LCOM, Lack of Cohesion in Methods Aggregation is defined as a composition of the product, exponential, and CCDF functions, which are - NOC, Number Of Children tonic functions. Hence, the score which is obtained by aggregation allows to rank set A of software artifacts RFC, Response For a Class with respect to metrics set M: WMC, Weighted Method Count (using Cyclomatic Complexity as method weight)

Rank(al) ≤ Rank(ai) ⇔ F (al) ≥ F (ai) (11) 3.2 Data Set Description From a practical point of view, probabilities can be We chose to investigate three open-source software calculated empirically, and each score can be obtained systems. The systems were chosen by such criteria: as a ratio of the number of software artifacts with lower (i) they are written in Java, (ii) available in GitHub, than a given metric value to the number |A| of software (iii) they were forked at least once, (iv) they are suf- artifacts. ficiently large (several tens of thousands of lines of The proposed aggregation approach makes it possi- code and several hundreds of classes), and (v) they ble to express the score for a software artifact as the have been under active development for several years. probability to observe something with equal or worse The projects we selected are three well-known and metrics values, based on all software artifacts observed. frequently used systems: JabRef 1, JUnit 2, and Rx- Once the quality scores are computed, the software ar- Java.3 Table 1 shows descriptive statistics for these tifacts can trivially be ranked by the score by simply systems. ordering the values from smallest to largest. We as- sign the same rank for software artifacts in case their Table 1: Descriptive statistics of investigated systems total scores are equal. Low (high) ranks correspond to high (low) probabilities. This interpretation is the JabRef JUnit RxJava same on all levels of aggregation, from metrics scores Number of classes (NOC) 1 532 1 119 2 744 to the total quality scores. Lines of code (LOC) 136 039 44 082 378 987 Version 4.3.1 5.3.2 3.0.0 3 Preliminary Evaluation We apply our approach to assess Maintainability and 3.3 Measures compare the results with the aggregation approach based on a weighted linear combination of software The result of the aggregation is a maintainability score, metrics. We measure the difference between rankings and a ranked list of software artifacts according to obtained by these approaches and study the agreement their maintainability score. To evaluate our approach, between aggregated scores. Finally, we compare ap- we compare it to a well-known approach considering proaches by means of sensitivity, and the ability to the following measures: detect extreme values and Pareto optimal solutions. Correlation We study the Spearman’s correla- In the following subsections, we investigate Java tion [18] between maintainability scores to assess the classes and their quality assessment using two research ordering, relative spacing, and possible functional de- questions: pendency. Ranking distance We measure a distance between RQ1 How effective is our approach for a quality assess- the two rankings based on the Kendall tau distance, ment? which counts the number of pairwise disagreements between two lists [19]. RQ2 How actionable is our approach by means of sen- sitivity and anomaly detection? 1JabRef, Graphical Java application for managing BibTeX and biblatex databases, https://github.com/JabRef/jabref 2 3.1 Quality Model Description JUnit, A framework to write repeatable tests for the Java , https://github.com/junit-team/ We consider a quality model for maintainability as- junit5 3 sessment of classes, which relies on well-known soft- RxJava, Reactive Extensions for the JVM – a library for composing asynchronous and event-based programs using ware metrics from Chidamber & Kemerer [17] software observable sequences for the Java VM, https://github.com/ metrics suit: ReactiveX/RxJava

3 Agreement We measure agreement between main- Table 2: Agreement between approaches tainability scores using Bland-Altman statistics [20]. To evaluate if the aggregated scores can be used to Correlation (Spearman) Distance (Kendall) detect extreme values and Pareto optimal solutions, JabRef 0.93397 0.04829 we consider the following measures: jUnit 0.98899 0.02483 RxJava 0.96978 0.03083 Sensitivity We study a variety of values to under- Merged 0.98953 0.03382 stand a percentage of software artifacts that have the same maintainability score. The overall sensitivity is the ratio of unique scores and the number of software artifacts. Anomaly detection We compare approaches in terms of their ability to detect anomalies (extreme val- ues and Pareto optimal solutions) using a ratio of the number of detected anomalies and the total number of anomalies in a sample data set.

3.4 Preliminary Results and Analysis We implemented all algorithms and statistical analyses in R4. The metrics data for analysis was collected with VizzMaintenance.5 We collected the metrics values for Figure 1: Bland-Altman plot for JabRef classes of JabRef, JUnit, and RxJava software systems (5 317 classes in total). We considered their packages Third, we study an agreement, in the Bland-Altman structure to group classes and applied Kolmogorov- plot each class is represented by a point with the av- Smirnov statistical test [21] to select a subset for fur- erage of the maintainability scores obtained by two ther statistical analysis, which was composed of 5 101 approaches as the x-value and the difference between classes. Moreover, we consider the quality assessment these two scores as the y-value. The blue line repre- of each system separately to study potential differ- sents the mean difference between scores and the red ences between software systems. We apply our ag- lines the 95% confidence interval (mean±1.96SD). We gregation approach (See Equation (12)) and compare can observe that plots for JabRef and RxJava have a the results with a weighted linear sum of metrics (see similar shape (cf. Figure 1, Figure 3) compare to jU- Equation(13)), which we normalized by the min-max nit (cf. Figure 2). We can observe a similar shape for transformation. merged data set (cf. Figure 4), since in total JabRef and RxJava have almost four times more classes than jUnit. We can observe that in all plots measurements w1 w2 w3 w4 w5 w6 are mostly concentrated near the blue line and only a sCBO × sDIT × sLCOM × sNOC × sRFC × sWMC (12) few of them are outside of the red lines. The difference for jUnit is slightly smaller than for JabRef and Rx- Java. In sum, we conclude that the approaches agree, w1 × CBO + w2 × DIT + w3 × LCOM + i.e., aggregation results do not differ statistically, and may be used interchangeably for the ranking of soft- w4 × NOC + w5 × RFC + w6 × WMC (13) ware classes. RQ1-effectiveness RQ2-actionability We compare approaches within a single software sys- tem and the merged data set. First, we study a cor- First, we study the variety of values for each metric relation between aggregation results. Second, we rank and number of extreme values, which we define by software classes based on maintainability scores ob- means of outliers. We detected 19 extreme values in tained by two approaches. Table 2 shows Kendall Tau total. In Table 3 we can observe that metrics have distance and Spearman’s rho correlation. We observe a quite low sensitivity, for each metric 40 values on av- strong correlation between maintainability scores and erage are unique. low distance between rankings. We consider a multi-objective optimization problem based on metrics, and we detect five possible Pareto 4 The R Project for Statistical Computing, https://www. optimal solutions, i.e., none of the metrics values can r-project.org 5VizzMaintenance, Eclipse plug-in, http://www.arisa.se/ be improved without degrading some of the other met- products.php rics values. Second, we study the sensitivity and abil-

4 Figure 2: Bland-Altman plot for jUnit Figure 4: Bland-Altman plot for Merged data

Table 4: Comparison of approaches by actionability Aggregation (Eq.12) Aggregation (Eq.13) Sensitivity 0.41317 0.31656 Extreme values 0.94736 0.63158 Pareto optimal solutions 1 0.6

tribution, which we consider as a ground truth. This might be a threat to internal validity. We compare our approach with a weighted linear combination of met- rics, it might be a treat as well since we do not compare it with other approaches. In this preliminary evalua- tion, we consider six metrics, three software systems Figure 3: Bland-Altman plot for RxJava written in Java, and focus on maintainability. This might be a threat to external validity. ity to detect anomalies (extreme values and Pareto optimal solutions) for both approaches. In Table 4 we can observe that our approach is more sensitive and 4 Conclusion and Future Work more suitable for anomaly detection. In conclusion, we defined an automated aggregation approach for software quality assessment. We defined Table 3: Metrics variety of values probabilistic scores based on software metrics distribu- Number of Extreme Values Sensitivity tions and aggregate them using the weighted product, CBO 3 0.00768 we obtained the weights from joint distribution. To DIT 7 0.00109 evaluate the effectiveness and actionability of our ap- LCOM 3 0.05746 proach, we conducted an empirical study for maintain- NOC 2 0.00365 ability assessment. We collected CBO, DIT, LCOM, RFC 2 0.01848 NOC, RFC, and WMC metrics from Chidamber & Ke- WMC 2 0.02086 merer metrics suit for classes of JabRef, JUnit, and RxJava software systems, and compared our approach with a weighted linear combination of metrics. The 3.5 Discussion results showed that the approaches agree and can be We define metric scores by means of probability, as used interchangeably for ranking software artifacts. it provides a simple interpretation for a quality score However, our approach is more effective and action- by means of the joint distribution. In contrast, qual- able, i.e., it has clear interpretation, higher sensitivity, ity scores obtained by a weighted linear combination and is better at detecting extreme values and Pareto of metrics do not provide clear interpretation, espe- optimal solutions. cially when metrics are incomparable. We assume that Our approach is mathematically well-defined since larger metrics values indicate worse quality, however generalization is not questionable, and can be theo- both too small and too large values can be problematic retically validated. For example, we can conduct sim- for some of the software metrics. Note that it is not ulation experiments to study the deviation between a limitation since we could transform metrics to have our and other approaches depending on the number of this property. We extracted weights from joint dis- classes, number of metrics, levels of aggregation, etc.

5 However, there is still a need for empirical validation [8] Kazuhiro Yamashita, Changyun Huang, Meiyap- of our approach. In the future, we plan to evaluate pan Nagappan, Yasutaka Kamei, Audris Mockus, our approach on other data sets, such as The GitHub Ahmed E Hassan, and Naoyasu Ubayashi. Java corpus, which contains around 15 000 software Thresholds for size and complexity metrics: A systems [22]. We also plan to compare our approach case study from the perspective of defect density. with Bakota et al. probabilistic approach [23]. In 2016 IEEE international conference on soft- ware quality, reliability and security (QRS), pages Acknowledgments 191–201. IEEE, 2016. We thank the anonymous reviewers whose comments [9] Karine Mordal, Nicolas Anquetil, Jannik Laval, and suggestions helped us improve and clarify the re- Alexander Serebrenik, Bogdan Vasilescu, and search onto paper. St´ephaneDucasse. Software quality metrics ag- gregation in industry. Journal of Software: Evo- References lution and Process, 25(10):1117–1135, 2013.

[1] Bogdan Vasilescu, Alexander Serebrenik, and [10] Jagdish Bansiya and Carl G Davis. A hierarchi- Mark Van den Brand. By no means: A study cal model for object-oriented design quality as- on aggregating software metrics. In Proceedings sessment. IEEE Transactions on software engi- of the 2nd International Workshop on Emerging neering, 28(1):4–17, 2002. Trends in Software Metrics, pages 23–26. ACM, 2011. [11] Stefan Wagner, Andreas Goeb, Lars Heinemann, Michael Kl¨as, Constanza Lampasona, Klaus [2] Rajesh Vasa, Markus Lumpe, Philip Branch, and Lochmann, Alois Mayr, Reinhold Pl¨osch, Andreas Oscar Nierstrasz. Comparative analysis of evolv- Seidl, Jonathan Streit, et al. Operationalised ing software systems using the gini coefficient. In product quality models and assessment: The 2009 IEEE International Conference on Software quamoco approach. Information and Software Maintenance, pages 179–188. IEEE, 2009. Technology, 62:101–123, 2015. [3] Alexander Serebrenik and Mark van den Brand. [12] Robert Baggen, Jos´ePedro Correia, Katrin Schill, Theil index for aggregation of software metrics and Joost Visser. Standardized code quality values. In 2010 IEEE International Conference benchmarking for improving software maintain- on Software Maintenance, pages 1–9. IEEE, 2010. ability. Software Quality Journal, 20(2):287–307, [4] Ilja Heitlager, Tobias Kuipers, and Joost Visser. 2012. A practical model for measuring maintainability. [13] Jean-Louis Letouzey and Thierry Coq. The sqale In null, pages 30–39. IEEE, 2007. analysis model: An analysis model compliant [5] Jos´ePedro Correia and Joost Visser. Certifica- with the representation condition for assessing tion of technical quality of software products. In the quality of software source code. In Ad- Proc. of the Int’l Workshop on Foundations and vances in System Testing and Validation Lifecycle Techniques for Open Source Software Certifica- (VALID), 2010 Second International Conference tion, pages 35–51, 2008. on, pages 43–48. IEEE, 2010. [6] Tiago L Alves, Jos´ePedro Correia, and Joost [14] K. Mordal-Manet, F. Balmas, S. Denier, Visser. Benchmark-based aggregation of met- S. Ducasse, H. Wertz, J. Laval, F. Bellingard, and rics to ratings. In 2011 Joint Conference of the P. Vaillergues. The squale model; a practice-based 21st International Workshop on Software Mea- industrial quality model. In 2009 IEEE Int. Conf. surement and the 6th International Conference on Software Maintenance (ICSM), pages 531–534, on Software Process and Product Measurement, Sept 2009. pages 20–29. IEEE, 2011. [15] Maria Ulan, Sebastian H¨onel,Rafael M Martins, [7] Paloma Oliveira, Fernando P Lima, Marco Tulio Morgan Ericsson, Welf L¨owe, Anna Wingkvist, Valente, and Alexander Serebrenik. Rttool: A and Andreas Kerren. Quality models inside out: tool for extracting relative thresholds for source Interactive visualization of software metrics by code metrics. In 2014 IEEE International Con- means of joint probabilities. In 2018 IEEE Work- ference on Software Maintenance and Evolution, ing Conference on Software Visualization (VIS- pages 629–632. IEEE, 2014. SOFT), pages 65–75. IEEE, 2018. [16] Roger B Nelsen. An introduction to copulas. Springer Science & Business Media, 2007.

6 [17] Shyam R Chidamber and Chris F Kemerer. A [21] Myles Hollander and Douglas A Wolfe. Nonpara- metrics suite for object oriented design. IEEE metric statistical methods. Wiley-Interscience, Transactions on software engineering, 20(6):476– 1999. 493, 1994. [22] Miltiadis Allamanis and Charles Sutton. Mining [18] C. Spearman. General intelligence, objectively de- source code repositories at massive scale using termined and measured. The American Journal language modeling. In Proceedings of the 10th of Psychology, 15(2):201–292, 1904. Working Conference on Mining Software Reposi- tories, pages 207–216. IEEE Press, 2013. [19] Maurice Kendall. Rank correlation methods. Grif- fin, 1948. [23] Tibor Bakota, P´eterHeged˝us,P´eterK¨ortv´elyesi, Rudolf Ferenc, and Tibor Gyim´othy. A proba- [20] J Martin Bland and Douglas Altman. Measuring bilistic software quality model. In 2011 27th IEEE agreement in method comparison studies. Statis- International Conference on Software Mainte- tical methods in medical research, 8(2):135–160, nance (ICSM), pages 243–252. IEEE, 2011. 1999.

7