Towards Meaningful Software Metrics Aggregation
Total Page:16
File Type:pdf, Size:1020Kb
Towards Meaningful Software Metrics Aggregation Maria Ulan, Welf L¨owe, Morgan Ericsson, Anna Wingkvist Department of Computer Science and Media Technology Linnaeus University, V¨axj¨o,Sweden fmaria.ulan j welf.lowe j morgan.ericsson j [email protected] The problem of metrics aggregation is addressed by the research community. Metrics are often defined at Abstract a method or class level, but quality assessment some- times requires insights at the system level. One bad Aggregation of software metrics is a challeng- metric value can be evened out by other good met- ing task, it is even more complex when it ric values when summing them up or computing their comes to considering weights to indicate the mean [1]. Some effort has been directed into met- relative importance of software metrics. These rics aggregation based on inequality indices [2,3], and weights are mostly determined manually, it based on thresholds [4{8] to map source code level mea- results in subjective quality models, which surements to software system rating. are hard to interpret. To address this chal- In this research, we do not consider aggregation lenge, we propose an automated aggregation along the structure of software artifacts, e.g., from approach based on the joint distribution of classes to the system. We focus on another type software metrics. To evaluate the effective- of metrics aggregation, from low-level to higher-level ness of our approach, we conduct an empir- quality properties, Mordal-Manet et al. call such type ical study on maintainability assessment for of aggregation metrics composition [9]. around 5 000 classes from open source soft- Different software quality models that use weighted ware systems written in Java and compare metrics aggregation have been proposed, such our approach with a classical weighted lin- as QMOOD [10], QUAMOCO [11], SIG [12], ear combination approach in the context of SQALE [13], and SQUALE [14]. The weights in these maintainability scoring and anomaly detec- models are defined based on experts' opinions or sur- tion. The results show that approaches assign veys. It is questionable whether manual weighting and similar scores, while our approach is more in- combination of the values with an arbitrary (not nec- terpretable, sensitive, and actionable. essarily linear) function are acceptable operations for metrics of different scales and distributions. Index terms| Software metrics, Aggregation, As a countermeasure, we propose to use a proba- Weights, Copula bilistic approach for metrics aggregation. In previous research, we considered software metrics to be equally 1 Introduction important and developed a software metrics visualiza- tion tool. This tool allowed the user to define and ma- Quality models provide a basic understanding of nipulate quality models to reason about where qual- what data to collect and what software metrics to ity problems were located, to detect patterns, correla- use. However, they do not provide how software tions, and anomalies [15]. (sub-)characteristics should be quantified, and metrics Here, we define metrics scores by probability as should be aggregated. complementary Cumulative Distribution Function and link them with joint probability by the so-called cop- Copyright © by the paper's authors. Use permitted under Cre- ula function. We determine weights from the joint dis- ative Commons License Attribution 4.0 International (CC BY 4.0). tribution and aggregate software metrics by weighted product of the scores. We formalize quality models In: D. Di Nucci, C. De Roover (eds.): Proceedings of the 18th Belgium-Netherlands Software Evolution Workshop, Brussels, to expresses quality as the probability of observing a Belgium, 28-11-2019, published at http://ceur-ws.org software artifact with equal or better quality. This 1 approach is objective since it relies solely on data. It 2. If all scores of one software artifact are greater or allows to modify quality models on the fly, and it cre- equal than all scores of another software artifact, ates a realistic scale since the distribution represents the same should be true for the overall scores. quality scores for a set of software artifacts. i l i l s1 ≥ s1 ^ · · · ^ sn ≥ sn ) i i l l 2 Approach Overview F (s1; : : : ; sn) ≥ F (s1; : : : ; sn); i l We consider a joint distribution of software metrics where sj = sj(ej(ai)); sj = sj(ej(al)) (4) values, and for each software artifact, we assign a probabilistic score. W.l.o.g, we assume that all soft- 3. If the software artifact perfectly meets all but one ware metrics are defined such that larger values indi- metric, the overall score is equal to that metrics cate lower quality. The joint distribution of software score. metrics provides the means of objective comparison F (1;:::; 1; sj; 1;:::; 1) = sj (5) of software artifacts in terms of their quality scores, We propose to express the degree of satisfaction which represent the relative rank of the software arti- with respect to a metric using probability. We define fact within the set of all software artifacts observed so the score function of Equation (1) as follows: far, i.e., how good or bad a quality score compare to other quality scores. Let A = fa1; : : : ; akg be a set of k software artifacts, sj(ej(a)) = P r(Ej > ej(a)) = CCDF ej (a) (6) and M = fm1; : : : ; mng be a set of n software metrics. Each software artifact is assessed by metrics from M, We calculate the Complementary Cumulative Dis- and the result of this assessment is represented as k×n tribution Function (CCDF). This score represents the performance matrix of metrics values. probability of finding another software artifact with an evaluation value greater than the given value. For a We denote by ej(ai) for 8i 2 f1; kg; 8j 2 f1; ng an (i; j)-entry, which shows the degree of performance multi-criteria case, we can specify a joint distribution in terms of n marginal distributions and a so-called for an software artifact ai measured for metric mj. We T k copula function [16]: denote by Ej = [ej(a1); : : : ; ej(ak)] 2 Ej the j-th col- umn of performance matrix, which represents metrics values for all software artifacts with respect to metric Cop(CCDF e1 (a);:::; CCDF en (a)) = mj where Ej is the domain of these values. P r(E1 > e1(a);:::;En > en(a)) (7) For each software artifact ai 2 A and metric mj 2 M, we define a score sj(ai), which indicates the degree The copula representation of a joint probability dis- to which this software artifact meets the requirements tribution allows us to model both marginal distribu- for the metric. Formally, for each metric mj we define tions and dependencies. The copula function Cop sat- a score function sj: isfies the signature (2) and fulfills the required proper- ties (3), (4), and (5). We consider a weight vector, where each wi repre- ej(a): A 7! Ej sents the relative importance of metric mi compared sj(e): Ej 7! [0; 1] (1) to the others: n Based on the score functions sj for each metric, our T X goal is to define an overall score function such that, for w = [w1; : : : ; wn] ; where wi = 1 (8) any software artifact, it indicates the degree to which i=1 this software artifact satisfies all metrics. Formally, we We compute weights using a non-linear exponen- are looking for a function: tial regression model for a sample of software artifacts mapping metrics scores of Equation(6) to copula value n F (s1; : : : ; sn) : [0; 1] 7! [0; 1] (2) of Equation(7). Note that these weights regard depen- Such an aggregation function takes an n-tuple of dencies between software metrics. Finally, we define metrics scores and returns a single overall score. We software metrics aggregation as a weighted composi- require the following properties: tion of metrics score functions: n 1. If a software artifact does not meet the require- Y wj F (s1; : : : ; sn) = sj (9) ments for one of the metrics, the overall score j=1 should be close to zero. We consider a software artifact al to be better than F (s1; : : : ; sn) ! 0 as sj ! 0 (3) or equally good as another software artifact ai, if the 2 total score according to Equation (2) of al is greater CBO, Coupling Between Objects than or equal the total score of ai: DIT, Depth of Inheritance Tree al ai , F (al) ≥ F (ai) (10) LCOM, Lack of Cohesion in Methods Aggregation is defined as a composition of the product, exponential, and CCDF functions, which are mono- NOC, Number Of Children tonic functions. Hence, the score which is obtained by aggregation allows to rank set A of software artifacts RFC, Response For a Class with respect to metrics set M: WMC, Weighted Method Count (using Cyclomatic Complexity as method weight) Rank(al) ≤ Rank(ai) , F (al) ≥ F (ai) (11) 3.2 Data Set Description From a practical point of view, probabilities can be We chose to investigate three open-source software calculated empirically, and each score can be obtained systems. The systems were chosen by such criteria: as a ratio of the number of software artifacts with lower (i) they are written in Java, (ii) available in GitHub, than a given metric value to the number jAj of software (iii) they were forked at least once, (iv) they are suf- artifacts. ficiently large (several tens of thousands of lines of The proposed aggregation approach makes it possi- code and several hundreds of classes), and (v) they ble to express the score for a software artifact as the have been under active development for several years. probability to observe something with equal or worse The projects we selected are three well-known and metrics values, based on all software artifacts observed.