Are Sonarqube Rules Inducing Bugs?

Are SonarQube Rules Inducing Bugs? Valentina Lenarduzzi Francesco Lomio Heikki Huttunen Davide Taibi Lahti-Lappeenranta University Tampere University Tampere University Tampere University Lahti-Lappeenranta, Finland Tampere, Finland Tampere, Finland Tampere, Finland valentina.lenarduzzi@lut.fi francesco.lomio@tuni.fi heikki.huttunen@tuni.fi davide.taibi@tuni.fi Abstract—Background. The popularity of tools for analyzing box set of rules (named ”the Sonar way”4) is very subjective Technical Debt, and particularly the popularity of SonarQube, and their developers did not manage to agree on a common is increasing rapidly. SonarQube proposes a set of coding rules, set of rules that should be enforced. Therefore, the companies which represent something wrong in the code that will soon be reflected in a fault or will increase maintenance effort. However, asked us to understand if it is possible to use machine our local companies were not confident in the usefulness of the learning to reduce the subjectivity of the customization of rules proposed by SonarQube and contracted us to investigate the SonarQube model, considering only rules that are actually the fault-proneness of these rules. fault-prone in their specific context. Objective. In this work we aim at understanding which Sonar- SonarQube is not the most used static analysis tool on the Qube rules are actually fault-prone and to understand which machine learning models can be adopted to accurately identify market. Other tools such as Checkstyle, PMD and FindBugs fault-prone rules. are more used, especially in Open Source Projects [1] and Method. We designed and conducted an empirical study on 21 in research [2]. However, the adoption of another tool in well-known mature open-source projects. We applied the SZZ the DevOps pipeline requires extra effort for the companies, algorithm to label the fault-inducing commits. We analyzed the including the training and the maintenance of the tool itself. If fault-proneness by comparing the classification power of seven machine learning models. the SonarQube rules actually resulted fault-prone, our compa- Result. Among the 202 rules defined for Java by SonarQube, nies would not need to invest extra effort to adopt and maintain only 25 can be considered to have relatively low fault-proneness. other tools. Moreover, violations considered as ”bugs” by SonarQube were At the best of our knowledge, not studies have investigated generally not fault-prone and, consequently, the fault-prediction power of the model proposed by SonarQube is extremely low. the fault-proneness of SonarQube rules, and therefore, we Conclusion. The rules applied by SonarQube for calculating accepted the challenge and we designed and conducted this technical debt should be thoroughly investigated and their study. At best, only a limited number of studies have con- harmfulness needs to be further confirmed. Therefore, companies sidered SonarQube rule violations [3], [4], but they did not should carefully consider which rules they really need to apply, investigate the impact of the SonarQube violations considered especially if their goal is to reduce fault-proneness. as ”bugs” on faults. The goal of this work is twofold: I. INTRODUCTION • Analyze the fault-proneness of SonarQube rule violations, The popularity of tools for analyzing technical debt, such and in particular, understand if rules classified as ”bugs” as SonarQube, is increasing rapidly. In particular, SonarQube are more fault-prone than security and maintainability 1 has been adopted by more than 85K organizations including rules. 2 nearly 15K public open-source projects . SonarQube analyzes • Analyze the accuracy of the quality model provided by code compliance against a set of rules. If the code violates a SonarQube in order to understand the fault-prediction rule, SonarQube adds the time needed to refactor the violated accuracy of the rules classified as ”bugs”. arXiv:1907.00376v2 [cs.SE] 19 Dec 2019 rule as part of the technical debt. SonarQube also identifies a set of rules as ”bugs”, claiming that they ”represent something SonarQube and issue tracking systems adopt similar terms wrong in the code and will soon be reflected in a fault”; for different concepts. Therefore, in order to clarify the moreover, they also claim that zero false positives are expected terminology adopted in this work, we define SQ-Violation from ”bugs” 3. as a violated SonarQube rule that generated a SonarQube Four local companies have been using SonarQube for more ”issue” and fault as an incorrect step, process, data definition, than five years to detect possible issue in their code, reported or any unexpected behavior in a computer program inserted that their developers do not believe that the rules classified as by a developer, and reported by Jira issue-tracker. We also bugs can actually result in faults. Moreover, they also reported use the term ”fault-fixing” commit for commits where the that the manual customization of the SonarQube out-of-the- developers have clearly reported bug fixing activity and ”fault- inducing” commits for those commits that are responsible for 1https://www.sonarqube.org the introduction of a fault. 2https://sonarcloud.io/explore/projects 3SonarQube Rules: https://tinyurl.com/v7r8rqo 4SonarQube Quality Profiles: https://tinyurl.com/wkejmgr The remainder of this paper is structured as follows. In work, we will refer to the different sq-violations with their id Section II-A we introduce SonarQube and the SQ-Violations (squid). The complete list of violations can be found in the adopted in this work. In Section II we present the background file ”SonarQube-rules.xsls” in the online raw data. of this work, introducing the SonarQube violations and the different machine learning algorithms applied in this work. B. Machine Learning Techniques In Section III, we describe the case study design. Section IV presents respectively the obtained results. Section V identifies In this Section, we describe the machine learning techniques threats to validity while Section VI describes related works. adopted in this work to predict the fault-proneness of sq- Finally, conclusions are drawn in Section VII. violations. Due to the nature of the task, all the models used for this work were used for classification. We compared eight II. BACKGROUND machine learning models. Among these, we used a generalized A. SonarQube linear model: Logistic Regression [7]; one tree based classifier: Decision Tree [8]; and 6 ensemble classifiers: Bagging SonarQube is one of the most common Open Source static [9], Random Forest [10], Extremely Randomized Trees [11], code analysis tools adopted both in academia [5],[2] and in AdaBoost [12], Gradient Boosting [13], and XGBoost [14] industry [1]. SonarQube is provided as a service from the which is an optimized implementation of Gradient Boosting. sonarcloud.io platform or it can be downloaded and executed All the models, except the XGBoost, were implemented using on a private server. the library Scikit-Learn6, applying the default parameters for SonarQube calculates several metrics such as the number building the models. For the ensamble classifiers we alwasys of lines of code and the code complexity, and verifies the used 100 estimators. The XGBoost classifier was implemented code’s compliance against a specific set of ”coding rules” using the XGBoost library7 also trained with 100 estimators. defined for most common development languages. In case the 1) Logistic Regression [7]: Contrary to the linear regres- analyzed source code violates a coding rule or if a metric sion, which is used to predict a numerical value, Logistic is outside a predefined threshold, SonarQube generates an Regression is used for predicting the category of a sample. ”issue”. SonarQube includes Reliability, Maintainability and Particularly, a binary Logistic Regression model is used to Security rules. estimate the probability of a binary result (0 or 1) given a set Reliability rules, also named ”bugs” create issues (code of independent variables. Once the probabilities are known, violations) that ”represents something wrong in the code” these can be used to classify the inputs in one of the two and that will soon be reflected in a bug. ”Code smells” are classes, based on their probability to belong to either of the considered ”maintainability-related issues” in the code that de- two. creases code readability and code modifiability. It is important Like all linear classifiers, Logistic Regression projects the to note that the term ”code smells” adopted in SonarQube P -dimensional input x into a scalar by a dot product of the does not refer to the commonly known code smells defined learned weight vector w and the input sample: w · x + w , by Fowler et al. [6] but to a different set of rules. Fowler et 0 where w 2 the constant intercept. To have a result which al. [6] consider code smells as ”surface indication that usually 0 R can be interpreted as a class membership probability—a num- corresponds to a deeper problem in the system” but they can be ber between 0 and 1—Logistic Regression passes the projected indicators of different problems (e.g., bugs, maintenance effort, scalar through the logistic function (sigmoid). This function, and code readability) while rules classified by SonarQube for any given input x, returns an output value between 0 and as ”Code Smells” are only referred to maintenance issues. 1. The logistic function is defined as Moreover, only four of the 22 smells proposed my Fowler et al. are included in the rules classified as ”Code Smells” by 1 σ(x) = : SonarQube (Duplicated Code, Long Method, Large Class, and 1 + e−x Long Parameter List). SonarQube also classifies the rules into five severity levels5: Where the class probability of a sample x 2 RP is modeled Blocker, Critical, Major, Minor, and Info. as In this work, we focus on the sq-violations, which are 1 P r(c = 1 j x) = : reliability rules classified as ”bugs” by SonarQube, as we are 1 + e−(w·x+w0) interested in understanding whether they are related to faults.

Load more