Learning Formatting Conventions to Repair Checkstyle Errors Benjamin Loriot Fernanda Madeiral Martin Monperrus

Styler: Learning Formatting Conventions to Repair Checkstyle Errors Benjamin Loriot Fernanda Madeiral Martin Monperrus

Abstract—Ensuring code formatting conventions is an essential Inspired by the problem statement of program repair [24], aspect of modern software quality assurance, because it helps we state in this paper the problem of automatically repairing in code readability. In this paper, we present STYLER, a tool formatting errors: given a program, its format checker rules, dedicated to fix formatting errors raised by Checkstyle, a highly configurable format checker for Java. To fix formatting errors in and one rule violation, the goal is to modify the source code a given project, STYLER 1) learns fixes for self-generated errors formatting so that no violation is raised by the format checker. according to the project-specific Checkstyle ruleset, based on In this paper, we explore this problem in the context of [8], token sequence fed into a LSTM neural network, and then 2) a popular format checker for the Java language. We present predicts fixes. In an empirical evaluation, we find that STYLER STYLER, a repair tool dedicated to fix Checkstyle formatting repairs 38% of 11,220 real Checkstyle errors mined from 70 errors in Java source code. The uniqueness of STYLER is to GitHub projects. Moreover, we compare STYLER with the IntelliJ plugin CHECKSTYLE-IDEA and the machine learning-based be applicable to any formatting coding convention, because its code formatters NATURALIZE and CODEBUFF. We find that approach is not based on rules to repair specific Checkstyle STYLER fixes errors from a more diverse set of Checkstyle rules errors. The key idea of STYLER is the usage of machine (24 rules, compared to CHECKSTYLE-IDEA: 19; NATURALIZE: learning to learn the coding conventions that are used in a 20; CODEBUFF: 17), and it uniquely repairs errors for two rules. software project. Once trained, STYLER predicts changes on Finally, STYLER generates small repairs, and once trained, it predicts repairs in seconds. The promising results suggest that formatting characters (e.g. whitespaces, new lines, indentation) STYLER can be used in IDEs and in Continuous Integration to fix a formatting convention violation happening in the wild. environments to repair Checkstyle errors. Technically, STYLER uses a sequence-to-sequence machine learning model based on a long short-term memory neural network (LSTM). I.INTRODUCTION We conduct a large scale experiment to evaluate STYLER Code readability is the first requirement for program com- using a curated dataset of 11,220 real Checkstyle errors mined prehension: one cannot comprehend what one cannot easily from 70 GitHub projects. Based on our research questions, we read. To improve code readability, most developers agree on find that STYLER repairs many errors (38%), and repairs errors using coding conventions, so the code is clear and uniformly from more different Checkstyle formatting rules compared to consistent across a given code base or organization [23], [16]. the state-of-the-art of machine learning formatters [3], [26] and A major challenge of using coding conventions is to keep all the tailored, human engineered IntelliJ plugin CHECKSTYLE- source code files consistent with the agreed conventions. The IDEA[9]. Moreover, STYLER produces small repairs and its first step towards that is the detection of coding convention performance is fast enough for developers. To sum up, our contributions are: violations (or errors). This can be automatically performed using linters, which are static analysis tools that warn software • A novel approach to fix violations of code formatting developers about possible violations of coding conventions conventions, based on machine learning. The approach is able to learn project-specific formatting rules without

arXiv:1904.01754v3 [cs.SE] 10 Aug 2020 [36]. The usage of linters also brings challenges because the developers need to create a configuration according to manual setup; their adopted conventions so that the linter detects the right • A tool, called STYLER, which implements our approach violations (not more and not less), and then to repair eventually in the context of Java and Checkstyle, to repair Check- violations. In this paper, we focus on the later task, automat- style formatting violations. The tool is made publicly ically repairing linter violations, which is a little researched available [21]; problem, and we focus on formatting errors1. • A curated dataset of real-world formatting Checkstyle To repair a formatting error detected by a format checker, errors, which contains 11,220 errors mined from 70 developers can either perform the fix manually or use a code GitHub repositories; To our knowledge, this is the largest formatter. Both alternatives are not satisfactory. Manually dataset of this kind, made publicly available for future fixing formatting errors is a waste of valuable developer time. research; With code formatters, the key problem is that they do not take • A comparative experiment of the performance of STYLER into account the project-specific convention rules, those that against the state-of-the-art of automatic code formatting are configured by the developers for the used format checker. [9], [3], [26], showing that STYLER outperforms it. The remainder of this paper is organized as follows. Sec- 1In this paper, we refer to linters specialized in formatting as format tion II and Section III present the background of this work. checkers. Section IV presents our tool, STYLER. Section V presents the 2

design of our experiment for evaluating STYLER and compar- linter before she commits her changes. If she does not do ing it with three code formatters: the experimental results are it, she might face a lot of errors raised by the linter after presented in Section VI. Section VII presents discussions, and the end of the building step for a release or for shipping the Section VIII presents the related works. Finally, Section IX program. On the other hand, when a linter is integrated in build presents the final remarks. tools, it is automatically executed in Continuous Integration (CI) environments. The important coding conventions might II.BACKGROUND be configured to make CI builds break when they are violated. A. Coding Conventions This way, developers are forced to repair coding convention violations early in the software development process. Coding conventions (also known as coding style or coding Several linters have been developed depending on the pro- standards) are rules that developers agree on for writing code. gramming language: e.g. ESLint [13] for JavaScript, Pylint The usage of coding conventions improves code readability [29] for Python, StyleCop [34] for C#, and RuboCop [31] for but it does not change the program behavior. Ruby. For Java, which is our target language in this paper, the There are several coding convention classes: e.g. naming, most commonly used linter is Checkstyle [8]. Checkstyle sup- control flow style, and formatting. In this paper, we focus ports predefined well-known coding conventions, such as the on the latter: formatting coding conventions. Formatting here Google Java Style Guide [16] and the Sun Code Conventions refers to the appearance or the presentation of the source [35]. It also allows developers to configure a specific ruleset code. One can change the formatting by using non-printable to match their own preferences. Checkstyle is a flexible linter characters such as spaces, tabulations, and line breaks. In that can be integrated in both an IDE (e.g. IntelliJ, Eclipse, and free-format languages such as Java and C++, the formatting NetBeans) and in a build tool (e.g. Maven and Gradle). In the does not change the abstract syntax tree. In non-free-format Java ecosystem, Checkstyle is often executed in Continuous languages such as Haskell or Python, formatting is even related Integration environments such as Travis and Circle CI. to behavior: correcting formatting issues can fix a bug [7]. For instance, a well-known formatting coding convention is III.STUDYOF CHECKSTYLE USAGEINTHE WILD about the placement of braces in code blocks. Figure 1 shows Static analysis tools have been subject of investigation two ways that developers may follow when writing conditional in recent research [39], [38], [22]. However, there is little blocks: one developer might place the left brace in a new line, empirical knowledge of the extent of what Checkstyle, one while another one might place it in the end of the conditional popular static analyzer, is used in the wild. To ground our work line. Agreeing on coding conventions avoids edit wars and with a solid empirical basis, we then investigate the usage of endless debates: all developers in a team decide on how to Checkstyle and its rules in open source projects. format code once and for all. Checkstyle can be executed on a project in different ways. The straightforward ways are 1) by directly invoking Check- if (condition) style on the command line, 2) by a build tool, or 3) by { if (condition) { // do something // do something a continuous integration service. Independently of the way } } Checkstyle is executed, there must exist a configuration file with the Checkstyle rules defined by the developers: we refer (a) Left curly on new line. (b) Left curly on end of line. to this file as Checkstyle ruleset. In this section, we report on Fig. 1: Two conventions for placing a left curly brace. our large-scale study on the usage of Checkstyle on GitHub. A. Checkstyle Usage in Practice Method. To measure the usage of Checkstyle on GitHub, we B. Coding Convention Checkers queried GitHub2 to only retrieve Java projects with at least five A challenge faced by developers is to keep their code stars, because stars have been shown meaningful to sample compliant with the agreed coding conventions. Basically, every projects from GitHub [6]: we found 148,127 Java projects. new change, every new commit must satisfy the convention Then, we searched each of them for finding a Checkstyle rules. Manually checking if code changes do not violate the ruleset file. A Checkstyle ruleset file can have any name, coding conventions is not an option because it would be too but we followed a conservative approach towards identifying time-consuming and error-prone. true positives: we used a set of commonly used names3. For To overcome this problem, a mechanism to automatically simplicity, in the rest of this paper we refer to a Checkstyle check if a code follows the coding convention rules is required. ruleset file as checkstyle.xml. Such a tool is known as linter, or coding convention enforcers [2]. A linter is a static analysis tool that warns software Results. We found 3,793 Java projects containing a checkstyle.xml developers about possible code errors or violations of coding file, which is 2.56% of all Java projects conventions [36]. Note that linters may go beyond coding 2In June 9, 2020. conventions and also perform some basis static analysis on 3Checkstyle ruleset file commonly used names: [‘checkstyle.xml’, the program behavior. ‘.checkstyle.xml’, ‘checkstyle_rules.xml’, ‘checkstyle_config.xml’, ‘check- style_configuration.xml’, ‘checkstyle_checker.xml’, ‘checkstyle_checks.xml’, Linters can be usually integrated in IDEs and build tools. ‘google_checks.xml’, ‘sun_checks.xml’]. Variants by replacing ‘_’ by ‘-’ are When integrated in IDEs, the developer manually runs the also used. 3

RightCurly 3,719 (98.05%) A. Targeted Error Types RegexpSingleline 3,162 (83.37%) STYLER is about learning how to repair errors related to LeftCurly 3,083 (81.28%) PackageName 3,047 (80.33%) formatting coding conventions (see Section II-A). For instance, UpperEll 3,033 (79.96%) consider that a developer specified that her preference on the TypeName 3,018 (79.57%) left curly token “{” in a conditional block must always be ParameterName 2,996 (78.99%) placed in a new line (as shown in Figure 1a). If this rule is 2,966 (78.2%) MemberName Formatting-related rules not satisfied (e.g. such as in Figure 1b), Checkstyle triggers FileTabCharacter 2,955 (Non-formatting-related77.91%) rules MethodName 2,947 (77.7%) a formatting-related error (see Figure 4a). In order to fix this 3,000 3,200 3,400 3,600 3,800 4,000 violation, a new line break should be inserted in the program before the token “{”. # Projects on Github In Checkstyle, there are different classes of checks: e.g. for- Fig. 2: The top-10 most popular Checkstyle rules. matting, naming, and lightweight linting checks. In STYLER, we exclusively focus on formatting checks, such as indentation and whitespace before and after punctuation. We ignore with at least five stars on GitHub. Table I shows the proportion Checkstyle checks that are not related to formatting, e.g. of those projects with their build tools and CI services if any. unused imports and method name. We note that build tools are widely used among projects using Checkstyle: 98% of the projects use at least one build tool. B. STYLER Workflow Moreover, 55% of the projects use a continuous integration service, which shows the software engineering maturity of the Figure 3 shows the STYLER workflow. It is composed of sampled projects. two main components: ‘STYLER training’ for learning how to fix formatting errors and ‘STYLER prediction’ for actually TABLE I: Usage of build tools and CI services by 3,793 repairing a concrete Checkstyle error. STYLER receives as projects that use Checkstyle. input a software project, including its source code and its Checkstyle ruleset. Maven 54 % Build tool usage Gradle 47 % Ant 10 % Styler Training (learning) Project with source A. Training B. TravisCI 51 % C. Training LSTM CI usage code and data Error-encoding CircleCI 4 % models Checkstyle ruleset generation (tokenization)

B. Popularity of Checkstyle Rules Styler Prediction (repairing) Method. To check the usage of Checkstyle rules4, we analyzed Checkstyle error Java code (Figure 4b) tokenized D. (Figure 4a) F. Predicting the previously-found checkstyle.xml ﬁles from the 3,793 E. Error-encoding (Figure 4c) Checkstyle-error repair (LSTM (tokenization) projects using Checkstyle. Our goal is to investigate the most localization models) used rules and check if formatting-related rules, which are the Repaired Java code tokenized target of this work, are widely used. (Figure 4e)

Repaired Java code Results. We found at least one usage for the 174 Checkstyle Repaired de-tokenized G. Java code rules. Figure 2 shows the top-10 most used rules. The bars (Figure 4f) H. Repair Repair-decoding I. Repair selection in dark red represent formatting-related rules, and the bars in verification (de-tokenization) gray represent the other rules. In the top-10 most used rules, there are four rules related to formatting. Notably, the top- Fig. 3: STYLER workflow. 3 most used rules are formatting-related ones. Therefore, we conclude that formatting-related rules are very important for The component ‘STYLER training’ is responsible for learn- developers, which validates the relevance of our work. ing how to repair Checkstyle errors on the given project according to its project-specific Checkstyle ruleset. It creates IV. STYLER the training data by injecting Checkstyle formatting errors on STYLER is a tool to fix Checkstyle formatting errors in source code files in the project (step A). Then, it translates the Java source code, in order to help developers in different training data into abstract token sequences (step B) in order software development workflows. For instance, STYLER could to train LSTM neural networks (step C). The learned LSTM be used locally as a pre-hook commit when developers are models are eventually used to predict repairs. about to release projects. Also, it could be configured to run in The component ‘STYLER prediction’ is responsible for Continuous Integration, where pull requests are automatically predicting fixes for real Checkstyle errors. It first localizes opened with formatting fixes’ suggestions. In this section, we Checkstyle errors by running Checkstyle on the project (step present the workflow and the technical principles of STYLER. D). Then, STYLER encodes the error line into an abstract token 4The set of Checkstyle rules we considered in our study is from Checkstyle sequence (step E), which is given as input to the LSTM models version 8.33 (released in May 31, 2020). (step F) previously learned. The models predict fixes for the 4 given Checkstyle error: these fixes are in the format of abstract [ERROR] .../NodeRelationshipCache.java:812:82: token sequences, so they must be translated back to Java code ’{’ at column 82 should be on a new (step G). STYLER then runs Checkstyle on the new Java codes line. [LeftCurly] containing the predicted fixes (step H). Finally, among the (a) Checkstyle LeftCurly rule violation. predicted fixes where no Checkstyle error is raised, STYLER selects one formatting repair to give as output (step I). As 812 public void visitChangedNodes( NodeChangeVisitor TYLER v i s i t o r , int nodeTypes ) { S only impacts the formatting of the code, its repairs do 813 long denseMask = changeMask( true); not change the behavior of the program under consideration. (b) Source code snippet of the error.

C. STYLER in Action before-context Identifier Consider the Checkstyle error presented in Figure 4a. This 0_SP , 1_SP int 1_SP Identifier 1_SP ) error is raised by a violation of the Checkstyle LeftCurly rule: 4_SP { 1_NL_4_ID long 1_SP Identifier the left curly should be on a new line. Checkstyle provides, 1_SP = 1_SP Identifier 0_SP ( 1_SP for a given error, the location (line and column) where the after-context Checkstyle rule is violated. The Java source code that caused (c) Buggy abstract token sequence. such an error is presented in Figure 4b. 0_SP 1_SP 1_SP 1_SP 1_NL 1_NL_4_ID 1_SP STYLER encodes the incorrectly formatted lines (Figure 4b) into the abstract token sequence shown in Figure 4c. Then, 1_SP 1_SP 0_SP 1_SP this abstract token sequence is given as input to LSTM (d) Formatting token sequence generated by a LSTM model. models, which predict the formatting token sequence shown before-context Identifier in Figure 4d. This predicted formatting token sequence is 0_SP , 1_SP int 1_SP Identifier 1_SP ) then used to modify the formatting tokens from the buggy 1_NL { 1_NL_4_ID long 1_SP Identifier abstract token sequence. It results in a predicted abstract token 1_SP = 1_SP Identifier 0_SP ( 1_SP sequence, as shown in Figure 4e, that may ﬁx the current after-context Checkstyle error. The diff between Figure 4c and Figure 4e (highlighted in bold) shows that the predicted repair is the (e) Predicted abstract token sequence. 4_SP 1_NL replacement of the formatting token by . This 812 public void visitChangedNodes( NodeChangeVisitor predicted repair means that the four whitespaces before the v i s i t o r , int nodeTypes ) 813 { token “{” should be replaced by a new line. 814 long denseMask = changeMask( true); Then, the predicted abstract token sequence (Figure 4e) is translated back to Java code (Figure 4f). Finally, when running (f) Source code snippet with repaired formatting. Checkstyle on the new Java code, no Checkstyle error is raised, Fig. 4: STYLER: from the Checkstyle-formatting error to a ﬁx. meaning that STYLER successfully repaired the error.

D. Java Source Code Encoding indentation deltas are represented by ∆_ID (indent), negative ones are represented by ∆_DD (dedent), and deltas equal to STYLER encodes the Java source code into an abstract zero (there is no indentation change between two lines) are token sequence that is required to predict formatting changes. ignored, they are not represented by an abstract token. The First, STYLER translates each Java token to an abstract token complete representation after the calculation of the number of by keeping the value of the Java keywords, separators, and new lines and the indentation delta is n_NL_∆_(ID|DD): operators (e.g. + → +), and by replacing the other token kinds for instance, in Figure 4b, the new line between lines 812 and such as literals, comments, and identifiers by their types (e.g. 813 is represented by 1_NL_4_ID), i.e. one new line and x → Identifier). Second, for each pair of subsequent indentation delta +4. Java tokens, STYLER creates an abstract formatting token that depends on the presence of a new line. If there is no new line, STYLER counts the number of whitespaces, and then E. Training Data Generation represents it like n_SP, where n is the number of whitespaces STYLER does not use predefined templates for repairing (e.g. → 1_SP). If there is no whitespace between two Java formatting errors. STYLER uses machine learning for inferring tokens (e.g. x=), STYLER adds 0_SP between the tokens. The a model to repair formatting errors and, consequently, it needs same process is applied for tabulations. training data. One option is to mine past commits from the If there are new lines between two Java tokens, STYLER first project under consideration to collect training data. However, counts the number of new lines, and represents it as n_NL, there might not exist enough data in the history of the project where n is the number of new lines. Then, STYLER calculates to cover all Checkstyle formatting rules. the indentation delta (∆) between the line containing the So in order to have enough data for training, our key insight previous token and the line containing the next token: the is to generate the training data. The idea is to modify error- delta is the difference of the indentation between the two free Java source code files in the project in order to trigger lines (the indentation is composed of whitespace or tabulation Checkstyle formatting rule violations. Then, one obtains a pair characters, exclusively, depending of the project). Positive of files (αorig, αerr): αorig is the file without the formatting 5

error, and αerr is the file with the formatting error. αorig Algorithm 1 Batch injection of Checkstyle errors in Java files. is a repaired version of αerr, and we can use supervised Input: ruleset – Checkstyle configuration of the project machine learning to predict αorig given αerr. We experiment under consideration that idea in two different ways (called protocols in this paper) Input: files – corpus of error-free Java files taken from the to generate training data: we name them as Stylerrandom and project Styler3grams, which we present as follows. Input: numberOfErrors – number of errored files to be The Stylerrandom protocol for injecting Checkstyle errors generated in a project consists of automated insertion or deletion of a Input: protocol in [Stylerrandom, Styler3grams ] single formatting character (space, tabulation, or new line) Output: dataset with Checkstyle errors in Java source files. These modifications require a careful 1: const BAT CH_SIZE ← 500 procedure so that 1) the project still compiles and 2) its 2: var dataset ← {} behavior is not changed. For this, we specify the locations 3: while dataset.length < numberOfErrors do in the source code files that are suitable to perform the 4: var modifiedF iles ← {} modifications. For insertions, the suitable locations are before 5: for i ← 0; i < BAT CH_SIZE; i + + do or after any token. For deletions, the suitable locations are 1) 6: file ← selectRandom(files) before or after any punctuation (“.”, “,”, “(”, “)”, “[”, “]”, “{”, 7: file0 ← changeF ormatting(file, protocol) “}”, and “;”), 2) before or after any operator (e.g. “+”, “-”, 8: modifiedF iles.append(file0) “*”, “=”, “+=”), and 3) in any token sequence longer than one 9: end for indentation character. 10: checkstyleResult ← The Styler3grams protocol is meant to produce likely runCheckstyle(modifiedF iles, ruleset) errors. It performs modifications at the abstract token 11: erroredF iles ← selectErroredF iles(checkstyleResult) level instead of directly changing the Java source code as 12: dataset.append(erroredF iles) Stylerrandom. The idea is to replace formatting tokens by 13: end while the ones used by developers in a similar context (i.e. the same 14: return dataset surrounding Java tokens). For that, we use 3-grams, where 3gram = {Java_token, formatting_ token, Java_token}. So given an error-free Java file, the task of Styler3grams is the Once the context surrounding a formatting error is tok- following. First, the Java file is tokenized (see Section IV-D), enized, STYLER places two tags around the error, so that and a random formatting token is picked and used to form its location and its violation type can be further identified. a 3-gram, which is 3gramorig. Then, given a corpus of 3- The tags consist of the name of the Checkstyle rule that was grams previously mined from a project, Styler3grams finds a violated and raised the error. For instance, the error presented 3grami−corpus that matches the surrounding Java tokens of in Figure 4a is about the Checkstyle LeftCurly rule, so the tags 3gramorig. Several matches can be found, but the selection of around the error are and as a 3grami−corpus is random according to its frequency in the shown in Figure 4c. corpus. Then, 3gramorig is replaced by 3grami−corpus: since To insert the tags concerning the error type in the abstract the Java tokens match, only the formatting token is actually token sequence, STYLER needs to find a place so that the tags replaced. Finally, Styler3grams performs a de-tokenization so surround the tokens related to the origin of the error, and at the that an error version of the original error-free Java file is same time to minimize the number of tokens between the two created. tags to have precise information about the location. STYLER Algorithm 1 presents the algorithm that STYLER uses to places the tags according to the location information given by generate one training dataset per protocol (Stylerrandom and Checkstyle (line and column). When Checkstyle provides the Styler3grams). The input of the algorithm is the Checkstyle line and the column, STYLER places n tokens ruleset of the project, a corpus of error-free Java files taken before the error and n tokens after. When from the project, the number of errored files to be generated, Checkstyle provides the line but not the column (e.g. when and the injection protocol to be used. Then, in each batch the error is about the LineLength rule), STYLER places the iteration, a random file is selected from the corpus of error- i tokens before the line and free Java files, and the specified injection protocol is applied to j tokens after the end of the line. The values of k, n, i, and it. Once a batch is completed, Checkstyle is executed so that j are explained in Section IV-I. the algorithm selects the modified files that contain a single error. The algorithm ends when the desired number of errored files is reached. G. Machine Learning Model Learning (Figure 3–step C). STYLER aims to translate a buggy F. Error Encoding token sequence (input sequence) to a new token sequence In order to repair formatting errors, the Java source code with no Checkstyle errors (output sequence). STYLER uses a encoding using an abstract token sequence (see Section IV-D) sequence-to-sequence translation based on a recurrent neural must capture both the error in the code and the context network LSTM (Long Short-Term Memory), similar to what surrounding the error. Therefore, STYLER considers a token is used for natural language translation. Thanks to the token window of k lines before and after the error. abstraction employed by STYLER to encode Java source code 6

I = ( 0_SP Identifier 0_SP , 1_SP Identifier 1_SP H. Repair Veriﬁcation and Selection

Fi = 0_SP 1_SP 1_NL 1_SP STYLER performs x predictions per training data generation O = ( 0_SP Identifier 1_SP , 1_NL Identifier 1_SP i protocol (i.e. Stylerrandom and Styler3grams), so in the end

(a) length(Fi) = length(I)/2. STYLER generates x × 2 predictions to repair a single error. I = ( 0_SP Identifier 0_SP , 1_SP Identifier 1_SP After the translation of those predictions back to Java source

Fi = 0_SP 1_SP 1_SP 2_SP 1_NL_4_DD code (Figure 3–step G), STYLER performs a verification (Fig- ure 3–step H), where Checkstyle is executed on the resulting Oi = ( 0_SP Identifier 1_SP , 1_SP Identifier 2_SP Java source code files. From the correctly repaired files (i.e. the (b) length(Fi) > length(I)/2. ones that do not result in Checkstyle errors), STYLER selects I = ( 0_SP Identifier 0_SP , 1_SP Identifier 1_SP the best one to give as output, where the best prediction is the

Fi = 0_SP 1_SP one that has the smallest source code diff (Figure 3–step I).

Oi = ( 0_SP Identifier 1_SP , 1_SP Identifier 1_SP

Fig. 5: Generation of the sequence Oi based on the predicted STYLER is implemented in Python. We use javalang [18] formatting tokens Fi and the input I. for parsing and OpenNMT-py [25] for the machine learning part. The code is publicly available [21]. For optimally training the LSTM models, we performed an exploratory study by training models with different configurations. The configurations combine values for key parameters, (see Section IV-D and Section IV-F), the input and output which are the model attention type (general or mlp), the vocabularies are small (respectively ∼150 and ∼50), hence number of layers (1, 2, or 3) and the number of units (256 are well handled by LSTM models. We use LSTM with or 512) for the model encoder/decoder, and the model word bidirectional encoding, which means that the embedding is embedding size (256 or 512). For each configuration, the able to catch information around the formatting error in training was performed for a maximum of 20k iterations, with the two directions: for instance, an error triggered by the a batch size of 32, and a model was saved in the iterations 10k Checkstyle WhitespaceAround rule, which checks that a token and 20k. This means that, in the end, we obtained 48 models (2 is surrounded by whitespaces, requires the contexts before and model attention types × 3 numbers of layers × 2 numbers of after the token. units × 2 embedding sizes × 2 number of training iterations) per training data generation protocol (i.e. Stylerrandom and Predicting/Repairing (Figure 3–step F). Once the LSTM mod- Styler3grams). els are trained (one per training protocol, see Section IV-E), Those models were created for one open-source project5, STYLER can be used for predicting fixes for an erroneous randomly selected from the top-5 projects with most diversity sequence I as in Figure 4c. For an input sequence I, a LSTM in terms of number of formatting rules (see Section V-B). model predicts x alternative formatting token sequences using The project was given as input to STYLER, which produced a technique called beam search, that we use off-the-shelf. training data by injecting Checkstyle errors in error-free files These alternatives are all potential repairs for the formatting in the project (see Section IV-E). For each protocol, 10k errors error (e.g. Figure 4d). were injected. This data was used to train the LSTM models, where 9k errors were used for training and 1k for validation. Note that the LSTM models predict formatting token se- When the 48 models per protocol were created, we ran each quences (e.g. Figure 4d), but the goal is to have token se- of them on real errors from the project so that we could test quences containing Java and formatting tokens (e.g. Figure 4e), the models and choose the configuration of the best ones. so they can further be translated back to Java code. Then, We picked the configuration of the models, one per protocol, STYLER generates a new abstract token sequence (Oi) for each that repaired more real errors. The best Stylerrandom-based formatting token sequence (Fi), based on the original input I, model was with general model attention type, 2 layers, 512 such as in Figure 5a. Recall that I is composed of pairs of Java units, embedding size of 512, and 20k training iterations, and tokens and formatting tokens (see Section IV-D), therefore its the best Styler3grams-based model was with general model number of formatting tokens is LI = length(I)/2. However, attention type, 1 layer, 512 units, embedding size of 256, a LSTM model does not enforce the output size, thus we and 20k training iterations. Those are the configurations we cannot guarantee that the length of a predicted formatting used for training the models for our experiments described in token sequence (LFi = length(Fi)) is equal to LI . If Section VI. LF > LI ,STYLER uses the first LI formatting tokens from For prediction, the beam search creates x = 5 potential Fi and ignores the remaining ones to generate Oi, such as in repairs per model. Finally, about the error encoding, we set Figure 5b. If LF < LI ,STYLER uses all formatting tokens k = 5, n = 10, i = 2, and j = 13. Recall that those parameters from Fi, and copies the LFi + 1,LFi + 2,...,LI original are about the token window before and after the error (i.e. the formatting tokens from I, such as in Figure 5c. Finally, after context surrounding the error) and the placement of tags for creating x abstract token sequences Oi,STYLER continues its workflow (Figure 3–step G). 5https://github.com/inovexcorp/mobi 7 the location and violation type identification once the error is B. Data Collection encoded. These parameters are made big enough to contain To answer our research questions, we create a dataset of real important information and, at the same time, small enough to Checkstyle formatting errors by mining open source projects. still allow for learning and prediction, and were set based on For that, we first build a list of projects to collect errors from meta-optimization. by filtering projects from our study presented in Section III. We select the projects that 1) use Checkstyle, 2) have only V. EVALUATION DESIGN one Checkstyle ruleset file, 3) contain at least one Checkstyle formatting rule in the Checkstyle ruleset, and 4) use Maven. We conduct an evaluation of STYLER on real Checkstyle This results in 1,791 projects. errors mined from GitHub repositories, and compare STYLER For each project, we try to reproduce Checkstyle errors with against three state-of-the-art code formatting systems. In this the following procedure. We first clone the remote repository section, we present the design of our evaluation. from GitHub6. Then, we search in the history of the project for the last commit (cn) that contains modifications in the checkstyle.xml file: this commit is used as a starting A. Research Questions point for the reproduction of real errors. We aim to answer the following five research questions. We then perform a sanity check in the checkstyle.xml file from the commit cn: if it contains unresolved variables, RQ #1 [Accuracy]: To what extent does STYLER repair real- we discard the project. Otherwise, we submit all files of world Checkstyle errors, compared to other systems? cn to a process of finding a version of Checkstyle that is Overall accuracy is an important metric to measure the value compatible to the checkstyle.xml of the project. This of tools. We investigate the accuracy of STYLER on real is necessary because new versions of Checkstyle sometimes Checkstyle errors, which allows us to understand to what introduce breaking backward compatibility7, and they might extent STYLER repairs formatting errors that have occurred fail to parse a checkstyle.xml used with previous ver- in practice. Moreover, we compare the accuracy of STYLER sions of Checkstyle. The process consists of executing multiple to the accuracy of three code formatters, by using the same Checkstyle versions on the project, from a newer version to dataset of errors, to investigate if, and to what extent, STYLER an older one, until finding one version that does not fail or outperforms the competing systems. until the available options end8. If a compatible Checkstyle version is found, we gather RQ #2 [Error type]: To what extent does STYLER repair all commits since c , inclusive: this process ensures that all different error types, compared to other systems? n commits are based on the same version of the Checkstyle Checkstyle has different formatting rules, so it raises different ruleset. For each selected commit, we check it out, and we error types. In this research question, we investigate if, and to check if the pom.xml file overrides any Checkstyle config- what extent, STYLER repairs different error types compared uration option: if it does, we discard that commit because to the other systems. This analysis is also important to find if we cannot untangle the Maven+Checkstyle configuration with the systems are complementary to each other. high accuracy. Otherwise, we run Checkstyle on the commit RQ #3 [Quality]: What is the size of the repairs generated by source tree. If at least one Checkstyle error is raised, we save STYLER, compared to other systems? the errored Java files and also the metadata information about There may be several alternative repairs that fix a given the errors (the Checkstyle error types and their location). Checkstyle error, including ones that change other lines in the We remove duplicate Java files according to the file content program and not only the ill-formatted line. In this research among all commits if any. Then, we select the files con- question, we compare the size of the repairs produced by taining a single Checkstyle error related to formatting. We STYLER against the repairs from the other systems. perform this selection to accurately evaluate repairs predicted by STYLER. Finally, we keep projects where all criteria yield RQ #4 [Performance]: How fast is STYLER for learning and at least 20 Checkstyle formatting errors. By applying this for predicting formatting repairs? systematic reproduction and selection process, we obtained a To investigate if STYLER is applicable in practice, we measure dataset containing 11,220 Checkstyle errors spread over 70 its performance for fixing Checkstyle errors. This is a valuable projects. Additionally, Table II shows the stats per Checkstyle information for who is interested in using STYLER as a pre- formatting rule. commit hook in IDEs or in continuous integration. RQ #5 [Technical analysis]: How do the two training data C. Systems Under Comparison generation techniques of STYLER contribute to its accuracy? We selected three systems to be compared with STYLER: Finally, we perform a technical analysis on the two protocols one is an IDE-based code formatter plugin for Checkstyle, for training data generation contained in STYLER (see Sec- tion IV-E), to investigate if one of them contributes more to the 6All repositories were cloned in June 24, 2020. 7 accuracy of STYLER. This is an important investigation from Checkstyle release notes: https://checkstyle.sourceforge.io/releasenotes. html the research viewpoint so that other researchers can further 8Our current implementation supports 35 Checkstyle versions, from 8.0 to choose a random or a 3-gram approach in related research. 8.33. 8

and the other two are the state-of-the-art of machine learning VI.EVALUATION RESULTS AND DISCUSSION formatters that aim to assist developers to fix code formatting- We present and discuss the results for our five research related issues without any prior or ad-hoc formatting rules. questions in this section. 1) CHECKSTYLE-IDEA: CHECKSTYLE-IDEA[9], also referred as CS-IDEA in this paper, is a plugin for the IntelliJ IDE. It provides IDE integrated feedback against a given A. Accuracy of STYLER (RQ #1) Checkstyle ruleset and suggests fixes for Checkstyle errors. To measure the accuracy of STYLER and the accuracy of the 2) NATURALIZE: NATURALIZE [3] is a tool dedicated other three systems on the 11,220 real errors, we categorize to assist developers on fixing coding conventions related to the repair attempts per status. Table III shows the results per naming and formatting in Java programs. It learns coding tool and per status of the repair attempts: repaired/no error conventions from a codebase and suggests fixes to developers refers to errors that were successfully repaired, i.e. no error such as formatting modifications, based on the n-gram model. is raised after the repair attempt; repaired/new errors refers 3) CODEBUFF: CODEBUFF [26] is a code formatter appli- to errors that were fixed, but new errors were introduced in cable to any programming language with an ANTLR grammar. the source code; not repaired/same error refers to errors that Instead of formatting the code according to ad-hoc rules for a were not repaired, i.e. the same error is still in the source language, CODEBUFF aims to infer the formatting rules given code; not repaired/same+new refers to errors that were not a grammar for the language and a set of files following the repaired and new errors were introduced in the source code; same formatting rules. For each token, a KNN model makes and broken refers to cases containing files that cannot be the decision to indent it or to align it with another token based parsed by javalang after the repair attempts. on the AST of the source file. STYLER repairs 38% of the errors while CS-IDEA repairs 63%, which is the greatest overall accuracy among the four D. Set-up considered tools. NATURALIZE and CODEBUFF repair less errors (13% and 15%, respectively). To check if there is a 1) CHECKSTYLE-IDEA: To use CS-IDEA, for each significant difference between STYLER and the other tools, we project in our dataset, we first create a project in IntelliJ used McNemar test and we considered α = 0.05: we found containing the checkstyle.xml file and the errored files. p-value=0.000 for all three tests. This means that STYLER and Then, we import the Checkstyle ruleset (Settings > Editor any other tool have a different proportion of errors. > Code Style > Import schema > Checkstyle configuration). We note that STYLER and CS-IDEA are the most reliable To run the CHECKSTYLE-IDEA plugin we simply call the function “Refactor code” from the IDE. tools in the sense of delivering to an end-user either a repaired source code or, in the worst case scenario, the code with 2) NATURALIZE and CODEBUFF adaptation: To use NAT- the same error. It is not the same case of NATURALIZE and URALIZE, we have to slightly modify it: i) NATURALIZE CODEBUFF, which have higher rates of delivering source code recommends multiple fixes, so we take the first one for a given with new errors or broken. They were, however, designed for error as being the repair; and ii) we changed NATURALIZE to a different goal, and do not take into account the Checkstyle only work for indentation, excluding fixes regarding variable ruleset of the project like STYLER and CS-IDEA do. Yet, naming conventions (which are out of the scope of this paper). they are relevant for our experiment since they are the state- To run CODEBUFF, we give it the required configuration, of-the-art of machine learning-based code formatters. Our including the number of spaces for indentation. This number is results show the need of specialized, focused-tools to repair based on the most common indentation used in the considered Checkstyle errors. projects (usually two or four spaces). 3) Training tools: We trained STYLER for each project in RQ #1: To what extent does STYLER repair real-world our real error dataset. The training process includes a step for Checkstyle errors, compared to other systems? creating the training data (see Figure 3–step A), where we STYLER repairs 38% (4,231/11,220) of the Checkstyle create 10,000 errors per project. To conduct a fair evaluation, errors we found in the wild, and it outperforms the two we ensure that STYLER learns repairs based on the same state-of-the-art machine learning systems, NATURALIZE and Checkstyle ruleset that is used for the real errors in the CODEBUFF. CS-IDEA is able to repair 63% of the errors, evaluation. Therefore, for each project from the real error however we note that CS-IDEA is heavily engineered, dataset, we select as training seeds all error-free Java files whereas STYLER’s approach to repair formatting errors from the last commit that modified the checkstyle.xml is fully automated and hence more appropriate for easily file used to collect the real errors. We take special care of handling new and configurable rules. consistency in the observed results: all three machine learning- based systems, STYLER,NATURALIZE, and CODEBUFF, are trained using the same Java files. B. Error Type Analysis (RQ #2) 4) Testing tools: Finally, we run all the four tools to repair To answer RQ #2, we investigate if STYLER is effective the 11,220 errors from the real error dataset. in fixing different Checkstyle error types (one error type is related to one Checkstyle rule). Figure 6 shows the repaired 9STYLER also targets the following rules that are not contained in our dataset: AnnotationLocation, AnnotationOnSameLine, EmptyForInitializer- Checkstyle errors per error type and per tool in a heatmap. The Pad, SingleSpaceSeparator, and TypecastParenPad. colour scale is from dark to light colours, where the darkest 9

TABLE II: Real error dataset stats per formatting rule9.

Checkstyle rule (25) Projects (70) Errors (11,220) CommentsIndentation 10 ( 14%) 32 (<1%) EmptyForIteratorPad 2 ( 3%) 10 (<1%) EmptyLineSeparator 22 ( 31%) 2,729 ( 24%) FileTabCharacter 17 ( 24%) 595 ( 5%) GenericWhitespace 4 ( 6%) 6 (<1%) Indentation 18 ( 26%) 755 ( 7%) LeftCurly 14 ( 20%) 197 ( 2%) LineLength 43 ( 61%) 2,774 ( 25%) MethodParamPad 9 ( 13%) 62 ( 1%) NewlineAtEndOfFile 11 ( 16%) 321 ( 3%) NoLineWrap 1 ( 1%) 11 (<1%) NoWhitespaceAfter 7 ( 10%) 44 (<1%) NoWhitespaceBefore 15 ( 21%) 141 ( 1%) OneStatementPerLine 2 ( 3%) 4 (<1%) OperatorWrap 22 ( 31%) 231 ( 2%) ParenPad 7 ( 10%) 120 ( 1%) Regexp 4 ( 6%) 374 ( 3%) RegexpMultiline 2 ( 3%) 8 (<1%) RegexpSingleline 14 ( 20%) 474 ( 4%) RegexpSinglelineJava 1 ( 1%) 203 ( 2%) RightCurly 18 ( 26%) 372 ( 3%) SeparatorWrap 5 ( 7%) 16 (<1%) TrailingComment 4 ( 6%) 370 ( 3%) WhitespaceAfter 17 ( 24%) 563 ( 5%) WhitespaceAround 42 ( 60%) 808 ( 7%) TABLE III: Results on the 11,220 real errors per tool (RQ #1).

Repaired Not repaired Tool No error New errors Same error Same+new Broken

STYLER 38% 10% 48% 3% 2% CS-IDEA 63% 15% 21% 1% 0% NATURALIZE 13% 23% 11% 25% 28% CODEBUFF 15% 34% 4% 32% 14% colour represents 0% of errors repaired and the lighter colour the maximum length is not reached and 2) no compilation error represents 100% (the lighter, the better). is introduced. STYLER repaired 31% of the 2,774 LineLength STYLER repairs errors for 24/25 Checkstyle rules, which is errors. more than all other tools: CS-IDEA fixes errors coming from STYLER works poorly for EmptyLineSeparator errors. This 19 rules, NATURALIZE from 20, and CODEBUFF from 17. rule enforces an empty line after specific source code construc- STYLER covers more diverse error types, which confirms our tions, such as header, fields, and constructors. This problem premise of employing a machine learning approach. Surpris- is likely due to the fact that our training data generation ingly, NATURALIZE produces fixes for a high number of error technique does not cover these cases properly. On the contrary, types, even more than CS-IDEA, and NATURALIZE does not CS-IDEA has a high accuracy (92%) for this error type. This target Checkstyle in particular. shows that STYLER and CS-IDEA are complementary to each The reason for CS-IDEA’s overall good performance is other, and can potentially be used in conjunction. that it works well in the three most frequent rules in our RQ #2: To what extent does STYLER repair different dataset: LineLength, EmptyLineSeparator, and FileTabChar- error types, compared to other systems? acter.STYLER, on the other hand, has a perfect success STYLER repairs errors from more Checkstyle rules (24) rate in rules that are not that frequent: EmptyForIteratorPad, than the other tools (CS-IDEA: 19; NATURALIZE: 20; GenericWhitespace, NoLineWrap, and ParenPad. Moreover, CODEBUFF: 17). It also exclusively fixes errors for two rules STYLER is the only tool that fixes (two) rules exclusively: that no other tool handles. This confirms that a machine- NewlineAtEndOfFile and NoLineWrap. This shows that Styler learning approach to repairing formatting errors is able is able to repair error types for which one would not need to to capture many different types of problems. For some put engineering effort to write the repair code (as opposed to rules, CS-IDEA has a much higher accuracy than STYLER, common rules). suggesting that they can be considered as complementary in STYLER is reasonably good for fixing LineLength. This is practice. interesting because LineLength is a configurable rule, where the developers specify the maximum length allowed for lines. C. Size of the Repairs (RQ #3) Fixing a LineLength error is not straightforward: it requires the One dimension of repair quality is the size of the diff (added repair tool to localize a place for breaking the line such that 1) + deleted lines) between the source code with a Checkstyle 10

CommentsIndentation (32) 9.4 % 40.6 % 25.0 % 0.0 % 40.6 %

EmptyForIteratorPad (10) 100.0 % 0.0 % 40.0 % 40.0 % 100.0 %

EmptyLineSeparator (2729) 2.3 % 91.9 % 20.4 % 1.0 % 94.5 %

FileTabCharacter (595) 9.4 % 93.1 % 6.4 % 31.8 % 98.8 %

GenericWhitespace (6) 100.0 % 100.0 % 16.7 % 33.3 % 100.0 %

Indentation (755) 84.2 % 92.2 % 3.8 % 74.8 % 94.4 %

LeftCurly (197) 95.9 % 92.9 % 35.5 % 34.5 % 95.9 %

LineLength (2774) 31.1 % 48.7 % 0.0 % 1.3 % 51.5 %

MethodParamPad (62) 51.6 % 80.6 % 11.3 % 12.9 % 87.1 %

NewlineAtEndOfFile (321) 61.4 % 0.0 % 0.0 % 0.0 % 61.4 %

NoLineWrap (11) 100.0 % 0.0 % 0.0 % 0.0 % 100.0 %

NoWhitespaceAfter (44) 18.2 % 22.7 % 2.3 % 15.9 % 22.7 %

NoWhitespaceBefore (141) 78.0 % 71.6 % 34.8 % 46.1 % 94.3 %

OneStatementPerLine (4) 25.0 % 25.0 % 0.0 % 0.0 % 25.0 %

OperatorWrap (231) 55.8 % 0.0 % 15.2 % 4.3 % 57.6 %

ParenPad (120) 100.0 % 36.7 % 35.0 % 26.7 % 100.0 %

Regexp (374) 2.9 % 2.9 % 8.6 % 11.0 % 14.2 %

RegexpMultiline (8) 0.0 % 0.0 % 0.0 % 0.0 % 0.0 %

RegexpSingleline (474) 10.8 % 37.8 % 2.1 % 5.3 % 38.0 %

RegexpSinglelineJava (203) 2.5 % 0.5 % 9.4 % 0.0 % 11.3 %

RightCurly (372) 54.3 % 76.3 % 3.5 % 26.6 % 78.2 %

SeparatorWrap (16) 37.5 % 6.2 % 25.0 % 0.0 % 37.5 %

TrailingComment (370) 88.1 % 0.0 % 31.4 % 0.0 % 88.9 %

WhitespaceAfter (563) 78.5 % 86.7 % 17.1 % 56.7 % 90.6 %

WhitespaceAround (807) 93.4 % 79.2 % 40.1 % 29.4 % 96.5 %

Styler Checkstyle-IDEA Naturalize CodeBuff All

Fig. 6: Types of Checkstyle error repaired per tool (RQ #2). error and the repaired source code. Among all repairs that pass RQ #3: What is the size of the repairs generated by all Checkstyle rules, the diff should be as small as possible for STYLER, compared to other systems? being the least disrupting for the developers. In the context of STYLER has a median repair size of five changed lines, a pull request on GitHub, a smaller diff is usually considered the same as NATURALIZE. Yet, NATURALIZE produces as easier to review and merge [12]. small formatting repairs with a less reliable predictability We calculate the size in lines of the diff from the errors that compared to STYLER. CS-IDEA and CODEBUFF clearly STYLER,CS-IDEA,NATURALIZE, and CODEBUFF repaired. produce bigger formatting repairs. The ability to produce Figure 7 shows the results: the x axis presents the size small diffs is an important property for code-reviews and distribution of the diffs, and each boxplot represents one tool. pull-request-based development, hence our results show that STYLER (in green) and NATURALIZE (in yellow) have STYLER can be realistically used in a modern software the smallest medians of diff size, which are both equal to development context. five changed lines. Yet, they suffer from fewer bad cases (the right-hand part of the distribution). CS-IDEA (in pink) and CODEBUFF (in blue) produce larger diff sizes, and have D. Performance (RQ #4) medians equals to 7 and 55, respectively. In the worst cases, To investigate if STYLER can be used in practice, we they produce the largest diffs, the 95th percentile passes 200 measure the execution time spent when running STYLER on changed lines, compared to 7 lines by STYLER. the real error dataset. Table IV shows the minimum, median, We performed Wilcoxon rank sum test to verify if the average, and maximum spent time on projects, split over the distributions of the diff sizes obtained by STYLER and the different steps from the STYLER workflow. For training data other tools are systematically different from one another. We generation, STYLER took at least 16 minutes and up to six found p-value=0.000 when testing STYLER with CS-IDEA and a half hours. To tokenize the training data, it took up to and CODEBUFF, and p-value=0.0000000039 when testing 13 minutes, and for training the models, it took mostly about STYLER with NATURALIZE. Considering α = 0.05, we reject one hour. Therefore, the training of STYLER (data generation the null hypothesis, which means that the distributions of + tokenization + model training) took around two hours and STYLER is significantly different from the other ones. a half on average. This can be considered just fine, since the 11

RQ #5: How do the two training data generation Styler Naturalize Checkstyle-IDEA CodeBuff techniques of STYLER contribute to its accuracy? For most errors, STYLER selects a repair predicted by the LSTM model based on the Styler3grams protocol because it produces a smaller diff, which is desirable for developers. Yet, the model based on Stylerrandom exclusively contributes to the overall accuracy of STYLER with 20% of the ﬁxes.

0 25 50 75 100 125 150 175 200 VII.DISCUSSION Diff size A. Machine learning versus rule-based approaches Fig. 7: Size of the repairs per tool. The two boxplot whiskers STYLER employs a machine-learning-based approach for represent the 5th and the 95th percentiles (RQ #3). repairing formatting convention violations. An alternative approach would be a rule-based one. For instance, there would TABLE IV: Statistics on the performance of STYLER (RQ #4). be one transformation to be applied in the code per Checkstyle rule. As said, this approach requires the engineering of a Training Prediction Data generation Tokenization Models Average Time transformation for every single linter rule, which is time- Stepa: ABC E→I consuming. While this is costly, this might be even impractical Min 00:16:18 00:00:51 00:31:54 1.608 s/err for highly configurable linters such as Checkstyle: the rule- Med 00:45:10 00:09:09 00:59:14 2.215 s/err based repair system would need to have different transforma- Avg 01:15:38 00:08:18 00:58:38 2.277 s/err tions for the same linter rule due to the configurable properties. Max 06:30:44 00:13:51 01:22:27 3.407 s/err On the contrary, a machine learning approach does not require a The steps were executed in a computer containing a processor In- tel(R) Core(TM) i9-10980XE CPU @ 3.00GHz and 125GiB system costly human engineering. It is able to infer transformations memory. For training the models, we used GPUs GeForce RTX 2080 for a diverse set of linter rules. Our experiments have validated Ti. this property in the context of formatting errors raised by Checkstyle. training is meant to happen only when the coding conventions B. Threats to Validity change (i.e. the Checkstyle ruleset file), which means rarely (a given version of coding conventions usually lasts for months). STYLER generates training data for repairing errors based After STYLER is trained for a given project, it takes in average on the Checkstyle configuration file contained in a given two seconds to predict a repair, which is fast enough to be used project. This means that STYLER assumes that all formatting in IDEs or in continuous integration environments. rules contained in the Checkstyle configuration file are valid. In practice, however, developers might ignore the violations of RQ #4: How fast is STYLER for learning and for certain rules. Our experiment does not take this scenario into predicting formatting repairs? account, thus we do not claim that 100% of the fixes produced On average, STYLER needs about two hours and a half by STYLER are necessarily relevant for developers. for training, and two seconds for predicting a repair. The The real error dataset contains Checkstyle errors mined training time is not an issue since it only happens when the from GitHub repositories. It is to be noted that it does not Checkstyle ruleset file changes. The prediction time relates cover all existing Checkstyle formatting rules. It is worth to to usability: our results show that STYLER can be used in mention that we are still collecting real errors, and those can the IDE or in CI, in a practical setting. potentially cover new rules. Moreover, the dataset might not be representative of the real distribution of the 19 rules in the real world. Consequently, future research is needed to strengthen E. Technical Analysis on STYLER (RQ #5) the validity of our study. When selecting real errors, we chose only files containing a At prediction time, STYLER used two trained LSTM mod- single real Checkstyle error (see Section V-B). We performed els, each one based on a different training data generation pro- this selection so that we could accurately check if the error tocol: Stylerrandom and Styler3grams. We investigate how was correctly repaired by the tools. Files containing more than the two protocols contribute to the final output of STYLER. one error are hard to check the correctness of repairs: once an We found that STYLER fixed 852 Checkstyle errors with the error is repaired, the location of the other ones in the file would Stylerrandom-based model exclusively, while it fixed 1,008 change. Therefore, our results are based on single-error files, errors with the Styler3grams-based model — 2,374 errors and future investigations on multiple-error files are needed. were fixed with both models. This shows that the model based Finally, to compare the quality of the repairs produced on Styler3grams is more effective. Moreover, when selecting by STYLER with the repairs produced by the other three one repair to give as output (Figure 3–step I), STYLER selected tools, we measured the size in lines of the diff between the the repair from the Styler3grams-based model in most cases buggy and repaired program versions. However, the diff size because it generates smaller diffs. is only one dimension for comparing the tools, which only 12 approximates the developer’s perception on formatting repairs. They mined millions of buggy and patched program versions User studies, such as proposing to developers formatting from the history of GitHub repositories, and abstracted them repairs, are interesting future experiments to further investigate to train an Encoder-Decoder model. The model was able to fix the practical value of this research. hundreds of unique buggy methods in the wild. [10] proposed SequenceR, an end-to-end program repair approach focused on VIII.RELATED WORK one-line fixes. In an experiment with Defects4J, SequenceR A. The use of static analysis tools was shown to be able to learn to repair behavioral bugs by Static analysis tools have been subject of investigation in generating patches that pass all tests. recent research. [39] investigated their usage in 20 popular Java open source projects hosted on GitHub and using Travis C. Linter-error repair and formatting CI to support CI activities. They first found out that the Linter-error repair. There are some tools to fix errors raised projects use seven static analysis tools—[8], [14], [28], [20], by specific linters. For instance, ESLint [13] is a linter for [4], [11], and [19]—being Checkstyle the most used one. JavaScript, but it also includes automated solutions to repair About the integration of static analysis tools in CI pipelines, errors raised by it. For Python, there exists the autopep8 tool they found out that build breakages due those tools are mainly [5], which formats Python code to conform to the PEP 8 related to adherence to coding standards, while breakages Style Guide for Python Code [27]. For Java, there exists the related to likely bugs or vulnerabilities occur less frequently. CHECKSTYLE-IDEA[9] plugin for IntelliJ, which we used [39] discuss that some tools are sometimes configured to not to be compared to STYLER.CHECKSTYLE-IDEA is able to break the build but just to produce warnings, possibly because highlight the error and also to suggest fixes in some cases. of the high number of false positives. However, it is very limited in repairing errors from several [38] investigated the usage of static analysis tools from the different rules as we have shown in RQ #2. perspective of the development context in which these tools are Code formatters. A way to enforce formatting conventions lies used. For that, they surveyed 42 developers and interviewed in code formatters. In Section V-C, we described NATURALIZE 11 industrial experts that integrate static analysis tools in their [3] and CODEBUFF [26]: NATURALIZE recommends fixes workflow. They found out that static analysis tools are used in for coding conventions related to naming and formatting in three main development contexts, which are local environment, Java programs, and CODEBUFF infers formatting rules to code review, and continuous integration. Moreover, they also any language given a grammar. Similar to the idea behind found out that developers differently consider warning types CODEBUFF,[30] had previously experimented with different depending on the context, e.g., when performing code reviews learning algorithms and feature set variations to learn the style they mainly look at style conventions and code redundancies. of a given corpus so that it could be applied to arbitrary code. [22] focused on one specific static analysis tool: [33]. Beyond those academic systems, there are code formatters Through an online survey with 18 developers from different such as google-java-format [15], which reformats source code organizations, they found out that most respondents agree that according to the Google Java Style Guide [16], and as such the issues reported by static analysis tools are relevant for fixes violations of the Google Style. However, these formatters improving the design and implementation of software. are usually not configurable or require manual tweaking, which is a tedious process for developers. This is a problem because B. Learning for repairing compiler errors and behavioral not all developers are ready to follow a unique convention bugs style. STYLER, on the other hand, is generic and automatically Learning for repairing compiler errors. There are related captures the conventions used in a project to fix formatting works in the area of automatic repair of compiler errors. violations. In this case, the compiler syntax rules are the equivalent of the formatting rules. There, recurrent neural networks and IX.CONCLUSION token abstraction have been used to fix syntactic errors [7]. In this paper, we presented STYLER, which implements a In DeepFix, [17] use a language model for repairing syntactic novel approach to repair formatting errors raised by Check- compilation errors in C programs. Out of 6,971 erroneous C style, the popular linter for Java programs. STYLER creates programs, DeepFix was able to completely repair 27% and a corpus of Checkstyle errors, learns from it, and predicts partially repair 19% of the programs. Later, [1] proposed fixes for new errors, using machine learning. Our experimental TRACER, which outperformed DeepFix, repairing 44% of results on 11,220 real Checkstyle errors showed that STYLER the programs. [32] confirmed the efficiency of LSTM over repairs real errors from a more diverse set of Checkstyle rules n-grams and of token abstraction for single token compiling than the systems CHECKSTYLE-IDEA,NATURALIZE, and errors. These approaches do not target formatting errors, which CODEBUFF. Moreover, STYLER produces smaller repairs than is the target of STYLER. the compared systems, and its prediction time is low so it can Learning for repairing behavioral bugs. As for repairing be used in IDEs or in Continuous Integration environments. compiler errors, there are also learning systems for repairing There are interesting areas for future work. First, improve- behavioral bugs, those that, for instance, break test cases. [37] ments on the error injection protocols for creating training data investigated the feasibility of using Neural Machine Transla- can be done so as to improve the representativeness of seeded tion techniques for learning bug-fixing patches for real defects. formatting errors. This might increase the performance of 13

STYLER on real errors. Second, user studies can be conducted, [25] OpenNMT-py. OpenNMT-py. https://github.com/OpenNMT/ where repairs predicted by STYLER are proposed to developers OpenNMT-py/, 2019. Accessed: 2019-01-09. [26] Terence Parr and Jurgen Vinju. Towards a Universal Code Formatter by pull requests, for instance. This type of study would bring through Machine Learning. In Proceedings of the 9th ACM SIGPLAN practical insights on the potential of STYLER. Additionally, International Conference on Software Language Engineering (SLE ’16), STYLER’s novel concept could be extended so as to repair pages 137–151, New York, NY, USA, 2016. ACM. [27] PEP 8. PEP 8 – Style Guide for Python Code. https://www.python.org/ other linter errors, beyond purely formatting ones. dev/peps/pep-0008/, 2018. Accessed: 2018-11-16. [28] PMD. Pmd. https://pmd.github.io/, 2020. Last access: 2020-07-13. REFERENCES [29] Pylint. Pylint. https://www.pylint.org/, 2018. Accessed: 2018-11-16. [30] Steven P. Reiss. Automatic code stylizing. In Proceedings of the Twenty- [1] Umair Z. Ahmed, Pawan Kumar, Amey Karkare, Purushottam Kar, and Second IEEE/ACM International Conference on Automated Software Sumit Gulwani. Compilation Error Repair: For the Student Programs, Engineering (ASE ’07), pages 74–83, New York, NY, USA, 2007. From the Student Programs. In Proceedings of the 40th International Association for Computing Machinery. Conference on Software Engineering: Software Engineering Education [31] RuboCop. RuboCop. https://docs.rubocop.org, 2018. Accessed: 2018- and Training (ICSE-SEET ’18), pages 78–87, New York, NY, USA, 11-16. 2018. ACM. [32] E. A. Santos, J. C. Campbell, D. Patel, A. Hindle, and J. N. Amaral. [2] Miltiadis Allamanis. Learning Natural Coding Conventions. PhD thesis, Syntax and sensibility: Using language models to detect and correct School of Informatics, University of Edinburgh, 2016. syntax errors. In Proceedings of the 25th International Conference on [3] Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles Sutton. Software Analysis, Evolution and Reengineering (SANER ’18), pages Learning Natural Coding Conventions. In Proceedings of the 22nd 311–322, 2018. ACM SIGSOFT International Symposium on Foundations of Software [33] SonarQube. Sonarqube. https://www.sonarqube.org/, 2020. Last access: Engineering (FSE ’14), pages 281–293, New York, NY, USA, 2014. 2020-07-13. ACM. [34] StyleCop. StyleCop. https://github.com/StyleCop/StyleCop/, 2018. [4] Apache-rat. Apache-rat. https://creadur.apache.org/rat/, 2020. Last Accessed: 2018-11-16. access: 2020-07-13. [35] Sun Code Conventions. Sun Code Conventions. https://www.oracle. [5] autopep8. autopep8. https://pypi.org/project/autopep8/, 2018. Accessed: com/technetwork/java/javase/documentation/codeconvtoc-136057.html, 2018-11-16. 2018. Accessed: 2018-11-16. [6] Moritz Beller, Georgios Gousios, and Andy Zaidman. Oops, My Tests [36] K. F. Tómasdóttir, M. Aniche, and A. Van Deursen. The Adoption Broke the Build: An Explorative Analysis of Travis CI with GitHub. In of JavaScript Linters in Practice: A Case Study on ESLint. IEEE Proceedings of the 14th International Conference on Mining Software Transactions on Software Engineering, pages 1–29, September 2018. Repositories (MSR ’17), pages 356–367, Piscataway, NJ, USA, 2017. [37] Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, IEEE Press. Martin White, and Denys Poshyvanyk. An empirical investigation into [7] S. Bhatia and R. Singh. Automated Correction for Syntax Errors in learning bug-fixing patches in the wild via neural machine translation. Programming Assignments using Recurrent Neural Networks. CoRR, In Proceedings of the 33rd ACM/IEEE International Conference on 2016. Automated Software Engineering (ASE ’18), pages 832–37, New York, [8] Checkstyle. Checkstyle. https://checkstyle.sourceforge.io/, 2020. Last NY, USA, 2018. Association for Computing Machinery. access: 2020-07-13. [38] C. Vassallo, S. Panichella, F. Palomba, S. Proksch, A. Zaidman, and [9] CheckStyle-IDEA. CheckStyle-IDEA. https://plugins.jetbrains.com/ H. C. Gall. Context is king: The developer perspective on the usage of plugin/1065-checkstyle-idea, 2019. Accessed: 2019-01-21. static analysis tools. In Proceedings of the 25th International Conference [10] Zimin Chen, Steve Kommrusch, Michele Tufano, Louis-NoÃnl´ Pouchet, on Software Analysis, Evolution and Reengineering (SANER ’18), pages Denys Poshyvanyk, and Martin Monperrus. SequenceR: Sequence-to- 38–49, 2018. Sequence Learning for End-to-End Program Repair. IEEE Transactions [39] Fiorella Zampetti, Simone Scalabrino, Rocco Oliveto, Gerardo Canfora, on Software Engineering, 2019. and Massimiliano Di Penta. How open source projects use static code [11] Clirr. Clirr. http://www.mojohaus.org/clirr-maven-plugin/, 2020. Last analysis tools in continuous integration pipelines. In Proceedings of the access: 2020-07-13. 14th International Conference on Mining Software Repositories (MSR [12] Hugo Dias. The anatomy of a perfect pull request. https://medium.com/ ’17), pages 334–344. IEEE Press, 2017. @hugooodias/the-anatomy-of-a-perfect-pull-request-567382bb6067, 2019. Accessed: 2019-05-10. [13] ESLint. ESLint. https://eslint.org/, 2018. Accessed: 2018-11-16. [14] FindBugs. Findbugs. http://findbugs.sourceforge.net/, 2020. Last access: 2020-07-13. [15] Google Java Format. google-java-format. https://github.com/google/ google-java-format/, 2018. Accessed: 2018-11-16. [16] Google Java Style Guide. Google Java Style Guide. http://checkstyle. sourceforge.net/reports/google-java-style-20170228.html, 2018. Ac- cessed: 2018-11-16. [17] Rahul Gupta, Soham Pal, Aditya Kanade, and Shirish Shevade. DeepFix: Fixing Common C Language Errors by Deep Learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI ’17), pages 1345–1351. AAAI Press, 2017. [18] javalang. javalang. https://github.com/c2nes/javalang/, 2019. Accessed: 2019-01-22. [19] jDepend. jdepend. http://www.mojohaus.org/jdepend-maven-plugin/, 2020. Last access: 2020-07-13. [20] License-gradle-plugin. License-gradle-plugin. https://github.com/ hierynomus/license-gradle-plugin, 2020. Last access: 2020-07-13. [21] Benjamin Loriot, Fernanda Madeiral, and Martin Monperrus. Open science repository for Styler. https://github.com/KTH/styler/, 2018. [22] Diego Marcilio, Rodrigo Bonifácio, Eduardo Monteiro, Edna Canedo, Welder Luz, and Gustavo Pinto. Are Static Analysis Violations Really Fixed? A Closer Look at Realistic Usage of SonarQube. In Proceedings of the 27th International Conference on Program Comprehension (ICPC ’19), pages 209–219. IEEE Press, 2019. [23] Robert C. Martin. Clean Code. Prentice Hall, 2008. [24] Martin Monperrus. Automatic Software Repair: a Bibliography. ACM Computing Surveys, 51(1):17:1–17:24, January 2018.