Towards a Universal Code Formatter through Machine Learning

Terence Parr Jurgen Vinju University of San Francisco, USA Centrum Wiskunde & Informatica, Netherlands [email protected] [email protected]

Abstract Because the value of a particular code formatting style is There are many declarative frameworks that allow us to im- a subjective notion, often leading to heated discussions, for- plement code formatters relatively easily for any specific lan- matters must be highly configurable. This allows, for exam- guage, but constructing them is cumbersome. The first prob- ple, current maintainers of existing code to improve their ef- lem is that “everybody” wants to format their code differ- fectiveness by reformatting the code per their preferred style. ently, leading to either many formatter variants or a ridicu- There are plenty of configurable formatters for existing lan- lous number of configuration options. Second, the size of guages, whether in IDEs like or standalone tools like each implementation scales with a language’s grammar size, Gnu indent, but specifying style is not easy. The emergent leading to hundreds of rules. behavior is not always obvious, there exists interdependency In this paper, we solve the formatter construction problem between options, and the tools cannot take context informa- using a novel approach, one that automatically derives for- tion into account [13]. For example, here are the options matters for any given language without intervention from a needed to obtain K&R style with indent: language expert. We introduce a code formatter called CODE- -nbad -bap -bbo -nbc -br -brs -c33 -cd33 -ncdb -ce BUFF that uses machine learning to abstract formatting rules -ci4 -cli0 -cp33 -cs -d0 -di1 -nfc1 -nfca -hnl -i4 from a representative corpus, using a carefully designed fea- -ip0 -l75 -lp -npcs -nprs -npsl -saf -sai -saw -nsc ture set. Our experiments on Java, SQL, and ANTLR gram- -nsob -nss mars show that CODEBUFF is efficient, has excellent accu- New languages pop into existence all the time and each racy, and is grammar invariant for a given language. It also one could use a formatter. Unfortunately, building a for- generalizes to a 4th language tested during manuscript prepa- matter is difficult and tedious. Most formatters used in ration. practice are ad hoc, language-specific programs but there Categories and Subject Descriptors D.2.3 [Software Engi- are formal approaches that yield good results with less ef- neering]: Coding - Pretty printers fort. Rule-based formatting systems let specify phrase-formatting pairs, such as the following sample spec- Keywords Formatting algorithms, pretty-printer ification for formatting the COBOL MOVE statement using 1. Introduction ASF+SDF [3, 12, 13, 15]. The way is formatted has a significant impact MOVE IdOrLit TO Id-list = on its comprehensibility [9], and manually reformatting code from-box( H [ "MOVE" is just not an option [8, p.399]. Therefore, programmers need Hts=25[to-box(IdOrLit)] Hts=49["TO"] ready access to automatic code formatters or “pretty printers” Hts=53[to-box(Id-list)]]) in situations where formatting is messy or inconsistent. Many program generators also take advantage of code formatters to This rule maps a pattern to a box expression. A set improve the quality of their output. of such rules, complemented with default behavior for the unspecified parts, generates a single formatter with a specific style for the given language. Section 6 has other related work. PermissionPermission to to make make digital digital or or hard hard copies copies of of all all or part of this work for personal or There are a number of problems with rule-based format- classroomclassroom use use is is granted granted without without fee fee provided provided that that copies copies are not made or distributed forfor profit profit or or commercial commercial advantage advantage and and that that copies copies bear bear this this notice and the full citation ters. First, each specification yields a formatter for one spe- onon the the first first page. page. Copyrights Copyrights for components of this work owned by others than ACM cific style. Each new style requires a change to those rules mustmust be be honored. honored. Abstracting with credit is permitted. To copy otherwise, or republish, toto post post on on servers servers or or to to redistribute redistribute to to lists, lists, requires requires prior prior specific permission and/or a or the creation of a new set. Some systems allow the rules to fee. Request permissions from [email protected]. fee. Request permissions from [email protected]. be parametrized, and configured accordingly, but that leads SLE’16SLE ’16,,, October October 31 31-November – November 01, 1, 2016, 2016, Amsterdam, Amsterdam, Netherlands. Netherlands c 2016 ACM. 978-1-4503-4447-0/16/10...$15.00 to higher rule complexity. Second, minimal changes to the Copyright c 2016 ACM [to be supplied]. . . $15.00. http://dx.doi.org/10.1145/2997364.2997383http://dx.doi.org/10.1145/http://dx.doi.org/10.1145/2997364.2997383 associated grammar usually require changes to the format-

137 ting rules, even if the grammar changes do not affect the lan- MonVerificationMaster.sql (trained with sqlite grammar on guage recognized. Finally, formatter specifications are big. sqlclean corpus):

Although most specification systems have builtin heuristics SELECT DISTINCT t.server_name for default behavior in the absence of a specification for a ,t.server_id ,’MessageQueuingService’ASmissingmonitors given language phrase, specification size tends to grow with FROM t_server t INNER JOIN t_server_type_assoc tsta ON t.server_id = tsta.server_id WHERE t.active = 1 AND tsta.type_id IN (’8’) the grammar size. A few hundred rules are no exception. AND t.environment_id = 0 AND t.server_name NOT IN Formatting is a problem solved in theory, but not yet in ( SELECT DISTINCT l.address practice. Building a good code formatter is still too difficult FROM ipmongroups g INNER JOIN ipmongroupmembers m ON g.groupid = m.groupid INNER JOIN ipmonmonitors l ON m.monitorid = l.monitorid and requires way too much work; we need a fresh approach. INNER JOIN t_server t ON l.address = t.server_name INNER JOIN t_server_type_assoc tsta ON t.server_id = tsta.server_id In this paper, we introduce a tool called CODEBUFF [11] that WHERE l.name LIKE ’%Message Queuing Service%’ AND t.environment_id = 0 uses machine learning to produce a formatter entirely from AND tsta.type_id IN (’8’) AND g.groupname IN (’Prod O/S Services’) a grammar for language L and a representative corpus writ- AND t.active = 1 ) ten in L. There is no specification work needed from the user UNION other than to ensure reasonable formatting consistency within ALL the corpus. The statistical model used by CODEBUFF first And here is a complicated query from dmart_bits_IAPPBO510.sql learns the formatting rules from the corpus, which are then with case statements: applied to format other documents in the same style. Differ- SELECT ent corpora effectively result in different formatters. From a CASE WHEN SSISInstanceID IS NULL THEN ’Total’ user perspective the formatter is “configured by example.” ELSE SSISInstanceID END SSISInstanceID ,SUM(OldStatus4)ASOldStatus4 ... Contributions and roadmap. We begin by showing sample ,SUM(OldStatus4+Status0+Status1+Status2+Status3+Status4)ASInstanceTotal FROM CODEBUFF output in Section 2 and then explain how and why ( SELECT CODEBUFF works in Section 3. Section 4 provides empirical CONVERT(VARCHAR, SSISInstanceID) AS SSISInstanceID ,COUNT(CASEWHENStatus=4AND evidence that CODEBUFF learns a formatting style quickly CONVERT(DATE, LoadReportDBEndDate) < CONVERT(DATE, GETDATE()) and using very few files. CODEBUFF approximates the cor- THEN Status ELSE NULL END) AS OldStatus4 pus style with high accuracy for the languages ANTLR, Java ... ,COUNT(CASEWHENStatus=4AND and SQL, and it is largely insensitive to language-preserving DATEPART(DAY, LoadReportDBEndDate) = DATEPART(DAY, GETDATE()) THEN Status grammar changes. To adjust for possible selection bias and ELSE NULL END) AS Status4 FROM dbo.ClientConnection model overfitting to these three well-known languages, we GROUP BY SSISInstanceID )ASStatusMatrix tested CODEBUFF on an unfamiliar language (Quorum) in GROUP BY SSISInstanceID Section 5, from which we learned that CODEBUFF works sim- Here is a snippet from Java, our 2nd test language, taken from ilarly well, yet improvements are still possible. We position STLexer.java (trained with java grammar on st corpus): CODEBUFF with respect to the literature in Section 6. switch ( c ) { ... default: 2. Sample Formatting if ( c==delimiterStopChar ) { consume(); scanningInsideExpr = false; This section contains sample SQL, Java, and ANTLR code return newToken(RDELIM); ODE UFF } formatted by C B , including some that are poorly for- if ( isIDStartLetter(c) ) { ... matted to give a balanced presentation. Only the formatting if ( name.equals("if") ) return newToken(IF); else if ( name.equals("endif") ) return newToken(ENDIF); style matters here so we use a small font for space reasons. ... return id; Github [11] has a snapshot of all input corpora and format- } RecognitionException re = new NoViableAltException("", 0, 0, input); ted versions (corpora, testing details in Section 4). To arrive ... errMgr.lexerError(input.getSourceName(), at the formatted output for document d in corpus D, our test "invalid character ’"+str(c)+"’", templateToken, rig removes all whitespace tokens from d and then applies re); an instance of CODEBUFF trained on the corpus without d, ... D d . \{ } Here is an example from STViz.java that indents a The examples are not meant to illustrate “good style.” method declaration relative to the start of an expression They are simply consistent with the style of a specific corpus. rather than the first token on the previous line: In Section 4 we define a metric to measure the success of the Thread t = new Thread() { automated formatter in an objective and reproducible man- @Override public void run() { ner. No quantitative research method can capture the qualita- synchronized ( lock ) { while ( viewFrame.isVisible() ) { tive notion of style, so we start with these examples. (We use try { lock.wait(); “. . . ” for immaterial text removed to shorten samples.) } catch (InterruptedException e) { SQL is notoriously difficult to format, particularly for } } nested queries, but CODEBUFF does an excellent job in most } } cases. For example, here is a formatted query from file IP- };

138 Formatting results are generally excellent for ANTLR, our decides (i) whether a space or line break is third test language. E.g., here is a snippet from Java.g4: required and if line break, (ii) how far to indent the next line. Previous approaches (see Section 6) make a language ClassOrIntModifier engineer define whitespace injection programmatically. :annotation //classorinterface |(’public’//classorinterface A formatting engine based upon machine learning op- ... erates in two distinct phases: training and formatting. The |’final’//classonly training phase examines a corpus of code documents, D, |’strictfp’//classorinterface written in language L to construct a statistical model that rep- ) resents the formatting style of the corpus author. The essence ; of training is to capture the whitespace preceding each token, t, and then associate that whitespace with the phrase context (Comments are passed through; see Section 3.5.) Among the surrounding t. Together, the context and whitespace preced- formatted files for the three languages, there are a few regions ing t form an exemplar. Intuitively, an exemplar captures how of inoptimal or bad formatting. CODEBUFF does not capture the corpus author formatted a specific, fine-grained piece of a all formatting rules and occasionally gives puzzling format- phrase, such as whether the author placed a newline before or ting. For example, in the Java8.g4 grammar, the following rule after the left curly brace in the context of a Java -statement. has all elements packed onto one line (“ -” means we soft- if Training captures the context surrounding t as an m- wrapped output for printing purposes): dimensional feature vector, X, that includes t’s token type, unannClassOrIntType parse-tree ancestors, and many other features (Section 3.3). :(unannClassType_lfno_unannClassOrIntType| - Training captures the whitespace preceding t as the concate- unannInterfaceType_lfno_unannClassOrIntType) - nation of two separate operations or directives: a whitespace (unannClassType_lf_unannClassOrIntType | - ws directive followed by a horizontal positioning hpos direc- unannInterfaceType_lf_unannClassOrIntType)* tive if ws is a newline (line break). The ws directive generates ; spaces, newlines, or nothing while hpos generates spaces to CODEBUFF does not consider line length during training or indent or align t relative to a previous token (Section 3.1). formatting, instead mimicking the natural line breaks found As a final step, training presents the list of exemplars among phrases of the corpus. For Java and SQL this works to a machine learning algorithm that constructs a statistical very well, but not always with ANTLR grammars. model. There are N exemplars (Xj,wj,hj) for j =1..N Here is an interesting Java formatting issue from Com- where N is the number of total tokens in all documents of piler.java that is indented too far to the right (column 102); corpus D and w ws, h hpos. Machine learning mod- j 2 j 2 it is indented from the {{. That is a good decision in general, els are typically both a highly-processed condensation of the but here the left side of the assignment is very long, which exemplars and a classifier function that, in our case, classi- indents the put() code too far to be considered good style. fies a context feature vector, X, as needing a specific bit of whitespace. Classifier functions predict how the corpus au- public Map<...> X = new ... {{ thor would format a specific context by returning a format- put("anchor", "true"); }}; ting directive. A model for formatting needs two classifier functions, one for predicting ws and one for hpos (consulted In STGroupDir.java, the prefix token is aligned improperly: if ws prediction yields a newline). CODEBUFF k if ( verb ) error("loadFile("+f+") in groupdir... uses a -Nearest Neighbor (kNN) machine "prefix="+ learning model, which uses the list of exemplars as the actual prefix); model. A kNN’s classifier function compares unknown con- text vector X to the Xj from all N exemplars and finds the We also note that some of the SQL expressions are incor- k nearest. Among these k, the classifier predicts the format- rectly aligned, as in this sample from SQLQuery23.sql: ting directive that appears most often (details in Section 3.4). AND X NOT LIKE ’...’ It’s akin to asking a programmer how they normally format AND X NOT LIKE ’...’ the code in a specific situation. Training requires a corpus AND X NOT... D written in L, a lexer and parser for L derived from gram- mar G, and the corpus indentation size to identify indented Despite a few anomalies, CODEBUFF generally reproduces phrases; e.g., one of the Java corpora we tested indents with a corpus’ style well. Now we describe the design used to 2 not 4 spaces. Let F =(X,W,H,indentSize) denote achieve these results. In Section 4 we quantify them. D,G the formatting model contents with context vectors forming rows of matrix X and formatting directives forming elements 3. The Design of an AI for Formatting of vectors W and H. Function 1 (see appendix) embodies the Our AI formatter mimics what programmers do during the training process, constructing FD,G. act of entering code. Before entering a program symbol, a

139 Once the model is complete, the formatting phase can be- that implicitly align or indent ti relative to the first token of gin. Formatting operates on a single document d to be for- the previous line: matted and functions with guidance from the model. At each hpos (align, t ), (indent, t ), align, indent 2{ j j } token t d, formatting computes the feature vector X rep- In the following Java fragments, assuming 4-space indenta- i 2 i resenting the context surrounding ti, just like training does, tion, directive (indent, if) captures the whitespace at posi- but does not add X to the model. Instead, the formatter tion , (align, if) captures , and (align, x) captures . i "a "b "c presents X to the ws classifier and asks it to predict a ws i if ( b ) { f(x, for (int i=0; ... directive for t based upon how similar contexts were for- i z++; y) x=i; a d matted in the corpus. The formatter “executes” the directive " "c " and, if a newline, presents Xi to the hpos classifier to get an }b indentation or alignment directive. After emitting any pre- " At position d, both (indent, for) and (align, ‘(’) capture ceding whitespace, the formatter emits the text for ti. Note " the formatting, but training chooses indentation over align- that any token ti is identified by its token type, string con- tent, and offset within a specific document, i. ment directives when both are available. We experimented The greedy, “local” decisions made by the formatter give with the reverse choice, but found this choice better. Here, “globally” correct formatting results; selecting features for (align, ‘(’) inadvertently captures the formatting because the X vectors is critical to this success. Unlike typical ma- for happens to be 3 characters. chine learning tasks, our predictor functions do not yield triv- To illustrate the need for (indent,tj) versus plain indent, ial categories like “it’s a cat.” Instead, the predicted ws and consider the following Java method fragment where the first hpos directives are parametrized. The following sections de- statement is not indented from the previous line. tail how CODEBUFF captures whitespace, computes feature public void write(String str) vectors, predicts directives, and formats documents. throws IOException { int n=0; 3.1 Capturing Whitespace as Directives " public In order to reproduce a particular style, formatting directives Directive (indent, ) captures the indentation of int must encode the information necessary to reproduce whites- but plain indent does not. Plain indent would mean indent- pace encountered in the training corpus. There are five canon- ing 4 spaces from throws, the first token on the previous ical formatting directives: line, incorrectly indenting int 8 spaces relative to public. Directive indent is used to approximate nonstandard in- 1. nl: Inject newline dentation as in the following fragment. 2. sp: Inject space character f(100, 3. (align,t): Left align current token with previous token t 0); 4. (indent,t): Indent current token from previous token t " 5. none: Inject nothing, no indentation, no alignment At the indicated position, the whitespace does not represent alignment or standard 4 space indentation. As a default for For simplicity and efficiency, prediction for nl and sp any nonstandard indentation, function capture_hpos returns operations can be merged into a single “predict whitespace” plain indent as an approximation. or ws operation and prediction of align and indent can be When no suitable alignment or indentation token is avail- merged into a single “predict horizontal position” or hpos able, but the current token is aligned with the previous line, operation. While the formatting directives are 2- and 3-tuples training captures the situation with directive align: (details below), we pack the tuples into 32-bit integers for efficiency, w for ws directives and h for hpos. return x + For ws operations, the formatter needs to know how many y+ z; // align with first token of previous line (n) characters to inject: ws (nl,n), (sp,n), none as 2{ } computed in Function 2. For example, in the following Java While (align, y) is valid, that directive is not available be- fragment, the proper ws directive at a is (sp, 1), meaning cause of limitations in how hpos directives identify previous " “inject 1 space,” the directive at b is none, and c is (nl, 1), tokens, as discussed next. " " meaning “inject 1 newline.” 3.2 How Directives Refer to Earlier Tokens x = y; The manner in which training identifies previous tokens for "a b z++; " hpos directives is critical to successfully formatting docu- c " ments and is one of the key contributions of this paper. The The hpos directives align or indent token ti relative to goal is to define a “token locator” that is as general as pos- some previous token, tj for j

140 identify. But, the more specific the locator, the less applica- expression ble it is in other contexts. Consider the indicated positions within the following Java fragments where align directives expression:1 ( expressionList:1 ) must identify the first token of previous function arguments. child 0 primary:5 expression , expression:1 , expression:1 f(x, f(x+1, f(x+1, y) y) y, f expression:1 + expression:1 primary:5 Left - expression:1 "a "b z) ancestor c " primary:5 primary:4 Left y primary:5 Leftmost Left The absolute token index within a document is a com- ancestor leaf ancestor pletely general locator but is so specific as to be inapplicable x literal:1 z to other documents or even other positions within the same document. For example, all positions , , and could use 1 "a "b "c a single formatting directive, (align, i), but x’s absolute in- Figure 1. Parse tree for f(x+1,y,-z). Node rule:n in the tree dex, i, is valid only for a function call at that specific location. indicates the grammar rule and alternative production number used The model also cannot use a relative token index referring to match the subtree phrase. backwards. While still fully general, such a locator is still too specific to a particular phrase. At position a, token x is " expression ancestor at delta 2, but at position b, x is at delta 4. Given argument child 0 " Δ=1 expressions of arbitrary size, no single token index delta is Leftmost possible and so such deltas would not be widely applicable. leaf expression + expression:1 Because the delta values are different, the model could not use a single formatting directive to mean “align with previous expression:1 + expression:1 primary:5 argument.” The more specific the token locator, the more specific the context information in the feature vectors needs primary:5 primary:5 Left z to be, which in turn, requires larger corpora (see Section 3.3). ancestor We have designed a token locator mechanism that strikes a x y balance between generality and applicability. Not every pre- Figure 2. Parse tree for x+y+z;. vious token is reachable but the mechanism yields a single locator for x from all three positions above and has proven widely applicable in our experiments. The idea is to pop up level from the left ancestor and down to the leftmost leaf of into the parse tree and then back down to the token of inter- the ancestor’s child 0. est, x, yielding a locator with two components: A path length The use of the left ancestor and the ancestor’s leftmost to an ancestor node and a child index whose subtree’s left- leaf is critical because it provides a normalization factor most leaf is the target token. This alters formatting directives among dissimilar parse trees about which training has no relative to previous tokens to be: (_,ancestor,child). inherent structural information. Unfortunately, some tokens Unfortunately, training can make no assumptions about are unreachable using purely leftmost leaves. Consider the the structure of the provided grammar and, thus, parse-tree return x+y+z; example from the previous section and one structure. So, training at ti involves climbing upwards in the possible parse tree for it in Figure 2. Leaf y is unreachable as tree looking for a suitable ancestor. To avoid the same issues part of formatting directives for z because y is not a leftmost with overly-specific elements that token indexes have, the leaf of an ancestor of z. Function capture_hpos must either path length is relative to what we call the earliest left ances- align or indent relative to x or fall back on the plain align tor as shown in the parse tree in Figure 1 for f(x+1,y,-z). and indent. The earliest left ancestor (or just left ancestor) is the The opposite situation can also occur, where a given token oldest ancestor of t whose leftmost leaf is t, and identifies is unintentionally aligned with or indented from multiple the largest phrase that starts with t. (For the special case tokens. In this case, training chooses the directive with the where t has no such ancestor, we define left ancestor to be smallest ancestor, with ties going to indentation. t’s parent.) It attempts to answer “what kind of thing we are And, finally, there could be multiple suitable tokens that looking at.” For example, the left ancestor computed from share a common ancestor but with different child indexes. For the left edge of an arbitrarily-complex expression always example, if all arguments of f(x+1,y,-z) are aligned, the refers to the root of the entire expression. In this case, the parse tree in Figure 1 shows that (align, 1, 0) is suitable to left ancestors of x, y, and z are siblings, thus, normalizing align y and both (align, 1, 0) and (align, 1, 2) could align leaves at three different depths to a common level. The token argument -z. Ideally, the formatter would align all func- locator in a directive for x in f(x+1,y,-z) from both y and tion arguments with the same directive to reduce uncertainty z is (_,ancestor,child) = (_, 1, 0), meaning jump up 1 in the classifier function (Section 3.4) so training chooses (align, 1, 0) for both function arguments.

141 The formatting directives capture whitespace in between Corpus N tokens Unique ws Unique hpos tokens but training must also record the context in which 19,692 3.0% 4.7% those directives are valid, as we discuss next. java 42,032 3.9% 17.4% java8 42,032 3.4% 7.5% 3.3 Token Context—Feature Vectors java_guava 499,029 0.8% 8.1% For each token present in the corpus, training computes sqlite 14,758 8.4% 30.8% tsql 14,782 7.5% 17.9% an exemplar that associates a context with a ws and hpos formatting-directive: (X, w, h). Each context has several fea- Figure 3. Percentage of unique context vectors in corpora. tures combined into a m-dimensional feature vector, X. The context information captured by the features must be specific enough to distinguish between language phrases requiring needed to achieve good generality. For example, rather than different formatting but not so specific that classifier func- relying solely on the exact tokens after a = token, it is more tions cannot recognize any contexts during formatting. The general to capture the fact that those tokens begin an expres- shorter the feature vector, the more situations in which each sion. A useful metric is the percentage of unique context vec- exemplar applies. Adding more features also has the poten- tors, which we counted for several corpora and show in Fig- tial to confuse the classifier. ure 3. Given the features described below, there are very few Through a combination of intuition and exhaustive exper- unique context for ws decisions (a few %). The contexts for imentation, we have arrived at a small set of features that hpos decisions, however, often have many more unique con- perform well. There are 22 context features computed dur- texts because ws uses 11-vectors and hpos uses 17-vectors. ing training for each token, but ws prediction uses only 11 of E.g., our reasonably clean SQL corpus has 31% and 18% them and hpos uses 17. (The classifier function knows which unique hpos vectors when trained using SQLite and TSQL subset to use.) The feature set likely characterises the context grammars, respectively. needs of the languages we tested during development to some For generality, the fewer unique contexts the better, as degree, but the features appear to generalize well (Section 5). long as the formatter performs well. At the extreme, a model Before diving into the feature details, it is worth describ- with just one X context would perform very poorly because ing how we arrived at these 21 features and how they affect all exemplars would be of the form (X, _, _). The formatting formatter precision and generality. We initially thought that a directive appearing most often in the corpus would be the sliding window of, say, four tokens would be sufficient con- sole directive returned by the classifier function for any X. text to make the majority of formatting decisions. For exam- The optimal model would have the fewest unique contexts ple, the context for x=1* would simply be the token but all exemplars with the same context having identical for- ··· ··· " matting directives. For our corpora, we found that a majority types of the surrounding tokens: X=[id,=,int_literal,*]. The of unique contexts for ws and almost all unique contexts for surrounding tokens provide useful but highly-specific infor- hpos predict a single formatting directive, as shown in Figure mation that does not generalize well. Upon seeing this ex- 4. For example, 57.1% of the unique corpus contexts act sequence during formatting, the classifier function would antlr are associated with just one ws directive and 95.7% of the find an exact match for X in the model and predict the associ- unique contexts predict one hpos directive. The higher the ated formatting directive. But, the classifier would not match ambiguity associated with a single context vector, the higher context x=y+ to the same X, despite having the same ··· ··· the uncertainty when making decisions during formatting. " formatting needs. The guava corpus stands out as having very few unique The more unique the context, the more specific the for- contexts for ws and among the fewest for hpos. This gives matter can be. Imagine a context for token ti defined as the a hint that the corpus might be much larger than necessary 20-token window surrounding each ti. Each context derived because the other Java corpora are much smaller and yield from the corpus would likely be unique and the model would good formatting results. Figure 8 shows the effect of corpus hold a formatting directive specific to each token position size on classifier error rates. The error rate flattens out after of every file. A formatter working from this model could training on about 10 to 15 corpus files. reproduce with high precision a very similar unknown file. In short, few unique contexts gives an indication of the The trade-off to such precision is poor generality because the potential for generality and few ambiguous decisions gives model has “overfit” the training data. The classifier would an indication of the model’s potential for accuracy. These likely find no exact matches for many contexts, forcing it to numbers do not tell the entire story because some contexts predict directives from poorly-matched exemplars. are used more frequently than others and those might all To get a more general model, context vectors use at most predict single directives. Further, while a single context could two exact token type but lots of context information from be associated with multiple directives, most of those could be the parse tree (details below). The parse tree provides infor- one specific directive. mation about the kind of phrase surrounding a token posi- With this perspective in mind, we turn to the details of the tion rather than the specific tokens, which is exactly what is individual features. The ws and hpos decisions use a differ-

142 Ambiguous Ambiguous void reset() {x=0;} void reset() { void reset() { Corpus ws directives hpos directives x=0; x=0;} antlr 42.9% 4.3% } java 29.8% 1.7% Without features #4-#5, the formatter would yield the third. java8 31.6% 3.2% Determining the set of paired symbols is nontrivial, given java_guava 23.5% 2.8% that the training can make no assumptions about the language sqlite_noisy 43.5% 5.0% it is formatting. We designed an algorithm, pairs in Function sqlite 24.8% 5.5% 4, that analyzes the parse trees for all documents in the tsql_noisy 40.7% 6.3% corpus and computes plausible token pairs for every non-leaf tsql 29.3% 6.2% node (grammar rule production) encountered. The algorithm Figure 4. Percentage of unique context vectors in corpora associ- relies on the idea that paired tokens are token literals, occur ated with > 1 formatting directive. as siblings, and are not repeated siblings. Grammar authors also do not split paired token references across productions. ent subset of features but we present all features computed Instead, authors write productions such as these ANTLR during training together, broken into three logical subsets. rules for Java:

3.3.1 Token Type and Matching Token Features expr : ID ’[’ expr ’]’ | ... ; type : ID ’<’ ID (’,’ ID)* ’>’ | ... ; At token index i within each document, context feature- that yield subtrees with the square and angle brackets as vector Xi contains the following features related to previous tokens in the same document. direct children of the relevant production. Repeated tokens are not plausible pair elements so the commas in a generic 1. ti 1, token type of previous token Java type list, as in T, would not appear in pairs 2. ti, token type of current token associated with rule type. A single subtree in the corpus 3. Is ti 1 the first token on a line? with repeated commas as children of a type node would 4. Is paired token for ti the first on a line? remove comma from all pairs associated with rule type. 5. Is paired token for ti the last on a line? Further details are available in Function 4 (see also source CollectTokenPairs.java). The algorithm neatly identifies pairs Feature #3 allows the model to distinguish between the fol- such as (?, :) and ([, ]), and ((, )) for Java expressions lowing two different ANTLR grammar rule styles at , when "a and (enum, }), (enum, {), and ({, }) for enumerated type ti=DIGIT, using two different contexts. declarations. During formatting, paired (Function 5) returns DECIMAL : DIGIT+ ; DECIMAL the paired symbols for ti. a " "b : DIGIT+ a ; " 3.3.2 List Membership Features "b Most computer languages have lists of repeated elements Exemplars for the two cases are: separated or terminated by a token literal, such as statement lists, formal parameter lists, and table column lists. The next (X=[:, RULEREF, false, . . . ], w=(sp, 1), h=none) group of features indicates whether ti is a component of a list (X0=[:, RULEREF, true, . . . ], w0=(sp, 3), h0=none) construct and whether or not that list is split across multiple where RULEREF is type(DIGIT), the token type of rule lines (“oversize”). reference DIGIT from the ANTLR meta-grammar. Without 6. Is leftancestor(t ) a component of an oversize list? feature #3, there would be a single context associated with i 7. leftancestor(t ) component type within list from two different formatting directives. i {prefix token, first member, first separator, member, Features #4 and #5 yield different contexts for common separator, suffix token} situations related to paired symbols, such as { and }, that require different formatting. For example, at position b, the With these two features, context vectors capture not only two " model knows that : is the paired previous symbol for ; different overall styles for short and oversize lists but how (details below) and distinguishes between the styles. On the the various elements are formatted within those two kinds left, : is not the first token on a line whereas : does start the of lists. Here is a sample oversize Java formal parameter list line for the case on the right, giving two different exemplars: annotated with list component types: (X=[. . . , false, false], w=(sp, 1), h=none) (X0=[. . . , true, false], w0=(nl, 1), h0=(align,:)) Those features also allow the model to distinguish between the first two following Java cases where the paired symbol for } is sometimes not at the end of the line in short methods.

143 formalParameters:1 tracks the number of those lists and the median list length, prefix suffix (r, c, sep) (n, median). ( formalParameterList:1 ) Repeated siblings 7! with separator 3.3.3 Identifying Oversize Lists During Formatting formalParameter:1 , formalParameter:1 separator As with training, the formatter performs a preprocessing pass typeSpec:2 variableDeclaratorId:1 typeSpec:2 variableDeclaratorId:1 to identify the tokens of list phrases. Whereas training iden- tifies oversize lists simply as those split across lines, format- primitiveType:1 x primitiveType:1 y ting sees documents with all whitespace squeezed out. For each (r, c, sep) encountered during the preprocessing pass, int 1st member int member the formatter consults a mini-classifier to predict whether that list is oversize or not based upon the list string length, ll. The Figure 5. Formal Args Parse Tree void f(int x, int y). mini-classifier compares the mean-squared-distance of ll to the median for regular lists and the median for oversize (big) lists and then adjusts those distances according to the like- Only the first member of a list is differentiated; all other lihood of regular vs oversize lists. The a priori likelihood members are labeled as just plain members because their that a list is regular is p(reg)=nreg/(nreg + nbig), giv- formatting is typically the same. The exemplars would be: ing an adjusted distance to the regular type list as: distreg = (ll median )2 (1 p(reg)). The distance for oversize (X=[. . . , true, prefix], w=none, h=none) reg ⇤ (X=[. . . , true, first member], w=none, h=none) lists is analogous. (X=[. . . , true, first separator], w=none, h=none) When a list length is somewhere between the two medi- (X=[. . . , true, member], w=(nl, 1), h=(align, first arg)) ans, the relative likelihoods of occurrence shift the balance. (X=[. . . , true, separator], w=none, h=none) When there are roughly equal numbers of regular and over- (X=[. . . , true, member], w=(nl, 1), h=(align, first arg)) size lists, the likelihood term effectively drops out, giving just (X=[. . . , true, suffix], w=none, h=none) mean-squared-distance as the mini-classifier criterion. At the extreme, when all (r, c, sep) lists are big, p(big)=1, forcing Even for short lists on one line, being able to differen- distbig to 0 and, thus, always predicting oversize. tiate between list components lets training capture differ- When a single token ti is a member of multiple lists, train- ent but equally valid styles. For example, some ANTLR ing and formatting associate ti with the longest list subphrase grammar authors write short parenthesized subrules like because that yields the best formatting, as evaluated manu- (ID|INT|FLOAT) but some write (ID | INT | FLOAT). ally across the corpora. For example, the expressions within As with identifying token pairs, CODEBUFF must identify a Java function call argument list are often themselves lists. the constituent components of lists without making assump- In f(e1,...,a+b), token a is both a sibling of f’s argument tions about grammars that hinder generalization. The intu- list but also the first sibling of expression a+b, which is also ition is that lists are repeated sibling subtrees with a sin- a list. Training and formatting identify a as being part of the gle token literal between the 1st and 2nd repeated sibling, larger argument list rather than the smaller a+b. This choice as shown in Figure 5. Repeated subtrees without separators ensures that oversize lists are properly split. Consider the op- are not considered lists. Training performs a preprocessing posite choice where a is associated with list a+b. In an over- pass over the parse tree for each document, tagging the to- size argument list, the formatter would not inject a newline kens identified as list components with values for features before a, yielding poor results: #6- #7. Tokens starting list members are identified as the left- f(e1, most leaves of repeated siblings (formalParameter in Figure ..., a+b) 5). Prefix and suffix components are the tokens immediately to the left and right of list members but only if they share a Because list membership identification occurs in a top-down common parent. parse-tree pass, associating tokens with the largest construct The training preprocessing pass also collects statistics is a matter of latching the first list association discovered. about the distribution of list text lengths (without whitespace) of regular and oversize lists. Regular and oversize list lengths 3.3.4 Parse-Tree Context Features are tracked per (r, c, sep) combination for rule subtree root The final features provide parse-tree context information: type r, child node type c, and separator token type sep; e.g., 8. childindex(ti) (r, c, sep)=(formalParameterList, formalParameter,‘,’) 9. rightancestor(ti 1) in Figure 5. The separator is part of the tuple so that ex- 10. leftancestor(ti) pressions can distinguish between different operators such 11. childindex(leftancestor(ti)) as = and *. Children of binary and ternary operator subtrees 12. parent 1(leftancestor(ti)) satisfy the conditions for being a list, with the operator as 13. childindex(parent1(leftancestor(ti))) separator token(s). For each (r, c, sep) combination, training 14. parent 2(leftancestor(ti))

144 block:1 15. childindex(parent 2(leftancestor(ti))) 16. parent 3(leftancestor(ti)) { blockStatement:2 } childindex(parent (leftancestor(t ))) 17. 3 i statement:3 (align,0,0) 18. parent 4(leftancestor(ti)) if parExpression:1 statement:1 else statement:3 (align,1,0) 19. childindex(parent 4(leftancestor(ti))) ( expression:1 ) block:1 if parExpression:1 statement:1 else statement:3 (align,1,3) 20. parent 5(leftancestor(ti)) 21. childindex(parent 5(leftancestor(ti))) primary:5 { } ( expression:1 ) block:1 if parExpression:1 statement:1 else statement:1

x primary:5 { } ( expression:1 ) block:1 block:1 Here the childindex(p) is the 0-based index of node p y primary:5 { } { } among children of parent(p), childindex(ti) is shorthand z for childindex(leaf (ti)), and leaf (ti) is the leaf node asso- ciated with ti. Function childindex(p) has a special case when p is a repeated sibling. If p is the first element, Figure 6. Alignment directives for nested if-else statements. childindex(p) is the actual child index of p within the chil- dren of parent(p) but is special marker * for all other re- peated siblings. The purpose is to avoid over-specializing the where stat abbreviates statement:3 and blockStat abbrevi- context vectors to improve generality. These features also use ates blockStatement:2. All deeper else clauses also use di- th function parent i(p), which is the i parent of p; parent 1(p) rective (align, 1, 3). is synonymous with the direct parent parent(p). Training is complete once the software has computed an The child index of ti, feature #8, gives the necessary con- exemplar for each token in all corpus files. The formatting text information to distinguish the alignment token between model is the collection of those exemplars and an associated the following two ANTLR lexical rules at the semicolon. classifier that predicts directives given a feature vector. BooleanLiteral fragment :’true’ DIGIT 3.4 Predicting Formatting Directives |’false’ :[0-9] CODEBUFF’s kNN classifier uses a fixed k = 11 (chosen ; ; experimentally in Section 4) and an L0 distance function (ra- On the left, the ; token is child index 3 but 4 on the right, tio of number of components that differ to vector length) yielding different contexts, X and X0, to support differ- but with a twist on classic kNN that accentuates feature ent alignment directives for the two cases. Training collects vector distances in a nonlinear fashion. To make predic- exemplars (X, (align, 0, 1)) and (X0, (align, 0, 2)), which tions, a classic kNN classifier computes the distance from aligns ; with the colon in both cases. unknown feature vector X to every Xj vector in the exem- Next, features rightancestor(ti 1) and leftancestor(ti) plars, (X,Y), and predicts the category, y, occurring most describe what phrase precedes ti and what phrase ti starts. frequently among the k exemplars nearest X. The rightancestor is analogous to leftancestor and is the The classic approach works very well in Euclidean space oldest ancestor of ti whose rightmost leaf is ti (or parent(ti) with quantitative feature vectors but not so well with an L0 if there is no such ancestor). For example, at ti=y in x=1; distance that measures how similar two code-phrase contexts y=2; the right ancestor of ti 1 and the left ancestor of ti are are. As the L0 distance increases, the similarity of two con- both “statement” subtree roots. text vectors drops off dramatically. Changing even one fea- Finally, the parent and child index features capture context ture, such as earliest left ancestor (kind of phrase), can mean information about highly nested constructs, such as: very different contexts. This quick drop off matters when counting votes within the k nearest Xj. At the extreme, there if ( x ) { } could be one exemplar where X = Xj at distance 0 and else if ( y ) { } 10 exemplars at distance 1.0, the maximum distance. Clearly else if ( z ) { } the one exact match should outweigh 10 that do not match else { } at all, but a classic kNN uses a simple unweighted count of exemplars per category (10 out of 11 in this case). Instead of Each token requires a different formatting directive for else counting the number of exemplars per category, our variation alignment, as shown in Figure 6; e.g., (align, 1, 3) means 3 sums 1 L0(X, Xj) for each Xj per category. Because “jump up 1 level from leftancestor(ti) and align with left- distances are in [0..1], the cube root nonlinearly accentuates most leaf of child 3 (token ).” To distinguish the cases, p else differences. Distances of 0 count as weight 1, like the clas- the context vectors must be different. Therefore, training col- sic kNN, but distances close to 1.0 count very little towards lects these partial vectors with features #10-15: their associated category. In practice, we found feature vec- X=[. . . , stat, 0, blockStat, *, block, 0, . . . ] tors more distant than about 15% from unknown X to be too X=[. . . , stat, *, stat, 0, blockStat, *, . . . ] dissimilar to count. Exemplars at distances above this thresh- X=[. . . , stat, *, stat, *, stat, 0, . . . ] old are discarded while collecting the k nearest neighbors.

145 The classfier function uses features #1-#10, #12 to make the model (and subsequently improve it) as well as report ws predictions and #2, #6-#21 for hpos; hpos predictions its efficacy in an objective manner. We propose the follow- ignore Xj not associated with tokens starting a line. ing measure. Given corpus D that is perfectly consistently formatted, CODEBUFF should produce the identity transfor- 3.5 Formatting a Document mation for any document d D if trained on a corpus sub- 2 To format document d, the formatter, Function 6, first squeezes set D d . This leave-one-out cross-validation allows us to \{ } out all whitespace tokens and line/column information from use the corpus for both training and for measuring formatter the tokens of d and then iterates through d’s remaining to- quality. (See Section 5 for evidence of CODEBUFF’s general- kens, deciding what whitespace to inject before each token. ity.) For each document, the distance between original d and At each token, the formatter computes a feature vector for formatted d0 is an inverse measure of formatting quality. that context and asks the model to predict a formatting di- A naive similarity measure is the edit distance (Leven- rective (whereas training examines the whitespace to deter- shtein Distance [7]) between d0 and d, but it is expensive to mine the directive). The formatter uses the information in compute and will over-accentuate minor differences. For ex- the formatting directive to compute the number of newline ample, a single indentation error made by the formatter could and space characters to inject. The formatter treats the di- mean the entire file is shifted too far to the right, yielding a rectives like bytecode instructions for a simple virtual ma- very high edit distance. A human reviewer would likely con- chine: {(nl,n), (sp,n), none, (align, ancestor, child), sider that a small error, given that the code looks exactly right (indent, ancestor, child), align, indent}. except for the indentation level. Instead, we quantify the doc- As the formatter emits tokens and injects whitespace, it ument similarity using the aggregate misclassification rate, in tracks line and column information so that it can annotate [0..1], for all predictions made while generating d0: tokens with this information. Computing features #3-5 at n_ws_errors + n_hpos_errors error = token ti relies on line and column information for tj for n_ws_decisions + n_hpos_decisions some j

146 As part of our examination of grammar invariance (details we swap out the grammar. The same evidence appears for the below), we used 2 different Java grammars and 2 different other corpora and grammars. The overall error rate could hide SQL grammars taken from ANTLR’s grammar repository: large variation in the formatting of individual files, however, so we define grammar invariance as a file-by-file comparison java. A Java 7 grammar. • of normalized edit distances. . A transcription of the Java 8 language specifica- • java8 tion into ANTLR format. Definition 4.1. Given models FD,G and FD,G0 derived from . A grammar for the SQLite variant of SQL. grammars G and G0 for a single language, L(G)=L(G0),a • sqlite . A grammar for the Transact-SQL variant of SQL. formatter is grammar invariant if the following holds for any • tsql document d: format(F , d) format(F , d) ✏ for D,G D,G0  some suitably small normalized edit distance ✏. 0.25 Definition 4.2. Let operator d1 d2 be the normalized

0.20 edit distance between documents d1 and d2 defined by the Levenshtein Distance [7] divided by max(len(d1),len(d2)). 0.15 The median edit distances between formatted files (using

0.10 leave-one-out validation) from 3 corpora provide strong evi- dence of grammar invariance for Java but less so for SQL: 0.05 Misclassification Error Rate 0.001 for corpus with and grammars • guava java java8 0.00 0.008 for corpus with and grammars • st java java8 tsql 0.099 for corpus with and grammars antlr sqlite sql sqlite tsql java st n=12 n=36 n=36 • n=59 java8n=59 st n=36 n=36 tsql noisy n=511 n=511 java guava sqlite noisy java8 guava The “average” difference between guava files formatted with Grammar and corpus size different Java grammars is 1 character edit per 1000 charac- ters. The less consistent st corpus yields a distance of 8 edits Figure 7. Standard box-plot of leave-one-out validation error rate per 1000 characters. A manual inspection of Java documents between formatted document d0 and original d. formatted using models trained with different grammars con- firms that the structure of the grammar itself has little to no 4.3 Formatting Quality Results effect on the formatting results, at least when trained on the Our first experiment demonstrates that CODEBUFF can faith- context features defined in 3.3. fully reproduce the style found in a consistent corpus. Details The sql corpus shows a much higher difference between to reproduce all results are available in a README.md [11]. Fig- formatted files, 99 edits per 1000 characters. Manual inspec- ure 7 shows the formatting error rate, as described above. tion shows that both versions are plausible, just a bit differ- Lower median error rates correspond with higher-quality ent in nl prediction. Newlines trigger indentation, leading formatting, meaning that the formatted files are closer to to bigger whitespace differences. One reason for higher edit the original. Manual inspection of the corpora confirms that distances could be that the noise in the less consistent SQL consistently-formatted corpora indeed yield better results. corpus amplifies any effect that the grammar has on format- For example, median error rates (17% and 19%) are higher ting. More likely, the increased grammar sensitivity for SQL using the two grammars on the sql_noisy corpus versus has to do with the fact that the sqlite and tsql grammars the cleaned up sql corpus. The guava corpus has extremely are actually for two different languages. The TSQL language consistent style because it is enforced programmatically and has procedural extensions and is Turing complete; the tsql consequently CODEBUFF is able to reproduce the style with grammar is 2.5x bigger than sqlite. In light of the differ- high accuracy using either Java grammar. The antlr corpus ent SQL dialects and noisier corpus, a larger difference be- results have a high error rate due to some inconsistencies tween formatted SQL files is unsurprising and does not rule among the grammars but, nonetheless, formatted grammars out grammar invariance. look good except for a few overly-long lines. 4.5 Effects of Corpus Size 4.4 Grammar Invariance Prospective users of CODEBUFF will ask how the size of the Figure 7 also gives a strong hint that CODEBUFF is gram- corpus affects formatting quality. We performed an experi- mar invariant, meaning that training models on a single cor- ment to determine: (i) how many files are needed to reach the pus but with different grammars gives roughly the same for- median overall error rate and (ii) whether adding more and matting results. For example, the error rates for the st cor- more files confuses the kNN classifier. Figure 8 summarizes pus trained with java and java8 grammars are roughly the the results of an experiment comparing the median error rate same, indicating that CODEBUFF’s overall error rate (similar- for randomly-selected corpus subsets of varying sizes across ity of original/formatted documents) does not change when different corpora and grammars. Each data point represents

147 50 trials at a specific corpus size. The error rate quickly drops 4.7 Worst-Case Complexity after about 5 files and then asymptotically approaches the Collecting all exemplars to train our kNN FD,G model re- median error rate shown in Figure 7. This graph suggests quires two passes over the input. The first pass walks the a minimum corpus size of about 10 files and provides evi- parse tree for each d D, collecting matching token pairs 2 dence that adding more (consistently formatted) files neither (Section 3.3.1) and identifying list membership (Section confuses the classifier nor improves it significantly. 3.3.2). The size of a single parse tree is bounded by the size of the grammar times the number of tokens in the doc- 0.30 ument, G d for document d (the charge per token is a sqlite | |⇥| | antlr tree depth of at most P productions of G). To make a pass 0.25 java st | | java8 guava over all parse trees for the entire corpus, the time complexity java8 st is G N, so in O(N) for N total tokens in the corpus. 0.20 tsql | |⇥ The second pass walks each token t d for all d D, us- java guava i 2 2 ing information computed in the first pass to compute feature 0.15 vectors and capture whitespace as ws and hpos formatting directives. There are m features to compute for each of N 0.10 tokens. Most of the features require constant time, but com-

Median Error rate for 50 trials puting the earliest ancestors and identifying tokens for in- 0.05 dentation and alignment might walk the depth of the parse tree in the worst case for a cost of O(log( G d )) per fea- 0.00 | |⇥| | 0 5 10 15 20 25 30 ture per token. For an average document size, a feature costs Number of training files in sample corpus subset O(log( G N/ D )). Computing all m features and captur- | |⇥ | | Figure 8. Effect of corpus size on median leave-one-out valida- ing whitespace for all documents costs tion error rate using randomly-selected corpus subsets. O(m N log( G N/ D )) = ⇥ ⇥ | |⇥ | | 4.6 Optimization and Stability of Model Parameters O(N m (log( G ) + log(N) log( D ))) Choosing a k for a kNN model is more of an art but k = pN ⇥ ⇥ | | | | for N exemplars is commonly used. Through exhaustive which is in O(N log N). Including the first pass over all manual review of formatted files, we instead arrived at a fixed parse trees adds a factor of N but does not change the overall k = 11 and then verified its suitability by experiment. Fig- worst-case time complexity for training. ure 9 shows the effect of varying k on the median error rate Formatting a document is much more expensive than across a selection of corpora and grammars; k ranged from 1 training. For each token in a document, the formatter requires to 99. This graph supports the conclusion that a formatter can at least one ws classifier function execution and possibly a make accurate decisions by comparing the context surround- second for hpos. Each classifier function execution requires ing ti to very few model exemplars. Moreover, formatter ac- the feature vector X computation cost per the above, but the curacy is very stable with respect to k; even large changes in cost is dominated by the comparison of X to every X X j 2 k do not alter the error rate very much. and sorting those results by distance to find the k nearest neighbors. For n = d tokens in a file to be formatted, the | | overall complexity is quadratic, O(n N log N). For an ⇥ 0.20 java st average document size of n = N/ D , formatting a doc- 2 | | java8 st ument costs O(N log N). Box formatters are also usually 0.18 antlr quadratic; streaming formatters are linear (See Section 6). 0.16 sqlite tsql 0.14 4.8 Expected Performance ODE UFF 0.12 C B is instrumented to report single-threaded CPU run-time for training and formatting time. We report a me- 0.10 Median error rate dian 1.5s training time for the (overly-large by an order mag- 0.08 nitude) guava corpus (511 files, 143k lines) after documents using the grammar and a training time of 0.06 java 1.8s with the java8 grammar.2 The antlr corpus, with only 0.04 0 20 40 60 80 100 12 files, is trained within 72ms. k nearest neighbors 2 Experiments run 20 times on an iMac17,1 OS 10.11.5, 4GHz Intel Core Figure 9. Effect of k on median leave-one-out error rate. i7 with Java 8, heap/stack size 4G/1M; we ignore first 5 measurements to account for warmup time. Details to reproduce in github repo.

148 Because kNN uses the complete list of exemplars as an duced the “Box” as a basic two-dimensional layout mecha- internal representation, the memory footprint of CODEBUFF nism and all derived work elaborates on specifying the map- is substantial. For each of N tokens in the corpus, we track ping from syntax (parse) trees to Box expressions more ef- m features and two formatting directives (each packed into fectively (in terms of meta-program size and expressive- a single word). The size complexity is O(N) but with a ness) [4, 12–14]. Oppen introduced the streaming format- nontrivial constant. For the guava corpus, the profiler shows ting algorithm in which alignment and indentation operators 2.5M X X feature vectors, consuming 275M RAM. are injected into a token stream (coming from a syntax tree j 2 Because CODEBUFF keeps all corpus text in memory for serialization or a lexer token stream). Derivative and later testing purposes, we observe an overall 1.7G RAM footprint. work [1, 2, 18] elaborates on specifying how and when to For even a modest sized corpus, the number of tokens, inject these tokens for different languages in different ways N, is large and a naive kNN classifier function implementa- and in different technological spaces [6]. tion is unusably slow. Ours uses an index to narrow the near- Oppen’s streams shine for their simplicity and efficiency est neighbors search space and a cache of prediction func- in time and memory, while the expressivity of Coutaz’ Boxes tion results to make the performance acceptable. To provide makes it easier to specify arbitrary formatting styles. The worst-case formatting speed figures, CODEBUFF formats the expressivity comes at the cost of building an intermediate biggest file (LocalCache.java) in the guava library (22,674 to- representation and running a specialized constraint solver. kens, 5022 lines) in median times of 2.7s (java) and 2.2s Conceptually CODEBUFF derives from the Oppen school (java8). At ~2300 lines/s even for such a large training cor- most, but in terms of expressivity and being able to capture pus, CODEBUFF is usable in IDEs and other applications. Us- “naturally occurring” and therefore hard to formalize for- ing the more reasonably-sized antlr corpus, formatting the matting styles, CODEBUFF approaches the power of Coutaz’ 1777 line Java8.g4 takes just 70ms (25k lines/s). Boxes. The reason is that CODEBUFF matches token context much like the algorithms that map syntax trees to Box ex- 5. Test of Generality pressions. CODEBUFF can, therefore, seamlessly learn to spe- As a test of generality, we solicited a grammar and cor- cialize for even more specific situations than a software lan- pus for an unfamiliar language, Quorum3 by Andreas Stefik, guage engineer would care to express. At the same time, it and trained CODEBUFF on that corpus. The first look at the uses an algorithm that simply injects layout directives into a grammar and corpus for both the authors and the model oc- token stream, which is very efficient. There are limitations, curred during the preparation of this manuscript. We dropped however (detailed in Section 3). about half of the corpus files randomly to simulate a more Language-parametric formatters from the Box school are reasonably-sized training corpus but did not modify any files. usually combined with default formatters, which are stati- Most files in the corpus result in plausibly formatted code, cally constructed using heuristics (expert knowledge). The but there are a few outlier files, such as HashTable.quorum. For default formatter is expected to guess the right formatting example, the following function looks fine except that it is rule in order to relieve the language engineer from having to indented way too far to the right in the formatted output: specify a rule for every language construct (like CODEBUFF does for all constructs). To do this, the input grammar is ana- action RemoveKey(Key key) returns Value lyzed for “typical constructs” (such as expressions and block repeat while node not= undefined constructs), which are typically formatted in a particular way. if key:Equals(node:key) if previous not= undefined The usefulness of default formatters is limited though. We previous:next = node:next have observed engineers mapping languages to Box com- else pletely manually, while a default formatter was readily avail- array:Set(index, node:next) able. If CODEBUFF is not perfect, at least it generalizes the end concept of a default formatter by learning from input sen- ... tences and not just the grammar of a language. This differ- end ence is the key benefit of the machine learning approach; return undefined CODEBUFF could act as a more intelligent default formatter end in existing formatting pipelines. The median misclassification error rate for Quorum is very PGF [1] by Bagge and Hasu is a rule-based system to low, 4%, on a par with the guava model, indicating that it is inject pretty printing directives into a stream. The meta- consistent and that CODEBUFF captures the formatting style. grammar for these rules (which the language engineer must write) is conceptually comparable to the feature set that 6. Related Work CODEBUFF learns. The details of the features are different The state of the art in language-parametric formatting stems however, and these details matter greatly to the efficacy of from two sources: Coutaz [3] and Oppen [10]. Coutaz intro- the learner. The interplay between the expressiveness of a meta-grammar for formatting rules and the features CODE- 3 https://www.quorumlanguage.com

149 BUFF uses during training is interesting to observe: we are 9. Appendix characterizing the domain of formatting from different ends. This section collects all of CODEBUFF’s key algorithms refer- Since CODEBUFF automatically learns by example, related enced in the main body. work that needs explicit formalizations is basically incom- parable from a language engineering perspective. We could Function 1: train(D, G, indentSize) model FD,G only compare the speed of the derived formatters and the ! X := []; W := []; H := []; j := 1; quality of the formatted output. It would be possible to com- foreach document d D do pare the default grammar-based formatters to see how many 2 tokens := tokenize(d); of their rules are useful compared to CODEBUFF, but (i) this is completely unfair because the default formatters do not have tree := parse(tokens); foreach ti tokens do access to input examples and (ii) default formatters are con- 2 structed in a highly arbitrary manner so there is no lesson to X[j] := compute context feature vector for ti, tree; learn other than their respective designer’s intuitions. W [j] := capture_ws(ti); Aside from the actual functionality of formatting text, re- H[j] := capture_hpos(ti, indentSize); lated work has extended formatting with “high fidelity” (not j := j +1; losing comments) [16, 17] and partial formatting (introduc- return (X, W, H, indentSize); ing a new source text in an already formatted context) [5]. CODEBUFF copies comments into the formatted output but partial formatting is outside the scope of the current contri- Function 2: capture_ws(ti) w ws ! 2 bution; CODEBUFF could be extended with such functionality. newlines := num newlines between ti 1 and ti; if newlines > 0 then return (nl, newlines); 7. Future Work col := ti.col - (ti 1.col + len(text(ti 1))); The relative feature weights in the current kNN classifier return (ws, col); were determined through trial and error so we anticipate swapping in a random forest classifier, which automatically Function 3: capture_hpos(ti, indentSize) h hpos determines the importance (weight) of each feature. Random ! 2 forest classifiers are also much more efficient than kNN clas- ancestor := leftancestor(ti); if ancestor w/child aligned with ti.col then sifiers. We anticipate introducing new classifier features to 9 rectify the few incorrectly-formatted files. halign := (align,ancestor,childindex) with smallest ancestor&childindex; if ancestor w/child at ti.col + indentSize then 8. Conclusion 9 hindent := (indent,ancestor,childindex) CODEBUFF is a step towards a universal code formatter that with smallest ancestor&childindex; uses machine learning to abstract formatting rules from if halign and hindent not nil then a representative corpus. Current approaches require com- return directive with smallest ancestor; plex pattern-formatting rules written by a language expert if halign not nil then return halign; whereas input to CODEBUFF is just a grammar, corpus, and indentation size. Experiments show that training requires if hindent not nil then return hindent; about 10 files and that resulting formatters are fast and highly if ti indented from previous line then return indent; accurate for three languages. Further, tests on a previously- return align; unseen language and corpus show that, without modification, CODEBUFF generalized to a 4th language. Formatting results Function 4: pairs(Corpus D) map node set<(s, t)> are largely insensitive to language-preserving changes in the ! 7! pairs := map of node set; grammar and our kNN classifier is highly stable with respect 7! repeats := map of node set; to changes in model parameter k. Based on these results, we 7! foreach d D do look forward to many more applications of machine learning 2 in software language engineering. foreach non-leaf node r in parse(d) do literals := t parent(t)=r, t is literal token ; { | } add (t ,t ) i

150 References [1] A. H. Bagge and T. Hasu. A pretty good formatting pipeline. In International Conferance on Software Language Engineering (SLE’13), volume 8225 of LNCS. Springer, Oct. 2013. Function 5: paired(pairs, token ti) t0 ! mypairs := pairs[parent(t)]; [2] O. Chitil. Pretty printing with lazy dequeues. ACM TOPLAS, viable := s (s, t) mypairs, s siblings(t) ; 27(1):163–184, Jan. 2005. ISSN 0164-0925. { | 2 2 } if viable =1then ttype := viable[0]; [3] J. Coutaz. The Box, a layout abstraction for user interface | | else if (s, t) s, t are common pairs then ttype := s; toolkits. CMU-CS-84-167, Carnegie Mellon University, 1984. 9 | else if (s, t) s, t are single-char literals then ttype := s; [4] M. de Jonge. Pretty-printing for software reengineering. In 9 | else ttype := viable[0]; // choose first if still ambiguous Proceedings of ICSM 2002, pages 550–559. IEEE Computer matching := [t t = ttype, j < i, t siblings(t )]; Society Press, Oct. 2002. j | j 8 j 2 i return last(matching); [5] M. de Jonge. Pretty-printing for software reengineering. In ICSM 2002, 3-6 October 2002, Montreal, Quebec, Canada, pages 550–559, 2002. [6] I. Kurtev, J. Bézivin, and M. Aksit. Technological spaces: An initial appraisal. In International Symposium on Distributed Objects and Applications, DOA 2002, 2002. [7] V. I. Levenshtein. Binary Codes Capable of Correcting Dele- tions, Insertions and Reversals. Soviet Physics Doklady, 10: Function 6: format(FD,G =(X,W,H,indentSize),d) 707, Feb. 1966. line := col := 0; [8] S. McConnel. Code Complete. Microsoft Press, 1993. d := d with whitespace tokens, line/column info removed; [9] R. J. Miara, J. A. Musselman, J. A. Navarro, and B. Shneider- foreach ti d do man. Program indentation and comprehensibility. ACM, 26 2 emit any comments to left of ti; (11):861–867, 1983. Xi := compute context feature vector at ti; [10] D. C. Oppen. Prettyprinting. ACM TOPLAS, 2(4):465–483, ws := predict directive using Xi and X, W ; 1980. ISSN 0164-0925. newlines := sp := 0; [11] T. Parr, F. Zhang, and J. Vinju. Codebuff, June 2016. URL if ws = (nl, n) then newlines := n; https://github.com/antlr/codebuff/tree/1.5.1. else if ws = (sp, n) then sp := n; [12] M. van den Brand and E. Visser. Generation of formatters for if newlines > 0 then // inject newline and align/indent context-free languages. ACM Trans. Softw. Eng. Methodol.,5 (1):1–41, Jan. 1996. emit newlines ‘\n’ characters; line += newlines; col := 0; [13] M. G. van den Brand, A. T. Kooiker, J. J. Vinju, and N. P. Veerman. A language independent framework for context- hpos := predict directive using X and , H; i X sensitive formatting. In CSMR 2006, pages 10–pp. IEEE, if hpos =(_,ancestor,child) then 2006. t = token relative to t at ancestor,child; j i [14] M. G. J. van den Brand and E. Visser. Generation of formatters col := tj.col; for context-free languages. ACM Trans. Softw. Eng. Methodol., if hpos =(indent, _, _) then col += indentSize; 5(1):1–41, 1996. ISSN 1049-331X. . emit col spaces; [15] M. G. J. van den Brand, A. van Deursen, J. Heering, H. A. else // plain align or indent de Jong, M. de Jonge, T. Kuipers, P. Klint, P. A. Olivier, tj := first token on previous line; J. Scheerder, J. J. Vinju, E. Visser, and J. Visser. The ASF+SDF Meta-Environment: a Component-Based Language col := tj.col; Development Environment. In CC ’01, volume 2027 of LNCS. if hpos = indent then col += indentSize; Springer-Verlag, 2001. emit col spaces; [16] J. Vinju. Analysis and Transformation of Source Code by end Parsing and Rewriting. PhD thesis, U. van Amsterdam, 2005. else col += sp; emit sp spaces; // inject spaces [17] D. G. Waddington and B. Yao. High-fidelity c/c++ code transformation. Electron. Notes Theor. Comput. Sci., 141(4): t .line = line; // set t location i i 35–56, Dec. 2005. ISSN 1571-0661. ti.col = col; [18] P. Wadler. A prettier printer. In Journal of Functional Pro- emit text(t ); i gramming, pages 223–244. Palgrave Macmillan, 1998. col += len(text(ti))

151