Towards a Universal Code Formatter Through Machine Learning

Towards a Universal Code Formatter through Machine Learning Terence Parr Jurgen Vinju University of San Francisco, USA Centrum Wiskunde & Informatica, Netherlands [email protected] [email protected] Abstract Because the value of a particular code formatting style is There are many declarative frameworks that allow us to im- a subjective notion, often leading to heated discussions, for- plement code formatters relatively easily for any specific lan- matters must be highly configurable. This allows, for exam- guage, but constructing them is cumbersome. The first prob- ple, current maintainers of existing code to improve their ef- lem is that “everybody” wants to format their code differ- fectiveness by reformatting the code per their preferred style. ently, leading to either many formatter variants or a ridicu- There are plenty of configurable formatters for existing lan- lous number of configuration options. Second, the size of guages, whether in IDEs like Eclipse or standalone tools like each implementation scales with a language’s grammar size, Gnu indent, but specifying style is not easy. The emergent leading to hundreds of rules. behavior is not always obvious, there exists interdependency In this paper, we solve the formatter construction problem between options, and the tools cannot take context informa- using a novel approach, one that automatically derives for- tion into account [13]. For example, here are the options matters for any given language without intervention from a needed to obtain K&R C style with indent: language expert. We introduce a code formatter called CODE- -nbad -bap -bbo -nbc -br -brs -c33 -cd33 -ncdb -ce BUFF that uses machine learning to abstract formatting rules -ci4 -cli0 -cp33 -cs -d0 -di1 -nfc1 -nfca -hnl -i4 from a representative corpus, using a carefully designed fea- -ip0 -l75 -lp -npcs -nprs -npsl -saf -sai -saw -nsc ture set. Our experiments on Java, SQL, and ANTLR gram- -nsob -nss mars show that CODEBUFF is efficient, has excellent accu- New languages pop into existence all the time and each racy, and is grammar invariant for a given language. It also one could use a formatter. Unfortunately, building a for- generalizes to a 4th language tested during manuscript prepa- matter is difficult and tedious. Most formatters used in ration. practice are ad hoc, language-specific programs but there Categories and Subject Descriptors D.2.3 [Software Engi- are formal approaches that yield good results with less ef- neering]: Coding - Pretty printers fort. Rule-based formatting systems let programmers specify phrase-formatting pairs, such as the following sample spec- Keywords Formatting algorithms, pretty-printer ification for formatting the COBOL MOVE statement using 1. Introduction ASF+SDF [3, 12, 13, 15]. The way source code is formatted has a significant impact MOVE IdOrLit TO Id-list = on its comprehensibility [9], and manually reformatting code from-box( H [ "MOVE" is just not an option [8, p.399]. Therefore, programmers need Hts=25[to-box(IdOrLit)] Hts=49["TO"] ready access to automatic code formatters or “pretty printers” Hts=53[to-box(Id-list)]]) in situations where formatting is messy or inconsistent. Many program generators also take advantage of code formatters to This rule maps a parse tree pattern to a box expression. A set improve the quality of their output. of such rules, complemented with default behavior for the unspecified parts, generates a single formatter with a specific style for the given language. Section 6 has other related work. PermissionPermission to to make make digital digital or or hard hard copies copies of of all all or part of this work for personal or There are a number of problems with rule-based format- classroomclassroom use use is is granted granted without without fee fee provided provided that that copies copies are not made or distributed forfor profit profit or or commercial commercial advantage advantage and and that that copies copies bear bear this this notice and the full citation ters. First, each specification yields a formatter for one spe- onon the the first first page. page. Copyrights Copyrights for components of this work owned by others than ACM cific style. Each new style requires a change to those rules mustmust be be honored. honored. Abstracting with credit is permitted. To copy otherwise, or republish, toto post post on on servers servers or or to to redistribute redistribute to to lists, lists, requires requires prior prior specific permission and/or a or the creation of a new set. Some systems allow the rules to fee. Request permissions from [email protected]. fee. Request permissions from [email protected]. be parametrized, and configured accordingly, but that leads SLE’16SLE ’16,,, October October 31 31-November – November 01, 1, 2016, 2016, Amsterdam, Amsterdam, Netherlands. Netherlands c 2016 ACM. 978-1-4503-4447-0/16/10...$15.00 to higher rule complexity. Second, minimal changes to the Copyright c 2016 ACM [to be supplied]. $15.00. http://dx.doi.org/10.1145/2997364.2997383http://dx.doi.org/10.1145/http://dx.doi.org/10.1145/2997364.2997383 associated grammar usually require changes to the format- 137 ting rules, even if the grammar changes do not affect the lan- MonVerificationMaster.sql (trained with sqlite grammar on guage recognized. Finally, formatter specifications are big. sqlclean corpus): Although most specification systems have builtin heuristics SELECT DISTINCT t.server_name for default behavior in the absence of a specification for a ,t.server_id ,’MessageQueuingService’ASmissingmonitors given language phrase, specification size tends to grow with FROM t_server t INNER JOIN t_server_type_assoc tsta ON t.server_id = tsta.server_id WHERE t.active = 1 AND tsta.type_id IN (’8’) the grammar size. A few hundred rules are no exception. AND t.environment_id = 0 AND t.server_name NOT IN Formatting is a problem solved in theory, but not yet in ( SELECT DISTINCT l.address practice. Building a good code formatter is still too difficult FROM ipmongroups g INNER JOIN ipmongroupmembers m ON g.groupid = m.groupid INNER JOIN ipmonmonitors l ON m.monitorid = l.monitorid and requires way too much work; we need a fresh approach. INNER JOIN t_server t ON l.address = t.server_name INNER JOIN t_server_type_assoc tsta ON t.server_id = tsta.server_id In this paper, we introduce a tool called CODEBUFF [11] that WHERE l.name LIKE ’%Message Queuing Service%’ AND t.environment_id = 0 uses machine learning to produce a formatter entirely from AND tsta.type_id IN (’8’) AND g.groupname IN (’Prod O/S Services’) a grammar for language L and a representative corpus writ- AND t.active = 1 ) ten in L. There is no specification work needed from the user UNION other than to ensure reasonable formatting consistency within ALL the corpus. The statistical model used by CODEBUFF first And here is a complicated query from dmart_bits_IAPPBO510.sql learns the formatting rules from the corpus, which are then with case statements: applied to format other documents in the same style. Differ- SELECT ent corpora effectively result in different formatters. From a CASE WHEN SSISInstanceID IS NULL THEN ’Total’ user perspective the formatter is “configured by example.” ELSE SSISInstanceID END SSISInstanceID ,SUM(OldStatus4)ASOldStatus4 ... Contributions and roadmap. We begin by showing sample ,SUM(OldStatus4+Status0+Status1+Status2+Status3+Status4)ASInstanceTotal FROM CODEBUFF output in Section 2 and then explain how and why ( SELECT CODEBUFF works in Section 3. Section 4 provides empirical CONVERT(VARCHAR, SSISInstanceID) AS SSISInstanceID ,COUNT(CASEWHENStatus=4AND evidence that CODEBUFF learns a formatting style quickly CONVERT(DATE, LoadReportDBEndDate) < CONVERT(DATE, GETDATE()) and using very few files. CODEBUFF approximates the cor- THEN Status ELSE NULL END) AS OldStatus4 pus style with high accuracy for the languages ANTLR, Java ... ,COUNT(CASEWHENStatus=4AND and SQL, and it is largely insensitive to language-preserving DATEPART(DAY, LoadReportDBEndDate) = DATEPART(DAY, GETDATE()) THEN Status grammar changes. To adjust for possible selection bias and ELSE NULL END) AS Status4 FROM dbo.ClientConnection model overfitting to these three well-known languages, we GROUP BY SSISInstanceID )ASStatusMatrix tested CODEBUFF on an unfamiliar language (Quorum) in GROUP BY SSISInstanceID Section 5, from which we learned that CODEBUFF works sim- Here is a snippet from Java, our 2nd test language, taken from ilarly well, yet improvements are still possible. We position STLexer.java (trained with java grammar on st corpus): CODEBUFF with respect to the literature in Section 6. switch ( c ) { ... default: 2. Sample Formatting if ( c==delimiterStopChar ) { consume(); scanningInsideExpr = false; This section contains sample SQL, Java, and ANTLR code return newToken(RDELIM); ODE UFF } formatted by C B , including some that are poorly for- if ( isIDStartLetter(c) ) { ... matted to give a balanced presentation. Only the formatting if ( name.equals("if") ) return newToken(IF); else if ( name.equals("endif") ) return newToken(ENDIF); style matters here so we use a small font for space reasons. ... return id; Github [11] has a snapshot of all input corpora and format- } RecognitionException re = new NoViableAltException("", 0, 0, input); ted versions (corpora, testing details in Section 4). To arrive ... errMgr.lexerError(input.getSourceName(), at the formatted output for document d in corpus D, our test "invalid character ’"+str(c)+"’", templateToken, rig removes all whitespace tokens from d and then applies re); an instance of CODEBUFF trained on the corpus without d, ... D d . \{ } Here is an example from STViz.java that indents a The examples are not meant to illustrate “good style.” method declaration relative to the start of an expression They are simply consistent with the style of a specific corpus. rather than the first token on the previous line: In Section 4 we define a metric to measure the success of the Thread t = new Thread() { automated formatter in an objective and reproducible man- @Override public void run() { ner. No quantitative research method can capture the qualita- synchronized ( lock ) { while ( viewFrame.isVisible() ) { tive notion of style, so we start with these examples.

Load more