Syntax Errors Just Aren't Natural: Improving Error Reporting With

Syntax Errors Just Aren’t Natural: Improving Error Reporting with Language Models Joshua Charles Abram Hindle José Nelson Amaral Campbell Department of Computing Department of Computing Department of Computing Science Science Science University of Alberta University of Alberta University of Alberta Edmonton, Canada Edmonton, Canada Edmonton, Canada [email protected] [email protected] [email protected] ABSTRACT programmers get frustrated by syntax errors, often resorting A frustrating aspect of software development is that com- to erratically commenting their code out in the hope that piler error messages often fail to locate the actual cause of they discover the location of the error that brought their a syntax error. An errant semicolon or brace can result in development to a screeching halt [14]. Syntax errors are many errors reported throughout the file. We seek to find annoying to programmers, and sometimes very hard to find. the actual source of these syntax errors by relying on the con- Garner et al. [4] admit that “students very persistently sistency of software: valid source code is usually repetitive keep seeking assistance for problems with basic syntactic de- and unsurprising. We exploit this consistency by construct- tails.” They, corroborated by numerous other studies [11, 12, ing a simple N-gram language model of lexed source code 10, 17], found that errors related to basic mechanics (semi- tokens. We implemented an automatic Java syntax-error lo- colons, braces, etc.) were the most persistent and common cator using the corpus of the project itself and evaluated its problems that beginner programmers faced. In fact these performance on mutated source code from several projects. errors made up 10% to 20% of novice programmer compiler Our tool, trained on the past versions of a project, can ef- problems. As an example, consider the missing brace at fectively augment the syntax error locations produced by the end of line 2 in the following Java code taken from the the native compiler. Thus we provide a methodology and Lucene 4.0.0 release: 1 for (int i = 0; i < scorers.length; i++) { tool that exploits the naturalness of software source code to 2 if (scorers[i].nextDoc() == NO_MORE_DOCS) detect syntax errors alongside the parser. 5 lastDoc = NO_MORE_DOCS; 6 return; Categories and Subject Descriptors 7 } 8 } D.2.5 [Software Engineering]: Testing and Debugging— Debugging aids, diagnostics; D.3.8 [Programming Lan- This mistake, while easy for an experienced programmer guages]: Processors—Parsing to understand and fix, if they know where to look, causes the Oracle Java compiler 1 to report 50 error messages, includ- General Terms ing those in Figure 1, none of which are on the line with the mistake. This poor error reporting slows down the software Languages, Human Factors development process because a human programmer must examine the source file to locate the error, which can be a very Keywords time consuming process. naturalness, language, n-grams, syntax, error location, NLP Some smart IDEs such as Eclipse, aim to address this problem but still fall short of locating the actual cause of the syntax error as seen in Figure 2. Thus syntax errors and 1. MOTIVATION poor compiler/interpreter error messages have been found Syntax errors plague new programmers as they struggle to be a major problem for inexperienced programmers [16, to learn computer programming [16, 4]. Even experienced 4, 17]. There has been much work on trying to improve error loca- Permission to make digital or hard copies of all or part of this work for tion detection and error reporting. Previous methods tend personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear to rely on rules and heuristics [3, 6, 8]. In contrast to those this notice and the full citation on the first page. Copyrights for components methods, we seek to improve on error location reporting by of this work owned by others than ACM must be honored. Abstracting with exploiting recent research on the regularity and naturalness credit is permitted. To copy otherwise, or republish, to post on servers or to of written source code [7]. Recent research by Hindle et redistribute to lists, requires prior specific permission and/or a fee. Request al. [7] exploits n-gram based language models, applied to permissions from [email protected]. existing source code, to predict tokens for code completion. MSR ’14, May 31 - June 07 2014, Hyderabad, India Copyright is held by the owner/author(s). Publication rights licensed to 1 ACM. Oracle’s Java compiler, version 1.7.0 13, is available ACM 978-1-4503-2863-0/14/05...$15.00. at http://www.oracle.com/technetwork/java/javase/ http://dx.doi.org/10.1145/2597073.2597102. downloads/index.htm Figure 1: Oracle Java Error messages from the motivational example. The effectiveness of the n-gram model at code completion and suggestion is due to the repetitive, and consistent, structure of code. Using this exact same model, and exploiting the natural structure of source code, we can improve error location detection. The idea behind this new method is to train a model on compilable source-code token sequences, and then evaluate on new code to see how often those sequences oc- cur within the model. Source code that does not compile should be surprising for an n-gram language model trained on source code that compiles. Intuitively, most available and distributed source code compiles. Projects and their source code are rarely released with syntax errors, although this property may depend on the project’s software development process. Thus, existing software can act as a corpus of compilable and working software. Furthermore, whenever a source file successfully compiles it can automatically up- date the training corpus because the goal is to augment the compiler’s error messages. Changes to the software might not compile. When that happens, locations in the source file that the model finds surprising should be prioritized as locations that a software engineer should examine. The following sections of the paper examine the feasibility of this new method by testing its ability to locate errors in a variety of situations, as an augmentation of compiler Figure 2: Top: Figure 1’s code snippet put into error reporting. These situations include situations which Eclipse; bottom: another example of failed Eclipse are much more adverse than expected in practice. syntax error location detection. Eclipse often de- The main contributions of this work include: tects syntax errors, but often reports them in the • A method of statistical, probabilistic syntax-error loca- wrong location. Eclipse’s suggested error location is tion detection that exploits n-gram language models. circled in red, and the actual error location is circled in green. • A prototype implementation of an n-gram language- model-based Java syntax error locator, UnnaturalCode,2 that can be used with existing build systems and Java 2. BACKGROUND compilers to suggest locations that might contain syn- An n-gram language model at its lowest level is simply a tax errors. collection of counts. These counts represent the number of times a phrase appears in a corpus. These phrases are re- • A validation of the feasibility of the new syntax error ferred to as n-grams because they consist of at most n words location detection method. or tokens. These counts are then used to infer the probability of a phrase: the probability is simply the frequency of • A validation of the integration of the new method with occurrence of the phrase in the original corpus. For example, the compiler’s own methods. if the phrase “I’m a little teapot” occurred 7 times in a cor- 3 pus consisting of 700 4-grams, its probability would be .01. • A modified version of MITLM, which has routines devel- However, it is more useful to consider the probability of a oped by the authors to calculate the entropy of short word in its surrounding context. The probability of finding sentences with respect to a large corpus quickly. the word “little” given the context of “I’m a teapot” is much higher because “little” may be the only word that 2 shows up in that context — a probability of 1. UnnaturalCode is available at https://github.com/ orezpraw/unnaturalcode These n-gram models become more accurate as n increases 3The modified MITLM package used in this pa- because they can count longer, more specific phrases. How- per is available at https://github.com/orezpraw/ ever, this relationship between n and accuracy is also prob- MIT-Language-Modeling-Toolkit lematic because most n-grams will not exist in a corpus of human-generated text (or code). Therefore most n-grams Cross Entropy of Project Code compared to English would have a probability of zero for a large enough n. This Code To Code issue can be addressed by using a smoothed model. Smooth- Lucene to Lucene Xerces to Xerces ing increases the accuracy of the model by estimating the English to English probability of unseen n-grams from the probabilities of the largest m-grams (where m<n) that the n-gram consists of and that exist in the corpus. For example, if the corpus does not contain the phrase “I’m a little teapot” but it does contain the phrases“I’m a”and“little teapot”it would estimate the probability of“I’m a little teapot”using a function of the two probabilities it does know. In UnnaturalCode, however, the entropy is measured in bits. Entropy in bits is simply CrossEntropy S = − log2 (p) , where p is the probability.

Load more