Learning a Metric for Code Readability Raymond P.L
Total Page:16
File Type:pdf, Size:1020Kb
TSE SPECIAL ISSUE ON THE ISSTA 2008 BEST PAPERS 1 Learning a Metric for Code Readability Raymond P.L. Buse, Westley Weimer Abstract—In this paper, we explore the concept of code readability and investigate its relation to software quality. With data collected from 120 human annotators, we derive associations between a simple set of local code features and human notions of readability. Using those features, we construct an automated readability measure and show that it can be 80% effective, and better than a human on average, at predicting readability judgments. Furthermore, we show that this metric correlates strongly with three measures of software quality: code changes, automated defect reports, and defect log messages. We measure these correlations on over 2.2 million lines of code, as well as longitudinally, over many releases of selected projects. Finally, we discuss the implications of this study on programming language design and engineering practice. For example, our data suggests that comments, in of themselves, are less important than simple blank lines to local judgments of readability. Index Terms—software readability, program understanding, machine learning, software maintenance, code metrics, FindBugs ✦ 1 INTRODUCTION its sequencing control (e.g., he conjectured that goto unnecessarily complicates program understanding), and E define readability as a human judgment of how employed that notion to help motivate his top-down easy a text is to understand. The readability of a W approach to system design [9]. program is related to its maintainability, and is thus a key We present a descriptive model of software readability factor in overall software quality. Typically, maintenance based on simple features that can be extracted automati- will consume over 70% of the total lifecycle cost of cally from programs. This model of software readability a software product [4]. Aggarwal claims that source correlates strongly with human annotators and also with code readability and documentation readability are both external (widely available) notions of software quality, critical to the maintainability of a project [1]. Other such as defect detectors and software changes. researchers have noted that the act of reading code is To understand why an empirical and objective model the most time-consuming component of all maintenance of software readability is useful, consider the use of read- activities [8], [33], [35]. Readability is so significant, in ability metrics in natural languages. The Flesch-Kincaid fact, that Elshoff and Marcotty, after recognizing that Grade Level [11], the Gunning-Fog Index [15], the SMOG many commercial programs were much more difficult Index [28], and the Automated Readability Index [21] are to read than necessary, proposed adding a development just a few examples of readability metrics for ordinary phase in which the program is made more readable [10]. text. These metrics are all based on simple factors such as Knight and Myers suggested that one phase of soft- average syllables per word and average sentence length. ware inspection should be a check of the source code Despite this simplicity, they have each been shown to be for readability [22] to ensure maintainability, portability, quite useful in practice. Flesch-Kincaid, which has been and reusability of the code. Haneef proposed adding a in use for over 50 years, has not only been integrated dedicated readability and documentation group to the into popular text editors including Microsoft Word, but development team, observing that, “without established has also become a United States governmental standard. and consistent guidelines for readability, individual re- Agencies, including the Department of Defense, require viewers may not be able to help much” [16]. many documents and forms, internal and external, to We hypothesize that programmers have some intu- meet have a Flesch readability grade of 10 or below itive notion of this concept, and that program features (DOD MIL-M-38784B). Defense contractors also are often such as indentation (e.g., as in Python [40]), choice of required to use it when they write technical manuals. identifier names [34], and comments are likely to play a These metrics can help organizations gain some con- part. Dijkstra, for example, claimed that the readability fidence that their documents meet goals for readability of a program depends largely upon the simplicity of very cheaply, and have become ubiquitous for that rea- son. We believe that similar metrics, targeted specifically • Buse and Weimer are with the Department of Computer Science at The Universiy of Virginia, Charlottesville, VA 22904. at source code and backed with empirical evidence E-mail: {buse, weimer}@cs.virginia.edu for effectiveness, can serve an analogous purpose in This research was supported in part by, but may not reflect the positions the software domain. Readability metrics for the niche of, National Science Foundation Grants CNS 0716478 and CNS 0905373, Air Force Office of Scientific Research grant FA9550-07-1-0532, and areas such as computer generated math [26], treemap Microsoft Research gifts. layout [3], and hypertext [17] have been found useful. We describe the first general readability metric for source code. TSE SPECIAL ISSUE ON THE ISSTA 2008 BEST PAPERS 2 It is important to note that readability is not the one in the previous presentation. same as complexity, for which some existing metrics • An additional experiment correlating our readabil- have been empirically shown useful [41]. Brooks claims ity metric with a more natural notion of software that complexity is an “essential” property of software; quality and defect density: explicit human mentions it arises from system requirements, and cannot be ab- of bug repair. Using software version control reposi- stracted away [12]. In the Brooks model, readability is tory information we coarsely separate changes made “accidental” because it is not determined by the prob- to address bugs from other changes. We find that lem statement. In principle, software engineers can only low readability correlates more strongly with this control accidental difficulties: implying that readability direct notion of defect density than it does with the can be addressed more easily than intrinsic complexity. previous approach of using potential bugs reported While software complexity metrics typically take into by static analysis tools. account the size of classes and methods, and the extent of • An additional experiment which compares readabil- their interactions, the readability of code is based primar- ity to cyclomatic complexity. This experiment serves ily on local, line-by-line factors. Our notion of readability to validate our claim that our notion of readability arises directly from the judgments of actual human is largely independent from traditional measures of annotators who do not have context for the code they code complexity. are judging. Complexity factors, on the other hand, may • An additional longitudinal experiment showing have little relation to what makes code understandable how changes in readability can correlate with to humans. Previous work [31] has shown that attempt- changes in defect density as a program evolves. For ing to correlate artificial code complexity metrics directly ten versions of each of five programs we find that to defects is difficult, but not impossible. Although both projects with unchanging readability have similarly involve local factors such as indentation, readability is unchanging defect densities, while a project that also distinct from coding standards (e.g., [2], [6], [38]), experienced a sharp drop in readability was subject conventions primarily intended to facilitate collaboration to a corresponding rise in defect density. by maintaining uniformity between code written by There are a number of possible uses for an automated different developers. readability metric. It may help developers to write more In this study, we have chosen to target readability readable software by quickly identifying code that scores directly both because it is a concept that is independently poorly. It can assist project managers in monitoring and valuable, and also because developers have great control maintaining readability. It can serve as a requirement over it. We show in Section 5 that there is indeed a for acceptance. It can even assist inspections by helping significant correlation between readability and quality. to target effort at parts of a program that may need The main contributions of this paper are: improvement. Finally, it can be used by other static • A technique for the construction of an automatic analyses to rank warnings or otherwise focus developer software readability metric based on local code fea- attention on sections of the code that are less readable tures. and thus more likely to contain bugs. The readability • A survey of 120 human annotators on 100 code metric presented in this paper has already been used snippets that forms the basis of the metric presented successfully to aid a static specification mining tool [24] and evaluated in this paper. We are unaware of any — in that setting, knowing whether a code path was published software readability study of comparable readable or not had over twice the predictive power (as size (12,000 human judgments). We directly evaluate measured by ANOVA F -score) as knowing whether a the performance of our model on this data set. code path was feasible or not. • A set of experiments which reveal significant cor- The structure of this paper is as follows. In Section 2 relations between our metric for readability and we present a study of readability involving 120 hu- external notions of software quality including defect man annotators. We present the results of that study density. in Section 3, and in Section 4 we determine a small • A discussion of the features involved in our met- set of features that is sufficient to capture the notion of ric and their relation to software engineering and readability for a majority of annotators. In Section 5 we programming language design. discuss the correlation between our readability metric and external notions of software quality.