Autofolding for Source Code Summarization

Autofolding for Source Code Summarization Jaroslav Fowkes∗, Pankajan Chanthirasegaran∗, Razvan Ranca†, Miltiadis Allamanis∗, Mirella Lapata∗ and Charles Sutton∗ ∗School of Informatics, University of Edinburgh, Edinburgh, EH8 9AB, UK {jaroslav.fowkes, pchanthi, m.allamanis, csutton}@ed.ac.uk; [email protected] †Tractable, Oval Office, 11-12 The Oval, London, E2 9DT, UK [email protected] Abstract—Developers spend much of their time read- a large code base. This can happen when a developer ing and browsing source code, raising new oppor- is joining an existing project, or when a developer is tunities for summarization methods. Indeed, modern evaluating whether to use a new software library. (b) Code code editors provide code folding, which allows one to selectively hide blocks of code. However this is reviews. Reviewers need to quickly understand the key impractical to use as folding decisions must be made changes before reviewing the details. (c) Locating relevant manually or based on simple rules. We introduce the code segments. During program maintenance, developers autofolding problem, which is to automatically create a often skim code, reading only a couple lines at a time, code summary by folding less informative code regions. while searching for a code region of interest [4]. We present a novel solution by formulating the problem as a sequence of AST folding decisions, leveraging a For this reason, many code editors include a feature scoped topic model for code tokens. On an annotated called code folding, which allows developers to selectively set of popular open source projects, we show that our display or hide blocks of source code. This feature is summarizer outperforms simpler baselines, yielding a commonly supported in editors and is familiar to de- 28% error reduction. Furthermore, we find through a velopers [5]–[7]. But in current Integrated Development case study that our summarizer is strongly preferred by experienced developers. More broadly, we hope this Environments (IDEs), folding quickly becomes impractical work will aid program comprehension by turning code because the folding decisions must be done manually by folding into a usable and valuable tool. the programmer, or based on simple rules, such as folding code blocks based on depth [8], that some IDEs take I. Introduction automatically. This creates an obvious chicken-and-egg Engineering large software systems presents many chal- problem, because the developer must already understand lenges due to the inherent complexity of software. Because the source file to decide what should be folded. of this complexity, programmers tend to spend more time In this paper, we propose that code folding can be a reading and browsing code than actually writing it [1], valuable tool for aiding program comprehension, provided [2]. Despite much research [3], there is still a large need that folding decisions are made automatically based on for better tools that aid program comprehension, thereby the code’s content. We consider the autofolding problem, in reducing the cost of software development. which the goal is to automatically create a code summary A key insight is that source code is written to be by folding non-essential code elements that are not useful understood not only by machines, but also by humans. on first viewing. To our knowledge, we are the first to Programmers devote significant time and attention to systematically study and quantitatively compare differ- writing their code in an idiomatic and intuitive way that ent methods for the autofolding problem. An illustrative arXiv:1403.4503v5 [cs.SE] 6 Mar 2017 can be easily understood by others — source code is example is shown in Figure 1. To any Java developer a means of human communication. This fact raises the the function of the StatusLine constructor and the clone, intriguing possibility that technology from the natural getCode, getReason and toString methods are obvious even language processing (NLP) community can be adapted to without seeing their method bodies. One possible sum- help developers make sense of large repositories of code. mary of this source file is shown in Figure 2. Often during development and maintenance, developers The key problem in content-based autofolding is to skim the code in order to quickly understand it [4]. A good determine which tokens in a file are most representative of summary of the source code aims to support this use case: its content. We compare two different content models for by eliding less-important details, a summary can be easier this task: a simple vector space model (VSM) and a topic to read quickly and help the developer to gain a high-level model that, building on work in NLP summarization [9], conceptual understanding of the code. endows different scopes (files, projects, and the corpus) Source code summarization has potential for valuable with separate topics, allowing the model to separate out applications in many software engineering tasks, such as: those tokens that are used most often in a particular file. (a) Understanding new code bases. Often developers need We find that the summaries from the topic model are to quickly familiarize themselves with the core parts of significantly better than those from the VSM. 1 /* Header*/ 1 /* Header...*/ 2 package org.zoolu.sip.header; 2 package org.zoolu.sip.header; 3 3 4 /** SIP Status-line,i.e. the first 4 /** SIP Status-line,i.e. the first...*/ 5 * line ofa response message*/ 6 public class StatusLine { 6 public class StatusLine { 7 protected int code; 7 protected int code; 8 protected String reason; 8 protected String reason; 9 9 10 /** Construct StatusLine...*/ 10 /** Construct StatusLine*/ 11 public StatusLine(int c, String r) {...} 11 public StatusLine(int c, String r) { 15 12 code = c; 16 /** Createa new copy of the request-line..*/ 13 reason = r; 17 public Object clone() {...} 14 } 20 15 21 /** Indicates whether some other Object...*/ 16 /** Createa new copy of the request-line*/ 23 public boolean equals(Object obj){ 17 public Object clone() { 24 try{ 18 return new StatusLine(getCode(), getReason()); 25 StatusLine r = (StatusLine) obj; 19 } 26 if (r.getCode() == (getCode()&& 20 27 r.getReason().equals(getReason())) 21 /** Indicates whether some other Object 28 return true; 22 * is"equal to" this StatusLine*/ 29 else 23 public boolean equals(Object obj){ 30 return false; 24 try{ 31 } catch (Exception e) {...} 25 StatusLine r = (StatusLine) obj; 34 } 26 if (r.getCode() == getCode()&& 35 27 r.getReason().equals(getReason())) 36 public int getCode() {...} 28 return true; 39 29 else 40 public String getReason() {...} 30 return false; 43 31 } catch (Exception e) { 44 public String toString() {...} 32 return false; 47 } 33 } 34 } Figure 2: A summary of the file in Figure 1 (left) which results from 35 folding lines 1, 4–5, 11–14, 21–22, 31–33, 36-38 and 40-42. The ellipses 36 public int getCode() { 37 return code; indicate folded segments of code. 38 } 39 40 public String getReason() { • To determine which non-essential regions should be 41 return reason; folded, we introduce a novel topic model for code (Sec- 42 } 43 tion III-B), building on machine learning methods used 44 public String toString() { in NLP [9], which separates tokens according whether 45 return"SIP/2.0" + code +"" + reason +"\r\n"; they best characterize their file, their project, or the 46 } 47 } corpus as a while. This allows TASSAL summaries to focus on file-specific tokens. Figure 1: Original source code. A snippet from bigbluebutton’s • We perform a comprehensive evaluation of our method StatusLine.java. We use this as a running example. on a set of popular open source projects from GitHub (Section IV), and find that TASSAL performs better Previous work in code summarization has considered than simpler baselines (Section V) at matching human summarization using: (a) program slicing (i.e. hiding irrel- judgements, with a relative error reduction of 28%. Fur- evant lines of code for a chosen program path) [10], [11]; thermore, in a user study with experienced developers, (b) natural language paraphrases [12], [13]; (c) short lists TASSAL is strongly preferred to the baselines. of keywords [14]–[17]; or (d) (potentially discontiguous) • We created a live demo of TASSAL [19] to showcase how lines of code that match a user’s query [18]. In contrast, it can be used to summarize open-source Java projects our work is based on the idea that an effective summary on GitHub. Our demo can be found at http://groups.inf. can be obtained by carefully folding the original file — ed.ac.uk/cup/tassal/demo.html and a video highlighting summarizing code with code. Our main contributions in the main features of TASSAL can be found at https: this paper are: //youtu.be/_yu7JZgiBA4. • We introduce a novel autofolding method for source code More broadly, we hope that this work will aid program summarization, called TASSAL1, based on optimizing comprehension by turning code folding, perhaps an over- the similarity between the summary and the source looked feature, into a useful, usable and valuable tool. file. Because of certain constraints among the folding decisions, we formulate this method as a contiguous II. Related Work rooted subtree problem (Section III-C). This is, to our knowledge, the first content-based autofolding method The application of NLP methods to the analysis of for code summarization. source code text is only just beginning to be explored. Recent work has applied language modelling [20]–[24], 1https://github.com/mast-group/tassal natural language generation [12], [25], machine translation 2 [26], and topic modelling [27] to the text of source code our work, we summarize code with code, which we would from large software projects. A main challenge in this area argue has the potential to provide a much richer and more is to adapt existing NLP techniques to source code text.

Autofolding for Source Code Summarization

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support