Learning Natural Coding Conventions
Total Page:16
File Type:pdf, Size:1020Kb
Learning Natural Coding Conventions Miltiadis Allamanis I V N E R U S E I T H Y T O H F G E R D I N B U Doctor of Philosophy Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh 2016 Abstract Coding conventions are ubiquitous in software engineering practice. Maintaining a uni- form coding style allows software development teams to communicate through code by making the code clear and, thus, readable and maintainable — two important properties of good code since developers spend the majority of their time maintaining software systems. This dissertation introduces a set of probabilistic machine learning models of source code that learn coding conventions directly from source code written in a mostly conventional style. This alleviates the coding convention enforcement problem, where conventions need to first be formulated clearly into unambiguous rules and then be coded in order to be enforced; a tedious and costly process. First, we introduce the problem of inferring a variable’s name given its usage con- text and address this problem by creating Naturalize — a machine learning framework that learns to suggest conventional variable names. Two machine learning models, a simple n-gram language model and a specialized neural log-bilinear context model are trained to understand the role and function of each variable and suggest new stylistically consistent variable names. The neural log-bilinear model can even suggest previously unseen names by composing them from subtokens (i.e. sub-components of code identi- fiers). The suggestions of the models achieve 90% accuracy when suggesting variable names at the top 20% most confident locations, rendering the suggestion system usable in practice. We then turn our attention to the significantly harder method naming problem. Learning to name methods, by looking only at the code tokens within their body, re- quires a good understating of the semantics of the code contained in a single method. To achieve this, we introduce a novel neural convolutional attention network that learns to generate the name of a method by sequentially predicting its subtokens. This is achieved by focusing on different parts of the code and potentially directly using body (sub)tokens even when they have never been seen before. This model achieves an F1 score of 51% on the top five suggestions when naming methods of real-world open- source projects. Learning about naming code conventions uses the syntactic structure of the code to infer names that implicitly relate to code semantics. However, syntactic similarities and differences obscure code semantics. Therefore, to capture features of semantic operations with machine learning, we need methods that learn semantic continuous logical representations. To achieve this ambitious goal, we focus our investigation on iii logic and algebraic symbolic expressions and design a neural equivalence network ar- chitecture that learns semantic vector representations of expressions in a syntax-driven way, while solely retaining semantics. We show that equivalence networks learn sig- nificantly better semantic vector representations compared to other, existing, neural network architectures. Finally, we present an unsupervised machine learning model for mining syntactic and semantic code idioms. Code idioms are conventional “mental chunks” of code that serve a single semantic purpose and are commonly used by practitioners. To achieve this, we employ Bayesian nonparametric inference on tree substitution grammars. We present a wide range of evidence that the resulting syntactic idioms are meaningful, demonstrating that they do indeed recur across software projects and that they occur more frequently in illustrative code examples collected from a Q&A site. These syn- tactic idioms can be used as a form of automatic documentation of coding practices of a programming language or an API. We also mine semantic loop idioms, i.e. highly abstracted but semantic-preserving idioms of loop operations. We show that semantic idioms provide data-driven guidance during the creation of software engineering tools by mining common semantic patterns, such as candidate refactoring locations. This gives data-based evidence to tool, API and language designers about general, domain and project-specific coding patterns, who instead of relying solely on their intuition, can use semantic idioms to achieve greater coverage of their tool or new API or language feature. We demonstrate this by creating a tool that suggests loop refactorings into functional constructs in LINQ. Semantic loop idioms also provide data-driven evidence for introducing new APIs or programming language features. iv Lay Summary Software systems are made out of source code that defines in a formal and unambiguous way the instructions that a computer needs to execute. Source code is a core artifact of the software engineering process. However, since software systems need to be main- tained and extended, source code needs to be frequently revisited by software engineers who need to read, understand and maintain the code. To this effect, source code acts as a means of communication between software developers and therefore source code needs to be easily understandable (and therefore easily modifiable). To achieve this, software teams enforce — implicitly and explicitly — a set of coding conventions, i.e. a set of self-imposed restrictions on how source code is written. These conventions are not a product of any technical constraints or limitations but are imposed for efficient developer communication through source code. One important coding convention is related to naming software artifacts. The names need to clearly reveal the role and the function of each code artifact. Other conventions include the idiomatic use of source code constructs. These idioms convey easily understandable semantics and therefore aid humans when reasoning about code functionality. This thesis presents an automated way for inferring and enforcing coding conven- tions to help software engineers write conventional and thus more maintainable code. To achieve this, we use machine learning — a set of statistical and mathematical modeling methods whose parameters are learned from data and and can be used to make “smart” predictions about previously unseen observations. Specifically, this thesis presents ma- chine learning models that learn to suggest conventional names for software engineering artifacts. This task requires novel machine learning models that “understand” the role and the function of the source code artifacts and how they compose to provide a distinct functionality. In addition, this dissertation presents a machine learning-based method that auto- matically finds widely used source code idioms from a large set of source code. Code idioms are “mental chunks” of code that serve a single, easily identifiable semantic purpose. The mined idioms serve as a form of documentation of how code libraries and programming language constructs are used. Finally, we mine semantic idioms, mental chucks of code that are not syntactic but represent common types of operations. We show how these idioms can be used within software engineering tools and to support the evolution of programming languages. v Acknowledgements When writing an acknowledgments section, one has to decide between being brief but vague or exhaustive and specific. I will pick the latter since I feel it is the only way to fully express my gratitude to all the people that have helped in many different ways during the last few years. This PhD thesis would not have been possible without the constant, help from my PhD advisor, Charles Sutton. We have spent hundreds of hours in discussions and emails about research projects, while he patiently taught me how to tackle hard problems and acquire a “taste” for research problems. Without his visionary understanding of the field and his belief that great research impact is possible, this dissertation would not have been at its present state. I would also like to thank Earl T. Barr, who although not officially related to my PhD acted as a remote PhD advisor, frequently chatting about new ideas, while he patiently explained to me programming language and software engineering concepts. Although being at UCL, his support was vital throughout this PhD. This PhD has be kindly and generously supported by Microsoft Research though its scholarship program, thanks to the Edinburgh Microsoft Research Joint Initiative in Informatics. The scholarship has funded my PhD studies for the first three years. It also funded my travel expenses to conferences at amazing places all over the world. I am also grateful to Microsoft Research for the great experiences, during my two internships in Cambridge, UK and Redmond, WA, USA. I would like to specially thank Danny Tarlow, Andrew D. Gordon, Christian Bird and Mark Marron for their guidance throughout the internships and thereafter that significantly helped me. My interactions with them led to important adjustments to the course of this dissertation. I would like to also thank Premkumar Devanbu and Pushmeet Kohli for their valu- able help, advice and feedback. I am also grateful to Mirella Lapata, Shay Cohen, Jaroslav Fowkes, Krzysztof Geras, Akash Srivastava, Pankajan Chanthirasegaran and the members of CUP, IANC and ILCC for the numerous discussions and feedback that I have received the last few years. This dissertation was only possible thanks to all the people and friends that have made me who I am; unfortunately I cannot list them all here. However, I want to spe- cially thank Stella for making life fun and interesting for the last three years. Finally, and most importantly, I am grateful to my parents — Aleka and Nikos — who have pa- tiently taught me so many things and have been a constant help, support and inspiration. This thesis is dedicated to them. vii Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified.