Learning Natural Coding Conventions

Total Page:16

File Type:pdf, Size:1020Kb

Learning Natural Coding Conventions Learning Natural Coding Conventions Miltiadis Allamanis I V N E R U S E I T H Y T O H F G E R D I N B U Doctor of Philosophy Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh 2016 Abstract Coding conventions are ubiquitous in software engineering practice. Maintaining a uni- form coding style allows software development teams to communicate through code by making the code clear and, thus, readable and maintainable — two important properties of good code since developers spend the majority of their time maintaining software systems. This dissertation introduces a set of probabilistic machine learning models of source code that learn coding conventions directly from source code written in a mostly conventional style. This alleviates the coding convention enforcement problem, where conventions need to first be formulated clearly into unambiguous rules and then be coded in order to be enforced; a tedious and costly process. First, we introduce the problem of inferring a variable’s name given its usage con- text and address this problem by creating Naturalize — a machine learning framework that learns to suggest conventional variable names. Two machine learning models, a simple n-gram language model and a specialized neural log-bilinear context model are trained to understand the role and function of each variable and suggest new stylistically consistent variable names. The neural log-bilinear model can even suggest previously unseen names by composing them from subtokens (i.e. sub-components of code identi- fiers). The suggestions of the models achieve 90% accuracy when suggesting variable names at the top 20% most confident locations, rendering the suggestion system usable in practice. We then turn our attention to the significantly harder method naming problem. Learning to name methods, by looking only at the code tokens within their body, re- quires a good understating of the semantics of the code contained in a single method. To achieve this, we introduce a novel neural convolutional attention network that learns to generate the name of a method by sequentially predicting its subtokens. This is achieved by focusing on different parts of the code and potentially directly using body (sub)tokens even when they have never been seen before. This model achieves an F1 score of 51% on the top five suggestions when naming methods of real-world open- source projects. Learning about naming code conventions uses the syntactic structure of the code to infer names that implicitly relate to code semantics. However, syntactic similarities and differences obscure code semantics. Therefore, to capture features of semantic operations with machine learning, we need methods that learn semantic continuous logical representations. To achieve this ambitious goal, we focus our investigation on iii logic and algebraic symbolic expressions and design a neural equivalence network ar- chitecture that learns semantic vector representations of expressions in a syntax-driven way, while solely retaining semantics. We show that equivalence networks learn sig- nificantly better semantic vector representations compared to other, existing, neural network architectures. Finally, we present an unsupervised machine learning model for mining syntactic and semantic code idioms. Code idioms are conventional “mental chunks” of code that serve a single semantic purpose and are commonly used by practitioners. To achieve this, we employ Bayesian nonparametric inference on tree substitution grammars. We present a wide range of evidence that the resulting syntactic idioms are meaningful, demonstrating that they do indeed recur across software projects and that they occur more frequently in illustrative code examples collected from a Q&A site. These syn- tactic idioms can be used as a form of automatic documentation of coding practices of a programming language or an API. We also mine semantic loop idioms, i.e. highly abstracted but semantic-preserving idioms of loop operations. We show that semantic idioms provide data-driven guidance during the creation of software engineering tools by mining common semantic patterns, such as candidate refactoring locations. This gives data-based evidence to tool, API and language designers about general, domain and project-specific coding patterns, who instead of relying solely on their intuition, can use semantic idioms to achieve greater coverage of their tool or new API or language feature. We demonstrate this by creating a tool that suggests loop refactorings into functional constructs in LINQ. Semantic loop idioms also provide data-driven evidence for introducing new APIs or programming language features. iv Lay Summary Software systems are made out of source code that defines in a formal and unambiguous way the instructions that a computer needs to execute. Source code is a core artifact of the software engineering process. However, since software systems need to be main- tained and extended, source code needs to be frequently revisited by software engineers who need to read, understand and maintain the code. To this effect, source code acts as a means of communication between software developers and therefore source code needs to be easily understandable (and therefore easily modifiable). To achieve this, software teams enforce — implicitly and explicitly — a set of coding conventions, i.e. a set of self-imposed restrictions on how source code is written. These conventions are not a product of any technical constraints or limitations but are imposed for efficient developer communication through source code. One important coding convention is related to naming software artifacts. The names need to clearly reveal the role and the function of each code artifact. Other conventions include the idiomatic use of source code constructs. These idioms convey easily understandable semantics and therefore aid humans when reasoning about code functionality. This thesis presents an automated way for inferring and enforcing coding conven- tions to help software engineers write conventional and thus more maintainable code. To achieve this, we use machine learning — a set of statistical and mathematical modeling methods whose parameters are learned from data and and can be used to make “smart” predictions about previously unseen observations. Specifically, this thesis presents ma- chine learning models that learn to suggest conventional names for software engineering artifacts. This task requires novel machine learning models that “understand” the role and the function of the source code artifacts and how they compose to provide a distinct functionality. In addition, this dissertation presents a machine learning-based method that auto- matically finds widely used source code idioms from a large set of source code. Code idioms are “mental chunks” of code that serve a single, easily identifiable semantic purpose. The mined idioms serve as a form of documentation of how code libraries and programming language constructs are used. Finally, we mine semantic idioms, mental chucks of code that are not syntactic but represent common types of operations. We show how these idioms can be used within software engineering tools and to support the evolution of programming languages. v Acknowledgements When writing an acknowledgments section, one has to decide between being brief but vague or exhaustive and specific. I will pick the latter since I feel it is the only way to fully express my gratitude to all the people that have helped in many different ways during the last few years. This PhD thesis would not have been possible without the constant, help from my PhD advisor, Charles Sutton. We have spent hundreds of hours in discussions and emails about research projects, while he patiently taught me how to tackle hard problems and acquire a “taste” for research problems. Without his visionary understanding of the field and his belief that great research impact is possible, this dissertation would not have been at its present state. I would also like to thank Earl T. Barr, who although not officially related to my PhD acted as a remote PhD advisor, frequently chatting about new ideas, while he patiently explained to me programming language and software engineering concepts. Although being at UCL, his support was vital throughout this PhD. This PhD has be kindly and generously supported by Microsoft Research though its scholarship program, thanks to the Edinburgh Microsoft Research Joint Initiative in Informatics. The scholarship has funded my PhD studies for the first three years. It also funded my travel expenses to conferences at amazing places all over the world. I am also grateful to Microsoft Research for the great experiences, during my two internships in Cambridge, UK and Redmond, WA, USA. I would like to specially thank Danny Tarlow, Andrew D. Gordon, Christian Bird and Mark Marron for their guidance throughout the internships and thereafter that significantly helped me. My interactions with them led to important adjustments to the course of this dissertation. I would like to also thank Premkumar Devanbu and Pushmeet Kohli for their valu- able help, advice and feedback. I am also grateful to Mirella Lapata, Shay Cohen, Jaroslav Fowkes, Krzysztof Geras, Akash Srivastava, Pankajan Chanthirasegaran and the members of CUP, IANC and ILCC for the numerous discussions and feedback that I have received the last few years. This dissertation was only possible thanks to all the people and friends that have made me who I am; unfortunately I cannot list them all here. However, I want to spe- cially thank Stella for making life fun and interesting for the last three years. Finally, and most importantly, I am grateful to my parents — Aleka and Nikos — who have pa- tiently taught me so many things and have been a constant help, support and inspiration. This thesis is dedicated to them. vii Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified.
Recommended publications
  • Java Programming Standards & Reference Guide
    Java Programming Standards & Reference Guide Version 3.2 Office of Information & Technology Department of Veterans Affairs Java Programming Standards & Reference Guide, Version 3.2 REVISION HISTORY DATE VER. DESCRIPTION AUTHOR CONTRIBUTORS 10-26-15 3.2 Added Logging Sid Everhart JSC Standards , updated Vic Pezzolla checkstyle installation instructions and package name rules. 11-14-14 3.1 Added ground rules for Vic Pezzolla JSC enforcement 9-26-14 3.0 Document is continually Raymond JSC and several being edited for Steele OI&T noteworthy technical accuracy and / PD Subject Matter compliance to JSC Experts (SMEs) standards. 12-1-09 2.0 Document Updated Michael Huneycutt Sr 4-7-05 1.2 Document Updated Sachin Mai L Vo Sharma Lyn D Teague Rajesh Somannair Katherine Stark Niharika Goyal Ron Ruzbacki 3-4-05 1.0 Document Created Sachin Sharma i Java Programming Standards & Reference Guide, Version 3.2 ABSTRACT The VA Java Development Community has been establishing standards, capturing industry best practices, and applying the insight of experienced (and seasoned) VA developers to develop this “Java Programming Standards & Reference Guide”. The Java Standards Committee (JSC) team is encouraging the use of CheckStyle (in the Eclipse IDE environment) to quickly scan Java code, to locate Java programming standard errors, find inconsistencies, and generally help build program conformance. The benefits of writing quality Java code infused with consistent coding and documentation standards is critical to the efforts of the Department of Veterans Affairs (VA). This document stands for the quality, readability, consistency and maintainability of code development and it applies to all VA Java programmers (including contractors).
    [Show full text]
  • Devsecops in Reguated Industries Capgemini Template.Indd
    DEVSECOPS IN REGULATED INDUSTRIES ACCELERATING SOFTWARE RELIABILITY & COMPLIANCE TABLE OF CONTENTS 03... Executive Summary 04... Introduction 07... Impediments to DevSecOps Adoption 10... Playbook for DevSecOps Adoption 19... Conclusion EXECUTIVE SUMMARY DevOps practices enable rapid product engineering delivery and operations, particularly by agile teams using lean practices. There is an evolution from DevOps to DevSecOps, which is at the intersection of development, operations, and security. Security cannot be added after product development is complete and security testing cannot be done as a once-per-release cycle activity. Shifting security Left implies integration of security at all stages of the Software Development Life Cycle (SDLC). Adoption of DevSecOps practices enables faster, more reliable and more secure software. While DevSecOps emerged from Internet and software companies, it can benefit other industries, including regulated and high security environments. This whitepaper covers how incorporating DevSecOps in regulated Industries can accelerate software delivery, reducing the time from code change to production deployment or release while reducing security risks. This whitepaper defines a playbook for DevSecOps goals, addresses challenges, and discusses evolving workflows in DevSecOps, including cloud, agile, application modernization and digital transformation. Bi-directional requirement traceability, document generation and security tests should be part of the CI/CD pipeline. Regulated industries can securely move away
    [Show full text]
  • Advanced Tcl E D
    PART II I I . A d v a n c Advanced Tcl e d T c l Part II describes advanced programming techniques that support sophisticated applications. The Tcl interfaces remain simple, so you can quickly construct pow- erful applications. Chapter 10 describes eval, which lets you create Tcl programs on the fly. There are tricks with using eval correctly, and a few rules of thumb to make your life easier. Chapter 11 describes regular expressions. This is the most powerful string processing facility in Tcl. This chapter includes a cookbook of useful regular expressions. Chapter 12 describes the library and package facility used to organize your code into reusable modules. Chapter 13 describes introspection and debugging. Introspection provides information about the state of the Tcl interpreter. Chapter 14 describes namespaces that partition the global scope for vari- ables and procedures. Namespaces help you structure large Tcl applications. Chapter 15 describes the features that support Internationalization, includ- ing Unicode, other character set encodings, and message catalogs. Chapter 16 describes event-driven I/O programming. This lets you run pro- cess pipelines in the background. It is also very useful with network socket pro- gramming, which is the topic of Chapter 17. Chapter 18 describes TclHttpd, a Web server built entirely in Tcl. You can build applications on top of TclHttpd, or integrate the server into existing appli- cations to give them a web interface. TclHttpd also supports regular Web sites. Chapter 19 describes Safe-Tcl and using multiple Tcl interpreters. You can create multiple Tcl interpreters for your application. If an interpreter is safe, then you can grant it restricted functionality.
    [Show full text]
  • OHD C++ Coding Standards and Guidelines
    National Weather Service/OHD Science Infusion and Software Engineering Process Group (SISEPG) – C++ Programming Standards and Guidelines NATIONAL WEATHER SERVICE OFFICE of HYDROLOGIC DEVELOPMENT Science Infusion Software Engineering Process Group (SISEPG) C++ Programming Standards and Guidelines Version 1.11 Version 1.11 11/17/2006 National Weather Service/OHD Science Infusion and Software Engineering Process Group (SISEPG) – C++ Programming Standards and Guidelines 1. Introduction..................................................................................................................1 2. Standards......................................................................................................................2 2.1 File Names .......................................................................................................2 2.2 File Organization .............................................................................................2 2.3 Include Files.....................................................................................................3 2.4 Comments ........................................................................................................3 2.5 Naming Schemes .............................................................................................4 2.6 Readability and Maintainability.......................................................................5 2.6.1 Indentation ...............................................................................................5 2.6.2 Braces.......................................................................................................5
    [Show full text]
  • Theta Engineering Firmware Coding Conventions
    Theta Engineering Firmware Coding Conventions Best Practices What constitutes “best practice” in software development is an ongoing topic of debate in industry and academia. Nevertheless, certain principles have emerged over the years as being sound and beneficial. In taking a conservative stance on this topic, we will avoid the most recent and contentious ideas and stick with the ones that have withstood the test of time. The principles we will use in the development of firmware are: o Object oriented design – Even though we are not programming in an “object oriented language” per se, the principles of object oriented design are still applicable. We will use a C module to correspond to an “object”, meaning a body of code that deals with a specific item or conceptually small zone of functionality, and encapsulates the code and data into that module. o Separation of interface and implementation – Each module will have a .c file that comprises the implementation and a .h file specifying the interface. Coding details and documentation pertaining to the implementation should be confined to the .c file, while items pertaining to the interface should be in the .h file. o Encapsulation – Each module should encapsulate all code and data pertaining to its zone of responsibility. Each module will be self contained and complete. Access to internal variables if necessary will be provided through published methods as described in the header file for the module. A module may use other appropriate modules in order to do its job, but may do so only through the published interface of those modules.
    [Show full text]
  • Tsduck Coding Guidelines
    TSDuck Coding Guidelines Version 3.2 April 2021 TSDuck coding guidelines License TSDuck is released under the terms of the license which is commonly referred to as "BSD 2-Clause License" or "Simplified BSD License" or "FreeBSD License". See http://opensource.org/licenses/BSD-2- Clause. Copyright (c) 2005-2021, Thierry Lelégard All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY
    [Show full text]
  • Towards a Structured Specification of Coding Conventions
    1 Towards a Structured Specification of Coding Conventions Elder Rodrigues Jr. and Leonardo Montecchi Instituto de Computac¸ao˜ Universidade Estadual de Campinas Campinas, SP, Brazil [email protected], [email protected] Abstract—Coding conventions are a means to improve the check similarity between rules, ii) identify conflicting rules, reliability of software systems. They can be established for iii) understand if a tool is able to check a certain rule, iv) many reasons, ranging from improving the readability of code configure a tool to check a certain rule, etc. to avoiding the introduction of security flaws. However, coding conventions often come in the form of textual documents in Following the principles behind Model-Driven Engineering natural language, which makes them hard to manage and to (MDE) [20], all the artifacts in the software development pro- enforce. Following model-driven engineering principles, in this cess, thus including coding conventions, should be represented paper we propose an approach and language for specifying as structured models, to increase the degree of automation, coding conventions using structured models. We ran a feasibility improve integration, and reduce the possibility of human study, in which we applied our language for specifying 215 coding rules from two popular rulesets. The obtained results mistakes. However, to the best of our knowledge, there is little are promising and suggest that the proposed approach is feasible. work on this topic in the literature. However, they also highlight that many challenges still need to be In this paper we investigate the possibility of specifying cod- overcome. We conclude with an overview on the ongoing work for ing conventions through structured, machine-readable, models.
    [Show full text]
  • Practical Programming in Tcl and Tk
    Practical Programming in Tcl and Tk Brent Welch DRAFT, January 13, 1995 Updated for Tcl 7.4 and Tk 4.0 THIS IS NOT THE PUBLISHED TEXT THE INDEX IS INCOMPLETE SOME SECTIONS ARE MISSING THE MANUSCIRPT HAS NOT BEEN EDITED GET THE REAL BOOK: ISBN 0-13-182007-9 An enhanced version of this text has been published by Prentice Hall: ISBN 0-13-182007-9 Send comments via email to [email protected] with the word “book” in the subject. http://www.sunlabs.com/~bwelch/book/index.html The book is under copyright. Print for personal use only. This on-line DRAFT is available curtesty the kind folks at PH. Table of Contents 1. Tcl Fundamentals ............................................. 1 Getting Started ............................................................1 Tcl Commands .............................................................2 Hello World ..................................................................3 Variables ......................................................................3 Command Substitution ................................................4 Math Expressions ........................................................4 Backslash Substitution ................................................6 Double Quotes .............................................................7 Procedures ...................................................................7 A While Loop Example ..................................................8 Grouping And Command Substitution .......................10 More About Variable Substitution ..............................11
    [Show full text]
  • Coding Conventions for C++ and Java Applications - Macadamian Technologies Inc
    Coding Conventions for C++ and Java applications - Macadamian Technologies Inc Coding Conventions for C++ and Java applications Section Contents Subscribe to our email list to be notified by email when there is a Table of Contents new Column. Archive of past columns A full list of our software Source Code Organization development articles from previous weeks ● Files and project organization Coding Conventions for C++ and ● Header Files Java Naming Conventions One of our most popular pages -- Coding conventions for C++ and ● Function Names Java, written by our Chief Architect and used by our developers. ● Class Names ● Variable Names Source Documentation ● Module Comments and Revision History ● Commenting Data Declarations ● Commenting Control Structures ● Commenting Routines Programming Conventions ● Use of Macros ● Constants and Enumerations ● Use of return, goto and throw for flow control ● Error Handling ● Style and Layout Testing/Debug Support ● Class Invariant ● Assertions and Defensive Programming ● Validation Tests ● Tracing Conclusion http://www.macadamian.com/codingconventions.htm (1 of 16) [10/1/2000 7:12:06 PM] Coding Conventions for C++ and Java applications - Macadamian Technologies Inc ● Glossary ● References ● History ● Other Coding Conventions on the Web Source Code Organization Files and project organization The name of files can be more than eight characters, with a mix of upper case and lower-case. The name of files should reflect the content of the file as clearly as possible. As a rule of thumb, files containing class definitions and implementations should contain only one class. The name of the file should be the same as the name of the class. Files can contain more than one class when inner classes or private classes are used.
    [Show full text]
  • Gamification for Enforcing Coding Conventions
    Gamification for Enforcing Coding Conventions Christian R. Prause Matthias Jarke DLR Space Administration RWTH Aachen Königswinterer Str. 522-524 Institut i5 Bonn, Germany Aachen, Germany [email protected] [email protected] ABSTRACT quality characteristic means how cost-effectively developers Software is a knowledge intensive product, which can only can continuously improve and evolve the software. evolve if there is effective and efficient information exchange Maintainability is broken down into the sub-characteristics between developers. Complying to coding conventions im- analyzability, changeability, stability, testability and main- proves information exchange by improving the readability of tainability compliance. They describe how easy places to be source code. However, without some form of enforcement, changed and causes of faults can be located, how well fu- compliance to coding conventions is limited. We look at ture changes are supported and can be realized, how much the problem of information exchange in code and propose the software avoids unexpected effects from changes, how gamification as a way to motivate developers to invest in well the software supports validation efforts, and how com- compliance. Our concept consists of a technical prototype pliant the interior of the software is to maintainability stan- and its integration into a Scrum environment. By means of dards and conventions. Maintainability is like an internal two experiments with agile software teams and subsequent version of the usability quality
    [Show full text]
  • Best Practice Programming Techniques for SAS® Users
    PharmaSUG 2016 – Paper AD11 Best Practice Programming Techniques for SAS® Users Kirk Paul Lafler, Software Intelligence Corporation, Spring Valley, California Mary Rosenbloom, Lake Forest, California Abstract It’s essential that SAS® users possess the necessary skills to implement “best practice” programming techniques when using the Base-SAS software. This presentation illustrates core concepts with examples to ensure that code is readable, clearly written, understandable, structured, portable, and maintainable. Attendees learn how to apply good programming techniques including implementing naming conventions for datasets, variables, programs and libraries; code appearance and structure using modular design, logic scenarios, controlled loops, subroutines and embedded control flow; code compatibility and portability across applications and operating platforms; developing readable code and program documentation; applying statements, options and definitions to achieve the greatest advantage in the program environment; and implementing program generality into code to enable its continued operation with little or no modifications. Introduction Code is an intellectual property and should be treated as a tangible asset by all organizations. Best practice programming techniques help to clarify the sequence of instructions in code, permit others to read code as well as understand it, assist in the maintainability of code, permit greater opportunity to reuse code, achieve measurable results, reduce costs in developing and supporting code, and assist in performance improvements (e.g., CPU, I/O, Elapsed time, DASD, Memory). Best Practice Concepts A best practice programming technique is a particular approach or method that has achieved some level of approval or acceptance by a professional association, authoritative entity, and/or by published research results. Successful best practice programming techniques translate into greater code readability, maintainability and longevity while ensuring code reusability.
    [Show full text]
  • A Language-Independent Static Checking System for Coding Conventions
    A Language-Independent Static Checking System for Coding Conventions Sarah Mount A thesis submitted in partial fulfilment of the requirements of the University of Wolverhampton for the degree of Doctor of Philosophy 2013 This work or any part thereof has not previously been presented in any form to the University or to any other body whether for the purposes of as- sessment, publication or for any other purpose (unless otherwise indicated). Save for any express acknowledgements, references and/or bibliographies cited in the work, I confirm that the intellectual content of the work is the result of my own efforts and of no other person. The right of Sarah Mount to be identified as author of this work is asserted in accordance with ss.77 and 78 of the Copyright, Designs and Patents Act 1988. At this date copyright is owned by the author. Signature: . Date: . Abstract Despite decades of research aiming to ameliorate the difficulties of creat- ing software, programming still remains an error-prone task. Much work in Computer Science deals with the problem of specification, or writing the right program, rather than the complementary problem of implementation, or writing the program right. However, many desirable software properties (such as portability) are obtained via adherence to coding standards, and there- fore fall outside the remit of formal specification and automatic verification. Moreover, code inspections and manual detection of standards violations are time consuming. To address these issues, this thesis describes Exstatic, a novel framework for the static detection of coding standards violations. Unlike many other static checkers Exstatic can be used to examine code in a variety of lan- guages, including program code, in-line documentation, markup languages and so on.
    [Show full text]