Ambiguous, Informal, and Unsound: Metaprogramming for Naturalness

Ambiguous, Informal, and Unsound: Metaprogramming for Naturalness Toni Mattis Patrick Rein Robert Hirschfeld Hasso Plattner Institute Hasso Plattner Institute Hasso Plattner Institute University of Potsdam University of Potsdam University of Potsdam Potsdam, Germany Potsdam, Germany Potsdam, Germany [email protected] [email protected] [email protected] Abstract ACM Reference Format: Program code needs to be understood by both machines Toni Mattis, Patrick Rein, and Robert Hirschfeld. 2019. Ambigu- and programmers. While the goal of executing programs ous, Informal, and Unsound: Metaprogramming for Naturalness. In Proceedings of the 4th ACM SIGPLAN International Workshop requires the unambiguity of a formal language, program- on Meta-Programming Techniques and Reflection (META ’19), Octo- mers use natural language within these formal constraints to ber 20, 2019, Athens, Greece. ACM, New York, NY, USA, 10 pages. explain implemented concepts to each other. This so called https://doi.org/10.1145/3358502.3361270 naturalness – the property of programs to resemble human communication – motivated many statistical and machine learning (ML) approaches with the goal to improve software engineering activities. 1 Introduction The metaprogramming facilities of most programming en- The complexity of modern software systems requires col- vironments model the formal elements of a program (meta- laboration among multiple programmers. Hence, the source objects). If ML is used to support engineering or analysis code itself becomes an artifact of communication and per- tasks, complex infrastructure needs to bridge the gap be- sisted domain knowledge. The idea that programmers should tween meta-objects and ML models, changes are not reflected “concentrate rather on explaining to human beings what we in the ML model, and the mapping from an ML output back want a computer to do” has been a prominent hypothesis in into the program’s meta-object domain is laborious. Knuth’s work on Literate Programming [9]. In the scope of this work, we propose to extend metapro- Even if modern programs are usually not “literate” in gramming facilities to give tool developers access to the Knuth’s sense (yet), machine learning and information re- representations of program elements within an exchangeable trieval techniques that worked well on literature and text ML model. We demonstrate the usefulness of this abstrac- documents have successfully been applied to source code. tion in two case studies on test prioritization and refactoring. Examples include semantic clustering, a reverse-engineering We conclude that aligning ML representations with the pro- approach by Kuhn et al. [10] that uses the semantic related- gram’s formal structure lowers the entry barrier to exploit ness between identifiers to cluster software artifacts, orthe statistical properties in tool development. extraction of latent concepts from large source code corpora CCS Concepts • Computing methodologies → Machine using topic modeling by Linstead et al. [12] learning; • Software and its engineering → Abstraction, More recent analyses performed by Hindle et al. in their modeling and modularity; Object oriented languages. work on the Naturalness of Software [6] systematically show that programs possess statistical regularities that correspond Keywords meta-objects, metaprogramming, machine learn- to those exploited by the natural language processing (NLP) ing, naturalness community. Especially as the transition from formal (rule- based) methods to statistical (probabilistic) methods has Permission to make digital or hard copies of all or part of this work for proven very productive in the NLP domain, we can reason- personal or classroom use is granted without fee provided that copies ably expect that software engineering can analogously ben- are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights efit from statistical/ML methods augmenting the currently for components of this work owned by others than the author(s) must prevalent formal methods. be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific Problem An obstacle in applying these insights to software permission and/or a fee. Request permissions from [email protected]. engineering is the heavyweight infrastructure needed to con- META ’19, October 20, 2019, Athens, Greece nect program and “black-box” ML models, and then mapping © 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM. their output back into the program. The following problems ACM ISBN 978-1-4503-6985-5/19/10...$15.00 play a role in this infrastructure and will be assessed in this https://doi.org/10.1145/3358502.3361270 work: 1 META ’19, October 20, 2019, Athens, Greece Toni Mattis, Patrick Rein, and Robert Hirschfeld • The metaprogramming domain and the input domains section 3, we then present a novel extension to Smalltalk of ML models are often misaligned. For example, NLP- meta-objects for accessing this information. We proceed to based models will likely use documents or n-grams use this framework in section 4 to improve two development as primary abstraction, which need to be mapped to tools: The automatic test runner and the refactoring capabil- classes, methods, or ASTs. ities of the code browser, each with minimal changes to the • ML models have a resource-intensive training and rela- existing meta-object code. In section 5, we elaborate on de- tively fast application mode, which works well on nat- sign decisions and document the requirements an ML model ural language with stable meanings of words. Abstrac- needs to satisfy in order to work in this new framework. In tions and the corresponding identifiers in programs section 6, we discuss related and possible future work. evolve much faster, requiring ML models to readjust on-line and within the granularity of programmers’ 2 Naturalness in Programs changes. Although it is possible to derive representations and ma- • Introducing ML into a tool can introduce significant chine learning input from the formal structure of a system, technical debt, since model-specific state needs to be their power stems from their capability to capture statisti- passed through all interfaces that eventually need ML- cal effects associated with natural language and structure, determined properties. Re-use of ML models, interme- such as patterns emerging at run-time or in the development diate representations, and results between multiple process. tools is not standardized and to the authors’ knowledge rarely observed in practice. 2.1 Manifestations of Naturalness Goal Our goal is to extend metaprogramming facilities to In programs, names, often joined from multiple words via provide access to the output generated by a large class of camel case or separators, provide the basis for communicat- ML models. Today’s prevalent form of metaprogramming ing computational concepts. From a statistical viewpoint, provides a reification of the formal elements of a program words that appear in similar contexts likely share the same (e.g. classes and methods in object-oriented systems), which meaning (“difference of meaning correlates with difference we call meta-objects in the scope of this work. Static analy- of distribution” [5]), which can be used to measure semantic sis tools, test runners, frameworks, or language extensions relatedness of program parts independently of their formal can be based on meta-objects, and should be able to access relations between each other. probabilistic properties computed through ML models as A wider range of language can be found in comments, rang- well. ing from pure natural language documentation to commented- To address the problems stated above, our design follows out code. Ordering, spacing, alignment, indentation, and the following principles: other “negative space” can convey meaning by visibly group- Mapping Each meta-object should have a representa- ing code passages and, at the same time, separating them tion inside the ML model. This principle should elim- from less related code. inate the need to manage mappings between meta- Beyond source code, related artifacts like issues, discus- object domain and ML domain in client code for using sions, pull requests, build logs, or documentation can convey and updating the model. meaning. In a software repository, the version history reveals Decoupling model from application Code using prop- how spatially unrelated program parts can still be semanti- erties computed by the ML model should not interleave cally connected by being frequently changed together. Dy- with code managing the ML model itself. This allows namic information generated at run-time plays a significant both to vary independently so that analysis tools can role in program understanding, as the widespread use of continue to focus on operating in the meta-object do- graphical debuggers or printf-style logging indicates. Con- main like they did for formal analyses. crete data and control flows can be observed and statistically Shared representation Different tools using the same mined. ML-derived properties should access them through the same meta-objects. No separate pipelines need to be 2.2 Exploiting Naturalness in Software Engineering constructed. There are multiple approaches to exploit natural information This is not an exhaustive list

Ambiguous, Informal, and Unsound: Metaprogramming for Naturalness

Knuthweb.Pdf

Programming Pearls: a Literate Program

Teaching Foundations of Computation and Deduction Through Literate Functional Programming and Type Theory Formalization

Typeset MMIX Programs with TEX Udo Wermuth Abstract a TEX Macro

Statistical Software

Exploratory Data Science Using a Literate Programming Tool Mary Beth Kery1 Marissa Radensky2 Mahima Arya1 Bonnie E

Trisycl Open Source C++17 & Openmp-Based Opencl SYCL Prototype

A Literate Programming Approach for Hardware Description Language In- Struction

Literate Programming and Reproducible Research

Literate Programming in the Encyclopedia of Computer Science, A

Research Tools and Methods for the Mathematical Science Lecture 5: Good Code and Data

Experience with Literate Programming in the Modelling and Validation of Systems