Inference and Regeneration of Programs That Store and Retrieve Data Martin Rinard and Jiasi Shen
Total Page:16
File Type:pdf, Size:1020Kb
Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-2017-006 April 24, 2017 Inference and Regeneration of Programs that Store and Retrieve Data Martin Rinard and Jiasi Shen massachusetts institute of technology, cambridge, ma 02139 usa — www.csail.mit.edu Inference and Regeneration of Programs that Store and Retrieve Data Martin Rinard Jiasi Shen EECS & CSAIL EECS & CSAIL MIT MIT [email protected] [email protected] Abstract networked, distributed computing platforms. A common As modern computation platforms become increasingly com- scenario, for example, is for a program to compute over data plex, their programming interfaces are increasingly dicult stored across many machines in a cloud computing environ- to use. is complexity is especially inappropriate given the ment to generate results that are then distributed via the relatively simple core functionality that many of the compu- Internet for graphical presentation on remote devices. tations implement. We present a new approach for obtaining Modern soware environments rely heavily on soware soware that executes on modern computing platforms with packages that help developers deal with the resulting com- complex programming interfaces. Our approach starts with plexity. Examples include application server frameworks a simple seed program, wrien in the language of the devel- such as JBoss and IBM WebSphere, key/value storage sys- oper’s choice, that implements the desired core functionality. tems such as Redis, NoSQL databases such as HBase, dis- It then systematically generates inputs and observes the re- tributed memory caching systems such as memcached, and sulting outputs to learn the core functionality. It nally auto- cluster computing frameworks such as Spark and MapRe- matically regenerates new code that implements the learned duce. While the implementations of such systems encap- core functionality on the target computing platform. is sulate the otherwise potentially overwhelming complexity regenerated code contains both (a) boilerplate code for the of coordinating the actions of the components of large dis- complex programming interfaces that the target computing tributed computing systems, the programming interfaces platform presents and (b) systematic error and vulnerability they provide are far from easy to use (as can be seen in checking code that makes the new implementations robust the large volume of questions posted to web sites such as and secure. By providing a productive new mechanism for Stack Overow [6]). Indeed, developers that work with such capturing and encapsulating knowledge about how to use systems spend much of their time constructing appropriate modern complex interfaces, this new approach promises to search terms to nd previously developed code that they can greatly reduce the developer eort required to obtain se- copy and adapt for their needs. e complexity of these pro- cure, robust soware that executes on modern computing gramming interfaces can be seen as especially inappropriate platforms. given the relatively simple core functionality that many of the computations implement. At a conceptual level, such computations oen simply store and retrieve data or per- 1 Introduction form simple computations on the stored data. e complexity Within the last decade, undergraduate computer science comes not from the core functionality that the computation enrollments, both within and outside the major, have dra- implements, but from the computing platform on which the matically increased [54]. As a result, undergraduates are computation executes. now acquiring basic programming skills as a normal part We propose a new approach for developing soware for of their college education. Indeed, the ability to write (rela- such computing platforms. Instead of coding to complex pro- tively simple) programs, like the ability to read, write, and gramming interfaces that existing soware packages export, perform basic mathematical reasoning, is now increasingly developers implement the core functionality in a seed pro- seen as part of the personal portfolio of a literate person in gram using the programming language of their choice. e our culture [32, 45]. seed program uses only the simplest standard programming At the same time, the systems on which even simple pro- interfaces, such as standard text input and output interfaces. duction soware must execute are becoming increasingly Our system rst interacts with the seed program to learn complex. Some decades ago most soware executed on a sin- its core functionality. e system then regenerates (a poten- gle machine, with the programming environment providing tially augmented version of) the computation that uses so- a few simple abstractions (such as le system interfaces) for phisticated soware packages to implement the learned core accessing the devices aached to that machine. Most so- functionality on the new, potentially much more complex ware today, in contrast, is expected to execute in complex, computing platform. In eect, the knowledge and expertise , required to use the relevant soware packages are all encap- . sulated in our regenerator. is approach can be particularly 1 ,, Martin Rinard and Jiasi Shen productive in a world in which basic programming skills are Reinterpretation: Many modern programming languages widely available, but the specialized knowledge and expertise • support a simple and basic model of computation (sequen- required to productively use specialized soware packages tial execution, le input and output, standard data struc- is much scarcer. is is the case for the world we are now tures, a single address space) that usually enables straight- entering as a society: such specialized knowledge of spe- forward implementation of the desired core functionality. cialized soware packages is constantly changing, available In many cases, however, the goal is to implement this core to far fewer people, and more dicult to use for everyone functionality in a much more complex environment — to regardless of their skill level. Our approach is founded on operate on distributed data, to work with data stored in a several principles: relational database or key/value store, to access specialized computing devices, to execute time-consuming computa- Program Inference via Active Learning: Starting with tions in parallel, to package the core functionality into an • a seed program that (mostly) implements the desired core appealing graphical user interface potentially accessed via functionality, the system reverse engineers the seed pro- the Internet, or to access values available via remote sen- gram to infer a representation of the core functionality. sors. To support such implementations, the regeneration It uses active learning to drive the process — because the algorithm will reinterpret standard constructs to translate specication is a program, it is possible to design algo- them into implementations that operate successfully in rithms that systematically interact with the program to the new, more complex target context. learn the core functionality that it implements. In this paper we present a black box inference algo- 1.1 Scope rithm that interacts with the seed program by generating e initial scope of the approach is programs that implement inputs and observing the resulting outputs. Gray box and simple core functionality (such as storing and retrieving white box approaches can also be appropriate — they can data or performing simple calculations over that stored data) obtain certain kinds of information more quickly by ob- on complex hardware platforms. We anticipate several use serving aspects of the implementation, but may require cases: more involved mechanisms that (dynamically or statically) analyze the program and/or its execution. New So ware: In this use case, the seed implementation • is developed from scratch, typically by implementing a Noisy, Partial Specications: e inference algorithm • simple text-based interface in a widely-taught language should be designed to tolerate noise (in the form of imple- such as Python. ese use cases typically involve substan- mentation errors or overlooked corner cases) and partial tial reinterpretation and augmentation to re-implement implementations of the desired core functionality. e the functionality on more complex production comput- algorithm must therefore be able to isolate the desired ing environments and/or to provide the system with an common case behavior of the seed program and infer a enhanced user interface. general specication from that isolated behavior, all while Legacy Systems: In this use case, the developer starts identifying and discarding undesirable behavior (noise) • that should not be part of the specication. with a legacy system that implements the desired func- Such algorithms relieve the developer of the seed pro- tionality. Here one goal is to start with a system that gram of the need to consider and implement obscure cor- runs in an obsolete or otherwise undesirable computing ner cases. e developer can instead simply implement context to obtain a regenerated version that can operate the core common case functionality while omiing error successfully in a more modern context. Another goal is checking and code that handles corner cases. An advan- to start with a system that may have defects or security tage is that implementing simply the core functionality vulnerabilities to generate a program without defects or is oen substantially easier than implementing a robust vulnerabilities (by,