Configuration Pinocchio
Total Page:16
File Type:pdf, Size:1020Kb
Configuration Pinocchio: The Lies Plainly Seen and the Quest to be a Real Discipline Andre P. Masella Ontario Institute for Cancer Research Abstract particularly Amazon Elastic Compute Cloud in 2006. As the layers of indirectly coupled components have The construction of configuration files has long been increased, the complexity of configurations have in- considered outside of the domain of “programming”. creased to match. As a central problem, the composabil- However, configuration files have a way of growing more ity of the configurations does not match the composabil- complex. There is a struggle between keeping a config- ity of the servers they describe. Moreover, the configu- uration terse, by having the system infer information au- rations of different servers operate in radically different tomatically, and explicit, without having excessive dupli- ways, each developing unique methods of propagating cation. Either the configuration file develops embedded default values and concisely expressing repetitive infor- domain-specific programming languages or a text-based mation. macro language is put in front. I will identify and catego- To build better configuration management software, rize patterns in the evolution of these programming lan- one must first understand the problem being solved. Con- guages and describe what kinds of patterns are needed to figuration seems simple enough: simply provide some avoid them. values to an application on start-up. Unfortunately, the reality is more nuanced. Firstly, the number of methods 1 Introduction for informing an application of its configuration is larger than expected. These include command line arguments, There has been a shift in the kinds of configurations writ- configuration files, environment variables, and values in ten. Previously, a server would be purchased, configured a database. Secondly, those values can contain sim- to perform some role, and largely left alone for its ser- ple data types, paths, composite data types, macro lan- viceable life. In compute clusters, the pattern is very dif- guages, and even programs written in Turing-complete ferent. Virtual machines allow repeated creation and de- languages that run inside the binary as part of its serv- ployment of machine configurations. Compounding this ing logic. A large number of these values also interact problem, applications have sprawled into many tiers of with each other in non-trivial ways; a change in one pa- services. It was previously common for a single service rameter can affect the interpretation of another. Finally, to be a single binary; now, it is likely to be several bina- the output of the configuration is usually opaque to the ries, running on different virtual machines, and the logic programmer since the true output of the program is hid- of the program is also likely to be spread out into the den inside the server. That is, there is part of the server database and peripheral services. This poses a problem that takes the configuration and interprets it, but there for testing as it is often impractical for a developer to is generally no way to view the interpreted result. This replicate the production environment on their desktop. interpretation depends on three conflated sources of in- Due to these factors, configuration of an application formation: the execution of the configuration (e.g., the has become much more complex. At the very least, con- macro expansion), the defaults provided by the binary, figuration of an application extends beyond the applica- and the invisible interface of the binary (i.e., the param- tion itself and must encompass all the peripheral services eters used by the server itself). It is concerning that the and the cluster environment–to say nothing of the build, defaults provided by the server are not necessarily static test, and deployment logic. This started happening in the themselves; they can be influenced by the server’s build late 1990s with the advent of multi-tier applications and process. accelerated with the use of cloud-based virtual machines, Language-theoretic security researches have described implicitly nesting. In fact, the Windows registry is a hi- Table 1: Configurations analyzed. erarchical key-value store and can be serialized to INI. Server Function Samba’s configuration format is exactly INI.[2] Asterisk Apache Web server stores data in a modified INI file; it has an extra layer of Asterisk Telephony server hierarchy by storing many separate files.[4] CUPS and BIND DNS server Apache HTTPd look different from INI files, but this is CUPS Printing server only superficial. In both formats directives are effectively Make Build system keys and the sections are nested into a hierarchy.[5, 1] NGINX Web server Again, BIND and NGINX looks very different from INI Samba File sharing server files, but describe semantically similar content.[7, 8] All of these servers have an additional property: the or- der of directives matters. They all have access control an exploitation path where any input data that can di- lists (ACL) that can be specified by a collection of “al- rect the flow of a program has the potential to be a pro- low” and “deny” directives, for which the order of the gram that exploits its host program. In this case, the directives matters. Most of the other directives are order host program is abstracted as a very unusual virtual ma- independent. chine, called a weird machine.[9, 3] Configuration is in- Make has the most different format. I wish to justify timately tied to this concept as the server is a weird ma- the inclusion of Make as a configuration format at all. chine for the configuration and the configuration and the Build configurations of all kinds tend to straddle the di- server taken together can be a weird machine for the vide between configuration format and script. For the queries. That is, a configuration can define new exploits purposes of this discussion, a script has control over the in a server. execution flow; in traditional programming languages, To determine the extent of the problem, a survey of the author of the program has control over the order in common servers will be conducted, the problems cate- which pieces of the program execute–this is true even gorized, and potential solutions discussed. in functional programming where the language itself has more control of the real behavior of the program. In a 2 Survey of Configurations configuration, this is not the case. Make provides an example of this: it is not possible to create a circular In order to draw patterns, I analyzed the configurations of dependency in Make as the Make interpreter can detect several common servers shown in Table 1. The goal is to and block it. A Makefile is really a declaration of a de- look for patterns of three types: how defaults are propa- sired scheduling behavior of rule bodies that the Make gated, how transformations are done on the configuration interpreter executes. Make does have key-value pairs for itself (i.e., macro languages), and the kinds of program- variables, but the main focus of a Makefile is the build ming languages that are embedded in the configuration rules. A rule could be considered a key-value pair, with to be used during serving (as distinct from macro lan- the body as the value and the sources and targets as the guages which are only processed at configuration time). key, but the algorithm by which Make examines the com- One of the major concerns with these embedded pro- posite keys would make that an inaccurate analogy. The gramming languages (EPL) is that they are underspec- rules are clearly separate configuration entities. ified; the semantics of their behavior is not laid out as Both Make and Apache HTTPd also have macro rules cleanly as would be expected in a normal programming embedded in the configuration. language. They also may have incomplete separation Despite the simplicity of these configuration files, As- from the macro languages in the configuration itself. terisk, Apache HTTPd, BIND, and NGINX all describe Turing-complete EPLs. 2.1 Formats 2.2 Default Propagation At first glance, it would seem that that the complexity of the configuration is related to the structure of the configu- In any binary, the configuration elements must eventually ration format. That is, there is a tacit assumption that INI be rendered to data structures or object graphs that direct configurations are semantically simpler than JSON ones; the behavior of the running program. When describing which is demonstrably false. Although the servers stud- these elements in the configuration, some information is ied use their own formats, they share remarkably similar elided. For example, each CUPS printer has a maximum structure. page limit, struct cupsd_printer_s.page_limit, Most of the servers use hierarchical key-value stores; but the matching configuration directive, PageLimit, is that is, something like an INI file, but the sections have not required. Therefore, CUPS must impute the elided 2 value by some method. This is what is meant by default ing served.[1] NGINX has a similar, though simplified, propagation. model.[8] Default propagation has two modes: implicit, where BIND has a very complex model. It is worth noting the binary has rules that control how defaults are propa- that BIND has two configuration formats: one for the gated, or explicit, where the programmer specifies how server itself and one for the DNS zones it is serving. In defaults are propagated. The ultimate value for a de- the zone format, which is much simpler, there is only one fault must be present in the binary itself. Sometimes, that default, the TTL for a record, which is either specified for choice could be determined at compile time. In commod- a record, or inherited from the $TTL directive, which is ity software, this is a potentially unpleasant surprise for required.