<<

arXiv:1409.3531v1 [stat.ME] 9 Sep 2014 onM Chambers M. John and Functional Programming Programming, Object-Oriented 04 o.2,N.2 167–180 2, DOI: No. 29, Vol. 2014, Science Statistical 40-05 S e-mail: California USA Stanford, 94305-4065, University, Stanford Statistics, of

onM hmesi osligPoesr Department Professor, Consulting is Chambers M. John ttsia Science article Statistical original the the of by reprint published electronic an is This ern iesfo h rgnli aiainand pagination in detail. original typographic the from differs reprint nttt fMteaia Statistics Mathematical of Institute 10.1214/13-STS452 nttt fMteaia Statistics Mathematical of Institute , ig betoine programming. object-oriented ming, phrases: and combination words unusual Key somewhat this will have evolution currently the we Outlining why well. ify continues as software design previous the the in but influence eac added, At been libraries. have program ideas earlier by influenced strongly was ucinlpormigmtvtsmc of much motivates programming Functional iedffr rmta sdi o-ucinllnugs dis confusion. a avoid languages, to emphasized non-functional be in to used needs that that from differs a tive from programming Object-oriented paradigm. the l o h sr h w aaim aebe aubei supp s other in numerous valuable and been data to have applications. models paradigms fitting two for software The major user. the for ple programmin anism Object-oriented results. reproducible giving h ai da eidojc-retdadfntoa prog functional and of evolution object-oriented the examine then behind and ideas basic the rmigad npriua,b h eiet obn functi combine to desire programming. the oriented by object particular, in and, gramming current eidtevrosvrin,ec fwihi aubei the in valuable is which of context. each ate versions, various the a behind in used of programming, ber object-oriented of projects versions substantial several for particularly useful, proved have Abstract. 04 o.2,N.2 167–180 2, No. 29, Vol. 2014, R ucinlpormigspot eldfie,defensible well-defined, supports programming Functional h aaim aebe dpe,adaatd distinctivel adapted, and adopted, been have paradigms The ocaiyhwti atclrmxo da a undoti t in out turned has ideas of mix particular this how clarify To R ntal elctdthe replicated initially a lobe togyiflecdb h da ffntoa p functional of ideas the by influenced strongly been also has [email protected] a excellence par R R 2014 , akgs h eiwtist lrf h rgn n ideas and origins the clarify to tries review The packages. agaeadspotn otae h ae ilfis revie first will paper the software, supporting and language hspprrvessm rgamn ehiusin techniques programming some reviews paper This o aaigcmlxt hl epn hnssim- things keeping while complexity managing for rgamn agae,fntoa program- functional languages, Programming . This . S agaefo elLb,wihi turn in which Labs, Bell from language in R 1 ihteeiespoiigcontext. providing ideas these with vial oother is software to The available results. new other describing or publications articles supporting journal accompany the frequently related ware to and References statistics in technology. methodology new nicating icpieaecnieal:Tecmuiyhas community The considerable: as are statistics for discipline benefits a The repository. standard a R a eoea motn eimfrcommu- for medium important an become has R but .INTRODUCTION 1. R osntenforce not does hs include These . R stemech- the is g sr,ielya akg in package a as ideally users, tg,new stage, h ute clar- further ag num- large lperspec- al hwits show o ramming, nlwith onal appropri- tatistical software fideas. of tinction in y R orting that ro- he R w . R soft- 2 . M. CHAMBERS rapid access to new ideas in a free, open-source for- The original motivating use case, fitting models to mat as software that can in most cases be installed data, remains compelling. An such as and used immediately by those interested in the sta- irisFit <- lm(Sepal.Width ∼ tistical techniques. The user community has both . - Sepal.Length, iris) created and benefited from this resource. This paper examines two of the most signifi- calls a function that creates an object representing cant paradigms in programming languages gener- the linear model specified by the first argument, ap- ally: object-oriented programming (OOP) and func- plied to the data specified by the second argument. tional programming. R makes use of both, but in The is functional, well-defined by the its own way. Both paradigms are valuable for seri- arguments. It returns an object whose properties ous programming with the language. But in both provide the information needed to study and work cases, understanding the relevant ideas in the con- with the fitted model. Other functions and other ob- text of R is needed to avoid confusion. The confu- jects can adapt to different models in a form that is sion sometimes arises, in both cases, from applying convenient for both the user and the implementer. to R interpretations of the paradigms that to Principles of functional programming guide us in other languages but not to this one. Section 2 of the writing reliable, reproducible functions for the dif- paper will review the ideas, generally and in their ferent models. Object-oriented programming pro- R versions, with the goal of clarifying the basics. vides tools for defining the model objects clearly, Given the importance of R software to the commu- and adapting to new ideas and new forms of mod- nity, creators of new R software should benefit from els. Section 3.4 goes into details of the R implemen- understanding these concepts. tations. We will also examine in Section 3 of the paper As they have been realized in R, both paradigms the evolution that led to these versions of functional center on a few, intuitive concepts. The details are programming and OOP. The prime motivation was more complicated, as they usually are. In the case of not language design in the abstract but to provide functional programming, the realization in R is only the tools needed for research and data analysis by partial, reflecting the language’s origins as well as the user community at the time. R originally repro- practical considerations. In the case of OOP, there duced the functionality of the S language at Bell are now at least three realizations of the ideas in R, Labs, which itself had evolved through several stages using two different paradigms. All three have signif- beginning in the late 1970s and which was in turn icant applications and practical . based on earlier statistical software libraries, mainly Despite all these devilish details, the main ideas in . remain visible and useful, particularly when pro- R added important new ideas and has continued gramming serious applications using the language. to evolve, but the main contents inherited through S shaped the capabilities and the approach to statisti- 2.1 Functional Programming cal computing. In a surprising number of areas, what For our purposes, the main principles of functional we think of as “the R way” of organizing the compu- programming can be summarized as follows: tations actually reflects software developed twenty years or more before R existed. 1. Programming consists largely of defining func- Having been involved in all the stages, I am nat- tions. urally inclined to a historical perspective, but it is 2. A function definition in the language, like a also the case that the history itself had substantial function in mathematics, implies that a function call impact on the results. It may be comforting to view returns a unique value corresponding to each valid programming languages as abstract definitions, but set of arguments, but only dependent on these ar- in practice they evolve from the needs, interests and guments. limitations of their creators and users. 3. A function call has no side effects that could alter other . 2. FUNCTIONAL AND OBJECT-ORIENTED The implication of the second point is that func- PROGRAMMING: THE MAIN IDEAS tions in the are mappings Functional and object-oriented programming fit from the allowed set of arguments to some range naturally into statistical applications and into R. of output values. In particular, the returned value OBJECT-ORIENTED AND FUNCTIONAL PROGRAMMING 3 should not depend on other quantities that affect 2.2 Object-Oriented Programming the “state” of the software when the function call is The main ideas of object-oriented programming evaluated. are also quite simple and intuitive: True functional languages conform to these ideas both by what they do provide, such as pattern ex- 1. Everything we compute with is an object, and pressions, and what they do not provide, such as objects should be structured to suit the goals of our procedural iteration or dynamic assignments. The computations. classic tutorial example of the function, for 2. For this, the key programming tool is a class example, could be expressed in the Haskell language definition saying that objects belonging to this class by the pattern: share structure defined by properties they all have, factorial x = if x > 0 with the properties being themselves objects of some then x * factorial (x-1) else 1, specified class. 3. A class can inherit from (contain) a simpler plus some type information, such as that a value for superclass, such that an object of this class is also x must be an integer scalar. an object of the superclass. Is R a functional programming language in this 4. In order to compute with objects, we can de- sense? No. The structure of the language does fine methods that are only used when objects are of not enforce functionality; Section 2.3 examines that certain classes. structure as it relates to functional programming and OOP. The evolution of R from earlier work in Many programming languages reflect these ideas, ei- statistical computing also inevitably left portions of ther from their inception or by adding some or all earlier pre-functional computations; Section 3 out- of the ideas to an existing language. lines the history. Random number generation, for ex- Is R an OOP language? Not from its inception, ample, is implemented in a distinctly “state-based” but it has added important software reflecting the model in which an object in the global environ- ideas. In fact, it has done so in at least three separate ment (.Random.seed) represents the current state forms, giving rise to some confusion that this paper of the generators. Purely functional languages have attempts to reduce. developed techniques for many of these computa- Some of the confusion arises from not recognizing tions, but rewriting R to eliminate its huge body of that the final item in the list above can be imple- supporting software is not a practical prospect and mented in radically different ways, depending on the would require replacing some very well-tested and general paradigm of the programming language. A well-analyzed computations (random number gen- key distinction is whether the methods are to be eration being a good example). embedded in some form of functional programming. Functional programming remains an important Traditionally, most languages adopting the OOP paradigm for statistical computing in spite of these paradigm are not functional; either the language be- limitations. Statistical models for data, the motivat- gan with objects and classes as a central motivation ing example for many features in S and R, illustrate (SIMULA, ) or added the paradigm to an exist- the value of analyzing the software from a functional ing non-functional language (C++, Python). In such programming perspective. Software for fitting mod- languages, methods were naturally associated with els to data remains one of the most active uses of classes, essentially as callable properties of the ob- R. The functional validity of such software is im- jects. The language would then include syntax to portant both for theoretical justification and to de- call or invoke a method on a particular object, most fend the results in areas of controversy: Can we show often using the infix operator “.”. The class defini- that the fitted models are well-defined functions of tion then encapsulates all the software for the class. the data, perhaps with other inputs to the model Where methods are needed for other computations, such as prior distributions considered as additional such as special method names in Python or opera- arguments? The structure of R as described in Sec- tor overloading in C++, these are provided by ad- tion 2.3 can provide support for analyzing functional hoc mechanisms in the language, but the method validity. Equally usefully, such analysis can also illu- remains part of the class definition. minate the limits of functional validity for particular In a language that is functional or that aspires to software, such as that for model-fitting. behave functionally as S and R do, the natural role 4 J. M. CHAMBERS of methods corresponds to the intuitive meaning of not enforce functional programming, but does en- “method”—a technique for computing the desired courage it to a degree. In particular, the evaluation result of a function call. In functional OOP, the par- process in R contributes to functional programming ticular computational technique is chosen because by largely avoiding side effects when function calls one or more arguments are objects from recognized are evaluated, but some mechanisms in the language classes. and especially in the underlying support code can Methods in this situation belong to functions, not behave in a non-functional way. To understand in a to classes; the functions are generic. In the simplest bit more detail, we need to examine this evaluation and most common case, referred to as a standard process. generic function in R, the function defines the formal Computations in R are carried out by the R evalu- arguments but otherwise consists of nothing but a ator by evaluating function call objects. These have table of the corresponding methods plus a command an expression for the function definition (usually a to select the method in the table that matches the reference to it by name) and zero or more expres- classes of the arguments. The selected method is a sions for the arguments to the call. The full details function; the call to the generic is then evaluated as are somewhat beyond our here, but an essen- a call to the selected method. tial question is how references to objects are han- We will refer to this form of object-oriented pro- dled. Any programming language must have refer- gramming as functional OOP as opposed to the en- ences to data, which in R means references to ob- capsulated form in which methods are part of the jects. As discussed in Section 3, the evolution of such class definition. references is central to the evolution of programming languages, especially for statistics. 2.3 Their Relationship to R In R a reference to an object is the combination To understand computations in R, two slogans are of a name and a context in which to look up that helpful: name; the contexts in R are themselves objects, of type “environment”. A reference is therefore the • Everything that exists is an object. combination of a name and an environment. (We’ll • Everything that happens is a function call. look at an example shortly.) In contrast to languages such as Java and C++ Note that we are talking about references to ob- where objects are distinct from more primitive data jects; most objects in R are not themselves refer- types, every reference in R is to an object, in partic- ence objects. Languages implementing OOP in the ular, to a single internal structure type in the under- traditional, non-functional form essentially always lying C implementation. This applies to data in the include reference objects, in particular, what are usual sense and also to all parts of the language it- termed mutable references. If a method alters an self, such as function definitions and function calls. object, say, by assigning new values to some of Computations that are more complex than a con- its properties, all references to that object see the stant or a simple name are all treated as function change, regardless of the context of the call to the calls by the R evaluator, with control structures and method. Whether the reassignment of the property operators simply alternative syntax hiding the func- takes place where the object originated or down in tion call. [Details and examples are shown in (Cham- some other method makes no difference; the object bers (2008), pages 458–468).] itself is the reference. The two slogans, however, do not imply that In contrast, the reference in R consists of a name computations in R must follow either functional or and an environment—the environment in which the object-oriented programming in the senses outlined object referred to has been assigned with that name. in the preceding sections. With respect to object- Most R programming is based on a concept of lo- oriented programming, R has several implementa- cal references; that is, reassigning part of an object tions that have evolved as outlined in Section 3. referred to by name alters the object referred to by These can be used by programmers to provide soft- that name, but only in the local environment. If that ware following either of the OOP paradigms. local reference started out as a reference in some Functional programming’s relationship to R is less other environment, that other reference is still to straightforward. The evaluation process in R does the original object. OBJECT-ORIENTED AND FUNCTIONAL PROGRAMMING 5

To understand the relation of local references to that depends on a non-functional “state.” Whether functional programming in R, an example and a few a particular computation in R is strictly functional more details of function call evaluation are needed. can only be determined by examining it in detail, R evaluates function calls as objects. For example, including all the functions that call code in C or when the evaluator encounters the call Fortran. The rest of this section takes a slight detour to lm(Sepal.Width ∼ . - Sepal.Length, iris), consider how one might do that examination. Validating Functionality in R it uses the object representing the call to create an environment for the evaluation. In principle, the functional validity of particu- The call identifies the function, also an object of lar computations could be analyzed and either cer- course, typically referring to it by name. In this case tified or the limitations to functionality reported. lm refers to an object in the stats package. That ob- Such functional validation would be useful in cases ject has formal arguments [14 of them, in the case of where either the theoretical validity or the implica- lm()]. The evaluator initializes an environment for tions of the result in an application are being ques- the call with objects corresponding to the formal ar- tioned. Fitting models to data provides a natural guments, as unevaluated expressions built from the example for both aspects. Given a function taking two actual arguments and default expressions found as arguments data and a model specification and in the function definition. For details see Section 4 returning a fitted model object, can one validate of the language definition, R Core Team (2013) and that the returned object is functionally defined by Chapter 13 of Chambers (2008). As an aside, the the arguments? If not, can the non-functionality be common use of terms like “call by value” (and the parametrized meaningfully, in which case one can contrasting “call by reference”) for argument pass- construct a functional version of the computation by ing in R is invalid and misleading. Arguments are including such as implicit arguments? R not “passed” in the usual sense. does not have organized support for such validity Local references operate on all the objects in the investigations, but developing tools for the purpose environment to prevent side effects. The formal ar- would be a worthwhile project. gument data to lm() matches the expression iris, Functional validation is a bottom-up construction. which refers to an object in the datasets package. Ex- The bottom layer consists of any functions called pressions that extract information from data work that are not implemented in R, typically those that on that object. But the local reference defined by call routines in C++, C or Fortran. Included are the data and the environment of the evaluation is dis- R primitives, routines from numerical libraries and a tinct from the reference to iris in the package. If variety of other standard sources, plus any new code an assignment or replacement expression is encoun- brought in to implement the computation in ques- tered that would alter data, the evaluator will du- tion. The functional validity of each of these is an plicate the object first to ensure locality of the ref- empirical assertion. Some are clearly non-functional, erence. such as the “<<-” operator and assign() function The local reference paradigm is helpful in validat- that do nonlocal assignments. ing the functionality of an R function. Only the local Many computations in R eventually call subpro- assignments and replacements need to be examined; grams not originally written for R. Each of these calls to other functions will not alter references in must be examined for potential non-functional be- this environment, so long as those functions stick havior, sometimes a daunting task. However, good to local reference behavior. If a function f() calls a practice in using well-tested, preferably open-source function g() and both functions stick to local refer- supporting software will often provide a plausible ence assignments, then knowing that the value of a basis. R C For- call to g() depends only on the arguments is all that If code includes an interface to code in , tran or other languages whose functional validity is needed; how g() computes that value is irrelevant. cannot be established, nothing more can be said. While local references help avoid side effects, they Other than such code, functional validity is likely to do not prevent computations from referring to ob- fail for one of three reasons: jects or other data outside the functions being called, and therefore potentially returning a result • dependance on nonlocal values; 6 J. M. CHAMBERS

• using low-level computations in R known to vio- 3. THE EVOLUTION OF FUNCTIONAL late functionality; PROGRAMMING, OOP AND R • changing functions or other objects at run time. The computational paradigms for functional pro- A prime example of the first is the use of external gramming and for object-oriented programming data, such as the global options object, for conver- have evolved from a sequence of changes in software, gence tolerances or other parameters for iterative beginning with the earliest programable computers. numerical computations. An example of the second During the same period, software for statistics was is the inclusion of pseudo-random values in the cal- also evolving, one of which led through early culation. The third problem might be caused, for libraries to S and then to R. example, by using a function from the global envi- There may be an appearance of earlier languages ronment. being replaced by later and presumably improved The third danger is greatly reduced when the code approaches. It is true that each major revision as- resides in the namespace of a package with explicit serts improvements that will extend our abilities to import rules. Any reasonable approach to validating express our ideas in software. However, none of the functionality would make this a requirement. versions of S or R actually totally replaced earlier My feeling is that most examples of failures could software paradigms. be corrected to create functionally valid extensions The current software in, and interfaced from, R il- of the computation in question. Tolerances are often lustrates this evolution. R has developed important organized through the R options() function, explic- new techniques, but originated from the S language, itly designed to avoid functional programming by reproducing nearly all of S as it was described at allowing users to set state parameters that are then that time. S in turn went through several evolution- queried by the calculation. Once identified, such op- ary changes and was itself based on extensive earlier tions could be converted to additional arguments to software, particularly libraries for Fortran the function being validated. [A general mechanism programming. Examining the history shows that a would be a version of getOption() that required the surprising portion of what we see now is structure option in question to be supplied as an argument.] inherited from the early stages. Pseudo-random values are used in a variety of The form in which functional programming and procedures, including some optimization techniques where they are expected to provide more robust nu- OOP were adopted was also influenced by the ex- merical behavior by jittering values during iteration. isting software. Examining the history will explain These can be made functionally valid by using well- many of the choices made. defined generator software, such as that supplied in 3.1 From Hardware to Data and Libraries R itself, and by treating the initial state of the gener- ator as another nonlocal value to be incorporated as The earliest general-purpose computers were pro- an additional argument. One should always include grammed in terms of the physical machine, its stor- an explicit initialization via set.seed() in any ex- age and the basic operations provided to move data ample expected to be reproducible, and that prac- around and perform arithmetic and other opera- tice can be the basis for a functionally valid version tions. The IBM 650 (Figure 1) was probably the of the computation. first computer widely sold and used (and the ma- Beyond these specific examples, numerical compu- chine on which I did my first programming, around tations often depend on the underlying parameters 1960). of the floating-point computations, for example, to In this pre-silicon world, storage for data or pro- select convergence criteria for iteration. Fortunately, grams resided on a rotating magnetic drum, holding several decades of work by numerical analysts and 2000 decimal words. Data could be read or written hardware designers have greatly standardized the only when the corresponding segment of the drum specification of the numerical engine in modern com- passed under the appropriate fixed head, so that puters: just knowing 32-bit or 64-bit gets us a long physical positioning of data was a serious aspect way. of performance. With this close view of the hard- Developing a framework for validating functional- ware, programming languages (assembly languages ity seems to me an interesting cooperative research for the actual machine instructions) defined storage direction that could be of value to the statistical in terms of single physical units (words in the 650) community. and blocks of sequential storage. OBJECT-ORIENTED AND FUNCTIONAL PROGRAMMING 7

Fig. 1. An IBM 650 computer, mid 1950s. Under the glass is the magnetic drum storage unit (memory), 2000 words for data and programs.

This was not an environment to encourage ab- puting generally and emphatically so in computing straction of ideas about data. However, by 1960 the for statistics. first generation of “high-level” languages had been series and other publications supported introduced and would support profound changes. by professional societies began to accumulate refer- For statistical computing this meant primarily For- eed, trustworthy procedures for many key compu- tran. tations. The statistics research group at In terms of data storage, Fortran actually con- developed a large Fortran library that reflected our tinued the basic notion of single items (scalars) needs and our philosophy of research and data anal- and contiguous blocks (arrays). Two major changes, ysis. The book “Computational Methods for Data however, were made. First, the contents were de- Analysis”, Chambers (1977), did not present soft- scribed in terms of their content, the first data types ware but did reflect the tools that would later form including integer and floating point numbers. Sec- the basis for S. After an introduction and discus- ond, the language encouraged operations that iter- sion of program design, the remaining six chapters ated over the contents of the arrays. By interpreting covered computations supported by the library: an array as a sequence of equal-length subarrays, 3. Data Management and Manipulation (includ- this indexing extended to matrices and to multi-way ing sorting and table lookup). tables. 4. Numerical Computations (approximations, Along with the new paradigm for data and facil- Fourier transforms, integration). ities for iteration, the high-level languages encour- 5. Linear Models (numerical linear algebra, re- aged software to be organized in , so that gression, multivariate methods). a computational method could be realized as one or 6. Nonlinear Models (optimization, nonlinear least several units of software. While the changes may squares). seem modest from the current perspective, they in 7. Simulation of Random Processes (random num- fact supported a major revolution in scientific com- ber generation and Monte Carlo). 8 J. M. CHAMBERS

8. Computational Graphics (plotting techniques, “bug reports” came to us as a result of confusing scatter plots, histograms and probability plots). an “I” and a “1” when typing in the stable dis- tribution software, Chambers, Mallows and Stuck Each of these was supported in the pre-S era by subroutines that would then become the basis for (1976).] corresponding functions in S. Substantial in-house libraries, such as the one at Much of the organization for basic tools in R has Bell Labs, gave users a fairly wide range of com- inherited, through S, the structure of the subrou- putations, supported by improved numerical and tine library. That includes the graphical computa- other . However, to apply the computa- tions, in particular, features essential to S and R: tions specifically to a particular dataset with partic- separation of graphic device specification from plot- ular results in mind required some substantial addi- ting; the plot, figure and margins structure; graph- tional Fortran programming. That programming had ical specification to control style. These to be repeated and revised for each analysis or re- were not created for S but taken over from previous search question. Fortran software, described in Becker and Chambers In the 1970s the situation was therefore a combi- (1977). nation of improved basic computational capabilities The Bell Labs software was in the background of but with a high programming barrier for most statis- Chambers (1977), but general readers were given in- ticians. The classical linear regression in Fortran as structions for obtaining similar software from pub- shown in Becker and Chambers (1985), for example, licly available sources for the methods described. was fairly straightforward: The procedure would not always be simple, but the call lsfit(X, N, P, y, coef, resid). potential availability marked a big step forward. For the first time, statisticians could draw on an ex- This computes the fitted model and returns it as tensive range of relevant software to support their vectors of coefficients and residuals. The data as ob- research, at least in principle. Various statistical jects are restricted to arrays, a matrix X and vector software packages had existed for some time, but y for the data and two arrays, coef and resid for these were by and large oriented to routine analysis, the fitted model. The structure of the objects and to teaching or to specialized statistical techniques. their storage allocation remains the programmer’s Chambers (1977) and the software it reflected were responsibility. Linking the basic computation to the aimed at research in statistics and challenging data data in an actual analysis remained nontrivial and analysis. For this purpose, a more general and open- mistakes along the way were likely. And this is for ended approach was needed. the most standard of models. Even given an exten- 3.2 From Fortran to S sive library, the programming to apply the tools to most applications was a laborious, error-prone activ- For those involved with statistical theory or ap- ity, usually assigned to dedicated programmers, re- plications, in academia or industry, there were two search assistants or students. The statistician’s ideas main limitations to the software described so far: went through nontrivial translation before they were availability and the programming interface. The Ap- expressed as computations. pendix to Chambers (1977) was a set of tables for The first two versions of S were designed to pro- each of the chapters, with rows corresponding to vide an “interactive environment” that included the computational tools that were more or less avail- computational areas described in Chambers (1977) able to readers. The last column of the table listed and that allowed the statistician to formulate ideas sources for the corresponding software. The entries directly for computation. The second version of S in that column were not uniformly helpful; in the was licensed for general use and described in Becker best situation, a generally available program library and Chambers (1984). could be ordered that provided a number of the In S, the linear regression computation became a subroutines, but these were not designed for sta- simpler expression, storage for data was provided tistical applications, most being directed at numer- automatically and the returned model was now an ical methods typically motivated by applications in object, with components for the coefficients and physics. More than half of the entries read “List- ing,” implying a laborious and error-prone man- residuals: ual procedure for the user. [As an example, many fit <- reg(X, y). OBJECT-ORIENTED AND FUNCTIONAL PROGRAMMING 9

At this stage, S had a functional appearance, not 3.3 From Data to Classes and Methods radically unlike R, but its paradigm was essentially an extension of the Fortran view. Dynamically cre- The languages that originated the concepts of ated, self-describing objects were assigned in a single classes, properties, inheritance and methods came workspace, but the underlying computations were out of several motivations. The first, Simula, was those of the earlier subroutine library: The functions concerned with simulating systems. In retrospect, in S, documented in Becker and Chambers (1984), modeling by simulation and modeling by fitting to were in fact interfaces to Fortran subroutines: reg() data have clear correspondences but with quite a would in fact be programmed by calling lsfit(). different perspective. For an example, suppose we Although there was a facility in the lan- want to simulate a simple model for an evolving guage, programming a function in this version of population of individuals. In R notation, but quite S meant “extending S” as described in the book in the style of Simula, we define a class SimplePop. of that name, Becker and Chambers (1985). The An object from this class is a specific realization definition of the new function was programmed in of the model population with properties that define an “interface language” built on Fortran and com- the probabilities of birth and death, and a vector of piled from its Fortran translation. As the main pro- population size at each generation. An object from gramming mechanism this was unsatisfactory, in the the population is created by calling the generator sense that extending the language had a substan- for the class: tial learning barrier beyond using the language. The ability to access other software via an inter-system p <- SimplePop(birth = 0.08, interface remains a key feature of R, however, one death = 0.1, size = 100). still under active development. Equally as important as the technical side was the Rather than a single functional computation as in beginning of a network of statisticians involved in the case of linear regression, computations proceed creating and sharing software through the medium by simulating the evolution of the population object of the language. S was licensed from the early 1980s, p. The object itself evolves; in the terminology of available thanks to the newly distributed oper- OOP, it is a mutable reference. ating system, with inexpensive academic licenses to A corresponding difference in the programming encourage adoption by university researchers, also paradigms of S and the emerging OOP languages following the example of UNIX. Open-source soft- was that the latter did not take a functional view ware was not an option, but the research community of computation. Instead, computations largely con- was increasingly involved and their interest stimu- sisted of invoking a method on an object. In the lated further developments on our part, particularly SimplePop example, the fundamental computation from contacts with interested users belonging to a is to simulate one generation of the evolution by in- “beta testing” network. voking the evolve() method Simultaneously, we were thinking about a new ap- proach to the language itself, emphasizing the pro- p$evolve(). gramming aspect of creating new software for statis- The value returned by this method is irrelevant. The tical and other quantitative applications. Described method’s purpose is to change the object, in this initially in Chambers (1987) as a language sepa- case by simulating one further generation and ap- rate from S, this research later merged with other changes to form the next version, labeled S3 and de- pending the resulting value to a property in the ob- scribed in the “blue book,” Becker, Chambers and ject, namely, p$size. (See files “SimplePop.R” and Wilks (1988). The slogans in Section 2.3 were basic “SimplePopExample.R” in the supplementary ma- to this version of S: everything is an object (stated terials.) explicitly) and function calls do all the computation Following the development of Simula in the late (implicit). 1960s, a variety of languages adopted this paradigm. This was functional programming (more or less) C++ added classes and methods to the C language; and object-based but not object-oriented. Objects like C, it was initially used for a variety of program- were given structure through attributes attached to ming tasks implementing UNIX and application soft- vectors and through named components, but there ware for UNIX. In contrast to the “add-on” nature were no classes or methods. of C++, the language was a very pure, 10 J. M. CHAMBERS simplified realization of the ideas in Simula. Its ma- For the convenience of the user, further computa- jor, and revolutionary, application was to implement tions should have a uniform appearance. To print or the graphical user interface created at Xerox PARC plot the fitted model or to compute predictions or an in the 1970s. Many other versions of encapsulated updated model corresponding to new data, the user OOP followed, either added on to existing languages should call the same function [print(), plot(), or incorporated into new languages from the start. predict() or update()] in the same way, regard- Dialects of the Lisp language and languages based less of the type of model. The owner of the software on Lisp also incorporated OOP in various forms. for a particular type of model, on the other hand, During the 1980s, several research projects built sta- would like to write just that version of each function, tistical software on the basis of these languages, in- without being responsible for the other versions. cluding some elegant and potentially widely appli- Once stated, this is essentially a prescription for cable systems, notably LISP-STAT, Tierney (1990). functional OOP: a class of objects for each of As it turned out, however, the most widely used ver- model, generic functions for the computations on sion of OOP for statistical applications would come the objects and methods for each function for each from a somewhat casual approach in S. class. Where one class of models is an extension of 3.4 Functional OOP in S and R another (analysis of variance as a subclass of linear models, e.g.), methods can be inherited when that The chief motivation for introducing classes and makes sense. functional methods to S was the initial applica- An implementation of generic functions and meth- tion: fitting, examining and modifying diverse kinds ods was introduced as part of the statistical mod- of statistical models for data. This remains ar- els project and described in the Appendix to the guably the most compelling example for functional white book. The central mechanism was an explicit OOP in statistics. The “Statistical Models in S” method dispatch. The function print(), for exam- project reported in Chambers and Hastie (1992)— ple, would evaluate the expression: the “white book”—brought together ten authors presenting software for a variety of statistical mod- UseMethod("print"). els, from linear regression to -based models. The The evaluation of this call would examine the different models were presented as consistently as “class” attribute of the first formal argument to the possible. function. If present, this would be a character vec- Each type of model had a definition as an ob- tor. Eligible methods would be those matching one ject having the information, such as coefficients and of the strings in the class vector; if none matched, other properties, required. The object was created a method matching the string “default” would be by a corresponding function taking as arguments the used. Inheritance was implemented by having more data, model description and possibly other control- than one string in the class, with the first string be- ling parameters. A linear regression fit, for example, ing “the” class and the remainder corresponding to called the function lm(): inherited behavior. irisFit <- lm(Sepal.Width Chambers and Hastie (1992), in the discussion ∼ . - Sepal.Length, iris) of classes and methods, noted that S differed from and returned a corresponding linear regression ob- other OOP languages because of its functional pro- ject. Further computations on this object would ex- gramming style. In fact, this version of functional amine the model, return information about it, or OOP finessed the resulting distinction from encap- update the fit. The underlying computations still sulated OOP in two ways. First, the methods were used basic software similar to that for lsfit() and dispatched according to a single argument, the first reg(). However, the description of the model (a for- formal argument of the generic function in princi- mula) and the data (a data frame) were designed to ple. As a result, the methods were unambiguously apply to statistical models generally. For example, to associated with a single class, as they would be in en- fit a generalized linear model the user called glm() capsulated OOP. Methods were actually dispatched with formula and data arguments typically similar on either argument to the usual binary operators, to those in a call to lm(). Other arguments would but a number of encapsulated OOP languages do provide information suitable to the particular type the same, under the euphemism of operator over- of model (a link function, e.g.). loading. OBJECT-ORIENTED AND FUNCTIONAL PROGRAMMING 11

Second, the question of whether methods be- 3. Generic functions, methods and class defini- longed to a class or a function was avoided by not tions were themselves objects of formally defined having them belong to either. Methods were as- classes, giving the paradigm reflectivity. signed as ordinary functions and identified by the The new paradigm was part of the version of S de- pattern of their name: “function.class”. In any case, scribed in the 1998 book and generally referred to there were no class objects and generic functions as S4. The S4 label is generally applied to this OOP were ordinary functions that invoked UseMethod() paradigm, whether in S or R. S4 methods never to select and call the appropriate method. Neither had much chance of replacing S3 methods. In prac- the function nor the class was able to own the meth- tice, many S4 generic functions were based on func- ods. tions that already dispatched S3 methods. In this Technically, the method dispatch in this version case, the S3 generic function became the default S4 of OOP was instance-based, not class-based, since method. no rule enforced a consistent set of classes, that is, The work on S4 paralleled in time the arrival of R that all objects with a given first class string would and its conversion into a broad-based joint project have identical following strings for the superclasses. following the initial publication by Ihaka and Gen- (R for some time had an S3 class in the base pack- tleman (1996). The implementation of R was de- age with a main class string “POSIXt”, representing signed to provide the functionality for S described date/times, that could be followed in different ob- in the blue book and white book, including S3 meth- jects by one of two strings that in fact represented ods. Beginning in 2000, an implementation of the S4 specializations, i.e., subclasses, of “POSIXt”.) version of OOP was added to R. The “Software for The classes and methods implemented for statisti- Data Analysis” book, Chambers (2008), includes a cal models constituted a bare-bones version of func- description of the R version. tional OOP, which is not to imply that this was a Both versions of functional OOP will remain in R. bad idea. Advantages include a relatively low learn- Many prefer the simplicity of the old form, and in ing barrier for programming and a thin implemen- any case the very large body of existing code will not tation layer above the previously existing language, be discarded, and should not be. Some important ex- which in turn means less computational in tensions have been made, for example, by register- some circumstances. [Interestingly, the encapsulated ing the S3 methods from a package. Major forward- OOP of Python has a similarly thin implementation, looking projects have typically used the newer ver- with classes containing methods but without defin- sion, for example, the Bioconductor project for bioin- ing the properties. A very analogous defense is made formatics software, Gentleman et al. (2004), and the for that implementation, in Section 9 of the Python Rcpp interface to C++, Eddelbuettel and Fran¸cois tutorial, Python (2013), e.g.] (2011). Recent changes, such as making the S3 and A more formal version of functional OOP was de- S4 versions of inheritance as compatible as possible, veloped at Bell Labs, introduced into S in the late have been aimed at helping the two forms to coexist 1990s and described in Chambers (1998). By this productively. time, S-based software was exclusively licensed to Any with some degree of the Insightful Corporation, which later purchased formality is likely to have a higher initial learning the rights to the S software, in 2004, and was itself barrier and require some extra specification from the subsequently purchased by Tibco. programmer. A comparison of encapsulated OOP The new paradigm differed from S3 classes and programming with Python to that with Java is an methods in three main ways: interesting parallel to S3 and S4. In both examples, 1. Methods could be specified for an arbitrary the less formal version is likely to be quicker to learn, subset of the formal arguments, and method dis- while the more formal version provides more infor- patch would find the best match to the classes of mation about the resulting software. That informa- the corresponding arguments in a call to the generic tion in turn can support some forms of validation function. for the resulting software, as well as tools to analyze 2. Classes were defined explicitly with given prop- and describe it. Python and Java being rather dif- erties (the slots) and optional superclasses for inher- ferent languages in other respects as well, projects iting both properties and methods. are not too likely to make a choice between them 12 J. M. CHAMBERS based solely on the formality of the object-oriented properties of the class with optional type declara- programming. tions; properties may also be optionally declared With R, a conscious choice is more likely. The ar- read-only. Class definitions are themselves objects guments for a more formal approach apply particu- available at runtime. Methods are programmed as larly, in my opinion, to projects with one or more of R functions, in which the object itself is implic- the characteristics: a substantial amount of software itly available, not an explicit argument. Methods is likely to be written; the application has a fairly can access or assign properties in the object by wide scope in terms of either the data or the com- name. These characteristics make the implementa- puting methods; or the validity and reliability of the tion more Java-like, say, than Python- or C++-like. resulting software is important. The programmer defines a reference class in the R Nothing prevents good software being written style, calling setRefClass() instead of setClass(). without formal tools in this case nor of bad soft- The call returns a generator for the class and saves ware being written with them. However, there are the class definition object as a side effect, as does several potential benefits that can be summarized setClass() for S4 classes. in parallel with the main innovations noted above: As a side comment, while R uses a model for most of its objects and computations that is fundamen- 1. Allowing methods to depend on multiple argu- tally different from the object references in encapsu- ments fits the functional paradigm in R, in which lated OOP, a few key features made the implementa- the arguments collectively define the domain of the tion of reference classes in R possible and even rela- function. Many functions in R are naturally applied tively straightforward. Most importantly, the R data to different classes of objects, not necessarily corre- type “environment” provides a vehicle for object sponding to the first argument, or only to one argu- references and properties. Environments are univer- ment. For example, when binary operators such as sal in R and well supported by programming tools. arithmetic are defined for a new class, a clean design In particular, the active binding mechanism, which of methods for the operators often needs to distin- allows access and assignment operations on objects guish three cases: the first operand only belonging in environments to be programmed in R, was valu- to the new class, the second operand only or both able in the implementation. operands. Reference classes allow the use of encapsulated 2. A formal definition for a class allows program- OOP for objects that suit that paradigm more natu- mers to rely on the properties of objects generated rally than they do functional OOP. As noted in Sec- from the class. Otherwise, the nature of the objects tion 3.3, the essential distinction between functional can only be inferred, if at all, from analyzing all the and encapsulated OOP is whether an object is cre- software that creates or modifies an object of this ated, once, by a function call or is instead a mutable class. object that changes as methods are invoked. 3. Having formal definitions for the generic func- Statistical computing has examples clearly suited tions, methods and class definitions themselves sup- to each of these paradigms. The linear model re- ports a growing set of tools for installing and using turned by lm() is not open to mutation. Change packages that include such functions, methods or the numbers in the coefficients or residuals and you classes. no longer have an object that should belong to that class. In contrast, a model simulating a dynamic pro- The benefits of a general, reliable form of functional cess such as the SimplePop class in Section 3.3 exists OOP extend to developments in the language itself. precisely for the purpose of changing, with its evo- For example, reference classes were built on the S4 lution being the central point of interest. Other, less classes and methods, with no internal changes to the directly statistical computations in R also may cor- R evaluator required. respond to mutable objects, for example, the frames 3.5 Reference Classes or other objects in a graphical interface. Not every case is clear cut. Sometimes, essentially Functional OOP remains an active area in R. the same class structure may be more appropriate In addition, reference classes, introduced to R in for functional or encapsulated classes depending on 2010 in version 2.12.0, provide an implementation the purpose of the computation. Data frames are of encapsulated OOP. Class definitions include the a prime example. This essential object structure is OBJECT-ORIENTED AND FUNCTIONAL PROGRAMMING 13 viewed naturally as functional when it is part of a other language with methods invoked in R but ex- functional object related to the data frame. For ex- ecuted on the original object. The Rcpp interface ample, a fitted model that wanted to be fully re- to C++, Eddelbuettel and Fran¸cois (2011), has a producible could return the data frame on which mechanism for extending C++ classes in this way. the fitting was based [e.g., lm() includes the model C++ classes can only be inferred from the source, frame it constructs]. Such a data frame is clearly meaning that either the programmer must supply functional; again, change it and you invalidate the the interface information (as in the current imple- model. On the other hand, a data frame to be used mentation) or some processing of the source must in data cleaning and editing is an object that needs be applied (currently used to export functions from to be mutable. C++ but not classes). Java classes are accessible Having both paradigms in a single language is as objects, via “reflectance” in Java terminology, unusual. Some functional-style languages have im- so that in principle proxy classes in R should be plemented functional OOP, notably Dylan, inter- possible. The rJavax package by Danenberg (2011) esting for its parallels with OOP in R—see Shalit has an initial implementation. For Python, methods (1996), particularly the discussion of method dis- are available from the objects but properties are not patch. Other languages with a functional structure formally defined. At the time of writing, basic inter- have nevertheless added what is essentially encapsu- faces to Python exist, for example, Grothendieck and lated OOP, for example, Odersky, Spoon and Ven- Bellosta (2012), which could be extended to support ners (2010) for the case of Scala. class interfaces, with methods but not properties in- We hope that providing both paradigms in R en- ferred from the Python class objects. courages software design that is natural for the ap- Further work on these and other inter-system in- plication. It does at the same time pose some sub- terfaces would be a valuable contribution to the user tleties. Reference classes and reference class objects community. are somewhat abnormal in R. One needs to under- stand the distinctions from standard R objects. 4. SUMMARY The key is the local reference mechanism noted in R plays a major role in the communication and Section 2.3. The R evaluator enforces local reference dissemination of new techniques for statistics and by duplicating an object when a computation might for results of statistical research more generally. In alter a nonlocal reference. Certain object types are particular, the many packages written in R or using exceptions that are not duplicated. The important R as a base for interfacing to other software consti- exception is type “environment”. Reference classes tute an essential, rapidly growing resource. There- are implemented by extending this type. Encapsu- fore, the quality of such software and the ability of lated OOP in R uses no special form of the func- programmers to create and extend it are important. tion call. Method invocation is just a call to the The current R language and its supporting func- “$” operator, for which reference classes have an S4 tionality are the result of many years of evolution, method. Reference semantics are obtained by one from early programming libraries through the S lan- basic fact: environments are never duplicated auto- guage to R, which itself has evolved and accumu- matically. The S4 class mechanism in R nevertheless lated a variety of programming techniques. This evo- allows one to subclass the “environment” type in lution has been much influenced by the functional order to define reference class behavior. and object-oriented programming paradigms. New The objects in the fields of a reference class object versions have continued to include supporting soft- can be ordinary R objects. They behave just as usual ware and programming tools found useful at earlier and when used in function calls will have regular stages along with improved capabilities. local reference behavior in that call. It is only when The programming paradigms become especially fields in the reference object itself are replaced that relevant when the applications are complex or the the encapsulated OOP is relevant. quality of the resulting software is important. In Reference class objects are also good candidates particular, the versions of object-oriented program- for interfaces to other languages that implement the ming in R can assist in dealing with complexity of same OOP paradigm, such as Java, C++ or Python. the underlying data. As noted, R implements OOP The R object could be a proxy for an object in the in two forms, functional and encapsulated. These 14 J. M. CHAMBERS are complementary, with one or the other suitable REFERENCES for particular applications. The latter is essentially Becker, R. A. and Chambers, J. M. (1977). Gr-z: A sys- the form of OOP used in most other languages, but tem of graphical subroutines for data analysis. In Proc. the former is distinctly different. Considerable con- Interface Symp. on Statistics and Computing 10 409–415. fusion has arisen in discussions of OOP in R from Becker, R. A. and Chambers, J. M. (1984). S: An In- not noting that distinction, which the present paper teractive Environment for Data Analysis and Graphics. has tried to clarify. Wadsworth, Belmont, CA. Becker, R. A. and Chambers, J. M. (1985). Extending the More generally, understanding the role of object- S System. Wadsworth, Belmont, CA. oriented and functional programming in R may Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988). assist future contributing programmers in using The New S Language. Chapman & Hall, Boca Raton, FL. related programming tools. The continuing rapid Chambers, J. M. (1977). Computational Methods for Data growth of R-based software and the expanding, chal- Analysis. Wiley, New York. MR0659716 lenging range of techniques it has to support make Chambers, J. M. (1987). Interface for a quantitative pro- gramming environment. In Comp. Sci. and Stat., Proc. effective programming an important goal for the sta- 19th Symp. on the Interface 280–286. tistical community. Chambers, J. M. (1998). Programming with Data: A Guide The importance of object-oriented programming to the S Language. Springer, New York. is likely to increase as statistical software takes on Chambers, J. M. (2008). Software for Data Analysis: Pro- new and challenging applications. In particular, the gramming with R. Springer, New York. need to deal with increasingly large objects and dis- Chambers, J. M. and Hastie, T., eds. (1992). Statistical Models in S. Chapman & Hall, Boca Raton, FL. tributed sources of data will bring in specialized Chambers, J. M., Mallows, C. L. and Stuck, B. W. classes of data and will need powerful computing (1976). A method for simulating stable random variables. tools. One important direction has been to trans- J. Amer. Statist. Assoc. 71 340–344. MR0415982 form selected software in R, particularly to speed up Danenberg, P. (2011). rJavax: rJava extensions. R pack- large-scale computations; see, for example, the com- age version 0.3. Available at http://CRAN.R-project. panion paper Temple Lang (2014). Complementary org/package=rJavax. Eddelbuettel, . and Franc¸ois, R. (2011). Rcpp: Seam- to this is to interface to other languages and soft- less R and C++ integration. Journal of Statistical Software ware when these provide better performance on “big 40 1–18. data” and other computationally demanding appli- Gentleman, R. C., Carey, V. J., Bates, D. M. et al. cations. In particular, interfaces that match with (2004). Bioconductor: Open for com- object-oriented treatments for specialized forms of putational biology and bioinformatics. Genome Biology 5 data can exploit the OOP facilities in R. The inter- R80. Grothendieck, G. and Bellosta, C. J. G. (2012). face to C++, Eddelbuettel and Fran¸cois (2011), is rJython: R interface to Python via Jython. R package an example. Further development of such interfaces version 0.0-4. Available at http://CRAN.R-project.org/ will be of much benefit. package=rJython. Functional programming is perhaps not such an Ihaka, R. and Gentleman, R. (1996). R: A language for 5 obviously hot topic at the moment. However, the data analysis and graphics. J. Comput. Graph. Statist. 299–314. underlying philosophy that our software should be Odersky, M., Spoon, L. and Venners, B. (2010). Program- in the form of reliable, defensible units is very much ming in Scala, 2nd ed. Artima, Walnut Creek, CA. part of R. Situations where the validity of statisti- Python (2013). The Python Tutorial. Python. Available at cal computations needs to be defended are likely to http://docs.python.org/tutorial. increase, given the growing need for statistical treat- R Core Team (2013). R Language Definition. R Founda- ment of complex problems for science and society. tion for Statistical Computing, Vienna, Austria. ISBN 3- 900051-13-5. Available at http://cran.r-project.org/ doc/manuals/R-lang.html/. ACKNOWLEDGMENTS Shalit, A. (1996). The Dylan Reference Manual. Addison- Thanks to the Associate Editor and the referees Wesley, Reading, MA. Temple Lang, D. (2014). Enhancing R with advanced com- for some helpful comments on presentation and con- pilation tools and methods. Statist. Sci. 29 181–200. tent. Thanks especially to Vincent Carey for orga- Tierney, L. (1990). LISP-STAT: An Object-Oriented Envi- nizing and encouraging the set of talks and papers ronment for Statistical Computing and Dynamic Graphics. of which this is part. Wiley, New York.