Namespaces, Source Code Analysis, and Byte Code Compilation UseR! 2004

Introduction

Namespaces, Source Code Analysis, • is a powerful, high level language. and Byte Code Compilation • As R is used for larger programs there is a need for tools to

Luke Tierney – help code more reliable and robust Department of Statistics & Actuarial Science – help improve performance University of Iowa • This talk outlines three approaches: – name space management – code analysis tools – byte code compilation

1

Namespaces, Source Code Analysis, and Byte Code Compilation UseR! 2004 Namespaces, Source Code Analysis, and Byte Code Compilation UseR! 2004

Why Name Spaces Static Binding of Globals

Two issues: • R functions usually use other functions and variables: • static binding of globals f <- function(z) 1/sqrt(2 * pi)* exp(- z^2 / 2) • hiding internal functions • Intent: exp, sqrt, pi from base. Common solution: name space management tools. • Dynamic global environment: definitions in base can be masked.

2 3 Namespaces, Source Code Analysis, and Byte Code Compilation UseR! 2004 Namespaces, Source Code Analysis, and Byte Code Compilation UseR! 2004

Hiding Internal Functions Name Spaces for Packages Starting with 1.7.0 a package can have Some useful programming guidelines: a name space: • more complex functions from simpler ones. • Only explicitly exported variables are • Create and (re)use functional building blocks. visible when attached or imported. • A function too large to fit in an editor window may be too • Variables needed from other packages complex. can be imported. Problem: All package variables are globally visible • Imported packages are loaded; may • Lots of little functions means clutter for user. not be attached. • Lots of functions means name conflicts more likely. • Consequence: often use big functions with repeated code.

4 5

Namespaces, Source Code Analysis, and Byte Code Compilation UseR! 2004 Namespaces, Source Code Analysis, and Byte Code Compilation UseR! 2004

Name Spaces for Packages (cont.) NAMESPACE Directives Adding a name space to a package involves: • export • Adding a NAMESPACE file export(as.stepfun, ecdf, is.stepfun, stepfun) • Replacing require calls by import directives. • exportPattern • Replacing .First.lib by .onLoad (and maybe exportPattern("\\.test$") ). .onAttach • import import(mva) • importFrom importFrom(stepfun, as.stepfun)

6 7 Namespaces, Source Code Analysis, and Byte Code Compilation UseR! 2004 Namespaces, Source Code Analysis, and Byte Code Compilation UseR! 2004

NAMESPACE File Directives (cont.) NAMESPACE File Directives (cont.)

• useDynLib • exportClass, exportClasses useDynLib(stats) exportClasses(mle, profile.mle, summary.mle) • S3method • exportMethods S3method(print, dendrogram) exportMethods(AIC, BIC, coef, confint, logLik, ...) • importClassFrom • importMethods

8 9

Namespaces, Source Code Analysis, and Byte Code Compilation UseR! 2004 Namespaces, Source Code Analysis, and Byte Code Compilation UseR! 2004

Name Spaces and Method Dispatch Name Spaces and Method Dispatch (cont.)

• S3 dispatch is based on combining generic and class name. • One solution: register S3 methods with the generic. – no hope of private classes – methods are always available to the generic. – methods need not be exported • Looked up in environment where generic is called. ∗ enforces calling methods only via generic. • Problem: if a package is imported but not attached its ∗ simplifies author/maintainer’s task methods may not be visible at the call site. • Name space integration is conceptually simpler for S4 classes, methods, and generic functions.

• The current implementation is evolving and may become simpler.

10 11 Namespaces, Source Code Analysis, and Byte Code Compilation UseR! 2004 Namespaces, Source Code Analysis, and Byte Code Compilation UseR! 2004

Name Space Odds and Ends Source Code Analysis

• Name spaces are sealed. • R provides a powerful infrastructure for managing test code. – cannot add internal variables, imports, exports • Test code alone cannot cover all possible paths. – cannot change values by assignment • Source code analysis provides a complementary approach. – simplifies implementation • Source code analysis examines code for possibly erroneous – helps with byte code compilation constructs: • Exports can be accessed by “fully qualified name”, e.g. – using variables that are not defined stats::ppr. – calling functions with the wrong number of arguments • Internal variables can be accessed using a triple colon, e.g. – calling functions with incorrect argument types stats:::vcov.coxph • Name spaces help to make checks more precise.

12 13

Namespaces, Source Code Analysis, and Byte Code Compilation UseR! 2004 Namespaces, Source Code Analysis, and Byte Code Compilation UseR! 2004

Some Source Code Analysis Tools An Artificial Example

g<-function(x, y = TRUE) { • The package codetools provides some experimental tools: exp <- y w <- x – checkUsage checks individual functions. y <- x – checkUsagePackage checks a loaded package. if (exp) exp(x+3) + ext(z-3) else log(x, bace=2) • It is likely that these will eventually be merged into the } tools package. Results of a code analysis: > checkUsage(g,name="g") g: no visible global function definition for ’ext’ g: no visible binding for global variable ’z’ g: possible error in log(x, bace = 2): unused argument(s) (bace ...) g: local variable ’w’ assigned but may not be used

14 15 Namespaces, Source Code Analysis, and Byte Code Compilation UseR! 2004 Namespaces, Source Code Analysis, and Byte Code Compilation UseR! 2004

An Artificial Example (cont.) Issues and Tradeoffs

A More Sensitive Analysis: • Finding the right sensitivity is challenging > checkUsage(g,name="g",all=TRUE) g: local variable ’exp’ may shadow global value – high sensitivity causes too many false positives g: no visible global function definition for ’ext’ – low sensitivity misses too many real errors g: no visible binding for global variable ’z’ g: possible error in log(x, bace = 2): unused argument(s) • Current approach allows tuning by various parameters (bace ...) g: local variable ’exp’ used as function with no apparent local function definition • Another option is to attempt to prioritize messages g: local variable ’w’ assigned but may not be used g: parameter ’y’ changed by assignment – has had some successes on kernel code

16 17

Namespaces, Source Code Analysis, and Byte Code Compilation UseR! 2004 Namespaces, Source Code Analysis, and Byte Code Compilation UseR! 2004

Issues and Tradeoffs (cont.) Profiling • Rprof takes snapshots of call • More sophisticated checks are needed stack • Are user-supplied checks possible? • summaryRprof reports • Inferred or estimated type information may help cumulative time in each function. • Intra-procedural analysis may also help • Tools to show more detail may • Partial evaluation may be useful help. • May also be able to detect possible inefficiencies • One example: call graph color • Source annotation mechanisms may help coded by total time in function. • Package proftools contains some first steps. 18 19 Namespaces, Source Code Analysis, and Byte Code Compilation UseR! 2004 Namespaces, Source Code Analysis, and Byte Code Compilation UseR! 2004

Byte Code Compilation A Simple Example

• Compilation can improve efficiency: • Simplified normal density function: – user code will run faster f<-function(x, mu=0,sigma=1) – less native system code needed (1/sqrt(2 * pi)) * exp(-0.5 * ((x - mu)/sigma)^2) / sigma • Developing a can clarify the language: • Compiled with – features that are hard to compile are hard to understand fc<-cmpfun(f)

• Code analysis is closely related to compilation – code analysis for compilation is more conservative – code analysis for correctness is more speculative

20 21

Namespaces, Source Code Analysis, and Byte Code Compilation UseR! 2004 Namespaces, Source Code Analysis, and Byte Code Compilation UseR! 2004

Generated Code Generated Code (cont.)

• Byte code and assembly code for a stack machine: • The compiler √ √ 16 1 LDCONST 0.398942 push 1/ 2π – folds constant expressions like 1/ 2π 16 2 LDCONST -0.5 push constant −0.5 20 3 GETVAR x get, push x – inlines basic arithmetic functions 20 4 GETVAR mu get, push mu 45 SUB subtract • Some timings for 1,000,000 repetitions: 20 5 GETVAR sigma get, push sigma 47 DIV divide 16 6 LDCONST 2 push 2 Function x = 1 x = seq(0,3,len=5) 48 EXPT pop x, y, push xy 46 MUL multiply f 14.62 17.75 50 EXP pop x, push ex 46 MUL multiply fc 3.95 7.67 20 5 GETVAR sigma get, push sigma 47 DIV divide dnorm 4.59 7.24 1 RETURN pop, return value • Most improvement comes from constant folding.

22 23 Namespaces, Source Code Analysis, and Byte Code Compilation UseR! 2004 Namespaces, Source Code Analysis, and Byte Code Compilation UseR! 2004

The Compiler Operation

• byte code instruction set • Optimizations: • stack architecture – constant folding – special opcodes for most SPECIALs, many BUILTINs • similar approach to Python, , many Scheme systems – inlines simple .Internal calls: dnorm(y, 2, 3) is • also related to JVM, .NET replaced by .Internal(dnorm(y, mean = 2, sd = 3, log = FALSE)) • Alternatives: – special opcodes for many .Internals – (using GCC extensions) • Compiler currently uses a single recursive pass: – generate code – generate JVM, .NET code – fast, but limits optimizations.

24 25

Namespaces, Source Code Analysis, and Byte Code Compilation UseR! 2004 Namespaces, Source Code Analysis, and Byte Code Compilation UseR! 2004

Timings and Performance Future Directions

• well-vectorized code will not improve much • Partial evaluation when some arguments are constants • some examples see substantial speedup: • Intra-procedural optimizations and inlining Context Speedup • Run-time specialization Marching Cubes 2 • Vectorized opcodes MCMC 2.5 • Declarations (sealing, scalars, types, strictness) Dynamic Programming 2.4 • Advice on possible inefficiencies • internal version of lapply almost not needed anymore • C code generation (maybe C−−) • need to improve variable lookup • Compilation technology? (Lisp/ML, Haskell, Self, NESL) • need to improve function calling

26 27 Namespaces, Source Code Analysis, and Byte Code Compilation UseR! 2004 Namespaces, Source Code Analysis, and Byte Code Compilation UseR! 2004

Conclusions Obtaining the Code

• As R is used for more high level projects, the need for • Code is still experimental programming support tools increases. • Once it is more stable it will be merged into R • The highly dynamic nature of R makes creating these tools • For now, you can obtain the code at challenging. • New language features such as annotations or declarations http://www.stat.uiowa.edu/~luke/R/ may help. • Results so far are quite promising. • Much more work remains to be done.

28 29