<<

CONTRIBUTED RESEARCH ARTICLES 5

Creating and Deploying an Application with ()Excel and R

Thomas Baier, Erich Neuwirth and Michele De Meo Prediction of a quantitative or categorical variable, is done through a tree structure, which even non- Abstract We present some ways of using R in professionals can read and understand easily. The Excel and build an example application using the application of a computationally complex algorithm package rpart. Starting with simple interactive thus results in an intuitive and easy to use tool. Pre- use of rpart in Excel, we eventually package the diction of a categorical variable is performed by a code into an Excel-based application, hiding all classification tree, while the term regression tree is used details (including R itself) from the end user. In for the estimation of a quantitative variable. the end, our application implements a service- Our application will be built for Excel oriented architecture (SOA) with a clean separa- and will make use of R and rpart to implement the tion of presentation and computation layer. functionality. We have chosen Excel as the primary tool for performing the analysis because of various advantages: Motivation • Excel has a familiar and easy-to-use user inter- Building an application for end users is a very chal- face. lenging goal. Building a application nor- • Excel is already installed on most of the work- mally involves three different roles: application de- stations in the industries we mentioned. veloper, statistician, and user. Often, statisticians are too, but are only (or mostly) fa- • In many cases, data collection has been per- miliar with statistical programming (languages) and formed using Excel, so using Excel for the anal- definitely are not experts in creating rich user inter- ysis seems to be the logical choice. faces/applications for (casual) users. • Excel provides many features to allow a For many—maybe even for most—applications of high-quality presentation of the results. Pre- statistics, is used as the primary user configured presentation options can easily interface. Users are familiar with performing simple adapted even by the casual user. computations using the and can easily format the result of the analyses for printing or inclu- • Output data (mostly graphical or tabular pre- sion in reports or presentations. Unfortunately, Excel sentation of the results) can easily be used in does not provide support for doing more complex further processing— e.g., embedded in Power- computations and analyses and also has documented Point slides or Word documents using OLE (a weaknesses for certain numerical calculations. subset of COM, as in Microsoft Corporation and Statisticians know a solution for this problem, and Digital Equipment Corporation(1995)). this solution is called R. R is a very powerful pro- gramming language for statistics with lots of meth- We are using R for the following reasons: ods from different statistical areas implemented in • One cannot rely on Microsoft Excel’ numerical various packages by thousands of contributors. But and statistical functions (they do not even give unfortunately, R’s user interface is not what everyday the same results when run in different versions users of statistics in business expect. of Excel). See McCullough and Wilson(2002) The following sections will show a very simple for more information. approach allowing a statistician to develop an easy- to-use and maintainable end-user application. Our • We are re-using an already existing, tested and example will make use of R and the package rpart proven package for doing the statistics. and the resulting application will completely hide the • Statistical programmers often use R for perform- complexity and the R user interface from the user. ing statistical analysis and implementing func- rpart implements recursive partitioning and re- tions. gression trees. These methods have become powerful tools for analyzing complex data structures and have Our goal for creating an RExcel-based applica- been employed in the most varied fields: CRM, fi- tion is to enable any user to be able to perform the nancial risk management, insurance, pharmaceuticals computations and use the results without any special and so on (for example, see: Altman(2002), Hastie knowledge of R (or even of RExcel). See Figure1 for et al.(2009), Zhang and Singer(1999)). an example of the application’s user interface. The The main reason for the wide distribution of tree- results of running the application are shown in Figure based methods is their simplicity and intuitiveness. 2.

The R Journal Vol. 3/2, December 2011 ISSN 2073-4859 6 CONTRIBUTED RESEARCH ARTICLES

Figure 1: Basic user interface of the RExcel based end-user application.

Figure 2: Computation results.

There are a few alternatives to RExcel for connect- scripting engine combined with a highly productive ing Excel with R: visual development environment. This environment is called Visual Basic for Applications, or VBA for XLLoop provides an Excel Add-In which is able to short (see Microsoft Corporation(2001)). VBA allows access various programming languages from creation of “worksheet-functions” (also called user within formulae. It supports languages like R, defined functions or UDFs), which can be used in Javascript and others. The “backend” languages a similar way as Excel’s built-in functions (like, e.g., are accessed using TCP/IP communication. In “SUM” or “MEAN”) to dynamically compute values to be contrast to this, RExcel uses COM, which has a shown in spreadsheet cells, and Subs which allow ar- very low latency (and hence is very fast). Ad- bitrary computations, putting results into spreadsheet ditionally, the architecture of RExcel supports ranges. VBA code can be bundled as an “Add-In” for a completely invisible R server process (which Microsoft Excel which provides an extension both in is what we need for deploying the application), the user interface (e.g., new menus or dialogs) and or can provide access to the R window while in functionality (additional sheet-functions similar to developing/testing. the built-in sheet functions). inference for R provides similar functionality as Extensions written in VBA can access third-party RExcel. From the website, the latest supported components compatible with Microsoft’s Component R version is 2.9.1, while RExcel supports current Object Model (COM, see Microsoft Corporation and R versions immediately after release (typically Digital Equipment Corporation(1995)). The link be- new versions work right out of the box without tween R and Excel is built on top of two components: an update in RExcel or statconnDCOM). RExcel is an Add-In for Microsoft Excel implement- ing a spreadsheet-style interface to R and statconn- DCOM. statconnDCOM exposes a COM component Integrating R and Microsoft Excel to any Windows application which encapsulates R’s functionality in an easy-to-use way. Work on integration of Excel and R is ongoing since statconnDCOM is built on an extensible design 1998. Since 2001, a stable environment is available with exchangeable front-end and back-end parts. In for building Excel documents based upon statistical the context of this article, the back-end is R. A back- methods implemented in R. This environment con- end implementation for (INRIA(1989)) is also sists of plain R (for Windows) and a toolbox called available. The front-end component is the COM in- statconnDCOM with its full integration with Excel terface implemented by statconnDCOM. This imple- using RExcel (which is implemented as an add-in for mentation is used to integrate R or Scilab into Win- Microsoft Excel). dows applications. A first version of an Uno (OpenOf- Like all Microsoft Office applications and some fice.org(2009)) front-end has already been released third-party applications, Excel provides a powerful for testing and download. Using this front-end R and

The R Journal Vol. 3/2, December 2011 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 7

Scilab can easily be integrated into OpenOffice.org In the example below, we have a dataset on the applications, like Calc (see OpenOffice.org(2006)). credit approval process for 690 subjects. For each ROOo is available via Drexel(2009) and already sup- record, we have 15 input variables (qualitative and ports Windows, Linux and MacOS X. quantitative) while variable 16 indicates the outcome RExcel supports various user interaction modes: of the credit application: positive (+) or negative ( ). Based on this training dataset, we develop our classi-− Scratchpad and data transfer mode: Menus control fication tree which is used to show the influence of 15 data transfer from R to Excel and back; com- variables on the loan application and to distinguish mands can be executed immediately, either from “good” applicants from “risky” applicants (so as to Excel cells or from R command line estimate variable 16). The risk manager will use this Macro mode: Macros, invisible to the user, control model fully integrated into Excel for further credit data transfer and R command execution approval process. The goal of the code snippet below is to estimate Spreadsheet mode: Formulas in Excel cells control the model and display the chart with the classification data transfer and command execution, auto- tree. The risk manager can easily view the binary tree matic recalculation is controlled by Excel and use this chart to highlight the splits. This will help discover the most significant variables for the credit Throughout the rest of this article, we will describe approval process. For example, in this case it is clear how to use scratchpad mode and macro mode for proto- that the predictor variables V9 determines the best typing and implementing the application. binary partition in terms of minimizing the “impurity For more information on RExcel and statconn- measure”. In addition, the risk manager will notice DCOM see Baier and Neuwirth(2007). that when V9 equals a, the only significant variable to observe is V4. The graphical representation of the Implementing the classification classification tree is shown in Figure3 on page8. and regression tree library(rpart) #!rputdataframe trainingdata \ With RExcel’s tool set we can start developing our ap- 'database'!A1:P691 plication in Excel immediately. RExcel has a “scratch- fit<-rpart(V16~.,data=trainingdata) pad” mode which allows you to write R code directly plot(fit,branch=0.1,uniform=T,margin=.1, \ into a worksheet and to run it from there. Scratchpad compress=T,nspace=0.1) mode is the user interaction mode we use when pro- text(fit,fancy=T,use.n=T) totyping the application. We will then transform the #!insertcurrentrplot 'database'!T10 prototypical code into a “real” application. In prac- .off() tice, the prototyping phase in scratchpad mode will be omitted for simple applications. In an Excel hosted Note: \ in the code denotes a line break for readabil- R application we also want to transfer data between ity and should not be used in the real spread- Excel and R, and transfer commands may be embed- sheet ded in R code. An extremely simplistic example of code in an R code scratchpad range in Excel might After estimating the model andvisualizing the re- look like this: sults, you can use the Classification Tree to make pre- dictions about the variable V16: the applicants will #!rput inval 'Sheet1'!A1 be classified as “good” or “risky.” By running the result<-sin(inval) following code, the 15 observed variables for two #!rget result 'Sheet1'!A2 people are used to estimate the probability of credit approval. Generally, a “positive” value (+) with prob- R code is run simply by selecting the range con- ability higher than 0.5 will indicate to grant the loan. taining code and choosing Run R Code from the pop- So the risk manager using this model will decide to up menu. grant the credit in both cases. Lines starting with #! are treated as special RExcel commands. rput will send the value (contents) of a library(rpart) cell or range to R, rget will read a value from R and #!rputdataframe trainingdata \ store it into a cell or range. 'database'!A1:P691 In the example, the value stored in cell A1 of sheet #!rputdataframe newcases \ Sheet1 will be stored in the R variable inval. After 'predict'!A1:O3 evaluating the R expression result<-sin(inval), the outdf<-as.data.frame(predict(fit,newcases)) value of the R variable result is stored in cell A2. predGroup <- ifelse(outdf[,1]>0.5, \ Table1 lists all special RExcel commands and pro- names(outdf[1]),names(outdf[2])) vides a short description. More information can be res<-cbind(predGroup,outdf) found in the documentation of RExcel. #!rgetdataframe res 'predict'!R1

The R Journal Vol. 3/2, December 2011 ISSN 2073-4859 8 CONTRIBUTED RESEARCH ARTICLES

Command Description #!rput variable range store the value (contents) of a range in an R variable #!rputdataframe variable range store the value of a range in an R data frame #!rputpivottable variable range store the value of a range in an R variable #!rget r-expression range store the value of the R expression in the range #!rgetdataframe r-expression range store the data frame value of the R expression in the range #!insertcurrentplot cell-address insert the active plot into the worksheet

Table 1: Special RExcel commands used in sheets

Figure 3: Graphical representation of the classification tree in the RExcel based application.

The R Journal Vol. 3/2, December 2011 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 9

The potential of this approach is obvious: Very when invoking an R/ function. If RExcel calls a func- complex models, such as classification and regression tion in R and R throws an error, an Excel dialog box trees, are made available to decision makers (in this will pop up informing the user of the error. An alterna- case, the risk manager) directly in Microsoft Excel. tive to extensive checking is to use R’s try mechanism See Figure2 on page6 for a screen-shot of the presen- to catch errors in R and handle them appropriately. tation of the results. Using these tools, we now can define a macro per- forming the actions our scratchpad code did in the previous section. Since we can define auxiliary fun- Tidying up the spreadsheet cions easily, we can also now make our design more modular. In an application to be deployed to end users, R code The workhorse of our application is the following should not be visible to the users. Excel’s mechanism VBA macro: for performing operations on data is to run macros written in VBA (Visual Basic for Applications, the Sub PredictApp() embedded in Excel). RExcel Dim outRange As Range implements a few functions and which ActiveWorkbook.Worksheets("predict") _ can be used from VBA. Here is a minimalistic exam- .Activate ple: If Err.Number <> 0 Then MsgBox "This workbook does not " _ Sub RExcelDemo() & "contain data for rpartDemo" RInterface.StartRServer Exit Sub RInterface.GetRApply "sin", _ End If Range("A2"), Range("A1") ClearOutput RInterface.StopRServer RInterface.StartRServer End Sub RInterface.RunRCodeFromRange _ ThisWorkbook.Worksheets("RCode") _ GetRApply applies an R function to arguments .UsedRange taken from Excel ranges and puts the result in a range. These arguments can be scalars, vectors, matrices or RInterface.GetRApply _ dataframes. In the above example, the function sin is "function(" _ applied to the value from cell A1 and the result is put & "trainingdata,groupvarname," _ in cell A2. The R function given as the first argument & "newdata)predictResult(fitApp("_ to GetRApply does not necessarily have to be a named & "trainingdata,groupvarname)," _ function, any function expression can be used. & "newdata)", _ RExcel has several other functions that may ActiveWorkbook.Worksheets( _ be called in VBA macros. RRun runs any code "predict").Range("R1"), _ given as string. RPut and RGet transfer matri- AsSimpleDF(DownRightFrom( _ ces, and RPutDataframe and RGetDataframe transfer ThisWorkbook.Worksheets( _ dataframes. RunRCall similar to GetRApply calls an "database").Range("A1"))), _ R function but does not transfer any return value to "V16", _ Excel. A typical use of RunRCall is for calling plot AsSimpleDF(ActiveWorkbook _ functions in R. InsertCurrentRPlot embeds the cur- .Worksheets("predict").Range( _ rent R plot into Excel as an image embedded in a "A1").CurrentRegion) worksheet. In many cases, we need to define one or more RInterface.StopRServer R functions to be used with GetRApply or the other Set outRange = ActiveWorkbook _ VBA support functions and subroutines. RExcel has .Worksheets("predict").Range("R1") _ a mechanism for that. When RExcel connects to R (us- .CurrentRegion ing the command RInterface.StartRServer), it will check HighLight outRange whether the directory containing the active workbook End Sub also contains a file named RExcelStart.R. If it finds such a file, it will read its contents and evaluate them The function predictResult is defined in the with R (source command in R). After doing this, REx- worksheet RCode. cel will check if the active workbook contains a work- sheet named RCode. If such a worksheet exists, its contents also will be read and evaluated using R. Packaging the application Some notes on handling R errors in Excel: The R im- plementation should check for all errors which are So far, both our code (the R commands) and data have expected to occur in the application. The application been part of the same spreadsheet. This may be conve- itself is required to pass correctly typed arguments nient while developing the RExcel-based application

The R Journal Vol. 3/2, December 2011 ISSN 2073-4859 10 CONTRIBUTED RESEARCH ARTICLES or if you are only using the application for yourself, age and then immediately close the the connection to but it has to be changed for redistribution to a wider R. audience. The statconnDCOM server can reside on another We will show how to divide the application into machine than the one where the Excel application two parts, the first being the Excel part, the second is running. The server to be used by RExcel can be being the R part. With a clear interface between these, configured in a configuration dialog or even from it will be easy to update one of them without affecting within VBA macros. So with minimal changes, the the other. This will make testing and updating easier. application created can be turned into an application Separating the R implementation from the Excel (or which uses a remote server. A further advantage of RExcel) implementation will also allow an R expert this approach is that the R functions and data used to work on the R part and an Excel expert to work on by the application can be managed centrally on one the Excel part. server. If any changes or updates are necessary, the As an additional benefit, exchanging the Excel Excel workbooks installed on the end users’ machines front-end with a custom application (e.g., written in a do not need to be changed. programming language like #) will be easier, too, as the R implementation will not have to be changed (or Building a VBA add-in for Excel tested) in this case. The application’s R code interface is simple, yet The implementation using an still has some powerful. RExcel will only have to call a single R shortcomings. The end user Excel workbook contains function called approval. Everything else is hidden macros, and often IT security policies do to not allow from the user (including the training data). approval end user workbooks to contain any code takes a data frame with the data for the cases to be (macros). Choosing a slightly different approach, we decided upon as input and returns a data frame con- can put all the VBA macro code in an Excel add-in. In taining the group classification and the probabilities the end user workbook, we just place buttons which for all possible groups. Of course, the return value trigger the macros from the add-in. When opening a can also be shown in a figure. new spreadsheet to be used with this add-in, Excel’s template mechanism can be used to create the buttons Creating an R package on the new worksheet. The code from the workbook described in section “Tidying up the spreadsheet” can- The macro-based application built in section “Tidy- not be used “as is” since it is written under the as- ing up the spreadsheet” still contains the R code and sumption that both the training data and the data for the training data. We will now separate the imple- the new cases are contained in the same workbook. mentation of the methodology and the user interface The necessary changes, however, are minimal. Con- by putting all our R functions and the training data verting the workbook from section “Tidying up the into an R package. This R package will only expose a spreadsheet” again poses the problem that the data single function which gets data for the new cases as and the methodology are now deployed on the end input and returns the predicted group membership as users’ machines. Therefore updating implies replac- result. Using this approach, our application now has ing the add-in on all these machines. Combining the a clear architecture. The end user workbook contains add-in approach with the packaging approach from the following macro: section “Creating an R package” increases modular- ization. With this approach we have: Sub PredictApp() Dim outRange As Range • End user workbooks without any macro code. ClearOutput • Methodology and base data residing on a cen- RInterface.StartRServer trally maintained server. RInterface.RRun "library(RExcelrpart)" • Connection technology for end users installed RInterface.GetRApply "approval", _ for all users in one place, not separately for each Range("'predict'!R1"), _ user. AsSimpleDF(Range("predict!A1") _ .CurrentRegion) Deploying the Application RInterface.StopRServer Set outRange = Range("predict!R1") _ Using the application on a computer requires installa- .CurrentRegion tion and configuration of various components. HighLight outRange The required (major) components are: End Sub • Microsoft Excel, including RExcel and statconn- DCOM In this macro, we start a connection to R, load the R package and call the function provided by this pack- • R, including rscproxy

The R Journal Vol. 3/2, December 2011 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 11

• The VBA Add-In and the R package created T. Hastie, R. Tibshirani, and J. Friedman. The Elements throughout this article of Statistical Learning: , Inference, and Prediction. Springer, 2009. For the simplest setup, all compontents are in- stalled locally. As an alternative, you can also install INRIA. Scilab. INRIA ENPC, 1989. URL http: Excel, RExcel and our newly built VBA Add-In on //www.scilab.org. every workstation locally and install everything else on a (centralized) server machine. In this case, R, . D. McCullough and B. Wilson. On the accuracy of rscproxy, our application’s R package and statconn- statistical procedures in Microsoft Excel 2000 and DCOM are installed on the server machine and one Excel XP. and , has to configure RExcel to use R via statconnDCOM 40:713–721, 2002. on a remote server machine. Please beware that this kind of setup can be a bit tricky, as it requires a cor- Microsoft Corporation. Microsoft Office 2000/Vi- rect DCOM security setup (using the Windows tool sual Basic ’s Guide. In MSDN Li- dcomcnfg). brary, volume Office 2000 Documentation. Mi- crosoft Corporation, October 2001. URL http: //msdn.microsoft.com/. Downloading Microsoft Corporation and Digital Equipment Corpo- All examples are available for download from the ration. The component object model specification. Download page on http://rcom.univie.ac.at. Technical Report 0.9, Microsoft Corporation, Octo- ber 1995. Draft.

Bibliography OpenOffice.org. OpenOffice, 2006. URL http://www. openoffice.org/. E. I. Altman. Bankruptcy, Credit Risk, and High Yield Junk Bonds. Blackwell Publishers Inc., 2002. OpenOffice.org. Uno, 2009. URL http://wiki. services.openoffice.org/wiki/Uno/. T. Baier. rcom: R COM Client Interface and internal COM Server, 2007. R package version 1.5-1. H. Zhang and B. Singer. Recursive Partitioning in the T. Baier and E. Neuwirth. R (D)COM Server V2.00, Health Sciences. Springer-Verlag, New York, 1999. 2005. URL http://cran.r-project.org/other/ ISBN 0-387-98671-5. DCOM.

T. Baier and E. Neuwirth. Excel :: COM :: R. Com- Thomas Baier putational Statistics, 22(1):91–108, April 2007. Department of Scientific Computing URL http://www.springerlink.com/content/ Universitity of Vienna uv6667814108258m/. 1010 Vienna Basel Committee on Banking Supervision. Interna- Austria [email protected] tional Convergence of Capital Measurement and Capital Standards. Technical report, Bank for In- ternational Settlements, June 2006. URL http: Erich Neuwirth //www.bis.org/publ/bcbs128.pdf. Department of Scientific Computing Universitity of Vienna L. Breiman, J. Friedman, R. Olshen, and C. Stone. 1010 Vienna Classification and Regression Trees. Wadsworth and Austria Brooks, Monterey, CA, 1984. [email protected] J. M. Chambers. Programming with Data. Springer, Michele De Meo New York, 1998. URL http://cm.bell-labs. Venere Net Spa com/cm/ms/departments/sia/Sbook/. ISBN 0-387- Via della Camilluccia, 693 98503-4. 00135 Roma R. Drexel. ROOo, 2009. URL http://rcom.univie.ac. Italy at/. [email protected]

The R Journal Vol. 3/2, December 2011 ISSN 2073-4859