Creating and Deploying an Application with (R)Excel and R
Total Page:16
File Type:pdf, Size:1020Kb
CONTRIBUTED RESEARCH ARTICLES 5 Creating and Deploying an Application with (R)Excel and R Thomas Baier, Erich Neuwirth and Michele De Meo Prediction of a quantitative or categorical variable, is done through a tree structure, which even non- Abstract We present some ways of using R in professionals can read and understand easily. The Excel and build an example application using the application of a computationally complex algorithm package rpart. Starting with simple interactive thus results in an intuitive and easy to use tool. Pre- use of rpart in Excel, we eventually package the diction of a categorical variable is performed by a code into an Excel-based application, hiding all classification tree, while the term regression tree is used details (including R itself) from the end user. In for the estimation of a quantitative variable. the end, our application implements a service- Our application will be built for Microsoft Excel oriented architecture (SOA) with a clean separa- and will make use of R and rpart to implement the tion of presentation and computation layer. functionality. We have chosen Excel as the primary tool for performing the analysis because of various advantages: Motivation • Excel has a familiar and easy-to-use user inter- Building an application for end users is a very chal- face. lenging goal. Building a statistics application nor- • Excel is already installed on most of the work- mally involves three different roles: application de- stations in the industries we mentioned. veloper, statistician, and user. Often, statisticians are programmers too, but are only (or mostly) fa- • In many cases, data collection has been per- miliar with statistical programming (languages) and formed using Excel, so using Excel for the anal- definitely are not experts in creating rich user inter- ysis seems to be the logical choice. faces/applications for (casual) users. • Excel provides many features to allow a For many—maybe even for most—applications of high-quality presentation of the results. Pre- statistics, Microsoft Excel is used as the primary user configured presentation options can easily interface. Users are familiar with performing simple adapted even by the casual user. computations using the spreadsheet and can easily format the result of the analyses for printing or inclu- • Output data (mostly graphical or tabular pre- sion in reports or presentations. Unfortunately, Excel sentation of the results) can easily be used in does not provide support for doing more complex further processing— e.g., embedded in Power- computations and analyses and also has documented Point slides or Word documents using OLE (a weaknesses for certain numerical calculations. subset of COM, as in Microsoft Corporation and Statisticians know a solution for this problem, and Digital Equipment Corporation(1995)). this solution is called R. R is a very powerful pro- gramming language for statistics with lots of meth- We are using R for the following reasons: ods from different statistical areas implemented in • One cannot rely on Microsoft Excel’s numerical various packages by thousands of contributors. But and statistical functions (they do not even give unfortunately, R’s user interface is not what everyday the same results when run in different versions users of statistics in business expect. of Excel). See McCullough and Wilson(2002) The following sections will show a very simple for more information. approach allowing a statistician to develop an easy- to-use and maintainable end-user application. Our • We are re-using an already existing, tested and example will make use of R and the package rpart proven package for doing the statistics. and the resulting application will completely hide the • Statistical programmers often use R for perform- complexity and the R user interface from the user. ing statistical analysis and implementing func- rpart implements recursive partitioning and re- tions. gression trees. These methods have become powerful tools for analyzing complex data structures and have Our goal for creating an RExcel-based applica- been employed in the most varied fields: CRM, fi- tion is to enable any user to be able to perform the nancial risk management, insurance, pharmaceuticals computations and use the results without any special and so on (for example, see: Altman(2002), Hastie knowledge of R (or even of RExcel). See Figure1 for et al.(2009), Zhang and Singer(1999)). an example of the application’s user interface. The The main reason for the wide distribution of tree- results of running the application are shown in Figure based methods is their simplicity and intuitiveness. 2. The R Journal Vol. 3/2, December 2011 ISSN 2073-4859 6 CONTRIBUTED RESEARCH ARTICLES Figure 1: Basic user interface of the RExcel based end-user application. Figure 2: Computation results. There are a few alternatives to RExcel for connect- scripting engine combined with a highly productive ing Excel with R: visual development environment. This environment is called Visual Basic for Applications, or VBA for XLLoop provides an Excel Add-In which is able to short (see Microsoft Corporation(2001)). VBA allows access various programming languages from creation of “worksheet-functions” (also called user within formulae. It supports languages like R, defined functions or UDFs), which can be used in Javascript and others. The “backend” languages a similar way as Excel’s built-in functions (like, e.g., are accessed using TCP/IP communication. In “SUM” or “MEAN”) to dynamically compute values to be contrast to this, RExcel uses COM, which has a shown in spreadsheet cells, and Subs which allow ar- very low latency (and hence is very fast). Ad- bitrary computations, putting results into spreadsheet ditionally, the architecture of RExcel supports ranges. VBA code can be bundled as an “Add-In” for a completely invisible R server process (which Microsoft Excel which provides an extension both in is what we need for deploying the application), the user interface (e.g., new menus or dialogs) and or can provide access to the R window while in functionality (additional sheet-functions similar to developing/testing. the built-in sheet functions). inference for R provides similar functionality as Extensions written in VBA can access third-party RExcel. From the website, the latest supported components compatible with Microsoft’s Component R version is 2.9.1, while RExcel supports current Object Model (COM, see Microsoft Corporation and R versions immediately after release (typically Digital Equipment Corporation(1995)). The link be- new versions work right out of the box without tween R and Excel is built on top of two components: an update in RExcel or statconnDCOM). RExcel is an Add-In for Microsoft Excel implement- ing a spreadsheet-style interface to R and statconn- DCOM. statconnDCOM exposes a COM component Integrating R and Microsoft Excel to any Windows application which encapsulates R’s functionality in an easy-to-use way. Work on integration of Excel and R is ongoing since statconnDCOM is built on an extensible design 1998. Since 2001, a stable environment is available with exchangeable front-end and back-end parts. In for building Excel documents based upon statistical the context of this article, the back-end is R. A back- methods implemented in R. This environment con- end implementation for Scilab (INRIA(1989)) is also sists of plain R (for Windows) and a toolbox called available. The front-end component is the COM in- statconnDCOM with its full integration with Excel terface implemented by statconnDCOM. This imple- using RExcel (which is implemented as an add-in for mentation is used to integrate R or Scilab into Win- Microsoft Excel). dows applications. A first version of an Uno (OpenOf- Like all Microsoft Office applications and some fice.org(2009)) front-end has already been released third-party applications, Excel provides a powerful for testing and download. Using this front-end R and The R Journal Vol. 3/2, December 2011 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 7 Scilab can easily be integrated into OpenOffice.org In the example below, we have a dataset on the applications, like Calc (see OpenOffice.org(2006)). credit approval process for 690 subjects. For each ROOo is available via Drexel(2009) and already sup- record, we have 15 input variables (qualitative and ports Windows, Linux and MacOS X. quantitative) while variable 16 indicates the outcome RExcel supports various user interaction modes: of the credit application: positive (+) or negative ( ). Based on this training dataset, we develop our classi-− Scratchpad and data transfer mode: Menus control fication tree which is used to show the influence of 15 data transfer from R to Excel and back; com- variables on the loan application and to distinguish mands can be executed immediately, either from “good” applicants from “risky” applicants (so as to Excel cells or from R command line estimate variable 16). The risk manager will use this Macro mode: Macros, invisible to the user, control model fully integrated into Excel for further credit data transfer and R command execution approval process. The goal of the code snippet below is to estimate Spreadsheet mode: Formulas in Excel cells control the model and display the chart with the classification data transfer and command execution, auto- tree. The risk manager can easily view the binary tree matic recalculation is controlled by Excel and use this chart to highlight the splits. This will help discover the most significant variables for the credit Throughout the rest of this article, we will describe approval process. For example, in this case it is clear how to use scratchpad mode and macro mode for proto- that the predictor variables V9 determines the best typing and implementing the application. binary partition in terms of minimizing the “impurity For more information on RExcel and statconn- measure”. In addition, the risk manager will notice DCOM see Baier and Neuwirth(2007).