Introduction to in IBM SPSS Modeler:

A guide for SPSS Users

Wannes Rosius/Belgium/IBM

Goal of this guide

Although there are several very good articles and blogs related to IBM SPSS Modeler, in my role as technical professional for IBM Analytical solutions, we still see lots of people struggling with both R and the integration between IBM SPSS Modeler and R.

The idea of this document is certainly not to replace these very useful links listed below, but to enhance these in a way that people knowing IBM SPSS Modeler, with only a very limited knowledge of R, can use this integration.

Going through sections 2, 3 and 4, the reader should be able to understand at a high level the R integration within SPSS and to (re)create some very basic R models within SPSS, even if you have only a basic knowledge of R.

In section 5 you will learn more detailed tips, tricks and other things. This part is for the experienced user and can be interpreted as a list of loose things which might help you get up to speed with some more detailed functionalities of the integration, and understand some pitfalls.

At every point in the document, we try to include R examples to the reader that could be easily copied into the appropriate R node in IBM SPSS Modeler. Unless specied otherwise, these code snippets are always based on the telco.sav dataset which can be found in the demo folder of your SPSS Modeler installation. After the source node, attach a type node, and thereafter the appro- priate R node. However, sometimes there are just abstracts of code to show you the idea. It will be clearly mentioned when the code is incomplete. You will nd these codes back into several code frames throughout this document. Furthermore, all the SPSS streams and assets are embedded in the pdf symbolized by . You can access them by right clicking within this pdf document.

Some useful links • Essentials for R - Installation Instructions • User Guide: IBM SPSS Modeler 18 R Nodes • Modeler essentials for R Downloads • SPSS Modeler and R integration - Getting started IBM SPSS Modeler and R

Contents

1 System Setup 3 1.1 Installing R ...... 3 1.2 Enabling the R nodes ...... 3

2 R basics 3

3 The basics of R nodes in IBM SPSS Modeler 5 3.1 The R nodes...... 5 3.2 Simple R code example ...... 5 3.2.1 modelerData ...... 6 3.2.2 modelerDataModel ...... 7 3.2.3 modelerModel ...... 8 3.3 Some general remarks ...... 10 3.4 Read data options ...... 11

4 Custom Dialog builder 11 4.1 Tools...... 12 4.2 Custom dialog ...... 12 4.3 Simple example ...... 12

5 Tips & tricks: Some more detailed 14 5.1 R code...... 14 5.1.1 ibmspsscf70 library ...... 14 5.1.2 Some useful parts of R code...... 15 5.2 Custom Dialog builder ...... 17 5.2.1 How to save and share a custom dialog? ...... 17 5.2.2 Link to dialog and script ...... 17 5.3 What about SQL Pushback? Hadoop pushback? ...... 18 5.4 What about real-time scoring? and Solution Publisher? ...... 19 5.5 Something more about the metadata in modeler and the consequences on R integration 19

Page 2 of 20 IBM SPSS Modeler and R

1 System Setup

Let us start with the setup of your system. For now, we assume that you have a valid installation of IBM SPSS Modeler on your machine. For more installation topics we refer to the Installation Instructions.

1.1 Installing R

Depending on the version of your IBM SPSS Modeler, you will now have to install dierent versions of R:

SPSS Version R version R download link 16.0 2.15.2 https://cran.r-project.org/bin/windows/base/old/2.15.2/ 17.0 3.1 https://cran.r-project.org/bin/windows/base/old/3.1.0/ 17.1 3.1 https://cran.r-project.org/bin/windows/base/old/3.1.0/ 18.0 3.2 https://cran.r-project.org/bin/windows/base/old/3.2.0/

Once you downloaded and installed, you will have a working R instance on your machine/server. Like SPSS Modeler, you can have several versions of R installed on your machine without any problem.

1.2 Enabling the R nodes

You will need to install the IBM SPSS Modeler essentials for R. You can nd these here, on the SPSS Community Downloads page. Click 2. "Get Essentials for SPSS" and then click the button "Get R Essentials for SPSS Modeler". This will take you to github and you will be able to select and download the Modeler 18 Essentials for R for a variety of platforms. If you require Essentials for R for earlier Modeler versions, there is also a link to legacy versions.

Run this execution le. The installation will ask you the path of your R installation, and the path to the bin les of your SPSS Modeler installation. (Note that in the prelled path, it is the default path to a ModelerServer, and you will need to change this if you want to congure your client). This installation will place the R nodes in your SPSS Modeler node palette, and it will also include necessary R libraries in your R installation folder.

2 R basics

There is already an overow of R courses (publicly) available through several channels, so we would certainly not want to replace these. In it also not very important that you are an R expert to follow this document. However, there are still some basics of R code and R terminology users need to understand in order to exploit the integration of R and IBM SPSS Modeler. For this section, let us open R in its original GUI. Therefore go to the R installation folder and open \bin\x64\RGUI.exe. A window will be opened looking like this:

Page 3 of 20 IBM SPSS Modeler and R

This is the R console ready for commands to run. You might often hear the term RStudio, which is nothing more than a development environment on top of this R gui. Installation of RStudio is not required for this introduction, but might be handy for further use. We will start the R introduction by stating R is a powerful and environment for statistical computing and graphics. An important part within that last phrase is that R is a programming language, unlike IBM SPSS Modeler. That means, it is built on objects that are dened by the user. As an example, assume the following R code (feel free to type it within the R console to see the R outputs):

1 x <- 1+1 2 y <-2*x 3 xyVector <-c(x,y) 4 z <- mean(xyVector) 5 print(z)

Here x is an object. This statement will ll the object x with the value of the evaluated formula 1 + 1, being 2. So whenever the program refers to x, it will be interpreted as 2. In the second line we will dene y as twice the value of x. In the third line, we create a vector containing the content of x and y, to calculate the mean of these 2 objects and place it in an object z.

The operator "<-" could also be replaced by "=", but for various reasons, lots of R users pre- fer this way of writing (actually it is not exactly the same, but that could be ignored for the purpose of this document). If you feel more comfortable in using "=", please do so.

Like we lled x, y and z with some numbers, any R object can be lled with a variety of types. Here is a list of the most important for our purposes:

Vector is a sequence of data elements of the same type (eg. numeric or character). This includes vectors of length 1, which can be interpreted as just being numbers. You can create a vector with the R function c(). So in the example code above, all the values of x, y and z are vectors of length one. xyVector is a vector of length 2, containing the values of (the vector) x followed by (the vector) y. Trying to link it back to SPSS, you can interpret a vector as the values of a single data column.

Data frame is a list of vectors of equal length. If you look at a vector as the values of a variable, a data frame could be interpreted as a 2-dimensional dataset with columns (the number of vectors) and lines (the size of each vector).

1 n <-c(2, 3, 5, 3, 9)#A first vector of5 numeric values 2 n2 <-c(1, 3, 2, 5, 4)#A second vector of5 numeric values 3 s <-c("aa","bb","cc","aa","zz")#A third vector of5 string values 4 b <-c(TRUE, FALSE, TRUE, TRUE, TRUE)#A fourth vector of5 flag values 5 Data <- data.frame(n, s, b, New = n+n2)#A data frame containing4 vectors

Page 4 of 20 IBM SPSS Modeler and R

6 #Noten+n2 will bea new vector called"New" with the sum of then+ n2:c(3, 6, 7, 8, 13) 7 8 dim(Data)#Will show you it isa5x4 dataset. 9 Data[2,4]#Will give back the value on the2nd line, the3rd column 10 colnames(Data)#Will give the column names asa vector("n","s","b","New") 11 Data$n[1]#Will give back the first value of the vectorn within the data frame. 12 13 iris#predefined data frame.

There are also several pre-dened data frames installed within R. One of them is called iris. Sometimes this document will refer back to iris. Model class which is actually a specic list containing predened objects dening a statistical model. For example a linear model class will be a list containing among others the coecients of the regression model.

List is an ordered collection of objects. As an example, you can have a list where the rst element is a vector, the second is a data frame, and the third is a model. Note that a data frame is a special type of a list, where all the elements are vectors of equal sizes.

3 The basics of R nodes in IBM SPSS Modeler 3.1 The R nodes Once the installation for the R essentials are done, you will see 3 new nodes in your node palettes. There is also a 4th R node which is the R nugget. The dierence between and understanding of these 4 objects are essential!

Output: with this node, data will be sent to R, but it will never go back to SPSS (as it is a terminal node). The only thing that can go back to SPSS is the outputs generated by Rthat will be presented within an SPSS output window.

Transform: data will go from SPSS to R, but will also go back to SPSS, after which the SPSS process can be continued.

Model: like the output node this is a terminal node, so data will not go back to SPSS. However, there will be a reusable R object created within a nugget.

Nugget: similar to Transform node, with the dierence that there is a reusable R object that can be used in the R code.

Node Name R output node R transform node R model node R syntax node Palette tab Output Record Ops Model NA Data back to SPSS No Yes No Yes Reusable R object No No Create Use

3.2 Simple R code example Let us start with saying that all the examples in this section are intentionally kept very simple so as to explain the interaction in a functional and structured way, and be simple enough for non R programmers. We are certainly aware that most of the R code snippets we show in this chapter

Page 5 of 20 IBM SPSS Modeler and R could also easily be implemented using standard SPSS Modeler functionality.

There are 3 very important and reserved R objects that you should keep in mind when you use the SPSS Modeler R integration. Here is a brief description of these 3, after which we will go into more detail for each of them: modelerData This is an R data frame, that will be lled by the data entering in this R node. This data frame can be used and changed within your R code. Eventually, it will also be the data frame that will be sent back to SPSS Modeler as a dataset. Note that it will only contain the content of the data, not (necessarily) the data column names and other metadata items. modelerDataModel This is also an R data frame, containing the metadata of the data that is sent to R and back to SPSS Modeler. It contains most of the information that you may expect within an SPSS Modeler type node. This will be the object that will be most strange for experienced R users. modelerModel this is an R object that can be lled by the user by any type of object you want. It does not need to have a certain structure. It will be calculated in the R model node, after which it will be saved within the R nugget, where it can be used in the R-syntax. Note that R code is case-sensitive and therefore so are these object names. In the following sections, we will explain the usage of these objects.

3.2.1 modelerData modelerData is the R data frame that will be lled by the dataset entering the SPSS Modeler node it comes from. So you can use this data frame to perform the desired calculations, transformations and outputs in R.

Place the following code in an R output node.

1 #Print the first6 lines of the data 2 head(modelerData) 3 4 #Givea summary of the data 5 summary(modelerData) 6 7 #createa histogram of the variable tenure 8 hist(modelerData$tenure, xlab ="years", main ="Tenure histogram") 9 10 #change the tenure unit from months to years 11 modelerData$tenure <- modelerData$tenure/12 12 13 #recreate the histogram, now in months 14 hist(modelerData$tenure, xlab ="months", main ="Tenure histogram") Execution of this node will result in an SPSS Modeler output window in which all the R outputs will be assembled. These will always be divided in 2 tabs: Text output and Graph output.

Page 6 of 20 IBM SPSS Modeler and R

In this case the text output is linked to the code on line 2 and 5. rst it prints the rst 6 lines (head) of the data, next it will give summary statistics for each column.

The graph output are two histograms. One for the tenure in months, the other for the same column, but after redening it by dividing the original value by twelve to give the tenure in years (note the X-axis scale).

As shown in the example stream "Explain modelerData.str", you can also copy exactly this same code into a transform node and attach a table node to it. After running this table node you will not see any R output (as none is expected). That means, that even though the output code has run, no outputs will be given. However the data frame of modelerData will be send back to SPSS Modeler. In this case you will see the value of tenure being divided by 12.

3.2.2 modelerDataModel Metadata is very important in SPSS Modeler. Let us for simplicity say that within modeler, meta- data is represented by the type node. With metadata we mean the type of each of the variables in the dataset (numeric, ag, String, storage, ... ); At all times, modeler will know exactly all the metadata at every step in the stream.

R does not handle the metadata in a similar way as SPSS Modeler. We already explained modelerDataModel taking over the role of the "type node". This is done by a data frame (=dataset) of the following structure:

X1 X2 ... Xn fieldName region tenure age fieldLabel Geographic indicator Months with service Age in years fieldStorage real real real fieldMeasure nominal continuous continuous fieldFormat standard standard standard fieldRole input input input

So this means that this dataset will always have 6 lines with xed names (yes, in R also the lines have names). The thing with this dataset is that it is completely the responsibility of the user to align this metadata with the appropriate data. So that means if we would like to add a variable with R, the user must also manually add a column in modelerDataModel to make sure modelerData correctly goes back to SPSS Modeler. In the earlier example above, we did not make any changes to the modelerDataModel and it was also not needed as the metadata did not change (dividing a number by 12 will not change the metadata). Now, let us continue on the previous example. But now, rather than changing the value of tenure in the same data variable, we will create another one. As a result, we would have to update the metadata.

1 #Create the vector of tenure in years: 2 Rcolumn <- modelerData$tenure/12 3 4 #Paste this vector to the right of the dataset: 5 modelerData <- cbind(modelerData,Rcolumn) 6 7 #create the metadata for the column to add 8 newVar <-c(fieldName="tenureYears", fieldLabel="",fieldStorage="real", fieldMeasure="", fieldFormat="", fieldRole="") 9 10 #paste the new column metadata to the existing metadata 11 modelerDataModel <- cbind(modelerDataModel,newVar)

Running a table node downstream of this transform node, will show you the new variable with the name tenureYears. There are some important things to realize in this:

Page 7 of 20 IBM SPSS Modeler and R

• fieldName and fieldStorage are the only 2 required rows that needs to be lled in for any new column. In the code we left all the other lines empty, meaning they will be lled in by the stream default. For a list of available values, we refer to the user guide.

• As modelerDataModel is only useful when you go back to SPSS Modeler, you will generally only use/change this object in non-terminal R-nodes. It might still be handy to use it in terminal nodes, if the value of the modelerDataModel is important for your output. (eg. run a histogram of all continuous variables)

• When data will go back to SPSS modeler, it will be the content of the data frame excluding the column- and row names. That means, that even though the column in the modelerData will be called Rcolumn, the name of the column in SPSS will only be dened by the metadata within the row fieldName. In this case it is called tenureYears.

• The only link between modelerData and modelerDataModel is the order of the columns. It will not look by name. The rst column in the data will be given the metadata of the rst column of modelerDataModel . In case the metadata (modelerDataModel ) does not match the modelerData , an error is thrown. The table below shows schematic how this works.

modelerData modelerDataModel X ... X RName1 ... RNamen 1 n fieldName Name ... Name x1,1 ... x1,n 1 n fieldLabel ..... x2,1 ... x2,n R fieldStorage xxx . . . xxx ...... fieldMeasure ..... xm−1,1 ... xm−1,n fieldFormat ..... xm,1 ... xm,n fieldRole .....

Name1 ... Namen x1,1 ... x1,n SPSS x2,1 ... x2,n ...... xm,1 ... xm,n

Note that only the names of the modelerDataModel are used

Since this concept is very strange to standard R users, We found this part the most dicult to explain. To people who know SPSS you can summarize it as modelerDataModel taking over the role of the type node.

You can nd all streams and R scripts explaining modelerDataModel here:

3.2.3 modelerModel modelerModel is the R object that is stored within the R nugget. This object will be populated within the R model node, after which you could use modelerModel within the R nugget for scoring. This very much works the same way as IBM Modeler works: You ask a model node to calculate a formula; after which that formula will be stored within the nugget, together with the way it should be used to calculate a scoring.

You will only use this object within the R model node and nugget. Note that, within the R model node, there are 2 syntax window.

Page 8 of 20 IBM SPSS Modeler and R

R model building is to calculate whatever you want to store within modelerModel that could be used within your nugget calculations to score your data. This can be any object within R. As any SPSS Model builder node, this will be a terminal code, meaning that from this code, no data will go back to SPSS modeler unless some outputs and things that are stored within the modelerModel object. R model scoring is the syntax to dene how you will use the object modelerModel , containing the content you stored to it in the R model building syntax, to derive the new data. Apart from the use of modelerModel, this is very similar to the R transform node. Let us start with a simple example, where we would like to create a basic linear model for the variable tenure. The formula of this model should be saved in the modelerModel after which it can be used in the scoring.

1 #Create the model and save it in modelerModel 2 modelerModel <- lm(tenure ~ age + region + ed + income, data= modelerData) 3 4 #Add some summary of the model in the nugget 5 summary(modelerModel) 6 7 #together witha histogram 8 hist(modelerModel$residuals, main ="residual histogram") 9 10 #and the residual vs actuals scatterplot 11 plot(modelerData$tenure, modelerModel$fitted.values, xlab ="actual", ylab ="predicted") 12 13 #All of these output will be stored in the modeler nugget tabs

1 #Use the model to makea prediction, and add it to the existing data. 2 pred <- predict(modelerModel, modelerData) 3 modelerData <- cbind(modelerData,pred) 4 5 #Take care of the metadata! 6 newVar <-c(fieldName="$L-tenure", fieldLabel="", fieldStorage="real", fieldMeasure="", fieldFormat="", fieldRole="") 7 modelerDataModel <- cbind(modelerDataModel,newVar)

It is important to note that modelerModel can be lled with any type of object, but will very often be of a model class but does not have to be. In the previous example, the object stored to it was clearly a (statistical) model. In the next example we will just save 2 numbers within the modelerModel object. Imagine we want to calculate the z-values of a certain variable. In order to create the z-values, we need the mean and the standard deviation of the column. We will store both of these within modelerModel, after which we will use them in the scoring syntax1 . This example shows you do not need to store a "statistical" model within your modelerModel object, but it really can be any R object.

1 #calculate mean and standard deviation 2 M <- mean(modelerData$tenure) 3 SD <- sd(modelerData$tenure) 4 5 #and save it ina list called modelerModel. 6 modelerModel <- list(avg = M, sDev = SD)

1 #calculatez scores using the elements in modelerModel 2 zTenure <- (modelerData$tenure - modelerModel$avg)/modelerModel$sDev 3 modelerData <- cbind(modelerData,zTenure) 4 5 #define new metadata column and add it. 6 newVar <-c(fieldName="zTenure", fieldLabel="",fieldStorage="real", fieldMeasure="", fieldFormat="", fieldRole="") 7 modelerDataModel <- cbind(modelerDataModel,newVar)

1Note that there is a very good reason this is not combined into an R Transform node, explained in 3.4

Page 9 of 20 IBM SPSS Modeler and R

You can nd all streams and R scripts explaining modelerModel here:

3.3 Some general remarks • Although it might seems this way, you are not required to build modelerData from the existing data within that frame. modelerData will be lled with the dataset you have in Modeler, however, nothing stops you to throw that data away in R, and dene some new data coming from another data source in R. As an example, imagine this link from the Weather Company website. This will give the weather history in Brussels, Belgium in the month November 2015. Now we can use R code as a source node by just overwriting modelerData and redening modelerDataModel

1 #Define the link 2 linkPath <-"http://www.wunderground.com/history/airport/EBBR/2015/11/01/ MonthlyHistory.html?req_city=Brussels&format=1" 3 #Read the data as csv 4 modelerData <- read.csv(linkPath) 5 modelerData[,1] <- as.Date(modelerData[,1]) 6 7 #Redefining modelerDataModel, all are real numbers, except the first column is the date. 8 modelerDataModel <- as.data.frame(t(data.frame(fieldName = colnames(modelerData), fieldLabel ="", fieldStorage =c("date",rep("real",ncol(modelerData)-1)), fieldMeasure ="", fieldFormat ="", fieldRole ="")))

As you can see this code does not use the old denition of the dened R objects, but completely redenes them. Placing this in a R transform node, will give back this new dataset to modeler. So in this way you can use this approach to create an R input node. You can nd an example here .

• Within an R model node, there is place for 2 scripts. The building script will be the script that will be populated within the R nugget. It will not be run when you run the model node. As a result of this, these 2 scripts are independent. The only thing they share is the value of the object modelerModel which is saved within the nugget when running the building syntax, and picked up within the R scoring syntax. This also means that eventual R-libraries that are required, should be loaded in both scripts. Take for example a model for a random forest.

1 #Load the library 2 library("randomForest") 3 #Create the model and save it in modelerModel 4 modelerModel <- randomForest(tenure ~ age + region + ed + income, data= modelerData,ntree =50)

1 #Load the library 2 library("randomForest") 3 4 #Use the model to makea prediction, and add it to the existing data. 5 pred<- predict(modelerModel, modelerData) 6 modelerData <- cbind(modelerData,pred) 7 8 #Take care of the metadata! 9 newVar <-c(fieldName=" $RF-tenure",fieldLabel="",fieldStorage="real",fieldMeasure="", fieldFormat="",fieldRole="") 10 modelerDataModel <- cbind(modelerDataModel,newVar)

• Talking about libraries and package: A package is a collection of R objects dened for a certain purpose. These often are specic statistical functionalities, like randomForest in the example above. A basic R installation comes with the standard packages, however, there are many more packages available made available by the R community, on CRAN.

Page 10 of 20 IBM SPSS Modeler and R

Packages needs to be installed and made locally available in libraries. Once the package is installed on the system as a library, you can load this library in any R session by the code library(). To install a package, you have several options. The easiest is probably to write a code like install.packages("randomForest") within R. You will have to select a CRAN mirror where this library will be downloaded from, and the download will go automatically. Nor- mally you will only have to do this once. Although possible, it is not recommended to run this package installation command from within SPSS. The reason is that these libraries will than be saved in a temporary folder, and afterwards be deleted. If you still want to this through SPSS, you will have to hard code the installation path.

3.4 Read data options

Something we ignored until now are the settings within the node under Read data options. The basics of the R integration with Modeler can be done without the knowledge of this, as it requires some more advanced R knowledge. The user guide still has a good explanation about these items.

However, there is one more thing that might be important. For modeler version 17 and lower, the R integration of non-terminal nodes (i.e. transform and nuggets), will by default be done in batches of 1000. The reason for this was to allow these R nodes to work on hadoop and other clustered environments. As a result of this, it is very important to realize that any R code, that would span over multiple lines of data, would lead to false results. For a workaround for this, we refer to 5.1.1.

Take as an example the z scores above. If we would calculate the mean and the standard de- viation of the variable in a non-terminal node, it would start with running this code for the rst 1000 lines of data. So that leads to a specic mean, deviation and corresponding z-scores. How- ever, the next 1000 lines, a new mean and deviation would be calculated, and the z scores will be based on these.

As a solution, the means and standard deviations are calculated in the R model (i.e. a termi- nal node) over all the data and used in the R nugget to calculate the z scores.

Note that this approach may lead to a very slow integration between SPSS and R in the case of streaming R nodes in a local, non-clustered environment. However, as from IBM SPSS Modeler version 17.1, there is a default option not to use this approach of batch processing or to increase the batch size. For the lower versions, there is a workaround possible if you still want to increase this batch size or turn it o (see later).

4 Custom Dialog builder

The Custom Dialog Builder allow you to create and manage R nodes with prefilled R code to use inside IBM SPSS Modeler streams. In this way, users can create their own nodes. You can start the Custom Dialog builder in the Tools menu under ”Custom Dialog Builder”.

When opening a custom dialog builder, you will see a 2 windows. One of them is the custom dialog itself, the other is the toolset to populate the dialog.

Page 11 of 20 IBM SPSS Modeler and R

4.1 Tools The tools window is a list of items you can place within your dialog. This include among others the field chooser, Check and combo boxes, Text and number controls and tabs. You can select any and drag them onto the dialog itself. Once you have any item in the dialog, you can select it, and you will see the item properties. These are the properties of this specific item, and will change dependent on the type of item it is. The most important are the identifier (the way it will be referenced within the script) and the Title (the one that will be visual in the dialog)

4.2 Custom dialog The big gray window is the dialog itself. For the moment it is empty, as it should be populated with items from the Tools list. Clicking on this gray dialog will show you the dialog properties below. As main items, this includes the name and title of the dialog, the script itself and the type and position of the created node. With regards to the script to be written. The global rule is that you reference to the items within the dialog using their identifier between double percentages (”%%%%”. Once you finished creating the custom node, you can install it by pressing the green arrow in the toolbar. You can also save intermediate versions to the disk.

4.3 Simple example Let us create a custom dialog for the randomForest model created earlier in section 3.3. Below you will find a step by step approach. The most important thing we have to wonder is what, within this code, we want flexible for the user. In the case of this model, there might be 3 things that we want flexibel: the input variables, the target and the number of trees in our forest.

First fill in the Custom dialog properties as indicated

Page 12 of 20 IBM SPSS Modeler and R

In our fixed example, the target is tenure, but a user might choose any other field. As a result we will place a field chooser on the dialog. Change the properties like shown. The variable filter properties allows you to select only categorical variables.

In our fixed example, the inputs are age, income, ..., but a user might choose any other field. As a result we will place a field chooser on the dialog. The biggest difference is that now we can select several variables, as there might be different inputs. To make it easier we will separate these values by a +. Therefore change the properties like shown.

As a third custom choice we would like to add the number of trees in our forest. In the original script is was 50, so we will choose this as the default. However, users may choose any integer value between 1 and 1000. Add a number control on the dialog and change the properties

So now the dialog is ready, and we need to add the script to it. Go to the Edit options, and choose ”Script Template”. This will bring you to an empty window for the script. In this case (as we selected we wanted 2 scripts), there is a tab for the building code, and one for the scoring script. If the coring script is greyed out, you did set the dialog property ”Score from the Model” to True.

Page 13 of 20 IBM SPSS Modeler and R

Let us start with the scoring script, as this is easier. The only thing which will need to be adapted for custom input is the variable name that will be send back to SPSS. So copy the scoring code, and change the name tenure to %%TARGET%% (this is the name of the identifier of the target)

Fill in the code for the building script. and change in a similar way as above the values for target and intput variables, together with the number of trees. Afterwards, press OK to close the script window.

Being back at your Custom Dialog builder, save the dialog in any appro- priate location. Also deploy the dialog by clicking on the green deploy arrow in the toolbar. Close the Dialog builder.

Back to the stream, you will now see the new node in the model palette. You can use this node within your stream.

You can find the resulted cfe file here: (place this in the correct location, see 5.2.1) and a stream where it is deployed:

5 Tips & tricks: Some more detailed 5.1 R code 5.1.1 ibmspsscf70 library Let us now have a more detailled view about what actually happens with the code. First of all, it is worth to check what happens when you do the R installation correctly. This will install the by IBM delivered R package ibmspsscf70 in the library folder of your R installation. This library contains several functions to handle the data traffic between SPSS and R.

Running any R node in SPSS will not only run the code you write, but it will also run some extra code behind the scenes. You can see this code in the ”Console output” window of the R node. Looking for example at this tab for an R nugget, you will see that your code will be something like

1 modelerModel <- ibmspsscfoutput.GetModel() 2 while(ibmspsscfdata.HasMoreData()){ 3 modelerDataModel <- ibmspsscfdatamodel.GetDataModel() 4 modelerData <- ibmspsscfdata.GetData(rowCount=1000, missing=NA, rDate="None", logicalFields=FALSE) 5

Page 14 of 20 IBM SPSS Modeler and R

6 @YourR code 7 8 ibmspsscfdatamodel.SetDataModel(modelerDataModel) 9 ibmspsscfdata.SetData(modelerData) 10 }

All the functions starting with ibmspsscf... are functions within this library. It is this part of the code, that is responsible for the transfer of the data and the metadata to R. Also you will see a while loop, indicating that the data will go to R in batches of 1000. The other options are the values of the data read options within the node. The last lines of the code prepares the data to be send back to SPSS. Also, note the closing brace, which is the end of the while loop.

Now, since this library will be loaded with every interaction between SPSS and R, you are free to use these functions within your code as well. So if you would like to avoid these runs in batches of 1K and you do not have version 17.1 available, you can start your R code with another loop to first continue filling the modelerData and only than start your actual code. In practice that means that in your R node you start with:

1 while(ibmspsscfdata.HasMoreData()) 2 { 3 modelerData <-rbind(modelerData,ibmspsscfdata.GetData(rowCount=100000,missing=NA,rDate="None", logicalFields=FALSE)) 4 } 5 6 @Some more code using modelerData which is now the complete dataset

5.1.2 Some useful parts of R code Make sure a package is installed Whenever you have created any R node using packages, you have to make sure that anyone using this node has this package on his machine installed, if needed without too much user interference. Therefore you can use this part of code:

1 packages <- function(x){ 2 x <- as.character(match.call()[[2]]) 3 if(!require(x,character.only=TRUE)){ 4 install.packages(pkgs=x,repos="http://cran.r-project.org") 5 require(x,character.only=TRUE) 6 } 7 } 8 9 packages(rpart)

It will firstly verify if the library is installed on the system, if not it will silently install it from the given CRAN mirror (you can change this to another mirror or a local repository if needed). Installation will only happen the first time the node will be used! Create the metadata corresponding to the R data frame Sometimes your data is very much trans- formed compared to the original data, that it is difficult to build your metadata starting from the original one. So you might want to change the metadata to link to the data in R independent of the original. This is in particularly useful if you want to use R as a sort node. Below is a function that asks for a data frame, and creates modelerData and modelerDataModel accordingly.

1 sendToModeler <- function (dataFrame) { 2 if(is.null(dim(dataFrame))){ 3 stop("Invalid data received: nota data.frame")} 4 if(dim(dataFrame)[1]<=0) { 5 print("Warning: modelerData has no line, all fieldStorage fields set to strings") 6 getStorage <- function(x){return("string")} 7 } else{ 8 getStorage <- function(x) { 9 x <- unlist(x) 10 res <- NULL 11 #ifx isa factor, typeof will return’integer’ so we handle this case first

Page 15 of 20 IBM SPSS Modeler and R

12 if(is.factor(x)) { 13 res <-"string" 14 } else{ 15 res <- switch(typeof(x), integer="integer", double="real","string") 16 } 17 return (res) 18 } 19 } 20 col= vector("list", dim(dataFrame)[2]) 21 for (i in 1:dim(dataFrame)[2]) { 22 col[[i]] <-c(fieldName= names(dataFrame[i]) ,fieldLabel ="", fieldStorage= getStorage(dataFrame[i]), fieldMeasure ="", fieldFormat ="", fieldRole ="") 23 } 24 mdm<-do.call(cbind,col) 25 26 modelerDataModel<<-data.frame(mdm) 27 modelerData <<- dataFrame 28 } 29 30 sendToModeler(iris)

If you use this code, you should make sure you only use this on data frames which are not very much dependent on the content of the original modelerData. If ever results are not as expected you might find an answer in 5.5. Looping through several variables is a very common functionality which is relatively easy in R. How- ever, with the custom dialog builder, it might be more difficult as the string %%INPUTS%% will be exactly replaced by the string age + income + gender or something similar (you can change the ”+” sign by commas or spaces depending on the separator chosen). Now the problem is that in order to loop in R, we need to transform this string into the R vector c("age", "income", "gender"). You can do that using this code:

1 #Createa function to remove trailing spaces 2 trim <- function(x) gsub("^\\s+|\\s+$","", x) 3 4 #Create the vector of using the strsplit functions 5 InputsAsVector <- trim(strsplit("%%INPUTS%%","+")[[1]]) 6 7 for (input in inputAsVector){ 8 @Some more code to run for every field defined 9 }

One important remark is that this method will never work if you have variable names containing trailing spaces or + symbols. As this code will recognize every + as the symbol to separate the variables, and remove the trailing spaces. It will be difficult in this way to distinguish between a + coming from a variable name, or a + being a separator. Use predefined roles this is an option SPSS modeler users are used to, and you might want to extend to R usage. The idea is to distinguish between inputs and targets (and others) merely in the Type- node. Once this is defined, all the modeling nodes by default will use these settings and variables. So you might want to use some R code to distinguish between inputs and targets. As an example, the below code will look for the flag target(s) and for all the input variables:

1 TARGET <- modelerDataModel[1,(modelerDataModel[6,] =="target" & modelerDataModel[4,] ==" flag")] 2 INPUTS <- as.vector(t(modelerDataModel[1,(modelerDataModel[6,] =="input"])) 3 4 @Some more code

Removing columns Sometimes you just want to remove a column in both modelerData and modelerDataModel . You should make sure to delete the appropriate column, as the link between data and metadata is merely the order! Therefore you can use this R code to remove the column ttenure

1 #define the remove function 2 removeColumn <- function(name){ 3 modelerDataModel[,modelerDataModel[1,]==name]<<-NULL

Page 16 of 20 IBM SPSS Modeler and R

4 modelerData[,colnames(modelerData)== name] <<- NULL 5 } 6 7 #apply the function to the tenure variable 8 removeColumn("tenure") 9 10 @Some more code

5.2 Custom Dialog builder 5.2.1 How to save and share a custom dialog? Specification of a custom dialog can be saved to an external file, with the extension .cfd. This can be saved, and reopened through the general save buttons on the custom dialog toolbar.

Once you deploy the dialog to your palette, this will also be saved as a local file, under a slightly different extension .cfe. You can find this cfe file back in the path C:\ProgramData\IBM\SPSS\Modeler\XX\CDB)2.

In order to share this node with others, this node needs to be copied within the same folder on the other SPSS Modeler instance.

5.2.2 Link to dialog and script Regarding the Custom Dialog builder, there are several more things to say. We earlier stated that in the code of the custom dialog builder. That means that if you have an identifier called TARGET, and you will fill in a variable churn in these dialog. All the references of %%TARGET%% will be replaced in your code by churn. If you have multiple variables selected (say age and income in the identifier INPUTS and you se- lect ”+” as the separator, then within the code %%INPUTS%% will be replaced by the verbatim age + income.

Although this is valid in default cases, it is not entirely true. There is still another level. It will also depend on the tool property in called ”R script”: . Within this line, the value %%ThisValue%% will be replaced by whatever you fill in the dialog described before. And it is the value of this R script property that will verbatim replace your identifier in the main R code. Now, because often the value of R script is just %%ThisValue%%, there is no use there.

However, this will start to be useful when you will work with radio buttons, in which case you would like to run different codes for each button. In general there are 2 ways to go: hard/dirty way You create your R script using if statements If ’%%choice%%’ == ’A’ then .... else if ’%%choice%%’ == ’B’ then .... easy/clean way: within your radio buttons, write the full code that has to be run when that bullet is selected. Imagine the following scenario: The user should select some variables, some computations are done on each of them. Depending on the outcome of the computations and defined cutoffs, The user can choose between any of these 3 actions: • to remove the columns • to keep them as data, but automatically set the role to ”None” • do nothing, and just let the data flow back to modeler without changes.

So now there are different levels you can work with. You will probably first add a radio button group in the dialog, and call it WHATTODO

2Replace ’XX’ by the version of your IBM SPSS Modeler installation

Page 17 of 20 IBM SPSS Modeler and R

Now you see that the value of the R script variable is not changed. Clicking through to the radio button itself, you want it to look something like this:

So there are the 3 option, each with its corresponding R code: 1. removeColumn(input) (refering to a function written to remove columns. 2. modelerDataModel[6,modelerDataModel[1,]==input] <- "none", this changes the role of that field to ”None”. 3. Basically an empty string. With these settings, your code %%WHATTODO%% will be replaced by the corresponding R code depending on which radiobutton the user prefers. So the R script behind the dialog will be something like:

1 trim <- function (x) gsub("^\\s+|\\s+$","", x) 2 inputAsVector <- trim(strsplit("%%INPUTS%%",",")[[1]]) 3 4 for (input in inputAsVector){ 5 %%WHATTODO%% 6 } 7 8 @Some more code

This approach simplifies the possibility of running different R codes depending on the end-users choice.

5.3 What about SQL Pushback? Hadoop pushback? SQL pushback supports R nodes for Pure Data for Analytics (PDA), SAP Hana and Oracle by utilizing their R support. Databases need to have the appropriate vendor-provided R extensions installed. Off course depending on vendor, a subset of libraries or scripts are supported.

We will show an example on R usage on PDA. The first important thing to note is that only R nuggets are available for SQL pushback. The reason for this is that data will be divided onto the several spus of PDA. R code will therefore independently be run on the different spus, and never on the entire data together (this is the main reason for the issues discussed in 3.4).

1 @This code cannot be pushed back to PDA, as it isa model building node 2 3 #Create the model and save it in modelerModel 4 modelerModel <- lm(tenure ~ ., data= modelerData)

1 @This code can be pushed back to PDA, as it isa model scoring node 2 3 #Make sure they are considered as factors, as PDA will by default only have numerics 4 modelerData$marital <- as.factor(modelerData$marital) 5 modelerData$ed <- as.factor(modelerData$ed) 6 modelerData$region <- as.factor(modelerData$region) 7 modelerData$churn <- as.factor(modelerData$churn) 8 9 #Use the model to makea prediction, and add it to the existing data. 10 pred <- predict(modelerModel, modelerData) 11 modelerData <- cbind(modelerData,pred) 12 13 #Take care of the metadata! 14 newVar <-c(fieldName="TenureSQLScore", fieldLabel="", fieldStorage="real", fieldMeasure="", fieldFormat="", fieldRole="") 15 modelerDataModel <- cbind(modelerDataModel,newVar)

Page 18 of 20 IBM SPSS Modeler and R

One more thing to note here is that the modelerModel will always be a local object. Within the R code that is pushed back, this object will have to be transferred to PDA behind the scenes. This is not always a problem, however, the size of modelerModel can often be quite big. As an example, a linear model for the telco dataset, will be approx 370kb. (As a comparison, the size where that model is run on will be approx 50kb). So random forests or any other type of models can be huge. This is something that needs to be considered when using this approach.

Since everything runs on the different spus, this also means that R, and all the necessary libraries has to be installed on both the host and every spu of PDA. For the R libraries, you can in generally not do that in the same way as loading R libraries on a local system, as PDA will generally be not connected to the internet. How to solve this is more a question of the R-PDA link, rather than SPSS. However here is a small R script to load a package on the PDA from local R instance:

1 @Note this will bea localR script. 2 3 #Load PDA localR libraries 4 library(nzr) 5 library(nza) 6 7 #Connect to the appropriate DSN 8 nzConnectDSN(’PDA-DSN’) 9 10 #install the library on PDA 11 nzInstallPackages("http://cran.r-project.org/src/contrib/rpart_4.1-10.tar.gz")

This will install the package rpart onto PDA (assuming you are logged on with appropriate credentials). You can choose to run this code in native R or within SPSS in a R output node. Both will work. Once it is installed on the PDA, you can use the libraries as normal.

A stream showing all of these scripts is attached here .

5.4 What about real-time scoring? and Solution Publisher? There is really not much to say about these things, except for the fact that it is supported! There is just one minor point to mention: You should make sure the R, as well as the R extensions are installed at some more location: • For solution publisher you should just install the R extensions in the /ext/bin directory of your Solution Publisher. Install R on the machine where your solution publisher is installed. • For real time scoring you should just ensure that the R extensions are installed on the /components/ modeler/ext/bin directory of both your server as your scoring server. Off course, functionally, only R transform and R nuggets are relevant for this part. Install R on the machine where your application server (websphere) is installed.

5.5 Something more about the metadata in modeler and the conse- quences on R integration Metadata in SPSS Modeler is something particular. It is very important for the way SPSS Modeler works. So important in fact that SPSS Modeler will know at all times all the metadata of the data at every node within the stream. You may have already noticed that when you add a new field (with a derive node) all the type nodes downstream, will immediately take into account this extra field. In order to do this, behind the scenes, modeler let some small dummy data flow around. This data only has the purpose to verify the metadata in near real time.

If you want to know what this dummy data looks like, you can add an R transform node just after the source node, and use the following syntax to write out the data that is passed by SPSS through R and back. This code doesn’t do anything with the data, it just writes it back into a file.

1 path <-"C:/test.txt" 2 sink(path) 3 writeLines(as.character(Sys.time()))

Page 19 of 20 IBM SPSS Modeler and R

4 writeLines("Data:") 5 print(modelerData)

Once these lines are added to the R transform node, and we continue in creating the stream, we will see that this file will already be populated without even running anything. You can see the data that is passed through the node only contains 5 lines of data with 1-2-3-4-5 and ”a-b-c-d-e” (some enclosed by quotes, for the string variables) depending on the metadata. These 5 lines of dummy data modeler are the once going around the stream every time SPSS Modeler needs/wants to check the metadata. You can even append this script with print(modelerModel) to see the value of modelerModel is not yet assigned. That also means that modelerModel cannot place any role into the assigning of the modelerDataModel .

Now this approach has some consequences with the R integration. We placed it within this document as it will explain a lot of ”strange” behavior in your projects, where the reason is not always obvious! As an example let us assume the following exercise: We have a multinomial logistic model with n different categories. We want to have a column back for all of the categories. A naive approach would be the following:

1 library("rpart") 2 modelerModel <- rpart(custcat ~ tenure+age+income, data = modelerData) 3 print(summary(modelerModel))

1 library("rpart") 2 probs <- predict(modelerModel,modelerData,type="prob") 3 modelerData <- cbind(modelerData,probs) 4 (pnames 5 for (x in ccolnames(probs)olrobs){ 6 modelerDataModel<-cbind(modelerDataModel,c(fieldName=paste(" $P-",x,sep =""), fieldLabel="" ,fieldStorage="real", fieldMeasure="", fieldFormat="", fieldRole="")) 7 }

Running this approach in R natively will produce the correct modelerData and modelerDataModel ob- jects. However, this approach will not work in SPSS because modelerModel is not assigned when modeler assesses the metadata. Therefore, this code will be run with an empty modelerModel and the 5 dummy records. As a result probs will be empty, so nothing within the for-loop will run.

A workaround in this case is to derive the number of columns not from the modelerModel , but in another way. One approach can be:

1 library(rpart) 2 probs <- predict(modelerModel,modelerData,type="prob") 3 modelerData <- cbind(modelerData,probs) 4 5 for (x inc(1,2,3,4)){ 6 modelerDataModel<-cbind(modelerDataModel,c(fieldName=paste(" $P-",x,sep =""), fieldLabel="" ,fieldStorage="real", fieldMeasure="", fieldFormat="", fieldRole="")) 7 }

However, also with this approach there is a problem as it needs to hardcode the values of the for loop.

The reason behind this approach of SPSS Modeler is that this metadata should be available in near real time. However modelerData and modelerModel are objects that can be very big, and therefore will lead to a big delay obtaining this metadata.

Page 20 of 20