<<

60 CONTRIBUTED RESEARCH ARTICLES

PMML: An for Sharing Models by Alex Guazzelli, Michael Zeller, Wen-Ching Lin and series, as well as different ways to explain models, Graham Williams including statistical and graphical measures. Ver- sion 4.0 also expands on existing PMML elements and their capabilities to further represent the diverse Introduction world of predictive analytics. The PMML package exports a variety of predic- tive and descriptive models from R to the Predic- tive Model Markup Language (Data Mining Group, PMML Structure 2008). PMML is an XML-based language and has become the de-facto standard to represent not only PMML follows a very intuitive structure to describe predictive and descriptive models, but also data pre- a data mining model. As depicted in Figure1, it and post-processing. In so doing, it allows for the is composed of many elements which encapsulate interchange of models among different tools and en- different functionality as it relates to the input data, vironments, mostly avoiding proprietary issues and model, and outputs. incompatibilities. The PMML package itself (Williams et al., 2009) was conceived at first as part of Togaware’s data min- ing toolkit Rattle, the R Analytical Tool To Learn Eas- ily (Williams, 2009). Although it can easily be ac- cessed through Rattle’s GUI, it has been separated from Rattle so that it can also be accessed directly in R. In the next section, we describe PMML and its overall structure. This is followed by a description of the functionality supported by the PMML pack- age and how this can be used in R. We then discuss the importance of working with a valid PMML file and finish by highlighting some of the debate sur- rounding the adoption of PMML by the data mining community at large.

A PMML primer

Developed by the Data Mining Group, an indepen- dent, vendor-led committee (http://www.dmg.org), PMML provides an open standard for representing data mining models. In this way, models can easily Figure 1: PMML overall structure described sequen- be shared between different applications, avoiding tially from top to bottom. proprietary issues and incompatibilities. Not only can PMML represent a wide range of statistical tech- niques, but it can also be used to represent their input Sequentially, PMML can be described by the fol- data as well as the data transformations necessary to lowing elements: transform these into meaningful features. PMML has established itself as the lingua franca for the sharing of predictive analytics solutions be- Header tween applications. This enables data mining scien- tists to use different statistical packages, including R, The header element contains general information to build, visualize, and execute their solutions. about the PMML document, such as copyright in- PMML is an XML-based language. Its current 3.2 formation for the model, its description, and infor- version was released in 2007. Version 4.0, the next mation about the application used to generate the version, which is to be released in 2009, will expand model such as name and version. It also contains an the list of available techniques covered by the stan- attribute for a timestamp which can be used to spec- dard. For example, it will offer support for time ify the date of model creation.

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 61

Data dictionary – Outlier Treatment: Defines the outlier treat- ment to be used. In PMML, outliers can be The data dictionary element contains definitions for treated as missing values, as extreme val- all the possible fields used by the model. It is in the ues (based on the definition of high and data dictionary that a field is defined as continuous, low values for a particular field), or as is. categorical, or ordinal. Depending on this definition, the appropriate value ranges are then defined as well – Missing Value Replacement Policy: If this at- as the data type (such as string or double). The data tribute is specified then a missing value is dictionary is also used for describing the list of valid, automatically replaced by the given val- invalid, and missing values relating to every single ues. input data. – Missing Value Treatment: Indicates how the missing value replacement was derived Data transformations (e.g. as value, mean or median). Transformations allow for the mapping of user data • Targets: The targets element allows for the scal- into a more desirable form to be used by the mining ing of predicted variables. model. PMML defines several kinds of data transfor- mation: • Model Specifics: Once we have the data schema in place we can specify the details of the actual • Normalization: Maps values to numbers, the in- model. put can be continuous or discrete. A multi-layered feed-forward neural network, for example, is represented in PMML by its • Discretization: Maps continuous values to dis- activation function and number of layers, fol- crete values. lowed by the description of all the network • Value mapping: Maps discrete values to discrete components, such as connectivity and connec- values. tion weights. A decision tree is represented in PMML, recur- • Functions: Derive a value by applying a func- sively, by the nodes in the tree. For each node, tion to one or more parameters. a test on one variable is recorded and the sub- nodes then correspond to the result of the test. • Aggregation: Summarizes or collects groups of Each sub-node contains another test, or the fi- values. nal decision. The ability to represent data transformations (as well Besides neural networks and decision trees, as outlier and missing value treatment methods) in PMML allows for the representation of many conjunction with the parameters that define the mod- other data mining models, including linear re- els themselves is a key concept of PMML. gression, generalised linear models, random forests and other ensembles, association rules, Model cluster models, näive Bayes models, support vector machines, and more. The model element contains the definition of the data mining model. Models usually have a model PMML also offers many other elements such as name, function name (classification or regression) built-in functions, statistics, model composition and and technique-specific attributes. verification, etc. The model representation begins with a mining schema and then continues with the actual represen- tation of the model: Exporting PMML from R

• Mining Schema: The mining schema (which is The PMML package in R provides a generic pmml embedded in the model element) lists all fields function to generate PMML 3.2 for an object. Using used in the model. This can be a subset of the an S3 generic function, the appropriate method for fields defined in the data dictionary. It contains the class of the supplied object is dispatched. specific information about each field, such as name and usage type. Usage type defines the An example way a field is to be used in the model. Typ- ical values are: active, predicted, and supple- A simple example illustrates the steps involved in mentary. Predicted fields are those whose val- generating PMML. We will build a decision tree, us- ues are predicted by the model. It is also in the ing rpart, to illustrate. With a standard installation mining schema that special values are treated. of R with the PMML package installed, the follow- These involve: ing should be easily repeatable.

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 62 CONTRIBUTED RESEARCH ARTICLES

First, we load the appropriate packages and dataset. We will use the well known Iris dataset that varieties of iris. > library(pmml) library(rpart) optype="continuous" > data(iris) dataType="double"/> We can build a simple decision tree to classify ex- ... amples into iris varieties. We see the resulting struc- ture of the tree below. The root node test is on the The header and the data dictionary are common variable Petal.Length against a value of 2.45. The to all PMML, irrespective of the model. “left” branch tests for Petal.Length < 2.45 and deliv- Next we record the details of the model itself. In ers a decision of setosa. The “right” branch splits on a this case we have a tree model, and so a ‘TreeModel’ further variable, Petal.Width, to distinguish between element is defined. Within the tree model we begin versicolor and virginica. with the mining schema.

> my.rpart <- rpart(Species ~ ., data=iris) my.rpart functionName="classification" algorithmName="rpart" n= 150 ...> node), split, n, loss, yval, (yprob) 2) Petal.Length< 2.45 50 0 setosa (0.33 ...... 3) Petal.Length>=2.45 100 50 versicolor ... This is followed by the actual nodes of the tree. 6) Petal.Width< 1.75 54 5 versicolor ... We can see that the first test involves a test on the 7) Petal.Width>=1.75 46 1 virginica ... Petal.Length. Once again, it is testing the value against 2.45. To convert this model to PMML we simply call the pmml function. Below we see some of the output. contains information about the model. As indicated above, this will contain general information about to build the model, and a timestamp. Other informa- pmml(my.rpart) recordCount="50" confidence="0.33"/>

description="RPart Decision Tree"> value="2009-02-15 06:51:50" ... extender="Rattle"/> value="iris tree" extender="Rattle"/> here than displayed with R’s print method for the
rpart object. The rpart object in fact includes infor- mation about surrogate splits which is also captured Next, the data dictionary records information in the PMML representation. about the data fields from which the model was built. Usually, we will want to save the PMML to a file, Here we see the definition of a categoric and a nu- and this can easily be done using the saveXML func- meric data field. tion from the XML package (Temple Lang, 2009).

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 63

> saveXML(pmml(my.rpart), the model. The PMML package uses the data + file="my_rpart.") object to be able to retrieve the values for cate- gorical variables which are used internally by ksvm Supported Models to create dummy variables, but are not part of the resulting ksvm object.

2. Neural Networks — nnet: PMML export for neural networks implementing multi-class or binary classification as well as regression mod- els built with the nnet package, available through the VR bundle (Venables and Ripley, 2002). Some details that are worth mentioning are:

Figure 2: PMML Export functionality is available for • Scaling of input variables: Since nnet does several predictive algorithms in R. not automatically implement scaling of numerical inputs, it needs to be added to Although coverage is constantly being expanded, the generated PMML file by hand if one PMML exporter functionality is currently available is planning to use the model to compute for the following data mining algorithms: scores/results from raw, unscaled data. • The PMML exporter uses transformations 1. Support Vector Machines — kernlab (Karat- to create dummy variables for categorical zoglou et al., 2008): PMML export for SVMs inputs. These are expressed in the ‘Neu- implementing multi-class and binary classifi- ralInputs’ element of the resulting PMML cation as well as regression for objects of class file. ksvm (package kernlab). It also exports trans- formations for the input variables by following • PMML 3.2 does not support the censored the scaling scheme used by ksvm for numerical variant of softmax. variables, as well as transformations to create • Given that nnet uses a single output node dummy variables for categorical variables. to represent binary classification, the re- For multi-class classification, ksvm uses the sulting PMML file contains a discretizer one-against-one approach, in which the appro- with a threshold set to 0.5. priate class is found by a voting scheme. Given that PMML 3.2 does not support this approach, 3. Classification and Regression Trees — rpart the export functionality comes with an exten- (Therneau and Atkinson. R port by B. Ripley, sion piece to handle such cases. The next ver- 2008): PMML export functionality for decision sion of PMML to be released in early 2009 will trees built with the rpart package is able to ex- be equipped to handle different classification port classification as well as regression trees. methods for SVMs, including one-against-one. It also exports surrogate predicate information The example below shows how to train a sup- and missing value strategy (default child strat- port vector machine to perform binary classifi- egy). cation using the audit dataset — see Williams lm glm (2009). 4. Regression Models — and from stats: PMML export for linear regression models for objects of class "lm" and binary logistic regres- > library(kernlab) sion models for objects of class "glm" built with > audit <- read.csv(file( the binomial family. Note that this function +"http://rattle.togaware.com/audit.csv") currently does not support multinomial logis- + ) tic regression models or any other regression > myksvm <- ksvm(as.factor(Adjusted) ~ ., models built using the VGAM package. + data=audit[,c(2:10,13)], + kernel = "rbfdot", 5. Clustering Models — hclust and kmeans from + prob.model=TRUE) stats: PMML export functionality for cluster- > pmml(myksvm, data=audit) ing models for objects of class "hclust" and "kmeans". Note that the PMML package is being invoked in the example above with two parameters: 1) 6. Association Rules — arules (Hahsler et al., the ksvm object which contains the model rep- 2008): PMML export for association rules built resentation; and 2) the data object used to train with the arules package.

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 64 CONTRIBUTED RESEARCH ARTICLES

7. Random Forest (and randomSurvivalForest) — randomForest (Breiman and Cutler. R 2009): PMML export of a randomSurvivalFor- the ability to export PMML containing the ge- ometry of a forest. PMML validation ... Since PMML is an XML-based standard, the speci- fication comes in the form of an XML Schema. Ze- mentis (http://www.zementis.com) has built a tool that can be used to validate any PMML file against Discussion the current and previous PMML schemas. The tool can also be used to convert older versions of PMML Although PMML is an open standard for represent- (2.1, 3.0, 3.1) to version 3.2 (the current version). The ing predictive models, the lack of awareness has PMML Converter will validate any PMML file and made its adoption slow. Its usefulness was also lim- is currently able to convert the following modeling ited because until recently predictive models were in elements: general built and deployed using the same data min- ing tool. But this is now quickly changing with mod- 1. Association Rules els built in R, for example, now able to be deployed 2. Neural Networks in other model engines, including data warehouses that support PMML. 3. Decision Trees For those interested in joining an on-going dis- cussion on PMML, we have created a PMML dis- 4. Regression Models cussion group under the AnalyticBridge community (http://www.analyticbridge.com/group/pmml). In 5. Support Vector Machines one of the discussion forums, the issue of the lack 6. Cluster Models of support for the export of data transformations into PMML is discussed. Unless the model is built The PMML converter is free to use and can be directly from raw data (usually not the case), the accessed directly from the Zementis website or in- PMML file that most tools will export is incomplete. stalled as a gadget in an iGoogle console. This also extends to R. As we saw in the previ- It is very important to validate a PMML file. ous section, the extent to which data transforma- Many vendors claim to support PMML but such sup- tions are exported to PMML in R depends on the port is often not very complete. Also, exported files amount of pre-processing carried out by the pack- do not always conform to the PMML specification. age/class used to build the model. However, if any Therefore, between tools can be a data massaging is done previous to model building, challenge at times. this needs to be manually converted to PMML if one More importantly, strict adherence to the schema wants that to be part of the resulting file. PMML of- is necessary for the standard to flourish. If this is not fers coverage for many commonly used data trans- enforced, PMML becomes more of a blueprint than a formations, including mapping and normalization standard, which then defeats its purpose as an open as well as several built-in functions for string, date standard supporting interoperability. By validating and time manipulation. Built-in functions also of- a PMML file against the schema, one can make sure fer several mathematical operations as depicted in that it can be moved around successfully. The PMML the PMML example below, which implements: maxi- Converter is able to pinpoint schema violations, em- mum(round(inputVar/1.3)). powering users and giving commercial tools, as well as the PMML package available for R, an easy way to validate their export functionality. The example below shows part of the data dictionary element for the iris classification tree discussed previously. This time, however, the PMML is missing the required 1.3 dataType attribute for variable Sepal.Length. The PMML Converter flags the problem by embedding an XML comment into the PMML file itself.

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 65

Integrated tools that export PMML should allow SIGKDD Explorations, June 2009. URL http://www. not only for the exporting of models, but also data sigkdd.org/explorations. transformations. One question we pose in our dis- cussion forum is how to implement this in R. M. Hahsler, C. Buchta, B. Gruen, and K. Hornik. More recently, KNIME has implemented exten- arules: Mining Association Rules and Frequent Item- http://cran.r-project.org/ sive support for PMML. KNIME is an open-source sets, 2008. URL package=arules framework that allows users to visually create data . R package version 0.6-8. flows (http://www.knime.org). It implements plug- H. Ishwaran and U. Kogalur. randomSurvivalFor- ins that allow R-scripts to be run inside its Eclipse- est: Ishwaran and Kogalur’s Random Surival Forest, based interface. KNIME can import and export 2009. URL http://cran.r-project.org/package= PMML for some of its predictive modules. In do- randomSurvivalForest. R package version 3.5.1. ing so, it allows for models built in R to be loaded in KNIME for data flow visualization. Given that it A. Karatzoglou, A. Smola, and K. Hornik. The kernlab exports PMML as well, it could potentially be used package, 2008. URL http://cran.R-project.org/ to integrate data pre- and post-processing into a sin- package=kernlab. R package version 0.9-8. gle PMML file which would then contain the entire data and model processing steps. D. Temple Lang. XML: Tools for parsing and gener- ating XML within R and S-Plus, 2009. URL http: PMML offers a way for models to be exported //cran.R-project.org/package=xml. R package out of R and deployed in a production environ- version 2.3-0. ment, quickly and effectively. Along these lines, an interesting development we are involved in is the T. M. Therneau and B. Atkinson. R port by B. Rip- ADAPA scoring engine (Guazzelli et al., 2009), pro- ley. rpart: Recursive Partitioning, 2008. URL http: duced by Zementis. This PMML consumer is able //cran.r-project.org/package=rpart. R pack- to upload PMML models over the Internet and then age version 3.1-42. execute/score them with any size dataset in batch- mode or real-time. This is implemented as a service W. N. Venables and B. D. Ripley. Modern Ap- through the Amazon Compute Cloud. Thus, models plied Statistics with S. Statistics and Computing. developed in R can be put to work in of min- Springer, New York, 4th edition, 2002. URL http: utes, via PMML, and accessed in real-time through //cran.r-project.org/package=VR. R package web-service calls from anywhere in the world while version 7.2-45. leveraging a highly scalable and cost-effective cloud G. Williams. rattle: A graphical user interface for data computing infrastructure. mining in R, 2009. URL http://cran.r-project. org/package=rattle. R package version 2.4.31. Bibliography G. Williams, M. Hahsler, A. Guazzelli, M. Zeller, W. Lin, H. Ishwaran, U. B. Kogalur, and R. Guha. L. Breiman and A. Cutler. R port by A. Liaw and pmml: Generate PMML for various models, 2009. M. Wiener. randomForest: Breiman and Cutler’s URL http://cran.r-project.org/package=pmml. Random Forests for Classification and Regression, R package version 1.2.7. 2009. URL http://cran.r-project.org/package= randomForest. R package version 4.5-30. Alex Guazzelli, Michael Zeller, Wen-Ching Lin Data Mining Group. PMML version 3.2. WWW, Zementis Inc 2008. URL http://www.dmg.org/pmml-v3-2.html. [email protected]

A. Guazzelli, K. Stathatos, and M. Zeller. Efficient Graham Williams deployment of predictive analytics through open Togaware Pty Ltd standards and cloud computing. To appear in ACM [email protected]

The R Journal Vol. 1/1, May 2009 ISSN 2073-4859