PMML: an Open Standard for Sharing Models

60 CONTRIBUTED RESEARCH ARTICLES PMML: An Open Standard for Sharing Models by Alex Guazzelli, Michael Zeller, Wen-Ching Lin and series, as well as different ways to explain models, Graham Williams including statistical and graphical measures. Ver- sion 4.0 also expands on existing PMML elements and their capabilities to further represent the diverse Introduction world of predictive analytics. The PMML package exports a variety of predictive and descriptive models from R to the Predic- tive Model Markup Language (Data Mining Group, PMML Structure 2008). PMML is an XML-based language and has become the de-facto standard to represent not only PMML follows a very intuitive structure to describe predictive and descriptive models, but also data pre- a data mining model. As depicted in Figure1, it and post-processing. In so doing, it allows for the is composed of many elements which encapsulate interchange of models among different tools and en- different functionality as it relates to the input data, vironments, mostly avoiding proprietary issues and model, and outputs. incompatibilities. The PMML package itself (Williams et al., 2009) was conceived at first as part of Togaware’s data mining toolkit Rattle, the R Analytical Tool To Learn Eas- ily (Williams, 2009). Although it can easily be accessed through Rattle’s GUI, it has been separated from Rattle so that it can also be accessed directly in R. In the next section, we describe PMML and its overall structure. This is followed by a description of the functionality supported by the PMML package and how this can be used in R. We then discuss the importance of working with a valid PMML file and finish by highlighting some of the debate sur- rounding the adoption of PMML by the data mining community at large. A PMML primer Developed by the Data Mining Group, an indepen- dent, vendor-led committee (http://www.dmg.org), PMML provides an open standard for representing data mining models. In this way, models can easily Figure 1: PMML overall structure described sequen- be shared between different applications, avoiding tially from top to bottom. proprietary issues and incompatibilities. Not only can PMML represent a wide range of statistical techniques, but it can also be used to represent their input Sequentially, PMML can be described by the fol- data as well as the data transformations necessary to lowing elements: transform these into meaningful features. PMML has established itself as the lingua franca for the sharing of predictive analytics solutions be- Header tween applications. This enables data mining scien- tists to use different statistical packages, including R, The header element contains general information to build, visualize, and execute their solutions. about the PMML document, such as copyright in- PMML is an XML-based language. Its current 3.2 formation for the model, its description, and infor- version was released in 2007. Version 4.0, the next mation about the application used to generate the version, which is to be released in 2009, will expand model such as name and version. It also contains an the list of available techniques covered by the stan- attribute for a timestamp which can be used to spec- dard. For example, it will offer support for time ify the date of model creation. The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 61 Data dictionary – Outlier Treatment: Defines the outlier treatment to be used. In PMML, outliers can be The data dictionary element contains definitions for treated as missing values, as extreme val- all the possible fields used by the model. It is in the ues (based on the definition of high and data dictionary that a field is defined as continuous, low values for a particular field), or as is. categorical, or ordinal. Depending on this definition, the appropriate value ranges are then defined as well – Missing Value Replacement Policy: If this at- as the data type (such as string or double). The data tribute is specified then a missing value is dictionary is also used for describing the list of valid, automatically replaced by the given val- invalid, and missing values relating to every single ues. input data. – Missing Value Treatment: Indicates how the missing value replacement was derived Data transformations (e.g. as value, mean or median). Transformations allow for the mapping of user data • Targets: The targets element allows for the scal- into a more desirable form to be used by the mining ing of predicted variables. model. PMML defines several kinds of data transfor- mation: • Model Specifics: Once we have the data schema in place we can specify the details of the actual • Normalization: Maps values to numbers, the in- model. put can be continuous or discrete. A multi-layered feed-forward neural network, for example, is represented in PMML by its • Discretization: Maps continuous values to dis- activation function and number of layers, fol- crete values. lowed by the description of all the network • Value mapping: Maps discrete values to discrete components, such as connectivity and connec- values. tion weights. A decision tree is represented in PMML, recur- • Functions: Derive a value by applying a func- sively, by the nodes in the tree. For each node, tion to one or more parameters. a test on one variable is recorded and the sub- nodes then correspond to the result of the test. • Aggregation: Summarizes or collects groups of Each sub-node contains another test, or the fi- values. nal decision. The ability to represent data transformations (as well Besides neural networks and decision trees, as outlier and missing value treatment methods) in PMML allows for the representation of many conjunction with the parameters that define the mod- other data mining models, including linear re- els themselves is a key concept of PMML. gression, generalised linear models, random forests and other ensembles, association rules, Model cluster models, näive Bayes models, support vector machines, and more. The model element contains the definition of the data mining model. Models usually have a model PMML also offers many other elements such as name, function name (classification or regression) built-in functions, statistics, model composition and and technique-specific attributes. verification, etc. The model representation begins with a mining schema and then continues with the actual representation of the model: Exporting PMML from R • Mining Schema: The mining schema (which is The PMML package in R provides a generic pmml embedded in the model element) lists all fields function to generate PMML 3.2 for an object. Using used in the model. This can be a subset of the an S3 generic function, the appropriate method for fields defined in the data dictionary. It contains the class of the supplied object is dispatched. specific information about each field, such as name and usage type. Usage type defines the An example way a field is to be used in the model. Typ- ical values are: active, predicted, and supple- A simple example illustrates the steps involved in mentary. Predicted fields are those whose val- generating PMML. We will build a decision tree, us- ues are predicted by the model. It is also in the ing rpart, to illustrate. With a standard installation mining schema that special values are treated. of R with the PMML package installed, the follow- These involve: ing should be easily repeatable. The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 62 CONTRIBUTED RESEARCH ARTICLES First, we load the appropriate packages and <DataDictionary numberOfFields="5"> dataset. We will use the well known Iris dataset that <DataField name="Species" ... records sepal and petal characteristics for different <Value value="setosa"/> varieties of iris. <Value value="versicolor"/> <Value value="virginica"/> > library(pmml) <DataField name="Sepal.Length" > library(rpart) optype="continuous" > data(iris) dataType="double"/> </DataField> We can build a simple decision tree to classify ex- ... amples into iris varieties. We see the resulting structure of the tree below. The root node test is on the The header and the data dictionary are common variable Petal.Length against a value of 2.45. The to all PMML, irrespective of the model. “left” branch tests for Petal.Length < 2.45 and deliv- Next we record the details of the model itself. In ers a decision of setosa. The “right” branch splits on a this case we have a tree model, and so a ‘TreeModel’ further variable, Petal.Width, to distinguish between element is defined. Within the tree model we begin versicolor and virginica. with the mining schema. > my.rpart <- rpart(Species ~ ., data=iris) <TreeModel modelName="RPart_Model" > my.rpart functionName="classification" algorithmName="rpart" n= 150 ...> <MiningSchema> node), split, n, loss, yval, (yprob) <MiningField name="Species" * denotes terminal node usageType="predicted"/> <MiningField name="Sepal.Length" 1) root 150 100 setosa ... usageType="active"/> 2) Petal.Length< 2.45 50 0 setosa (0.33 ... ... 3) Petal.Length>=2.45 100 50 versicolor ... This is followed by the actual nodes of the tree. 6) Petal.Width< 1.75 54 5 versicolor ... We can see that the first test involves a test on the 7) Petal.Width>=1.75 46 1 virginica ... Petal.Length. Once again, it is testing the value against 2.45. To convert this model to PMML we simply call the pmml function. Below we see some of the output. <Node id="1" score="setosa" The exported PMML begins with a header that recordCount="150" defaultChild="3"> contains information about the model. As indicated <True/> above, this will contain general information about <ScoreDistribution value="setosa" the model, including copyright, the application used recordCount="50" confidence="0.33"/> to build the model, and a timestamp. Other informa- <ScoreDistribution value="versicolor" tion might also be included.

Load more