Lacking standards for statistical and mining models, applications cannot leverage the benefits of . DATA MINING STANDARDS INITIATIVES he data mining and statistical models generated by commercial data mining nents in other systems, including those in customer relationship management, enter- indicates that a neural network node with id 10 has prise resource planning, risk management, an input from a node with id 0 and a weight of and intrusion detection. In the research 2.08148. The standards for defining parameter- community, data mining is used in systems ized models using XML are relatively mature. They processing scientific and engineering data. Employ- assume the inputs to the models are given explicitly, ing common data mining standards greatly simpli- as in the example. In practice, however, inputs are fies the integration, updating, and maintenance of generally not explicit; the data must first be cleaned the applications and systems containing the models. and transformed. But standards for cleaning and Established and emerging standards address various transforming data are only beginning to emerge. aspects of data mining, including: Standards related to the broader process of using data mining in operational processes and systems are Models. For representing data mining and statistical relatively immature; for example, what is the busi- data. ness implication of a particular credit risk score pro- Attributes. For representing the cleaning, transform- duced by a credit card fraud model? ing, and aggregating of attributes used as input in the models. XML Standards Interfaces and APIs. For linking to other languages The Predictive Model Markup Language (PMML) and systems. is an XML standard being developed by the Data Settings. For representing the internal parameters Mining Group (www.dmg.org), a vendor-led con- required for building and using the models. sortium established in 1998 to develop data mining Process. For producing, deploying, and using the standards [7]. PMML represents and describes data models. mining and statistical models, as well as some of the Remote and distributed data. For analyzing and operations required for cleaning and transforming mining remote and distributed data. data prior to modeling. PMML aims to provide enough infrastructure for an application to be able The parameters of a parameterized data mining to produce a model (the PMML producer) and model, such as a neural network, can be represented another application to consume it (the PMML con- using the Extensible Markup Language (XML); for sumer) simply by reading the PMML XML data example, the tag file.

BY ROBERT L. GROSSMAN, MARK F. H ORNICK, AND GREGOR MEYER

COMMUNICATIONS OF THE ACM August 2002/Vol. 45, No. 8 59 TWO MAJOR CHALLENGES top the data mining standards agenda: agreeing on a common standard for cleaning, transforming, and preparing data for data mining; and agreeing on a common set of Web services for working with remote and distributed data.

PMML consists of the following components: tionary. The consensus among Data Mining Group members is that the transformation dictionary is pow- . Defines the input attributes to erful enough for capturing the process of preparing models and specifies each one’s type and value data for statistical and data mining models. range. Mining schema. Precisely one in each model, listing Standard APIs the schema’s attributes and their role in the To facilitate integration of data mining with appli- model; these attributes are a subset of the attrib- cation software, several data mining APIs have been utes in the data dictionary. The schema contains developed for the following types of application: information specific to a certain model, while the data dictionary contains data definitions that do SQL. The SQL Multimedia and Applications Pack- not vary by model. It also specifies an attribute’s ages Standard (SQL/MM) includes a specification usage type, which can be active (an input of the called SQL/MM Part 6: Data Mining, which model), predicted (an output of the model), or specifies a SQL interface to data mining applica- supplementary (holding descriptive information tions and services. It provides an API for data and ignored by the model). mining applications to access data from Transformation dictionary. Can contain any of the SQL/MM-compliant relational . following transformations: normalization (map- Java. The Java Specification Request-73 (JSR-73) ping continuous or discrete values to numbers); defines a pure Java API supporting the building discretization (mapping continuous values to dis- of data mining models and the scoring of data- crete values); value mapping (mapping discrete using models, as well as the creation, storage, and values to discrete values); and aggregation (sum- maintenance of and access to data and marizing or collecting groups of values, such as supporting data mining results [5]. by computing averages). Microsoft. The Microsoft-supported OLE DB for Model statistics. Univariate statistics about the DM defines an API for data mining for attributes in the model. Microsoft-based applications [6]. Released in Models. Model parameters specified by tags. 2000, OLE DB for DM was especially notewor- PMML v.2.0 includes regression models, cluster thy for introducing several new capabilities, vari- models, trees, neural networks, Bayesian models, ants of which are now part of other standards, association rules, and sequence models. including PMML v.2.0; included are taxonomies for data and a mechanism for transforming data. The first major release of PMML (v.1.0 in 1999) Earlier this year, however, OLE DB for DM was focused on defining XML representations for some of subsumed by Microsoft’s Analysis Services for the most common statistical and data mining models. SQL Server 2000 [9]; Analysis Services provide The assumption built into PMML v.1.0 was that the APIs to Microsoft’s SQL Server 2000 for data inputs to the models (called DataFields) were already transformations, data mining, and online analyti- defined. In practice, however, defining such inputs cal processing (OLAP). can be highly complex. The next major release of PMML (v.2.0 in 2001) introduced a mechanism, the Other Standards Efforts transformation dictionary, to more flexibly define Standards have also been developed for defining the model inputs. In PMML v.2.0, inputs to PMML software objects used in data mining, the business models can be DataFields defined in a data dictionary processes used in data mining, and Web-based ser- or DerivedFields defined in the transformation dic- vices for mining remote and distributed data.

60 August 2002/Vol. 45, No. 8 COMMUNICATIONS OF THE ACM Data mining metadata. In 2000, the Object Man- is that data mining is used in so many different ways agement Group defined the Common Warehouse and in combination with a so many different systems Model for Data Mining (CWM DM) [1] for meta- and services, many requiring their own separate data specifying model building settings, model rep- often-incompatible standards. Although some ven- resentations, and results from model operations, dor-led efforts have sought to homogenize terminol- along with other data mining-related objects. Mod- ogy and concepts among standards, more work is els are defined through the Unified Modeling Lan- indeed required. guage [10] using tools to generate XML Document Relatively narrow XML standards, such as PMML, Type Definitions, which are used to specify formally serve as common ground for several emerging stan- XML documents. dards. For example, SQL/MM Part 6: Data Mining, Process standards. The CRoss-Industry Standard JSR-73, CWM, and Microsoft’s Analysis Services all Process for Data Mining (CRISP-DM) was devel- use PMML in their specifications, providing a base oped in 1997 by two vendors (ISL and NCR) along level of compatibility among them all. with two industrial partners. Designed to capture Meanwhile, two major challenges top the data the data mining process, it begins with business mining standards agenda: agreeing on a common problems, then captures and understands data, standard for cleaning, transforming, and preparing applies data mining techniques, interprets results, data for data mining (PMML v.2.0 represents a first and deploys the knowledge gained in operations [2]. step in this direction); and agreeing on a common Web standards. The semantic Web includes the set of Web services for working with remote and dis- open standards being developed by the World Wide tributed data (an effort only just beginning). c Web Consortium (W3C) for defining and working with knowledge through XML, the Resource References Description Framework (RDF), and related stan- 1. Common Warehouse Metamodel: Data Mining. Object Management Group; see cgi.omg.org/cgi-bin/doclist.pl. dards [8]. RDF can be thought of informally as a 2. Cross Industry Standard Process for Data Mining (CRISP-DM); see way to code triples consisting of subjects, verbs, and www.crisp-dm.org. 3. Data Space Transfer Protocol. National Center for Data Mining; see objects. The semantic Web can in principle be used www.ncdm.uic.edu. to store knowledge extracted from data though data 4. Grossman, R. and Mazzucco, M. DataSpace: A data Web for the mining systems, though this capability is, today, exploratory analysis and mining of data. IEEE Comput. Sci. Eng. (2002). more a goal than an achievement. 5. Java Specification Request 73; see jcp.org/jsr/detail/073.jsp. The W3C is also standardizing Web services 6. OLE DB for Data Mining Specification 1.0. Microsoft; see www.microsoft.com/data/oledb/default.htm. based on XML and a protocol for working with 7. Predictive Model Markup Language (PMML). Data Mining Group, remote objects called the Simple Object Access Pro- see www.dmg.org. tocol (SOAP). The services describe themselves to 8. Semantic Web. World Wide Web Consortium; see www.w3c.org/ 2001/sw. applications using the Web Services Description 9. SQL Server 2000 Analysis Services. Microsoft; see www.microsoft.com/ Language [11]. SQL/techinfo/bi/analysis.asp. Data webs are Web-based infrastructures employ- 10. Unified Modeling Language. Object Management Group; see www.uml.org. ing Web services and other open Web protocols and 11. Web Services Activity. World Wide Web Consortium; see www. standards for analyzing and mining remote and dis- w3c.org/2002/ws. tributed data [4]. In addition to standard Web pro- 12. XML for Analysis (XMLA); see www.xmla.org. tocols, some data webs also use protocols designed to transport remote and distributed data, such as the Robert L. Grossman ([email protected]) is director of the Data Space Transport Protocol (DSTP) [3] being Laboratory of Advanced Computing and the National Center for Data developed by the National Center for Data Mining Mining at the University of Illinois at Chicago and president of the Two Cultures Group, Chicago. at the University of Illinois at Chicago and stan- Mark F. Hornick ([email protected]) is a senior manager dardized by the Data Mining Group. in the Data Mining Technologies unit of Oracle Corp., Burlington, Meanwhile, earlier this year, Hyperion, a software MA. vendor, and Microsoft announced a set of XML Gregor Meyer ([email protected]) is a senior software message interfaces using SOAP to define the data- engineer in the unit of IBM Corp., San Jose, CA. access interaction between a client application and Permission to make digital or hard copies of all or part of this work for personal or OLAP or other data mining data provider [12]. classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute Conclusion to lists, requires prior specific permission and/or a fee. The main reason so many different data representa- tion and data communication standards exist today © 2002 ACM 0002-0782/02/0800 $5.00

COMMUNICATIONS OF THE ACM August 2002/Vol. 45, No. 8 61