Deploying Analytics with the Portable Format for Analytics (PFA)
Total Page:16
File Type:pdf, Size:1020Kb
Deploying Analytics with the Portable Format for Analytics (PFA) ∗ Jim Pivarski Collin Bennett Robert L. Grossman Open Data Group Inc. Open Data Group Inc. Open Data Group Inc. 400 Lathrop Ave Suite 90 400 Lathrop Ave Suite 90 400 Lathrop Ave Suite 90 River Forest IL USA River Forest IL USA River Forest IL USA ABSTRACT 1. INTRODUCTION We introduce a new language for deploying analytic models The KDD research community has traditionally been fo- into products, services and operational systems called the cused on developing new algorithms and developing systems Portable Format for Analytics (PFA). PFA is an example for managing and analyzing data. This is the one of the of what is sometimes called a model interchange format, a cores of data science. Technology is also required for deploy- language for describing analytic models that is independent ing algorithms and statistical models in analytic products, of specific tools, applications or systems. Model interchange analytic services, and operational systems. Sometimes the formats allow one application (the model producer) to ex- term analytic operations is used for this component of the port models and another application (the model consumer KDD process. or scoring engine) to import models. The core idea behind A problem that has haunted the KDD community for the PFA is to support the safe execution of statistical func- last twenty years is how to efficiently deploy the analytic tions, mathematical functions, and machine learning algo- models developed by data scientists into products and ser- rithms and their compositions within a safe execution envi- vices that must live in operational environments, which usu- ronment. With this approach, the common analytic models ally have stringent service level requirements. Because of used in data science can be implemented, as well as the data these requirements, most organizations do not allow code to transformations and data aggregations required for pre- and be integrated into operations without careful and extensive post-processing data. PFA compliant scoring engines can testing. See Figure 1. be extended by adding new user defined functions described In this paper, we introduce a new language called the in PFA. We describe the design of PFA. A Data Mining Portable Format for Analytics (PFA), describe some of the Group (DMG) Working Group is developing the PFA stan- ways that PFA differs from the Predictive Model Markup dard. The current version is 0.8.1 and contains many of the Language (PMML), the current dominant standard for de- commonly used statistical and machine learning models, in- scribing analytic models [1], and discuss some of the lessons cluding regression, clustering, support vector machines, neu- that we have learned working with PFA over the past two ral networks, etc. We also describe two implementations of years in various deployed analytic applications. Hadrian, one in Scala and one in Python. We discuss four Contributions of this Paper This paper makes four main case studies that use PFA and Hadrian to specify analytic contributions to the KDD community. The first contribu- models, including two that are deployed in operations at tion is that we introduce a language (PFA) for describing client sites. statistical and data mining models that overcomes some of the important limitations of PMML. The second contribu- Keywords tion is that we have developed and deployed two implemen- model producers, scoring engines, Portable Format for An- tations of PFA-compliant scoring engines (one in Scala and alytics, PFA, PMML, deploying analytics one in Python), which are available to KDD community for personal and research use. The third contribution is that we have described two deployments into production sites of the Scala scoring engine and discussed some of the lessons learned from these deployments. The fourth contribution is calling attention to PFA and to the Data Mining Group ∗RLG is also a faculty member at the University of Chicago. (DMG) PFA Working Group so that others in the KDD re- search community can contribute to the further development Permission to make digital or hard copies of all or part of this work for personal or classroomPermission use to is make granted digital without or feehard provided copies that of copiespart or are all not of madethis work or distributed of the PFA standard. for profitpersonal or commercial or classroom advantage use is and granted that copies without bear thisfee noticeprovided and thethat full citation oncopies the first are page.not made Copyrights or distributed for components for profit of this or commercial work owned by others than the 2. ARCHITECTURES FOR DEPLOYING author(s)advantage must and be that honored. copies Abstracting bear this with notice credit and is permitted.the full citation To copy on otherwise, the or republish,first page. to Copyrights post on servers for or third-party to redistribute components to lists, requires of this prior work specific must permission ANALYTICS and/or a fee. Request permissions from [email protected]. be honored. For all other uses, contact the Owner/Author. KDD ’16 August 13-17, 2016, San Francisco, CA, USA An analytic architecture was developed over fifteen years Copyright is held by the owner/author(s). ago that separated systems, applications and tools that pro- KDDc 2016 '16, Copyright August held13-17, by the2016, owner/author(s). San Francisco, Publication CA, USA rights licensed to ACM. ISBN 978-1-4503-4232-2/16/08. duced analytic models from those that consumed analytic ACM 978-1-4503-4232-2/16/08. models [2, 3]. Model producers are part of the development DOI:http://dx.doi.org/10.1145/2939672.2939731http://dx.doi.org/10.1145/2939672.2939731 Modeling Environment Exploratory Data Analysis Get and clean the data Build model in dev/modeling environment Deploy model Analytic modeling Analytic operations Deploy model in operational Log systems with scoring files application Monitor performance and employ champion- Retire model and deploy challenger methodology to improved model develop improved model Enterprise IT Environment Figure 1: Analytic model producers and analytic model consumers (scoring engines). or data modeling environment, while model consumers are over time limitations of PMML became apparent. Perhaps, part of the operational environment. See Figure 1. the most common was that PMML was not extensible and Model producers export models in a language for ana- did not support all the code required to deploy analytic lytic models (sometimes called a model interchange format), models, since it has relatively limited functionality for the while model consumers import analytic models. With this pre- and post-processing required by most deployed appli- approach, multiple systems, applications and tools can be cations, and many deployed applications did not use a sin- used by data scientists for exporting models, and a single gle PMML model, but rather workflows built from multiple scoring engine can consume all of them in an operational PMML models. environment. PMML is a specification for a model interchange format. Over time, the term scoring engine began to be used as Once the PMML Working Group approves a specification a synonym for model consumer. As shown in Figure 3, the for a new PMML model, PMML-compliant scoring engines scoring engine is usually embedded in deployment environ- are updated by their vendors or developers to add support ment that includes the inputs and outputs of the scoring for that model. Although the PMML Working Group works engine. For example, scoring engines may be integrated hard on the standard, overtime the specification grew more with web applications and services, with databases, with and more complicated, raising the work required to develop enterprise service buses, with distributed data processing a PMML compliant scoring engine. systems, with real time distributed messaging systems, etc. The goals for PFA were to define a language for analytic It is important to note that with this architecture, the models with the following properties: analytic models that are described in the model interchange format are designed to be written and read by programs, 1. The language should be extensible so that users and not by humans. In other words, models in the model in- system developers can define their models and pre- and terchange format are designed to be machine readable, not post-processing code. human readable. 2. The language should allow models to be composed so that chains of models and hierarchical models can be 3. LANGUAGES FOR DEPLOYING supported and allow multiple models to be combined ANALYTICS into workflows. The model interchange format that has gained the most 3. The language should be easy to integrate into today's adoption is the Predictive Model Markup Language (PMML) distributed and event-based data processing platforms, [3], which is an XML-based markup language for describing such as Hadoop [5], Spark and Storm [6]. statistical and data mining models. The PMML standard [1, 4] supports the models most commonly used by data sci- 4. The language should be \safe" to deploy in IT opera- entists, including regression, clustering, support vector ma- tional environments. chines, neural networks, etc. PMML represented a significant step forward in terms of As mentioned, PMML struggled with the first three re- providing a mechanism for different systems, applications quirements. The approach taken with PFA was to define and tools to exchange analytic models. On the other hand, a language in which JSON-based functions could be safely composed and the functions were rich enough to define the output to existing data analysis software for it to generate common analytic models, as well as the common pre- and PFA-based scoring engines. post- processing used by data scientists. The language and The smallest possible PFA document is: standard support user defined functions, which supports Re- {"input": "null", "output": "null", "action": null}, quirements 1 and 2.