Quick viewing(Text Mode)

The KDD Process for Extracting Useful Knowledge from Volumes of Data

The KDD Process for Extracting Useful Knowledge from Volumes of Data

Knowledge Discovery in Databases creates the context for developing the tools needed to control the flood of data facing organizations that depend on ever-growing databases of business, manufacturing, scientific, and personal information.

The KDD Process for Extracting Useful Knowledge from Volumes of Data

AS WE MARCH INTO THE AGE of digital information, the problem of data overload Usama Fayyad, looms ominously ahead. Our ability to analyze and Gregory Piatetsky-Shapiro, understand massive datasets lags far behind and Padhraic Smyth our ability to gather and store the data. A new gen- eration of computational techniques and many more applications generate and tools is required to support the streams of digital records archived in extraction of useful knowledge from huge databases, sometimes in so-called the rapidly growing volumes of data. data warehouses. These techniques and tools are the Current hardware and database tech- subject of the emerging field of knowl- nology allow efficient and inexpensive edge discovery in databases (KDD) and reliable data storage and access. Howev- . er, whether the context is business, Large databases of digital informa- medicine, science, or government, the tion are ubiquitous. Data from the datasets themselves (in raw form) are of neighborhood store’s checkout regis- little direct value. What is of value is the ter, your bank’s credit card authoriza- knowledge that can be inferred from tion device, records in your doctor’s the data and put to use. For example, office, patterns in your telephone calls, the marketing database of a consumer TERRY WIDENER

COMMUNICATIONS OF THE ACM November 1996/Vol. 39, No. 11 27 goods company may yield knowledge Yet the true value of such data lies of correlations between sales of certain in the users' ability to extract use- items and certain demographic group- ful reports, spot interesting events ings. This knowledge can be used to The value of and trends, support decisions and introduce new targeted marketing policy based on statistical analysis campaigns with predictable financial storing volumes of and inference, and exploit the return relative to unfocused cam- data to achieve business, opera- paigns. Databases are often a dormant data depends on tional, or scientific goals. potential resource that, tapped, can When the scale of data manip- yield substantial benefits. our ability to ulation, exploration, and infer- This article gives an overview of the ence grows beyond human emerging field of KDD and data min- extract useful capacities, people look to comput- ing, including links with related fields, er technology to automate the a definition of the knowledge discov- reports, spot bookkeeping. The problem of ery process, dissection of basic data knowledge extraction from large mining algorithms, and an analysis of interesting events databases involves many steps, the challenges facing practitioners. ranging from data manipulation and trends, and retrieval to fundamental Impractical Manual Data Analysis mathematical and statistical infer- The traditional method of turning support decisions ence, search, and reasoning. data into knowledge relies on manual Researchers and practitioners analysis and interpretation. For exam- and policy based interested in these problems have ple, in the health-care industry, it is been meeting since the first KDD common for specialists to analyze cur- on statistical Workshop in 1989. Although the rent trends and changes in health-care problem of extracting knowledge data on a quarterly basis. The special- analysis and from data (or observations) is not ists then provide a report detailing the new, automation in the context of analysis to the sponsoring health-care inference, and large databases opens up many organization; the report is then used new unsolved problems. as the basis for future decision making exploit the data to and planning for health-care manage- Definitions ment. In a totally different type of achieve business, Finding useful patterns in data is application, planetary geologists sift known by different names (includ- through remotely sensed images of operational, or ing data mining) in different com- planets and asteroids, carefully locat- munities (e.g., knowledge ing and cataloging geologic objects of scientific goals. extraction, information discovery, interest, such as impact craters. information harvesting, data For these (and many other) appli- archeology, and data pattern pro- cations, such manual probing of a cessing). The term “data mining” dataset is slow, expensive, and highly is used most by statisticians, data- subjective. In fact, such manual data analysis is becom- base researchers, and more recently by the MIS and ing impractical in many domains as data volumes grow business communities. Here we use the term “KDD” to exponentially. Databases are increasing in size in two refer to the overall process of discovering useful knowl- ways: the number N of records, or objects, in the data- edge from data. Data mining is a particular step in this base, and the number d of fields, or attributes, per process—application of specific algorithms for extract- object. Databases containing on the order of N = 109 ing patterns (models) from data. The additional steps in objects are increasingly common in, for example, the the KDD process, such as data preparation, data selec- astronomical sciences. The number d of fields can easi- tion, data cleaning, incorporation of appropriate prior ly be on the order of 102 or even 103 in medical diag- knowledge, and proper interpretation of the results of nostic applications. Who could be expected to digest mining ensure that useful knowledge is derived from billions of records, each with tens or hundreds of fields? the data. Blind application of data mining methods

28 November 1996/Vol. 39, No. 11 COMMUNICATIONS OF THE ACM (rightly criticized as data dredging in the statistical liter- ported. KDD places a special empha- ature) can be a dangerous activity leading to discovery sis on finding understandable pat- of meaningless patterns. terns that can be interpreted as KDD has evolved, and continues to evolve, from the useful or interesting knowledge. intersection of research in such fields as databases, Scaling and robustness properties of machine learning, pattern recognition, statistics, artifi- modeling algorithms for large noisy cial intelligence and reasoning with uncertainty, knowl- datasets are also of fundamental interest. edge acquisition for expert systems, data visualization, Statistics has much in common with KDD. machine discovery [7], scientific discovery, information Inference of knowledge from data has a fundamental retrieval, and high-performance computing. KDD soft- statistical component (see [2] and the article by Gly- ware systems incorporate theories, algorithms, and mour on statistical inference in this special section for methods from all of these fields. more detailed discussions of the relationship between Database theories and tools provide the necessary KDD and statistics). Statistics provides a language and infrastructure to store, access, and manipulate data. framework for quantifying the uncertainty resulting Data warehousing, a recently popularized term, refers when one tries to infer general patterns from a partic- to the current business trend of collecting and cleaning ular sample of an overall population. As mentioned transactional data to make them available for online earlier, the term data mining has had negative conno- analysis and decision support. A popular approach for tations in statistics since the 1960s, when computer- analysis of data warehouses is called online analytical based data analysis techniques were first introduced. processing (OLAP).1 OLAP tools focus on providing The concern arose over the fact that if one searches

Figure 1. Overview of the steps constituting the KDD process

Pre- Trans- Data Interpretation/ Selection processing formation Mining Evaluation

Target Preprocessed Transformed Patterns Knowledge Data Data Data Data multidimensional data analysis, which is superior to long enough in any dataset (even randomly generated SQL (a standard data manipulation language) in com- data), one can find patterns that appear to be statisti- puting summaries and breakdowns along many dimen- cally significant but in fact are not. This issue is of fun- sions. While current OLAP tools target interactive data damental importance to KDD. There has been analysis, we expect they will also include more auto- substantial progress in understanding such issues in sta- mated discovery components in the near future. tistics in recent years, much directly relevant to KDD. Fields concerned with inferring models from data— Thus, data mining is a legitimate activity as long as one including statistical pattern recognition, applied statis- understands how to do it correctly. KDD can also be tics, machine learning, and neural networks—were the viewed as encompassing a broader view of modeling impetus for much early KDD work. KDD largely relies than statistics, aiming to provide tools to automate (to on methods from these fields to find patterns from data the degree possible) the entire process of data analysis, in the data mining step of the KDD process. A natural including the statistician’s art of hypothesis selection. question is: How is KDD different from these other fields? KDD focuses on the overall process of knowl- The KDD Process edge discovery from data, including how the data is Here we present our (necessarily subjective) perspec- stored and accessed, how algorithms can be scaled to tive of a unifying process-centric framework for KDD. massive datasets and still run efficiently, how results can The goal is to provide an overview of the variety of activ- be interpreted and visualized, and how the overall 1See Providing OLAP to User Analysts: An IT Mandate by E.F. Codd and human-machine interaction can be modeled and sup- Associates (1993).

COMMUNICATIONS OF THE ACM November 1996/Vol. 39, No. 11 29 ities in this multidisciplinary field and how they fit of patterns (or models) over the data, subject to some together. We define the KDD process [4] as: acceptable computational-efficiency limitations. Since the patterns enumerable over any finite dataset are The nontrivial process of identifying valid, novel, potentially potentially infinite, and because the enumeration of useful, and ultimately understandable patterns in data. patterns involves some form of search in a large space, computational constraints place severe limits on the Throughout this article, the term pattern goes beyond subspace that can be explored by a data mining algo- its traditional sense to include models or structure in rithm. data. In this definition, data comprises a set of facts The KDD process is outlined in Figure 1. (We did (e.g., cases in a database), and pattern is an expression not show all the possible arrows to indicate that loops in some language describing a subset of the data (or a can, and do, occur between any two steps in the model applicable to that subset). The term process process; also not shown is the system's performance ele- implies there are many steps involving data prepara- ment, which uses knowledge to make decisions or take tion, search for patterns, knowledge evaluation, and actions.) The KDD process is interactive and iterative refinement—all repeated in multiple iterations. The (with many decisions made by the user), involving process is assumed to be nontrivial in that it goes numerous steps, summarized as: beyond computing closed-form quantities; that is, it must involve search for structure, models, patterns, or 1. Learning the application domain: includes relevant parameters. The discovered patterns should be valid prior knowledge and the goals of the application for new data with some degree of cer- 2. Creating a target dataset: tainty. We also want patterns to be includes selecting a dataset or novel (at least to the system, and focusing on a subset of vari- preferably to the user) and potentially ables or data samples on which useful for the user or task. Finally, the In practice, a discovery is to be performed patterns should be understandable—if 3. Data cleaning and preprocessing: not immediately, then after some large portion of includes basic operations, such as postprocessing. removing noise or outliers if This definition implies we can the applications appropriate, collecting the neces- define quantitative measures for eval- sary information to model or uating extracted patterns. In many effort can go account for noise, deciding on cases, it is possible to define measures strategies for handling missing of certainty (e.g., estimated classifica- into properly data fields, and accounting for tion accuracy) or utility (e.g., gain, time sequence information and perhaps in dollars saved due to better formulating the known changes, as well as decid- predictions or speed-up in a system’s ing DBMS issues, such as data response time). Such notions as nov- problem (asking types, schema, and mapping of elty and understandability are much missing and unknown values more subjective. In certain contexts, the right question) 4. Data reduction and projection: understandability can be estimated includes finding useful features through simplicity (e.g., number of rather than to represent the data, depend- bits needed to describe a pattern). An ing on the goal of the task, and important notion, called interesting- optimizing the using dimensionality reduction ness, is usually taken as an overall mea- or transformation methods to sure of pattern value, combining algorithmic details reduce the effective number of validity, novelty, usefulness, and sim- variables under consideration plicity. Interestingness functions can of a particular or to find invariant representa- be explicitly defined or can be mani- tions for the data fested implicitly through an ordering data mining 5. Choosing the function of data placed by the KDD system on the dis- mining: includes deciding the covered patterns or models. method. purpose of the model derived Data mining is a step in the KDD by the data mining algorithm process consisting of an enumeration (e.g., summarization, classifica

30 November 1996/Vol. 39, No. 11 COMMUNICATIONS OF THE ACM tion, regression, and clustering) principles. In particular, data min- 6. Choosing the data mining algorithm(s): includes ing algorithms consist largely of selecting method(s) to be used for searching for some specific mix of three com- patterns in the data, such as deciding which models ponents: and parameters may be appropriate (e.g., models for categorical data are different from models on • The model. There are two rele- vectors over reals) and matching a particular data vant factors: the function of the mining method with the overall criteria of the KDD model (e.g., classification and process (e.g., the user may be more interested in clustering) and the representational understanding the model than in its predictive form of the model (e.g., a linear function capabilities) of multiple variables and a Gaussian probability den- 7. Data mining: includes searching for patterns of sity function). A model contains parameters that are interest in a particular representational form or a to be determined from the data. set of such representations, including classification • The preference criterion. A basis for preference of rules or trees, regression, clustering, sequence mod- one model or set of parameters over another, eling, dependency, and line analysis depending on the given data. The criterion is usual- 8. Interpretation: includes interpreting the discovered ly some form of goodness-of-fit function of the patterns and possibly returning to any of the previ- model to the data, perhaps tempered by a smooth- ous steps, as well as possible visualization of the ing term to avoid overfitting, or generating a model extracted patterns, removing redundant or irrele- with too many degrees of freedom to be constrained vant patterns, and translating the useful ones into by the given data. terms understandable by users • The search algorithm. The specification of an algo- 9. Using discovered knowledge: includes incorporat- rithm for finding particular models and parameters, ing this knowledge into the performance system, given data, a model (or family of models), and a taking actions based on the knowledge, or simply preference criterion. documenting it and reporting it to interested par- ties, as well as checking for and resolving potential A particular data mining algorithm is usually an conflicts with previously believed (or extracted) instantiation of the model/preference/search compo- knowledge. nents (e.g., a classification model based on a decision- tree representation, model preference based on data Most previous work on KDD focused primarily on likelihood, determined by greedy search using a partic- the data mining step. However, the other steps are ular heuristic. Algorithms often differ largely in terms equally if not more important for the successful appli- of the model representation (e.g., linear and hierarchi- cation of KDD in practice. We now focus on the data cal), and model preference or search methods are mining component, which has received by far the most often similar across different algorithms. The literature attention in the literature. on learning algorithms frequently does not state clear- ly the model representation, preference criterion, or Data Mining search method used; these are often mixed up in a Data mining involves fitting models to or determining description of a particular algorithm. The reductionist patterns from observed data. The fitted models play view clarifies the independent contributions of each the role of inferred knowledge. Deciding whether or component. not the models reflect useful knowledge is a part of the overall interactive KDD process for which subjec- Model Functions tive human judgment is usually required. A wide vari- The more common model functions in current data ety and number of data mining algorithms are mining practice include: described in the literature—from the fields of statis- tics, pattern recognition, machine learning, and data- • Classification: maps (or classifies) a data item into bases. Thus, an overview discussion can often consist one of several predefined categorical classes. of long lists of seemingly unrelated, and highly spe- • Regression: maps a data item to a real-value predic- cific algorithms. Here we take a somewhat reduction- tion variable. ist viewpoint. Most data mining algorithms can be • Clustering: maps a data item into one of several cate- viewed as compositions of a few basic techniques and gorical classes (or clusters) in which the classes must

COMMUNICATIONS OF THE ACM November 1996/Vol. 39, No. 11 31 be determined from the data—unlike classification the more complex models may fit the data better but in which the classes are predefined. Clusters are may also be more difficult to understand and to fit reli- defined by finding natural groupings of data items ably. While researchers tend to advocate complex mod- based on similarity metrics or probability density els, practitioners involved in successful applications models. often use simpler models due to their robustness and • Summarization: provides a compact description for a interpretability [3, 5]. subset of data. A simple example would be the mean Model preference criteria determine how well a par- and standard deviations for all fields. More sophisti- ticular model and its parameters meet the criteria of cated functions involve summary rules, multivariate the KDD process. Typically, there is an explicit quanti- visualization techniques, and functional relation- tative criterion embedded in the search algorithm (e.g., ships between variables. Summarization functions the maximum likelihood criterion of finding the para- are often used in interactive exploratory data analy- meters that maximize the probability of the observed sis and automated report generation. data). Also, an implicit criterion (reflecting the subjec- • Dependency modeling: describes significant depen- tive bias of the analyst in terms of which models are ini- dencies among variables. Dependency models exist tially chosen for consideration) is often used in the at two levels: structured and quantitative. The struc- outer loops of the KDD process. tural level of the model specifies (often in graphical Search algorithms are of two types: parameter form) which variables are locally dependent; the search, given a model, and model search over model quantitative level specifies the strengths of the space. Finding the best parameters is often reduced to dependencies using some numerical scale. an optimization problem (e.g., finding the global max- • Link analysis: determines relations between fields in imum of a nonlinear function in parameter space). the database (e.g., association Data mining algorithms tend to rules [1] to describe which items rely on relatively simple optimiza- are commonly purchased with tion techniques (e.g., gradient other items in grocery stores). descent), although in principle The focus is on deriving multi- more sophisticated optimization field correlations satisfying sup- Researchers and techniques are also used. Problems port and confidence thresholds. with local minima are common • Sequence analysis: models practitioners and dealt with in the usual manner sequential patterns (e.g., in data (e.g., multiple random restarts and with time dependence, such as should ensure that searching for multiple models). time-series analysis). The goal is Search over model space is usually to model the states of the process the potential carried out in a greedy fashion. generating the sequence or to A brief review of specific popu- extract and report deviation and contributions of KDD lar data mining algorithms can be trends over time. found in [4, 5]. An important point are not overstated is that each technique typically suits Model Representation some problems better than others. Popular model representations and that users For example, decision-tree classi- include decision trees and rules, lin- fiers can be very useful for finding ear models, nonlinear models (e.g., understand the structure in high-dimensional neural networks), example-based spaces and are also useful in prob- methods (e.g., nearest-neighbor true nature of lems with mixed continuous and and case-based reasoning meth- categorical data (since tree meth- ods), probabilistic graphical depen- those contributions ods do not require distance met- dency models (e.g., Bayesian rics). However, classification trees networks [6]), and relational along with their with univariate threshold decision attribute models. Model representa- boundaries may not be suitable for tion determines both the flexibility limitations. problems where the true decision of the model in representing the boundaries are nonlinear multi- data and the interpretability of the variate functions. Thus, there is no model in human terms. Typically, universally best data mining

32 November 1996/Vol. 39, No. 11 COMMUNICATIONS OF THE ACM method; choosing a particular algorithm for a particu- incorporate prior knowledge about a problem except lar application is something of an art. In practice, a in simple ways. Use of domain knowledge is impor- large portion of the applications effort can go into tant in all steps of the KDD process. For example, properly formulating the problem (asking the right Bayesian approaches use prior probabilities over data question) rather than into optimizing the algorithmic and distributions as one way of encoding prior details of a particular data mining method. knowledge (see [6] and Glymour's article on statisti- The high-level goals of data mining tend to be pre- cal inference in this special section). Others employ dictive, descriptive, or a combination of predictive and deductive database capabilities to discover knowledge descriptive. A purely predictive goal focuses on accura- that is then used to guide the data mining search. cy in predictive ability. A purely descriptive goal focuses • Overfitting and assessing statistical significance. on understanding the underlying data-generating When an algorithm searches for the best parameters process—a subtle but important distinction. In predic- for one particular model using a limited set of data, tion, a user may not care whether the model reflects it may overfit the data, resulting in poor perfor- reality as long as it has predictive power (e.g., a model mance of the model on test data. Possible solutions combining current financial indicators in some nonlin- include cross-validation, regularization, and other ear manner to predict future dollar-to-deutsche-mark sophisticated statistical strategies. Proper assessment exchange rates). A descriptive model, on the other of statistical significance is often missed when the hand, is interpreted as a reflection of reality (e.g., a system searches many possible models. Simple meth- model relating economic and demographic variables to ods to handle this problem include adjusting the educational achievements used as the basis for social test statistic as a function of the search (e.g., Bonfer- policy recommendations to cause change). In practice, roni adjustments for independent tests) and ran- most KDD applications demand some degree of both domization testing, although this area is largely predictive and descriptive modeling. unexplored. • Missing data. This problem is especially acute in Research Issues and Challenges business databases. Important attributes may be Current primary research and application challenges missing if the database was not designed with discov- for KDD [4, 5] include: ery in mind. Missing data can result from operator error, actual system and measurement failures, or • Massive datasets and high dimensionality. Multigiga- from a revision of the data collection process over byte databases with millions of records and large time (e.g., new variables are measured, but they numbers of fields (attributes and variables) are com- were considered unimportant a few months before). monplace. These datasets create combinatorially Possible solutions include more sophisticated statisti- explosive search spaces for model induction and cal strategies to identify hidden variables and depen- increase the chances that a data mining algorithm dencies. will find spurious patterns that are not generally • Understandability of patterns. In many applications, valid. Possible solutions include very efficient algo- it is important to make the discoveries more under- rithms, sampling, approximation methods, massively standable by humans. Possible solutions include parallel processing, dimensionality reduction tech- graphical representations, rule structuring, natural niques, and incorporation of prior knowledge. language generation, and techniques for visualiza- • User interaction and prior knowledge. An analyst is tion of data and knowledge. Rule refinement strate- usually not a KDD expert but a person responsible gies can also help address a related problem: for making sense of the data using available KDD Discovered knowledge may be implicitly or explicitly techniques. Since the KDD process is by definition redundant. interactive and iterative, it is a challenge to provide a • Managing changing data and knowledge. Rapidly high-performance, rapid-response environment that changing (nonstationary) data may make previously also assists users in the proper selection and match- discovered patterns invalid. In addition, the variables ing of appropriate tools and techniques to achieve measured in a given application database may be their goals. There needs to be more emphasis on modified, deleted, or augmented with new measure- human-computer interaction and less emphasis on ments over time. Possible solutions include incre- total automation—with the aim of supporting both mental methods for updating the patterns and expert and novice users. Many current KDD methods treating change as an opportunity for discovery by and tools are not truly interactive and do not easily using it to cue the search for patterns of change.

COMMUNICATIONS OF THE ACM November 1996/Vol. 39, No. 11 33 • Integration. A standalone discovery system may not synthesize new knowledge from data is still unsurpassed be very useful. Typical integration issues include by any machine. However, the volumes of data to be integration with a DBMS (e.g., via a query interface), analyzed make machines a necessity. This niche for integration with spreadsheets and visualization tools, using machines as an aid to analysis and the hope that and accommodation of real-time sensor readings. the massive datasets contain nuggets of valuable knowl- Highly interactive human-computer environments as edge drive interest and research in the field. Bringing outlined by the KDD process permit both human- together a set of varied fields, KDD creates fertile assisted computer discovery and computer-assisted ground for the growth of new tools for managing, ana- human discovery. Development of tools for visualiza- lyzing, and eventually gaining the upper hand over the tion, interpretation, and analysis of discovered pat- flood of data facing modern society. The fact that the terns is of paramount importance. Such interactive field is driven by strong social and economic needs is environments can enable practical solutions to many the impetus to its continued growth. The reality check real-world problems far more rapidly than humans of real applications will act as a filter to sift the good the- or computers operating independently. There are a ories and techniques from those less useful. C potential opportunity and a challenge to developing techniques to integrate the OLAP tools of the data- References base community and the data mining tools of the 1. Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., and Verkamo, I. Fast discovery of association rules. In Advances in Knowledge Discovery machine learning and statistical communities. and Data Mining, U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. • Nonstandard, multimedia, and object-oriented data. Uthurusamy, Eds. AAAI/MIT Press, Cambridge, Mass., 1996. 2. Elder, J., and Pregibon, D. A statistical perspective on KDD. In A significant trend is that databases contain not just Advances in Knowledge Discovery and Data Mining, U. Fayyad, G. numeric data but large quantities of nonstandard Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Eds. AAAI/MIT Press, Cambridge, Mass., 1996. and multimedia data. Nonstandard data types 3. Fayyad, U., and Uthurusamy, R., Eds. Proceedings of KDD-95: The include nonnumeric, nontextual, geometric, and First International Conference on Knowledge Discovery and Data Min- ing. AAAI Press, Menlo Park, Calif., 1995. graphical data, as well as nonstationary, temporal, 4. Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. From data mining spatial, and relational data, and a mixture of cate- to knowledge discovery: An overview. In Advances in Knowledge Dis- gorical and numeric fields in the data. Multimedia covery and Data Mining, U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Eds. AAAI/MIT Press, Cambridge, Mass., data include free-form multilingual text as well as 1996. digitized images, video, and speech and audio data. 5. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., and Uthurusamy, R., Eds. Advances in Knowledge Discovery and Data Mining, AAAI/MIT These data types are largely beyond the scope of cur- Press, Cambridge, Mass., 1996. rent KDD technology. 6. Heckerman, D. Bayesian networks for knowledge discovery. Advances in Knowledge Discovery and Data Mining, U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Eds. AAAI/MIT Conclusions Press, Cambridge, Mass., 1996. 7. Langley, P., and Simon, H.A. Applications of machine learning Despite its rapid growth, the KDD field is still in its and rule induction. Commun. ACM 38, 11 (Nov. 1995), 55–64. infancy. There are many challenges to overcome, but some successes have been achieved (see the articles by Additional references for this article can be found at Brachman on business applications and by Fayyad on http://www.research.microsoft.com/research/datamine/CACM- DM-refs/. science applications in this special section). Because the potential payoffs of KDD applications are high, USAMA FAYYAD is a senior researcher at Microsoft Research and a distinguished visiting scientist at the Jet Propulsion Laboratory of the there has been a rush to offer products and services in California Institute of Technology. He can be reached at the market. A great challenge facing the field is how to [email protected]. avoid the kind of false expectations plaguing other GREGORY PIATETSKY-SHAPIRO is a principal member of the nascent (and related) technologies (e.g., artificial intel- technical staff at GTE Laboratories. He can be reached at ligence and neural networks). It is the responsibility of [email protected]. researchers and practitioners in this field to ensure that PADHRAIC SMYTH is an assistant professor of computer science at the potential contributions of KDD are not overstated the University of California, Irvine, and technical group leader at the Jet Propulsion Laboratory of the California Institute of Technology. and that users understand the true nature of the con- He can be reached at [email protected]. tributions along with their limitations. Permission to make digital/hard copy of part or all of this work for personal Fundamental problems at the heart of the field or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, the copyright notice, the title remain unsolved. For example, the basic problems of of the publication and its date appear, and notice is given that copying is by statistical inference and discovery remain as difficult permission of ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee. and challenging as they always have been. Capturing the art of analysis and the ability of the human brain to © ACM 0002-0782/96/1100 $3.50

34 November 1996/Vol. 39, No. 11 COMMUNICATIONS OF THE ACM