The KDD Process for Extracting Useful Knowledge from Volumes of Data
Total Page:16
File Type:pdf, Size:1020Kb
Knowledge Discovery in Databases creates the context for developing the tools needed to control the flood of data facing organizations that depend on ever-growing databases of business, manufacturing, scientific, and personal information. The KDD Process for Extracting Useful Knowledge from Volumes of Data AS WE MARCH INTO THE AGE of digital information, the problem of data overload Usama Fayyad, looms ominously ahead. Our ability to analyze and Gregory Piatetsky-Shapiro, understand massive datasets lags far behind and Padhraic Smyth our ability to gather and store the data. A new gen- eration of computational techniques and many more applications generate and tools is required to support the streams of digital records archived in extraction of useful knowledge from huge databases, sometimes in so-called the rapidly growing volumes of data. data warehouses. These techniques and tools are the Current hardware and database tech- subject of the emerging field of knowl- nology allow efficient and inexpensive edge discovery in databases (KDD) and reliable data storage and access. Howev- data mining. er, whether the context is business, Large databases of digital informa- medicine, science, or government, the tion are ubiquitous. Data from the datasets themselves (in raw form) are of neighborhood store’s checkout regis- little direct value. What is of value is the ter, your bank’s credit card authoriza- knowledge that can be inferred from tion device, records in your doctor’s the data and put to use. For example, office, patterns in your telephone calls, the marketing database of a consumer TERRY WIDENER COMMUNICATIONS OF THE ACM November 1996/Vol. 39, No. 11 27 goods company may yield knowledge Yet the true value of such data lies of correlations between sales of certain in the users' ability to extract use- items and certain demographic group- ful reports, spot interesting events ings. This knowledge can be used to The value of and trends, support decisions and introduce new targeted marketing policy based on statistical analysis campaigns with predictable financial storing volumes of and inference, and exploit the return relative to unfocused cam- data to achieve business, opera- paigns. Databases are often a dormant data depends on tional, or scientific goals. potential resource that, tapped, can When the scale of data manip- yield substantial benefits. our ability to ulation, exploration, and infer- This article gives an overview of the ence grows beyond human emerging field of KDD and data min- extract useful capacities, people look to comput- ing, including links with related fields, er technology to automate the a definition of the knowledge discov- reports, spot bookkeeping. The problem of ery process, dissection of basic data knowledge extraction from large mining algorithms, and an analysis of interesting events databases involves many steps, the challenges facing practitioners. ranging from data manipulation and trends, and retrieval to fundamental Impractical Manual Data Analysis mathematical and statistical infer- The traditional method of turning support decisions ence, search, and reasoning. data into knowledge relies on manual Researchers and practitioners analysis and interpretation. For exam- and policy based interested in these problems have ple, in the health-care industry, it is been meeting since the first KDD common for specialists to analyze cur- on statistical Workshop in 1989. Although the rent trends and changes in health-care problem of extracting knowledge data on a quarterly basis. The special- analysis and from data (or observations) is not ists then provide a report detailing the new, automation in the context of analysis to the sponsoring health-care inference, and large databases opens up many organization; the report is then used new unsolved problems. as the basis for future decision making exploit the data to and planning for health-care manage- Definitions ment. In a totally different type of achieve business, Finding useful patterns in data is application, planetary geologists sift known by different names (includ- through remotely sensed images of operational, or ing data mining) in different com- planets and asteroids, carefully locat- munities (e.g., knowledge ing and cataloging geologic objects of scientific goals. extraction, information discovery, interest, such as impact craters. information harvesting, data For these (and many other) appli- archeology, and data pattern pro- cations, such manual probing of a cessing). The term “data mining” dataset is slow, expensive, and highly is used most by statisticians, data- subjective. In fact, such manual data analysis is becom- base researchers, and more recently by the MIS and ing impractical in many domains as data volumes grow business communities. Here we use the term “KDD” to exponentially. Databases are increasing in size in two refer to the overall process of discovering useful knowl- ways: the number N of records, or objects, in the data- edge from data. Data mining is a particular step in this base, and the number d of fields, or attributes, per process—application of specific algorithms for extract- object. Databases containing on the order of N = 109 ing patterns (models) from data. The additional steps in objects are increasingly common in, for example, the the KDD process, such as data preparation, data selec- astronomical sciences. The number d of fields can easi- tion, data cleaning, incorporation of appropriate prior ly be on the order of 102 or even 103 in medical diag- knowledge, and proper interpretation of the results of nostic applications. Who could be expected to digest mining ensure that useful knowledge is derived from billions of records, each with tens or hundreds of fields? the data. Blind application of data mining methods 28 November 1996/Vol. 39, No. 11 COMMUNICATIONS OF THE ACM (rightly criticized as data dredging in the statistical liter- ported. KDD places a special empha- ature) can be a dangerous activity leading to discovery sis on finding understandable pat- of meaningless patterns. terns that can be interpreted as KDD has evolved, and continues to evolve, from the useful or interesting knowledge. intersection of research in such fields as databases, Scaling and robustness properties of machine learning, pattern recognition, statistics, artifi- modeling algorithms for large noisy cial intelligence and reasoning with uncertainty, knowl- datasets are also of fundamental interest. edge acquisition for expert systems, data visualization, Statistics has much in common with KDD. machine discovery [7], scientific discovery, information Inference of knowledge from data has a fundamental retrieval, and high-performance computing. KDD soft- statistical component (see [2] and the article by Gly- ware systems incorporate theories, algorithms, and mour on statistical inference in this special section for methods from all of these fields. more detailed discussions of the relationship between Database theories and tools provide the necessary KDD and statistics). Statistics provides a language and infrastructure to store, access, and manipulate data. framework for quantifying the uncertainty resulting Data warehousing, a recently popularized term, refers when one tries to infer general patterns from a partic- to the current business trend of collecting and cleaning ular sample of an overall population. As mentioned transactional data to make them available for online earlier, the term data mining has had negative conno- analysis and decision support. A popular approach for tations in statistics since the 1960s, when computer- analysis of data warehouses is called online analytical based data analysis techniques were first introduced. processing (OLAP).1 OLAP tools focus on providing The concern arose over the fact that if one searches Figure 1. Overview of the steps constituting the KDD process Pre- Trans- Data Interpretation/ Selection processing formation Mining Evaluation Target Preprocessed Transformed Patterns Knowledge Data Data Data Data multidimensional data analysis, which is superior to long enough in any dataset (even randomly generated SQL (a standard data manipulation language) in com- data), one can find patterns that appear to be statisti- puting summaries and breakdowns along many dimen- cally significant but in fact are not. This issue is of fun- sions. While current OLAP tools target interactive data damental importance to KDD. There has been analysis, we expect they will also include more auto- substantial progress in understanding such issues in sta- mated discovery components in the near future. tistics in recent years, much directly relevant to KDD. Fields concerned with inferring models from data— Thus, data mining is a legitimate activity as long as one including statistical pattern recognition, applied statis- understands how to do it correctly. KDD can also be tics, machine learning, and neural networks—were the viewed as encompassing a broader view of modeling impetus for much early KDD work. KDD largely relies than statistics, aiming to provide tools to automate (to on methods from these fields to find patterns from data the degree possible) the entire process of data analysis, in the data mining step of the KDD process. A natural including the statistician’s art of hypothesis selection. question is: How is KDD different from these other fields? KDD focuses on the overall process of knowl- The KDD Process edge discovery from data, including how the data is Here we present our (necessarily