From Data Mining to Knowledge Discovery in Databases

AI Magazine Volume 17 Number 3 (1996) (© AAAI) Articles From Data Mining to Knowledge Discovery in Databases Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth ■ Data mining and knowledge discovery in This article begins by discussing the histori- databases have been attracting a significant cal context of KDD and data mining and their amount of research, industry, and media atten- intersection with other related fields. A brief tion of late. What is all the excitement about? summary of recent KDD real-world applica- This article provides an overview of this emerging tions is provided. Definitions of KDD and da- field, clarifying how data mining and knowledge ta mining are provided, and the general mul- discovery in databases are related both to each other and to related fields, such as machine tistep KDD process is outlined. This multistep learning, statistics, and databases. The article process has the application of data-mining al- mentions particular real-world applications, gorithms as one particular step in the process. specific data-mining techniques, challenges in- The data-mining step is discussed in more de- volved in real-world applications of knowledge tail in the context of specific data-mining al- discovery, and current and future research direc- gorithms and their application. Real-world tions in the field. practical application issues are also outlined. Finally, the article enumerates challenges for future research and development and in par- cross a wide variety of fields, data are ticular discusses potential opportunities for AI being collected and accumulated at a technology in KDD systems. Adramatic pace. There is an urgent need for a new generation of computational theories and tools to assist humans in extracting useful information (knowledge) from the Why Do We Need KDD? rapidly growing volumes of digital data. The traditional method of turning data into These theories and tools are the subject of the knowledge relies on manual analysis and in- emerging field of knowledge discovery in terpretation. For example, in the health-care databases (KDD). industry, it is common for specialists to peri- At an abstract level, the KDD field is con- odically analyze current trends and changes cerned with the development of methods and in health-care data, say, on a quarterly basis. techniques for making sense of data. The basic The specialists then provide a report detailing problem addressed by the KDD process is one the analysis to the sponsoring health-care or- of mapping low-level data (which are typically ganization; this report becomes the basis for too voluminous to understand and digest easi- future decision making and planning for ly) into other forms that might be more com- health-care management. In a totally differ- pact (for example, a short report), more ab- ent type of application, planetary geologists stract (for example, a descriptive sift through remotely sensed images of plan- approximation or model of the process that ets and asteroids, carefully locating and cata- generated the data), or more useful (for exam- loging such geologic objects of interest as im- ple, a predictive model for estimating the val- pact craters. Be it science, marketing, finance, ue of future cases). At the core of the process is health care, retail, or any other field, the clas- the application of specific data-mining meth- sical approach to data analysis relies funda- ods for pattern discovery and extraction.1 mentally on one or more analysts becoming Copyright © 1996, American Association for Artificial Intelligence. All rights reserved. 0738-4602-1996 / $2.00 FALL 1996 37 Articles intimately familiar with the data and serving areas is astronomy. Here, a notable success as an interface between the data and the users was achieved by SKICAT, a system used by as- and products. tronomers to perform image analysis, For these (and many other) applications, classification, and cataloging of sky objects this form of manual probing of a data set is from sky-survey images (Fayyad, Djorgovski, slow, expensive, and highly subjective. In and Weir 1996). In its first application, the fact, as data volumes grow dramatically, this system was used to process the 3 terabytes type of manual data analysis is becoming (1012 bytes) of image data resulting from the completely impractical in many domains. Second Palomar Observatory Sky Survey, Databases are increasing in size in two ways: where it is estimated that on the order of 109 (1) the number N of records or objects in the sky objects are detectable. SKICAT can outper- database and (2) the number d of fields or at- form humans and traditional computational tributes to an object. Databases containing on techniques in classifying faint sky objects. See the order of N = 109 objects are becoming in- Fayyad, Haussler, and Stolorz (1996) for a sur- creasingly common, for example, in the as- vey of scientific applications. tronomical sciences. Similarly, the number of In business, main KDD application areas There is an fields d can easily be on the order of 102 or includes marketing, finance (especially in- even 103, for example, in medical diagnostic vestment), fraud detection, manufacturing, urgent need applications. Who could be expected to di- telecommunications, and Internet agents. for a new gest millions of records, each having tens or Marketing: In marketing, the primary ap- hundreds of fields? We believe that this job is plication is database marketing systems, generation of certainly not one for humans; hence, analysis which analyze customer databases to identify computation- work needs to be automated, at least partially. different customer groups and forecast their al theories The need to scale up human analysis capa- behavior. Business Week (Berry 1994) estimat- bilities to handling the large number of bytes ed that over half of all retailers are using or and tools to that we can collect is both economic and sci- planning to use database marketing, and assist entific. Businesses use data to gain competi- those who do use it have good results; for ex- tive advantage, increase efficiency, and pro- ample, American Express reports a 10- to 15- humans in vide more valuable services to customers. percent increase in credit-card use. Another extracting Data we capture about our environment are notable marketing application is market-bas- the basic evidence we use to build theories ket analysis (Agrawal et al. 1996) systems, useful and models of the universe we live in. Be- which find patterns such as, “If customer information cause computers have enabled humans to bought X, he/she is also likely to buy Y and (knowledge) gather more data than we can digest, it is on- Z.” Such patterns are valuable to retailers. ly natural to turn to computational tech- Investment: Numerous companies use da- from the niques to help us unearth meaningful pat- ta mining for investment, but most do not rapidly terns and structures from the massive describe their systems. One exception is LBS volumes of data. Hence, KDD is an attempt to Capital Management. Its system uses expert growing address a problem that the digital informa- systems, neural nets, and genetic algorithms volumes of tion era made a fact of life for all of us: data to manage portfolios totaling $600 million; overload. since its start in 1993, the system has outper- digital formed the broad stock market (Hall, Mani, data. Data Mining and Knowledge and Barr 1996). Discovery in the Real World Fraud detection: HNC Falcon and Nestor PRISM systems are used for monitoring credit- A large degree of the current interest in KDD card fraud, watching over millions of ac- is the result of the media interest surrounding counts. The FAIS system (Senator et al. 1995), successful KDD applications, for example, the from the U.S. Treasury Financial Crimes En- focus articles within the last two years in forcement Network, is used to identify finan- Business Week, Newsweek, Byte, PC Week, and cial transactions that might indicate money- other large-circulation periodicals. Unfortu- laundering activity. nately, it is not always easy to separate fact Manufacturing: The CASSIOPEE trou- from media hype. Nonetheless, several well- bleshooting system, developed as part of a documented examples of successful systems joint venture between General Electric and can rightly be referred to as KDD applications SNECMA, was applied by three major Euro- and have been deployed in operational use pean airlines to diagnose and predict prob- on large-scale real-world problems in science lems for the Boeing 737. To derive families of and in business. faults, clustering methods are used. CASSIOPEE In science, one of the primary application received the European first prize for innova- 38 AI MAGAZINE Articles tive applications (Manago and Auriol 1996). Data Mining and KDD Telecommunications: The telecommunications alarm-sequence analyzer (TASA) was Historically, the notion of finding useful pat- built in cooperation with a manufacturer of terns in data has been given a variety of telecommunications equipment and three names, including data mining, knowledge ex- telephone networks (Mannila, Toivonen, and traction, information discovery, information Verkamo 1995). The system uses a novel harvesting, data archaeology, and data pattern processing. The term data mining has mostly framework for locating frequently occurring been used by statisticians, data analysts, and alarm episodes from the alarm stream and the management information systems (MIS) presenting them as rules. Large sets of discov- communities. It has also gained popularity in ered rules can be explored with flexible infor- the database field. The phrase knowledge dis- mation-retrieval tools supporting interactivity covery in databases was coined at the first KDD and iteration. In this way, TASA offers pruning, workshop in 1989 (Piatetsky-Shapiro 1991) to grouping, and ordering tools to refine the re- emphasize that knowledge is the end product sults of a basic brute-force search for rules.

Load more