Machine Learning and Data Mining Machine Learning Algorithms Enable Discovery of Important “Regularities” in Large Data Sets
Total Page:16
File Type:pdf, Size:1020Kb
General Domains Machine Learning and Data Mining Machine learning algorithms enable discovery of important “regularities” in large data sets. Over the past Tom M. Mitchell decade, many organizations have begun to routinely cap- ture huge volumes of historical data describing their operations, products, and customers. At the same time, scientists and engineers in many fields have been captur- ing increasingly complex experimental data sets, such as gigabytes of functional mag- netic resonance imaging (MRI) data describing brain activity in humans. The field of data mining addresses the question of how best to use this historical data to discover general regularities and improve PHILIP NORTHOVER/PNORTHOV.FUTURE.EASYSPACE.COM/INDEX.HTML the process of making decisions. COMMUNICATIONS OF THE ACM November 1999/Vol. 42, No. 11 31 he increasing interest in Figure 1. Data mining application. A historical set of 9,714 medical data mining, or the use of records describes pregnant women over time. The top portion is a historical data to discover typical patient record (“?” indicates the feature value is unknown). regularities and improve The task for the algorithm is to discover rules that predict which T future patients will be at high risk of requiring an emergency future decisions, follows from the confluence of several recent trends: C-section delivery. The bottom portion shows one of many rules the falling cost of large data storage discovered from this data. Whereas 7% of all pregnant women in devices and the increasing ease of the data set received emergency C-sections, the rule identifies a collecting data over networks; the subclass at 60% at risk for needing C-sections. development of robust and efficient Data machine learning algorithms to Patient103 Patient103 Patient103 process this data; and the falling time=1 time=2 time=n cost of computational power, Age: 23 Age: 23 Age: 23 enabling use of computationally FirstPregnancy: No FirstPregnancy: No FirstPregnancy: No intensive methods for data analysis. Anemia: No Anemia: No Anemia: No Diabetes: No Diabetes: Yes Diabetes: No The field of data mining, some- PreviousPrematureBirth: No PreviousPrematureBirth: No PreviousPrematureBirth: No times called “knowledge discovery Ultrasound: ? Ultrasound: Abnormal Ultrasound: ? from databases,” “advanced data Elective C–Section: ? Elective C–Section: ? Elective C–Section: No analysis,” and machine learning, has Emergency C–Section: ? Emergency C–Section: ? Emergency C–Section: Yes already produced practical applica- tions in such areas as analyzing med- Rules learned ical outcomes, detecting credit card fraud, predicting customer purchase If No previous vaginal delivery, and behavior, predicting the personal Abnormal 2nd Trimester Ultrasound, and interests of Web users, and optimiz- Malpresentation at admission ing manufacturing processes. It has Then Probability of Emergency C-Section is 0.6 also led to a set of fascinating scien- tific questions about how comput- Training set accuracy: 26/41 = .63 ers might automatically learn from Test set accuracy: 12/20 = .60 past experience. given birth by emergency C-section. As summarized Prototypical Applications at the bottom of the figure, this regularity holds over Figure 1 shows a prototypical example of a data both the training data used to formulate the rule and mining problem. Given a set of historical data, we a separate set of test data used to verify the reliabil- are asked to use it to improve our medical decision ity of the rule over new data. Physicians may want making. The data consists of a set of medical records to consider this rule as a useful factual statement describing 9,714 pregnant women. We want to about past patients when weighing treatment for improve our ability to identify future high-risk preg- similar new patients. nancies—specifically, those at high risk of requiring What algorithms can be used to learn rules like the an emergency Cesarean-section delivery. In this one in the figure? This rule was learned through a database, each pregnant woman is described in symbolic rule-learning algorithm similar to Clark’s terms of 215 distinct features, such as her age, and Nisbett’s CN2 [3]. Decision-tree learning algo- whether she is diabetic, and whether this is her first rithms, such as Quinlan’s C4.5 [9], are also fre- pregnancy. These features (in the top portion of the quently used to formulate rules of this type. When figure) together describe the evolution of each preg- rules have to be learned from extremely large data nancy over time. sets, specialized algorithms stressing computational The bottom portion of the figure illustrates a typ- efficiency may also be used [1, 4]. Other machine ical result of data mining, including one of the rules learning algorithms commonly applied to this kind learned automatically from this data set. This partic- of data mining problem include neural networks ular rule predicts a 60% risk of emergency C-section [2], inductive logic programming [8], and Bayesian for mothers exhibiting a particular combination of learning algorithms [5]. Mitchell’s 1997 textbook three features, out of the 215 possible features. [7] describes a broad range of machine learning Among the women known to exhibit these three fea- algorithms used for data mining, as well as the sta- tures, the data indicates that 60% have historically tistical principles on which they are based. 32 November 1999/Vol. 42, No. 11 COMMUNICATIONS OF THE ACM Figure 2. Typical data and rules for analyzing credit risk. which data mining has been applied successfully and in which Data further research promises even Customer 103: (time=t0) Customer 103: (time=t1) Customer 103: (time=tn) more effective techniques. Years of credit: 9 Years of credit: 9 Years of credit: 9 Loan balance: $2,400 Loan balance: $3,200 Loan balance: $4,500 The State of the Art Income: $52k Income: ? Income: ? Own House: Yes Own House: Yes Own House: Yes and Beyond Other delinquent accts: 2 Other delinquent accts: 2 Other delinquent accts: 3 The field of data mining is at an Max billing cycles late: 3 Max billing cycles late: 4 Max billing cycles late: 6 interesting crossroads; we now Repay loan?: ? Repay loan?: ? Repay Loan?: No have a first generation of machine learning algorithms (such as those Rules learned from synthesized data: for learning decision trees, rules, neural networks, Bayesian net- If Other-Delinquent-Accounts > 2, and works, and logistic regressions) Number-Delinquent-Billing-Cycles > 1 Then Repay-Loan? = No that have been demonstrated to be of significant value in a variety of If Other-Delinquent-Accounts = 0, and real-world data mining applica- (Income > $30k) OR (Years-of-Credit > 3) tions. Dozens of companies Then Repay-Loan? = Yes around the world now provide commercial implementations of these algorithms (see www. Although machine learning algorithms are central kdnuggets.com), along with efficient interfaces to to the data mining process, it is important to note commercial databases and well-designed user inter- that the process also involves other important steps, faces. But these first-generation algorithms also have including building and maintaining the database, significant limitations. They typically assume the data formatting and cleansing, data visualization and data contains only numeric and symbolic features summarization, the use of human expert knowledge and no text, image features, or raw sensor data. They to formulate the inputs to the learning algorithm assume the data has been carefully collected into a and evaluate the empirical regularities it discovers, single database with a specific data mining task in and determining how to deploy the results. Thus, mind. Furthermore, today’s algorithms tend to be data mining bridges many technical areas, including fully automatic and therefore fail to allow guidance databases, human-computer interaction, statistical from knowledgeable users at key stages in the search analysis, and machine learning algorithms. My focus for data regularities. here is on the role of machine learning algorithms in Given these limitations, and the strong commercial the data mining process. interest despite them, and the accelerating university The patient-medical-records application example research in machine learning and data mining, we in Figure 1 represents a prototypical data mining might well expect the next decade to produce an order problem in which the data consists of a collection of of magnitude advance in the state of the art. Such an time-series descriptions; we use the data to learn to advance could be motivated by development of new predict later events in the series—emergency C-sec- algorithms that accommodate dramatically more tions—based on earlier events—symptoms before diverse sources and types of data, a broader range of delivery. Although I use a medical example to illus- automated steps in the data mining process, and trate these ideas, I could have given an analogous mixed-initiative data mining in which human example of learning to predict, say, which bank-loan experts collaborate more closely with the computer applicants are at high risk of failing to repay their to form hypotheses and test them against the data. loans (see Figure 2). As shown in this figure, data in To illustrate one important research issue, consider such applications typically consists of time-series again the problem of predicting the risk of an emer- descriptions of customer bank balances and other gency C-section for pregnant women. One key lim- demographic information, rather than medical itation of current data mining methods is that they symptoms. cannot utilize the full patient record that is today Other data mining applications include predicting routinely captured in hospital medical records. This customer purchase behavior, customer retention, and is because hospital records for pregnant women the quality of goods produced by a particular manu- often contain sequences of images (such as the ultra- facturing line (see Figure 3). All are applications for sound images taken during pregnancy), other raw COMMUNICATIONS OF THE ACM November 1999/Vol.