The Quest Data Mining System

From: KDD-96 Proceedings. CopyrightThe © 199 6Quest, AAAI (www.aaai.org). Data All rightsMining reserved. System Rakesh Agrawal and Manish Mehta and John Shafer and Ramakrishnan Srikant* IBM Almaden Research Center San Jose, California 95120, U.S.A. Andreas Arning and Toni Bollinger IBM German Software Development Laboratory Boeblingen, Germany Introduction Association Rules The goal of the Quest project at the IBM Almaden We introduced the problem of discovering association Research center is to develop technology to enable a rules in (Agrawal, Imielinski, & Swami 1993b). Given new breed of da&intensive decision-support applica- a set of transactions, where each transaction is a set tions. This paper is a capsule summary of the current of literals (called items), an association rule is an ex- ~ functionality and architecture of the Quest data min- pression of the form X + Y, where X and Y are sets ing System. of items. The intuitive meaning of such a rule is that Our overall approach has been to identify basic data transactions of the database which contain X tend to mining operations that cut across applications and contain Y. An example of an association rule is: “30% develop fast, scalable algorithms for their execution of transactions that contain beer also contain diapers; (Agrawal, Imielinski, & Swami 1993a). We wanted our 2% of all transactions contain both of these items”. algorithms to: Here 30% is called the conjdence of the rule, and 2% the support of the rule. The problem is to find all l discoveT patterns in very large databases, rather association rules that satisfy user-specified minimum than simply verify that a pattern exists; support and minimum confidence constraints. Appli- cations include discovering affinities for market bas- l have a completeness property that guarantees that ket analysis and cross-marketing, catalog design, loss- all patterns of certain types have been discovered; leader analysis, store layout, customer segmentation based on buying patterns, etc. See (Nearhos, Roth- l have high performance and near-linear scaling on man, & Viveros 1996) for a case study of a successful very large (multiple gigabytes) real-life databases. application in health insurance. We discuss the operations of discovering associ- Apriori Algorithm ation rules, sequential patterns, time-series cluster- The problem of mining association rules is decomposed ing, classification, and incremental mining. Due into two subproblems (Agrawal, Imielinski, & Swami to space limitation, we only give highlights and 1993b): point the reader to the relevant information for details. Unfortunately, for the same reason, we have Find all combinations of items that have transaction not been able to include a discussion of the re- support above minimum support. Call those combi- lated work. Besides proceedings of the KDD, SIG- nations frequent itemsets. MOD, VLDB, and Data Engineering Conferences, other excellent sources of information about the data Use the frequent itemsets to generate the desired mining systems and algorithms include (Piatetsky- rules. The general idea is that if, say, ABCD and Shapiro & Frawley 1991) (Fayyad et aZ. 1995). AB are frequent itemsets, then we can determine if Further information about Quest can be obtained the rule AB + CD holds by computing the ratio T = from http://www.almaden.ibm.com/cs/quest. IBM support(ABCD)/support(AB). The rule holds only is making the Quest technology commercially avail- if T > minimum confidence. Note that the rule will able through the data mining product, IBM Intelligent have minimum support because ABCD is frequent. Miner. The Apriori algorithm (Agrawal & Srikant 1994) *Current members of the Quest group. used in Quest for finding all frequent itemsets is Q 244 KDD-96 procedure AprioriAlg() Generalizations begin Very often, taxonomies (is-a hierarchies) over the items .Ll := {frequent 1-itemsets); for ( k := 2; Lk-1 # 0; k++ ) do { are available. An example of a taxonomy is shown ck := apriO&gen(Lk-1); // New candidates in Figure 2: this taxonomy says that Jacket is-a forall transactions t in the dataset do { Outerwear, Ski Pants is-a Outerwear, Outerwear is- forall candidates c E ck contained in t do a Clothes, etc. Users are often interested in generating c.colmt ++; rules that span different levels of the taxonomy. FOI 1 example, we may infer a rule that people who buy Out- erwear tend to buy Hiking Boots from the fact that &swer:= u, Lk; people bought Jackets with Hiking Boots and and Ski end Pants with Hiking Boots. However, the support for the rule “Outerwear + Hiking Boots” may not be the Figure 1: Apriori Algorithm sum of the supports for the rules “Jackets j Hiking Boots” and “Ski Pants $ Hiking Boots” since some people may have bought Jackets, Ski Pants and Hik- Clothes Footwear ing Boots in the same transaction. Also, “Outerwear j Hiking Boots” may be a valid rule, while “Jackets /\ J\ =+ Hiking Boots” and “Clothes + Hiking Boots” may Outerwear Shirts Shoes Hiking Boots not. The former may not have minimum support, and the latter may not have minimum confidence. This /a - generalization of association rules and the algorithm Jackets ski pants used in Quest for finding such rules are described in (Srikant & Agrawal 1995). Figure 2: Example of a Taxonomy Another generalization of the problem of mining association rules is to discover rules in data containing both quantitative and categorical attributes. An example of such a “quantitative” association rule might given in Figure 1. It makes multiple passes over the be that “10% of married people between age 50 and database. In the first pass, the algorithm simply counts 60 have at least 2 cars”. We deal with quantitative at- item occurrences to determine the frequent 1-itemsets tributes by fine-partitioning the values of the attribute (itemsets with 1 item). A subsequent pass, say pass and then combining adjacent partitions as necessary. k, consists of two phases. First, the frequent itemsets We also have measures of partial completeness that Lk-r (the set of all frequent (k - l)-itemsets) found in quantify the information loss due to partitioning. This the (k - 1)th pass are used to generate the candidate generalization and the algorithm for finding such rules itemsets ck, using the apriori-gen() function. This used in Quest are presented in (Srikant & Agrawal function first joins &k-r with Lk-1, the joining condi- 1996a). tion being that the lexicographically ordered first k - 2 One potential problem that users experience in items are the same. Next, it deletes all those itemsets applying association rules to real problems is that from the join result that have some (k-1)-subset that many uninteresting or redundant rules may be gen- is not in Lk - 1, yielding Ck . For example, let LS be { { 1 erated along with the interesting rules. In (Srikant 2 31, {l 2 4}, (1 3 41, (1 3 53, (2 3 4)). After the join & Agrawal 1995) (further generalized in (Srikant & step, Cd will be ((12 3 4}, (13 4 5) }. The prune step Agrawal 1996a)), a “greater-than-expected-value” in- will delete the itemset {1 3 4 5) because the itemset terest measure was introduced, which is used in Quest (1 4 5) is not in L3. We will then be left with only (1 to prune redundant rules. 2 3 41 in C4. The algorithm now scans the database. For each Sequential Patterns transaction, it determines which of the candidates in We introduced the problem of discovering sequential ck are contained in the transaction using a hash-tree patterns in (Agrawal & Srikant 1995). The input data data structure and increments the count of those can- is a set of sequences, called data-sequences. Each data- didates. At the end of the pass, ck is examined to sequence is a list of transactions, where each trans- determine which of the candidates are frequent, yield- action is a sets of items (literals). Typically there is ing .&. The algorithm terminates when LI, becomes a transaction-time associated with each transaction. empty. A sequential pattern also consists of a list of sets of Systemsfor Mining Large Databases 245 items. The problem is to find all sequential patterns contained in the union of the items bought in a set with a user-specified minimum support, where the sup- of transactions, as long as the difference between port of a sequential pattern is the percentage of data- the maximum and minimum transaction-times is less sequences that contain the pattern. than the size of a sliding time window. For exam- For example, in the database of a book-club, each ple, if the book-club specifies a time window of a data-sequence may correspond to all book selections week, a customer who ordered the “Foundation” on of a customer, and each transaction to the books se- Monday, “Ringworld” on Saturday, and then “Foun- lected by the customer in one order. A sequential pat- dation and Empire” and “Ringworld Engineers” in tern might be “5% of customers bought ‘Foundation’, a single order a few weeks later would still support then ‘Foundation and Empire’, and then ‘Second Foun- the pattern “‘Foundation’ and ‘Ringworld’, followed dation’ “. The data-sequence corresponding to a cus- by ‘Foundation and Empire’ and ‘Ringworld Engi- tomer who bought some other books in between these neers’ “. books still contains this sequential pattern; the data- sequence may also have other books in the same trans- In addition, if there were taxonomies (is-a hierar- action as one of the books in the pattern. Elements of chies) over the items in the data, the sequential pat- a sequential pattern can be sets of items, for example, terns could now include items across different levels of “ ‘Foundation’ and ‘Ringworld’, followed by ‘Founda- the taxonomy.

The Quest Data Mining System

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support