Association Rule Discovery Is a Popular Data Mining Method

3. Association Rules

Association rule discovery is a popular data mining method.

For example, the data consist of two variables: a transaction and an item. For each transaction, there is a list of items. Typically, a transaction is a single customer purchase and the items are the things that were bought. An association rule is statement of the form (item set A) (item set B)

The strength of the association is measured by the support and confidence of the rule.

The support of A B transaction that contain every item A and B  all transactions P( A  B )

The confidence of A B transaction that contain every item in A and B  transaction that contain the items in A  P( B | A )

The lift of A B The confidence of A B  The expected confidence of A B , given A and B are independent P( B | A )  P( B ) Note the lift can be interpreted as a general measure of association between the two item sets. Values greater than one indicate positive correlation; values less than one indicate negative correlation; and values equal to zero indicate zero correlation.

The lift of A B = the lift of B A .

67 Example 2.3 Consider the correlation between owning a saving account and a checking account at a financial institution. The data are displayed in the following two-way contingency table.

Checking Account Saving No Yes Account No 500 3,500 4,000 Yes 1,000 5,000 6,000 10,000

Consider the association rule (saving account) (checking account)

5,000 The support= 50% 10,000

5,000 The confidence= 83% 6,000

5,000/6,000 83 The lift=  1 8,500/10,000 85

Comments: a. High confidence and support does not imply cause and effect. The rule is not necessary interesting. The two items might not even be related. b. Based on these measures, this might be considered a strong rule. However, those without a saving account are even more likely to have a checking account (confidence=87.5%). Saving and checking are in fact negatively correlated. c. If the two accounts were independent, knowing one has a saving account does not help in knowing whether one has a checking account. The expected confidence if the two accounts were independent if 8,500/10,000=85%. This is higher than the confidence if SVG  CK.

68 Example 2.4 A bank seeks to examine its customer base and understand which of its products the same customer owns. It has chosen to conduct a market-basket analysis of a sample of its customer base.

The BNKSERV data set lists the banking products/services used by 7,991 customers. Thirteen possible services are represented.

ATM Automated teller machine debit card AUTO Automobile installment loan CCRD Credit card CD Certificate of deposit CKCRD Check/debit card CKING Checking account HMEQLC Home equity line of credit IRA Individual retirement account MMDA Money market deposit account MTG Mortgage PLOAN Personal/consumer installment loan SVG Saving account TRUST Personal trust account.

There are 24,375 rows in the data set. Each row of the data set represents a customer-service combination. The median number of services per customer is three.

69 Construct the following diagram.

Specify the setting for the Input Data Source node. 1. Open the Input Data Source node. 2. Select the BNKSERV data set from the Neuralnt library 3. Set Model Role for ACCT to id and for SERVICE to target 4. Close and save changes to the Input Data Source node.

Open the Association node. The Variables tab is active by default and lists the same information that is found in the Variable tab in the Input Data Source node

Select the General tab. This tab enables you to modify the analysis mode and control how many rules are generated.

70 71 Inspect the Analysis mode options.

The Analysis mode is set to By context by default. This mode uses information specified in the input data source to determine the appropriate analysis. If the input data set contains an id variable and a target variable, the node automatically performs an association analysis

Other option include . Minimum Transaction frequency to Support Associations-specifies a minimum level of support to claim that items are associated (that is, occur together in the database). The default frequency is 5%. . Maximum number of items in an association-determines the maximum size of the item set to be considered. For example, the default 4 items indicate that up to 4-way associations are performed. . Minimum confidence for rule generation-specifies the minimum confidence level to generate a rule. The default is 10%.

Use the default Association settings. Run the diagram from the Association node and view the results.

Click on the Support(%) column with the right mouse button and select Sort Descending.

72 The support is the percentage of customers that have all the services involved in the rule. For example, 54% of the 7,991 customers have a checking and savings account and 25% have a checking account, saving account and an ATM card.

Click on the Confidence(%) column with right mouse button and select Sort Descending.

the confidence represents the percentage of customers who have the rigt-hand-side (RHS) item among those who have the left-hand-side (LHS) item. For example, all customers who have a check card also have a checking account, and 97.81% of those with a mortgage also have a checking account.

73 Lift, in the context of association rules, is the ratio of the confidence of a rule to the confidence of a rule assuming the RHS was independent of the LHS. For example the chance of having a customer who has a checking account among those who have a saving account is 17% more than those who are randomly selected.

74