Final Exam Review, Spring 2003

ACCTG 6901, University of Utah Final Exam Review, Spring 2003 Types of questions  Concepts: short answers, comparisons, definitions, descriptions, interpretations and examples  Methods: problem solving, analyses, applications and selections of data mining tasks/methods, and attributes

Topics  Data mining basics and the KDD process  Why data mining? Different views of data mining. Related terms and disciplines. Definitions.  Readings: slides and Ch. 1.1 and Ch. 1.2 in T2

 Association rule discovery  Objectives  Fundamentals: association rules, item sets, support, confidence, and lift  Apriori property and algorithm  Applications  Types of association rules

 Sequential patterns mining  Objectives  Fundamentals: Sequence, scope within which sequences are mined, large sequence, relationships amongst sequences, support, distance measure and calculations, and comparative criteria  Apriori property and algorithm  Applications

 Clustering  Objectives  Fundamentals: Cluster, distance measure and calculations, and comparative criteria  Bottom-up hierarchical algorithm  Applications and comparative criteria

 Classification and prediction  Objectives  Fundamentals: classification, different views of prediction, data required, and comparative criteria  Decision tree induction, measure of diversity and pruning  Neural network classifier, components, steps and tuning parameters  Applications and comparative criteria Sample Exam Questions

Question 1 Association rules and sequential patterns The following is an example of customer purchase transaction data set.

CID TID Date Items Purchased 1 1 01/01/2001 10,20 1 2 01/02/2001 10,30,50,70 1 3 01/03/2001 10,20,30,40 2 4 01/03/2001 20,30 2 5 01/04/2001 20,40,70 3 6 01/04/2001 10,30,60,70 3 7 01/05/2001 10,50,70 4 8 01/05/2001 10,20,30 4 9 01/06/2001 20,40,60 5 10 01/11/2001 10,20,30,60 Note: CID = Customer ID and TID = Transactions ID a) Calculate the support, confidence and lift of the following association rule. Indicate if the items in the association rule are independent of each other or have negative or positive impacts on each other. (8 points) {10} -> {50,70} b) The following is the list of large two item sets. Show the steps to apply the Apriori property to generate and prune the candidates for large three itemsets. Describe how the Apriori property is used is in the steps. Give the final list of candidate large three item sets. (10 points) {10,20} {10,30} {20,30} {20,40} c) Does customer 1 support the sequence <{20} {50,70} {10}>? Justify your answer. (5 points) d) Calculate the support of <{10}, {30}>. (4 points) e) Based on the types of association rules discussed in class, identify which type(s) of rules {10}-> {50,70} is? (3 points)

Question 2. Clustering and Classification The following is the star schema for Sales Department of your company. CUSTOMER PRODUCT

Name Name Date of Birth Type Annual Income Introduction Date City State

SALESFACT

TransactionID Quantity Amount

DATE Day of Year Year

Note: 1. TransactionID is used as the primary key in the fact table because there might be more than one transaction for each customer and product in a given day.

2. The Introduction Date for a product is the date when it is first introduced into the market.

a) The clustering task was selected to identify customer segmentation. Suggest the attributes including derived attributes to be used in the clustering task and justify your answer. (10 points) b) Recommend a standardization or normalization method for the attributes in a distance function. (10 points)

c) You are asked to recommend a classification/predication task to be performed on the above data set.

i. Specify the input and class label attributes you choose for this classification/prediction task. Give an example of business decision(s) that can benefit from the classification/prediction results using the input and class label attributes of your choice. (10 points)

ii. Define and give an example of noise using the data set above. (5 points) iii. Assume that you will use a decision tree classifier. Specify and compare the different tree pruning approaches. (10 points)

iv. Suppose you are using a neural network instead of a decision tree. List at least three possible parameters you want to tune to improve its performance during the training period. (5 points)

Question 3: Selecting data mining tasks

The task attributes of the four data mining tasks discussed in class are briefly described below: Association rule and sequential pattern mining - Customer ID, Transaction ID and Item. Classification/prediction - input and the class label attributes Clustering mining - input attributes

The following are the data fields in the data mining server log: User ID, Session ID, Dataset ID, MiningTask ID, Parameter Value, Accuracy a) Which task will you perform to identify the data mining tasks that tend to be performed in the same session? Describe the attributes you choose and how they are mapped to the data mining task attributes listed above. (6 points) b) Which task will you perform to identify the sequence of data mining tasks that users tend to perform on the same data set over time? Describe the attributes you choose and how they are mapped to the data mining task attributes listed above. (6 points) c) Which task will you perform to determine if the Parameter Value level (low, medium or high) and the level of Parameter Value adjustment (small, moderate or large) tend to have a positive or negative impact on Accuracy. Describe the attributes you choose and how they are mapped to the data mining task attributes listed above. (8 points) Answers to Sample Exam Questions Question 1: a) Support = Support ({10,50,70}) = 2/10 = 20% Confidence = Support ({10,50,70})/ Support({10}) = 0.2/0.7 = 2/7 = 29% Lift = Confidence/Support({50,70}) = 2/7/0.2 = 10/7 = 1.43 > 1

Since lift is larger than 1, it’s a positive rule. b) {10,20} {10,30} {20,30} {20,40} ***O: describe how the apriori property is used to decide which 2 large item sets are joined together and to determine which 3 item set should be pruned.

Join: {10,20,30} {20,30,40} Prune: {10,20,30} ({20,30,40} is pruned) Final list: {10,20,30} d) The sequence of customer 1 is: <{10,20} {10,30,50,70} {10,20,30,40}>

Since {20}  {10,20}, {50,70}  {10,30,50,70}, and {10}  {10,20,30,40}, <{20} {50,70} {10}> is contained in the sequence of customer 1. Therefore, customer 1 supports sequence <{20} {50,70} {10}>. e) Only customer 1 supports the sequence <{10} {30}> and there are 5 customers, therefore,

Support = 1/5 = 20% f) The association rule {10} -> {50,70} is a single-level, single-dimensional and Boolean association rule.

Question 2.

a) Customer age, customer’s annual income, # of days since a customer’s last purchase, the total number of a customer’s sales transactions, and the total amount of a customer’s purchases. I am interested in groups of similar customers based on customer age, income and life-time value using the last three derived attributes. b) For each attribute, calculate the mean value and the mean absolute deviation. Calculate the standardized value = (original value – mean value)/mean absolute deviation. c) i) Input: Customer city, State, age, Income. Output: Product Type Business Analysis: These choices for input and output attributes enable us to understand the impact of customer demographics on product type preference. In marketing, this is called “customer segmentation.” ii) Noise refers to records with the same input attribute values but different class labels. For example, in customer table, the same customer name with different city and state may be a noise: Is it because of erroneous input, or is it because the customer just moved> iii) 1. Prepruning: Halting creations of unreliable branches by statistically determine the goodness of further tree splits

2. Postpruning: Remove unreliable branches from a full tree by minimizing error rates or required encoding bits

iv) Hidden layer node number, learning rate, epochs, momentum, accuracy threshold, hidden layer number,

Question 3: a) I will suggest using association rule mining. In this data mining task, Session ID can be mapped to Transaction ID and MiningTask ID can be mapped to Item. b) I will suggest using sequential pattern mining. In this data mining task, Dataset ID can be mapped to Customer ID, Session ID can be mapped to Transaction ID and MiningTask ID can be mapped to Item. c) I will suggest using classification. In this data mining task, input attributes include parameter value level and level of parameter value adjustment, and the class label attribute is the impact on accuracy (i.e., positive, negative, or no impact).