Data Mining and Knowledge Discovery (KSE525)

Data Mining and Knowledge Discovery (KSE525)Assignment #1 (March 20, 2017, due: 10:30 a.m., April 3)1. [10 points] Although their sales are reasonable, an online fashion retailer is struggling to generate the volume of sales that they had originally hoped for when launching their site. List at least “two” ways in which predictive data analytics (i.e., classification / prediction) could be used to help address this business problem. For each proposed approach, describe the predictive (classification) model that will be built, how the model will be used by the business, and how using the model will help address the original business problem.2. [10 points] Please prove that the variance is an algebraic measure. In addition, please write an example showing on how the variance can be obtained from smaller subsets without recomputing it from the entire data set.3. [10 points] Suppose that the data for analysis includes the attribute age. The age values for the data tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35,36, 40, 45, 46, 52, 70. (Do not use any software, do the math by yourself.)1) What is the mean of the data? What is the median?2) What is the mode of the data? Comment on the data’s modality (e.g., bimodal).3) Can you find the first quartile () and the third quartile () of the data? Please use the first method covered in class.4) Give the five-number summary of the data.5) Show a boxplot of the data.4. [10 points] It is important to define or select similarity measures in data analysis. However, there is no commonly accepted subjective similarity measure. Results can vary depending on the similarity measures used. Nonetheless, seemingly different similarity measures may be equivalent after some transformation.Suppose we have the following 2-D data set:1.5 1.7 2 1.9 1.6 1.8 1.2 1.5 1.5 1.01) Consider the data as 2-D data points. Given a new data point, as a query, rank the database points based on similarity with the query using Euclidean distance, Manhattan distance, supremum distance, and cosine similarity.2) Normalize the data set to make the norm of each data point equal to 1. Use Euclidean distance on the transformed data to rank the data points.5. [10 points] The following table shows the IQs for a group of people who applied to take part in a television general knowledge quiz (e.g., Problematic Men).ID 1 2 3 4 5 6 7 8 9 10 IQ 92 107 83 101 107 92 99 119 93 106ID 11 12 13 14 15 16 17 18 19 20 IQ 105 88 106 90 97 118 120 72 100 104Using this data set, generate the following binned versions of the IQ attribute. That is, replace the original values with the bin numbers (1, 2, 3, 4, and 5).1) An equal-width binning using 5 bins2) An equal-frequency binning using 5 bins

Data Mining and Knowledge Discovery (KSE525)

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support