STAT/IST 557: Data Mining Fall 2011

Class: Tuesday, Thursday 9:45-11:00am, IST 202A

Instructor: Le Bao Office Wartik 514C E-mail [email protected] Office Hours Tuesday 11:00am-12:00pm at IST Friday 10:00am-12:00pm at Thomas 421A or by appointment

Class Cancellation on Oct. 20

Textbook: The Elements of Statistical Learning By Hastie, Trevor, Tibshirani, Robert, Friedman, Jerome

Recommended:

 Pattern Recognition and Machine Learning by C. M. Bishop  Classification and Regression Trees by L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone.  Pattern Recognition and Neural Networks by B. Ripley

Course Objective:

This course covers methodology, major software tools and applications in data mining. By introducing principal ideas in statistical learning, the course will help students to understand conceptual underpinnings of methods in data mining. It focuses more on usage of existing software packages (mainly in R). Students will be required to work on projects to practice applying existing software. The topics include:

 Exploratory data analysis  Linear regression and regularization (ridge, lasso)  Dimension reduction: Stepwise model selection, PCA  Regression methods for classification: regression on indicators, Logistic regression  Discriminant analysis: Linear discriminant analysis (LDA), Quadratic discriminant analysis (QDA), Regularized discriminant analysis (RDA), and Reduced-rank LDA.  Clustering methods: k-means, hierarchical clustering, model-based clustering, etc.  Nonparametric methods: K-nearest neighbor  Classification and regression trees (CART)  Random Forest  Boosting and Bagging  Mixture models  Model selection: Cross-validation, Bootstrap, Bayesian Model Averaging

Website: Class announcements and materials will be regularly posted on ANGEL, so it is recommended that you check the site frequently. Such materials as lecture notes, homework and lab assignments, homework and exam solutions, etc. will be posted.

Prerequisites: Stat 414, 415, 416, 418, or similar courses that cover basics on probability, expectation, and conditional distribution. Matrix algebra and multivariate calculus. Basic programming skills in R, SAS, or Matlab. STAT 511 is recommended for students from statistics department, and STAT 501 is recommended for all other students. Recommended book: All of Statistics: A Concise Course in Statistical Inference by L. Wasserman.

Evaluation: Attendance 10% Project 1. 15% Project 2. 15% Project 3 (Contest) 60% (20% on test data result, 15% on report, 15% on presentation, 10% on contribution) Late report will not be accepted.

Grading Scale: A: 93%; A-: 90%; B+: 87%; B: 83%; B-: 80%; C+: 77%; C: 70%; D: 60%; F: <60%

Academic Integrity: All Penn State and Eberly College of Science policies regarding academic integrity apply to this course. See http://www.science.psu.edu/academic/Integrity/index.html for details.

ECOS Code of Mutual Respect and Cooperation: The Eberly College of Science Code of Mutual Respect and Cooperation: http://www.science.psu.edu/climate/Code-of-Mutual-Respect final.pdf embodies the values that we hope our faculty, staff, and students possess and will endorse to make The Eberly College of Science a place where every individual feels respected and valued, as well as challenged and rewarded.