1 an Introduction to Statistical Learning
Total Page:16
File Type:pdf, Size:1020Kb
1 Introduction CS 189 / 289A [Spring 2017] Machine Learning Jonathan Shewchuk http://www.cs.berkeley.edu/∼jrs/189 TAs: Daylen Yang, Ting-Chun Wang, Moulos Vrettos, Mostafa Rohaninejad, Michael Zhang, Anurag Ajay, Alvin Wan, Soroush Nasiriany, Garrett Thomas, Noah Golmant, Adam Villaflor, Raul Puri, Alex Francis Questions: Please use Piazza, not email. [Piazza has an option for private questions, but please use public for most questions so other people can benefit.] For personal matters only, [email protected] Discussion sections: 12 now; more will be added. Attend any section. If the room is too full, please go to another one. [However, to get into the course, you have to pick some section with space. Doesn’t matter which one!] No sections this week. [Enrollment: We’re trying to raise it to 540. After enough students drop, it’s possible that everyone might get in. Concurrent enrollment students have the lowest priority; non-CS grad students the second-lowest.] [Textbooks: Available free online. Linked from class web page.] STS Hastie • Tibshirani Tibshirani Hastie • • Friedman Springer Texts in Statistics Tibshirani · Hastie Witten James · Springer Texts in Statistics Gareth James · Daniela Witten · Trevor Hastie · Robert Tibshirani Springer Series in Statistics Springer Series in Statistics An Introduction to Statistical Learning with Applications in R Gareth James An Introduction to Statistical Learning provides an accessible overview of the fi eld of statistical learning, an essential toolset for making sense of the vast and complex data sets that have emerged in fi elds ranging from biology to fi nance to marketing to Daniela Witten astrophysics in the past twenty years. T is book presents some of the most important Trevor Hastie modeling and prediction techniques, along with relevant applications. Topics includeTrevor Hastie • Robert TibshiraniTrevor • Jerome Hastie Friedman linear regression, classifi cation, resampling methods, shrinkage approaches, tree-based Robert Tibshirani methods, support vector machines, clustering, and more. Color graphics and real-worldThe Elements of StaticticalRobert Learning Tibshirani examples are used to illustrate the methods presented. Since the goal of this textbook Jerome Friedman is to facilitate the use of these statistical learning techniques by practitioners in sci- ence, industry, and other fi elds, each chapter contains a tutorial on implementing the analyses and methods presented in R, an extremely popular open source statistical During the past decade there has been an explosion in computation and information tech- sof ware platform. nology. With it have come vast amounts of data in a variety of fields such as medicine, biolo- Two of the authors co-wrote T e Elements of Statistical Learning (Hastie, Tibshiranigy, finance, and marketing. The challenge of understanding these data has led to the devel- opment of new tools in the field of statistics, and spawnedAn new areas Introduction such as data mining, and Friedman, 2nd edition 2009), a popular reference book for statistics and machine The Elements of machine learning, and bioinformatics. Many of these tools have common underpinnings but learning researchers. An Introduction to Statistical Learning covers many of the same 1 topics, but at a level accessible to a much broader audience. T is book is targetedare at often expressed with different terminology. This book describes the important ideas in The Elements of Statistical Learning these areas in a common conceptual Learning Statistical to An Introduction framework. While the approach is statistical, the statisticians and non-statisticians alike who wish to use cutting-edge statistical learn- emphasis is on concepts rather than mathematics. Many examples are given, with a liberal ing techniques to analyze their data. T e text assumes only a previous course in linear to Statistical use of color graphics. It should be a valuable resource for statisticians and anyone interested Statistical Learning regression and no knowledge of matrix algebra. in data mining in science or industry. The book’s coverage is broad, from supervised learning Gareth James is a professor of statistics at University of Southern California. He has(prediction) to unsupervised learning. The many topics include neural networks, support published an extensive body of methodological work in the domain of statistical learn-vector machines, classification trees and boosting—the first comprehensive treatment of this topic in any book. Data Mining,Inference,and Prediction ing with particular emphasis on high-dimensional and functional data. T e conceptual Learning framework for this book grew out of his MBA elective courses in this area. This major new edition features many topics not covered in the original, including graphical Daniela Witten is an assistant professor of biostatistics at University of Washington. Hermodels, random forests, ensemble methods, least angle regression & path algorithms for the research focuses largely on high-dimensional statistical machine learning. She haslasso, non-negative matrix factorization, and spectral withclustering. ThereApplications is also a chapter on in R contributed to the translation of statistical learning techniques to the fi eld of genomics,methods for “wide” data (p bigger than n), including multiple testing and false discovery rates. through collaborations and as a member of the Institute of Medicine committee that led to the report Evolution of Translational Omics. Trevor Hastie, Robert Tibshirani, and Jerome Friedman are professors of statistics at Stanford University. They are prominent researchers in this area: Hastie and Tibshirani Second Edition Trevor Hastie and Robert Tibshirani are professors of statistics at Stanford University, anddeveloped generalized additive models and wrote a popular book of that title. Hastie co- are co-authors of the successful textbook Elements of Statistical Learning. Hastie anddeveloped much of the statistical modeling software and environment in R/S-PLUS and Tibshirani developed generalized additive models and wrote a popular book of thatinvented principal curves and surfaces. Tibshirani proposed the lasso and is co-author of the title. Hastie co-developed much of the statistical modeling sof ware and environmentvery successful An Introduction to the Bootstrap. Friedman is the co-inventor of many data- in R/S-PLUS and invented principal curves and surfaces. Tibshirani proposed the lassomining tools including CART, MARS, projection pursuit and gradient boosting. and is co-author of the very successful An Introduction to the Bootstrap. STATISTICS Statistics ISBN 978-0-387-84857-0 I SBN 978 - 1 - 4614 - 7137 - 0 9781461471370 › springer.com 1 Prerequisites Math 53 (vector calculus) Math 54 or 110 (linear algebra) CS 70 (probability) NOT CS 188 [Might still be listed as a prerequisite, but we’re having it removed.] [BUT be aware that 189 midterm starts 10 minutes after 188 midterm ends.] Grading: 189 40% 7 Homeworks. Late policy: 5 slip days total 20% Midterm: Wednesday, March 15, in class (6:30–8 pm) 40% Final Exam: MOVED to Monday, May 8, 3–6 PM (Exam group 3) [Good news for some of you who had final exam conflicts.] Grading: 289A 40% HW 20% Midterm 20% Final 20% Project Cheating – Discussion of HW problems is encouraged. – All homeworks, including programming, must be written individually. – We will actively check for plagiarism. – Typical penalty is a large NEGATIVE score, but I reserve right to give an instant F for even one violation, and will always give an F for two. [Last time I taught CS 61B, we had to punish roughly 100 people for cheating. It was very painful. Please don’t put me through that again.] CORE MATERIAL – Finding patterns in data; using them to make predictions. – Models and statistics help us understand patterns. – Optimization algorithms “learn” the patterns. [The most important part of this is the data. Data drives everything else. You cannot learn much if you don’t have enough data. You cannot learn much if your data sucks. But it’s amazing what you can do if you have lots of good data. Machine learning has changed a lot in the last decade because the internet has made truly vast quantities of data available. For instance, with a little patience you can download tens of millions of photographs. Then you can build a 3D model of Paris. Some techniques that had fallen out of favor, like neural nets, have come back big in the last few years because researchers found that they work so much better when you have vast quantities of data.] 2 CLASSIFICATION 4.2 Why Not Linear Regression? 129 Income Income Balance 0 20000 40000 60000 0 500 1000 1500 2000 2500 0 20000 40000 60000 0 500 1000 1500 2000 2500 No Yes No Yes Balance Default Default FIGURE 4.1. The Default data set. Left: The annual incomes and monthly credit card balances of a number of individuals. The individuals who defaulted on their credit card payments are shown in orange, and those who did not are shown in blue. Center: Boxplots of balance as a function of default status. Right: Boxplots of income as a function of default status. creditcards.pdf (ISL, Figure 4.1) [The problem of classification. We are given data points, each4.2 belonging Why to oneNot of two Linear classes. Regression? Then we are given additional points whose class is unknown, and we are asked to predict what class each new point is in. Given the credit card balanceWe have and stated annual income that linear of a cardholder, regression predict is not whether appropriate they will default in the on case their debt.] of a – Collectqualitative trainingresponse. data: reliable Why debtors not? & defaulted debtors – EvaluateSuppose new applicants that we (prediction) are trying to predict the medical condition of a patient in the emergency room on the basis of her symptoms. In this simplified example, there are three possible diagnoses:decisionstroke ,boundarydrug overdose,and epileptic seizure.Wecouldconsiderencodingthesevaluesasaquantita- tive response variable, Y ,asfollows: 1 if stroke; Y = ⎧2 if drug overdose; ⎨⎪3 if epileptic seizure.