Machine Learning with R – Brett Lantz
Total Page:16
File Type:pdf, Size:1020Kb
Machine Learning with R Learn how to use R to apply powerful machine learning methods and gain an insight into real-world applications Brett Lantz BIRMINGHAM - MUMBAI Machine Learning with R Copyright © 2013 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: October 2013 Production Reference: 1211013 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78216-214-8 www.packtpub.com Cover Image by Abhishek Pandey ([email protected]) Credits Author Project Coordinator Brett Lantz Anugya Khurana Reviewers Proofreaders Jia Liu Simran Bhogal Mzabalazo Z. Ngwenya Ameesha Green Abhinav Upadhyay Paul Hindle Acquisition Editor Indexer James Jones Tejal Soni Lead Technical Editor Graphics Azharuddin Sheikh Ronak Dhruv Technical Editors Production Coordinator Pooja Arondekar Nilesh R. Mohite Pratik More Anusri Ramchandran Cover Work Nilesh R. Mohite Harshad Vairat About the Author Brett Lantz has spent the past 10 years using innovative data methods to understand human behavior. A sociologist by training, he was first enchanted by machine learning while studying a large database of teenagers' social networking website profiles. Since then, he has worked on interdisciplinary studies of cellular telephone calls, medical billing data, and philanthropic activity, among others. When he's not spending time with family, following college sports, or being entertained by his dachshunds, he maintains dataspelunking.com, a website dedicated to sharing knowledge about the search for insight in data. This book could not have been written without the support of my family and friends. In particular, my wife Jessica deserves many thanks for her patience and encouragement throughout the past year. My son Will (who was born while Chapter 10 was underway), also deserves special mention for his role in the writing process; without his gracious ability to sleep through the night, I could not have strung together a coherent sentence the next morning. I dedicate this book to him in the hope that one day he is inspired to follow his curiosity wherever it may lead. I am also indebted to many others who supported this book indirectly. My interactions with educators, peers, and collaborators at the University of Michigan, the University of Notre Dame, and the University of Central Florida seeded many of the ideas I attempted to express in the text. Additionally, without the work of researchers who shared their expertise in publications, lectures, and source code, this book might not exist at all. Finally, I appreciate the efforts of the R team and all those who have contributed to R packages, whose work ultimately brought machine learning to the masses. About the Reviewers Jia Liu holds a Master's degree in Statistics from the University of Maryland, Baltimore County, and is presently a PhD candidate in statistics from Iowa State University. Her research interests include mixed-effects model, Bayesian method, Boostrap method, reliability, design of experiments, machine learning and data mining. She has two year's experience as a student consultant in statistics and two year's internship experience in agriculture and pharmaceutical industry. Mzabalazo Z. Ngwenya has worked extensively in the field of statistical consulting and currently works as a biometrician. He holds an MSc in Mathematical Statistics from the University of Cape Town and is at present studying for a PhD (at the School of Information Technology, University of Pretoria), in the field of Computational Intelligence. His research interests include statistical computing, machine learning, and spatial statistics. Previously, he was involved in reviewing Learning RStudio for R Statistical Computing (Van de Loo and de Jong, 2012), and R Statistical Application Development by Example beginner's guide (Prabhanjan Narayanachar Tattar , 2013). Abhinav Upadhyay finished his Bachelor's degree in 2011 with a major in Information Technology. His main areas of interest include machine learning and information retrieval. In 2011, he worked for the NetBSD Foundation as part of the Google Summer of Code program. During that period, he wrote a search engine for Unix manual pages. This project resulted in a new implementation of the apropos utility for NetBSD. Currently, he is working as a Development Engineer for SocialTwist. His day-to-day work involves writing system level tools and frameworks to manage the product infrastructure. He is also an open source enthusiast and quite active in the community. In his free time, he maintains and contributes to several open source projects. www.PacktPub.com Support files, eBooks, discount offers and more You might want to visit www.PacktPub.com for support files and downloads related to your book. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks. TM http://PacktLib.PacktPub.com Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books. Why Subscribe? • Fully searchable across every book published by Packt • Copy and paste, print and bookmark content • On demand and accessible via web browser Free Access for Packt account holders If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access. Table of Contents Preface 1 Chapter 1: Introducing Machine Learning 5 The origins of machine learning 6 Uses and abuses of machine learning 8 Ethical considerations 9 How do machines learn? 10 Abstraction and knowledge representation 11 Generalization 14 Assessing the success of learning 16 Steps to apply machine learning to your data 17 Choosing a machine learning algorithm 18 Thinking about the input data 18 Thinking about types of machine learning algorithms 20 Matching your data to an appropriate algorithm 22 Using R for machine learning 23 Installing and loading R packages 24 Installing an R package 24 Installing a package using the point-and-click interface 25 Loading an R package 27 Summary 27 Chapter 2: Managing and Understanding Data 29 R data structures 30 Vectors 30 Factors 31 Lists 32 Data frames 35 Matrixes and arrays 37 Table of Contents Managing data with R 39 Saving and loading R data structures 39 Importing and saving data from CSV files 40 Importing data from SQL databases 41 Exploring and understanding data 42 Exploring the structure of data 43 Exploring numeric variables 44 Measuring the central tendency – mean and median 45 Measuring spread – quartiles and the five-number summary 47 Visualizing numeric variables – boxplots 49 Visualizing numeric variables – histograms 51 Understanding numeric data – uniform and normal distributions 53 Measuring spread – variance and standard deviation 54 Exploring categorical variables 56 Measuring the central tendency – the mode 57 Exploring relationships between variables 58 Visualizing relationships – scatterplots 59 Examining relationships – two-way cross-tabulations 61 Summary 63 Chapter 3: Lazy Learning – Classification Using Nearest Neighbors 65 Understanding classification using nearest neighbors 66 The kNN algorithm 67 Calculating distance 70 Choosing an appropriate k 71 Preparing data for use with kNN 72 Why is the kNN algorithm lazy? 74 Diagnosing breast cancer with the kNN algorithm 75 Step 1 – collecting data 76 Step 2 – exploring and preparing the data 77 Transformation – normalizing numeric data 79 Data preparation – creating training and test datasets 80 Step 3 – training a model on the data 81 Step 4 – evaluating model performance 83 Step 5 – improving model performance 84 Transformation – z-score standardization 84 Testing alternative values of k 86 Summary 87 Chapter 4: Probabilistic Learning – Classification Using Naive Bayes 89 Understanding naive Bayes 90 Basic concepts of Bayesian methods 91 Probability 91 Joint probability 92 [ ii ] Table of Contents Conditional probability with Bayes' theorem 93 The naive Bayes algorithm 95 The naive Bayes classification 96 The Laplace estimator 98 Using numeric features with naive Bayes 100 Example – filtering mobile phone spam with the naive Bayes algorithm 101 Step