Introduction to Data Science

Hui Lin and Ming Li Introduction to Data Science Contents List of Tables ix List of Figures xi Preface xiii About the Authors xxi 1 Introduction 1 1.1 A brief history of data science .......... 1 1.2 Data science role and skill tracks ......... 5 1.2.1 Engineering ................. 6 1.2.2 Analysis ................... 8 1.2.3 Modeling .................. 9 1.3 What kind of questions can data science solve? . 13 1.3.1 Prerequisites ................ 13 1.3.2 Problem type ................ 15 1.4 Structure data science team ............ 18 1.5 List of potential data science careers ....... 21 2 Soft Skills for Data Scientists 27 2.1 Comparison between Statistician and Data Scien- tist ......................... 27 2.2 Beyond Data and Analytics ............ 29 2.3 Three Pillars of Knowledge ............ 31 2.4 Data Science Project Cycle ............ 32 2.4.1 Types of Data Science Projects ...... 32 2.4.2 Problem Formulation and Project Planning Stage .................... 34 2.4.3 Project Modeling Stage .......... 36 iii iv Contents 2.4.4 Model Implementation and Post Produc- tion Stage .................. 37 2.4.5 Project Cycle Summary .......... 38 2.5 Common Mistakes in Data Science ........ 39 2.5.1 Problem Formulation Stage ........ 39 2.5.2 Project Planning Stage ........... 40 2.5.3 Project Modeling Stage .......... 41 2.5.4 Model Implementation and Post Produc- tion Stage .................. 42 2.5.5 Summary of Common Mistakes ...... 43 3 Introduction to The Data 45 3.1 Customer Data for A Clothing Company .... 45 3.2 Swine Disease Breakout Data ........... 47 3.3 MNIST Dataset .................. 49 3.4 IMDB Dataset ................... 49 4 Big Data Cloud Platform 53 4.1 Power of Cluster of Computers .......... 54 4.2 Evolution of Cluster Computing ......... 55 4.2.1 Hadoop ................... 55 4.2.2 Spark .................... 56 4.3 Introduction of Cloud Environment ....... 56 4.3.1 Open Account and Create a Cluster ... 57 4.3.2 R Notebook ................. 58 4.3.3 Markdown Cells .............. 58 4.4 Leverage Spark Using R Notebook ........ 60 4.5 Databases and SQL ................ 67 4.5.1 History ................... 67 4.5.2 Database, Table and View ......... 69 4.5.3 Basic SQL Statement ........... 70 4.5.4 Advanced Topics in Database ....... 74 5 Data Pre-processing 75 5.1 Data Cleaning ................... 77 5.2 Missing Values ................... 80 5.2.1 Impute missing values with median/mode 81 5.2.2 K-nearest neighbors ............ 82 Contents v 5.2.3 Bagging Tree ................ 83 5.3 Centering and Scaling ............... 84 5.4 Resolve Skewness ................. 85 5.5 Resolve Outliers .................. 89 5.6 Collinearity ..................... 93 5.7 Sparse Variables .................. 96 5.8 Re-encode Dummy Variables ........... 97 6 Data Wrangling 101 6.1 Summarize Data .................. 102 6.1.1 dplyr package ................ 103 6.1.2 apply(), lapply() and sapply() in base R .. 113 6.2 Tidy and Reshape Data .............. 117 7 Model Tuning Strategy 123 7.1 Variance-Bias Trade-Off .............. 124 7.2 Data Splitting and Resampling .......... 132 7.2.1 Data Splitting ............... 133 7.2.2 Resampling ................. 143 8 Measuring Performance 149 8.1 Regression Model Performance .......... 149 8.2 Classification Model Performance ......... 153 8.2.1 Confusion Matrix .............. 155 8.2.2 Kappa Statistic ............... 157 8.2.3 ROC ..................... 158 8.2.4 Gain and Lift Charts ............ 161 9 Regression Models 165 9.1 Ordinary Least Square ............... 166 9.1.1 The Magic P-value ............. 171 9.1.2 Diagnostics for Linear Regression ..... 174 9.2 Principal Component Regression and Partial Least Square ....................... 178 10 Regularization Methods 187 10.1 Ridge Regression .................. 188 10.2 LASSO ....................... 193 vi Contents 10.3 Variable selection property of the lasso ...... 197 10.4 Elastic Net ..................... 199 10.5 Penalized Generalized Linear Model ....... 200 10.5.1 Introduction to glmnet package ...... 201 10.5.2 Penalized logistic regression ........ 206 11 Tree-Based Methods 217 11.1 Tree Basics ..................... 217 11.2 Splitting Criteria .................. 220 11.3 Tree Pruning .................... 227 11.4 Regression and Decision Tree Basic ........ 231 11.4.1 Regression Tree ............... 231 11.4.2 Decision Tree ................ 236 11.5 Bagging Tree .................... 241 11.6 Random Forest ................... 245 11.7 Gradient Boosted Machine ............ 249 11.7.1 Adaptive Boosting ............. 249 11.7.2 Stochastic Gradient Boosting ....... 251 11.7.3 Boosting as Additive Model ........ 256 12 Deep Learning 261 12.1 Projection Pursuit Regression ........... 265 12.2 Feedforward Neural Network ........... 269 12.2.1 Logistic Regression as Neural Network .. 269 12.2.2 Stochastic Gradient Descent ........ 271 12.2.3 Deep Neural Network ........... 272 12.2.4 Activation Function ............ 276 12.2.5 Optimization ................ 280 12.2.6 Deal with Overfitting ............ 287 12.2.7 Image Recognition Using FFNN ...... 288 12.3 Convolutional Neural Network .......... 302 12.3.1 Convolution Layer ............. 304 12.3.2 Padding Layer ............... 308 12.3.3 Pooling Layer ................ 309 12.3.4 Convolution Over Volume ......... 313 12.3.5 Image Recognition Using CNN ...... 316 12.4 Recurrent Neural Network ............ 322 Contents vii 12.4.1 RNN Model ................. 324 12.4.2 Word Embedding .............. 328 12.4.3 Long Short Term Memory ......... 330 12.4.4 Sentiment Analysis Using RNN ...... 333 Appendix 343 A Handling Large Local Data 345 A.1 readr ........................ 345 A.2 data.table— enhanced data.frame ......... 351 B R code for data simulation 363 B.1 Customer Data for Clothing Company ...... 363 B.2 Swine Disease Breakout Data ........... 368 Bibliography 373 Index 381 List of Tables ix List of Figures 1.1 Different roles in data science and the skill require- ments ........................ 23 2.1 Comparison of Statistician and Data Scientist .. 28 2.2 Three Pillars of Knowledge ............ 32 3.1 Sample of MNIST dataset ............. 50 4.1 Single computer (left) and a cluster of computers (right) ........................ 54 4.2 Example of R Notebook .............. 59 4.3 Example of Markdown Cell ............ 59 5.1 Data Pre-processing Outline ............ 75 5.2 Example of skewed distributions .......... 86 7.1 Types of Model Error ................ 126 7.2 High bias model ................... 128 7.3 High variance model ................ 129 7.4 Parameter Tuning Process ............. 135 7.5 Compare Maximum Dissimilarity Sampling with Random Sampling ................. 140 7.6 Divide data according to time ........... 141 7.7 5-fold cross-validation ............... 144 9.1 Correlation Matrix Plot for Explanatory Variables 169 9.2 Linear Regression Diagnostic Plots: residual plot (top left), Q-Q plot (top right), standardized resid- uals plot (lower left), Cook’s distance (lower right) 176 10.1 Test mean squared error for the ridge regression . 191 10.2 Test mean squared error for the lasso regression . 195 xi xii List of Figures 10.3 Contours of the RSS and constrain functions for the lasso (left) and ridge regression (right) ...... 198 11.1 This is a classification tree trained from passenger survival data from the Titanic. Survival probability is predicted using sex and age. ........... 219 11.2 Example of decision regions for different types of flowers based on petal length and width. Three different types of flowers are classified using decisions about length and width. .............. 221 12.1 Feedforward Neural Network ............ 273 12.2 1-layer Neural Network ............... 273 12.3 Sigmoid Function .................. 276 12.4 Hyperbolic Tangent Function ........... 277 12.5 Rectified Linear Unit Function ........... 278 12.6 Rectified Linear Unit Function ........... 279 12.7 The intuition behind the weighted average gradient 282 12.8 Exponentially weighted averages with and without corrrection ...................... 284 12.9 Grayscale image is a set of pixels on 2-d space. Each pixel has a value range from 0 to 255. ....... 289 12.10Color image is a set of pixels on 3-d space. Each pixel has a value range from 0 to 255. ....... 289 12.11There are an input image (left), a filter (middel), and an output image (right). ............ 304 12.12Convolution step by step .............. 305 12.13Edge Detection Example .............. 306 12.14Simple Edge Detection Example .......... 307 12.15Convolution Layer Parameters ........... 308 12.16Padding Layer .................... 309 12.17Pooling Layer .................... 310 12.18Example of max and mean pooling ........ 314 12.19Example of convolution over volume ....... 315 12.20Recurrent Neural Network Unit .......... 323 12.21An Unrolled Recurrent Neural Network ...... 323 Preface During the early years of our career as data scientists, we were bewildered by all kinds of data science hype. There is a lack of definition of many basic terminologies such as “big data,” “artifi- cial intelligence,” and “data science.” How big is big data? What is data science? What is the difference between the sexy title “Data Scientist” and the traditional “Data Analyst?” The term data science stirs so many associations such as machine learning, deep learning, data mining, pattern recognition. All those struck us as confusing and vague as real-world data scientists! However, we always feel something tangible in data science applications, and

Load more