237 Aes() Function, 152 ANOVA Between-Group Variability, 199–200

Index

A Binomial distribution, 121–124 Boolean operators, 68 aes() function, 152 Boxplot, 143–144 ANOVA Break keyword, 72–75 between-group Business understanding, 8 variability, 199–200 grand mean, 198 hypothesis, 198 C one-way, 202–203 two-way, 204, 206 Calculator R script within-group variability, 201–202 add(), subtract(), product(), and Apache Spark, 14–15, 18 division() functions, 81 apply() function, 173, 175–176 readline() function, 81 running in RStudio IDE, 82–83 Categorical data, 104 B Central limit theorem, 87 Bar chart, 130–134 Central tendency, 87, 105, 124–125 barplot() function, 130 Chi-square test, 197 Big data contingency test, 196–198 Apache Spark, 14 goodness of fit test, 194–196 challenges, 13, 17 Code editor, 42–45 formats and types, 14 Comma-separated values (CSV) Hadoop, 14 file, 88 IoT devices, 14 reading, 89–90 properties, 17 writing, 91 relational databases and Common charts desktop statistics, 14 bar chart, 158–159 velocity, 14 boxplot, 163–166 volume, 13 density plot, 161

Common charts (cont.) Data preparation, 8 histogram, 160 Data processing line chart, 162–163 data selection, 97–99 scatterplot, 161–162 filtering, 101–102 Computing Machinery and removing Intelligence, 12 duplicates, 103 Contingency test, 196–198 missing values, 102 coord_flip() function, 159 sorting, 99–101 Correlations, 183–184 Data science, 16 Covariance, 185–186 data product, 4 Cross-industry standard diagram, 6 process of data mining domain expertise, 5 (CRISP-DM), 7–8 history of, 5 Cumulative distribution function linear regression, 5 (CDF), 118 product design and engineering knowledge, 6 statistics, 5 D Data types, 48–50 Data understanding, 8, 17 Data acquisition, 10 Data visualization, 129 Data frame, 63–67 Descriptive analytics, 12, 17 Data mining, 1–2, 15–17 Descriptive statistics, 2–3, 173 business understanding, 8 central limit theorem, 88 CRISP-DM, 7–8 central tendency, 87–88 data preparation, 8 data and variables, 88 data understanding, 8 dplyr library, 181–182 definition, 6 deployment, 9 evaluation, 9 E modeling, 9 element_text() function, 157 Nayes theorem, 7 Excel file statistical learning and machine reading, 92–93 learning algorithms, 7 writing, 93

238 Index F I Facebook, 15, 18 Inferential statistics, 2, 4, 174 Fit test, 194–196 Integrated development For loop, 69–70 environment (IDE), 2, 19 Functions, 75–77, 79 code editors, 20 Dartmouth BASIC, 21 features, 21 G NetBeans, 21 GATE, 15 RStudio and R (see RStudio IDE) geom_point() function, 153 Softbench, 21 getwd() function, 89 Interquartile range, 111–112 ggplot2 IQR() and quantile() functions, 112 common charts (see Common charts) geometric objects, 152–155 J grammar of JSON file, 96–97 graphics, 150–151 labels, 155–156 setup, 151–152 K themes, 156–157 Kruskal-Wallis test, 216, 218 ggplotly() function, 168 ggsave() function, 165 L GNU package, 1 Google, 15, 18 labs() function, 155 lapply() function, 173, 177 library() function, 141, 148, 151, 211 H Linear regression, 5 Hadoop, 14 Line chart, 137–138 High-level programming lines() function, 136, 138 language (HLL), 2–3, 16 Lists hist() function, 135 data structure type, 54 Histogram, 135–136 length() function, 54 Hypothesis testing, 186 syntax, create, 53

239 Index Lists (cont.) N, O value/element Natural language processing delete, 57 (NLP), 11–12, 17 modification, 56 Nayes theorem, 7 values retrieve NetBeans, 21 integer vector, 54 Next keyword, 72–74 logical vector, 55 Nonparametric test negative integer, 55 Kruskal-Wallis, 216–218 lm() function, 220 Wilcoxon-Mann-Whitney, Logical statements, 67–69 213–215 Loops Wilcoxon Signed break and next keyword, 72–74 Rank, 209–210, 212 for loop, 69–70 Normal distribution repeat loop, 74–75 bell curve, 115 while loop, 71–72 bins, 116 Lower-level programming hist() function, 116 language (LLL), 2, 16 inverse CDF, 118 M modality, 119 p-th quantile, 118 MANOVA, 206–209 qqnorm() and qqline() Matrix functions, 116 attributes() function, 59 rnorm() function, 117 cbind() function, 62 Shapiro Test, 117 class() function, 59 skewness, 119–120 colnames() and rownames() standard deviation, 118 functions, 59 Numeric data, 104 logical vector, 61 rbind() function, 62 syntax, creation, 58 P, Q t() function, 63 pairs() function, 146 mean() function, 175–176 Pie chart, 139–141 Mean, 109 pie3D() function, 141 Median, 109 plot() function, 137, 142 Mode, 105–108 Plotly JS, 166–169

240 Index

Prediction model, 9 Repeat loop, 74–75 Predictive analytics, 12–13, 17 require() function, 141, 151, 168 Predictive modelling R programming techniques, 218 definition, 19 Prescriptive analytics, 12–13, 17 GNU package, 20 Programming languages, 15, 17 IDE (see Integrated development P-value, 186 environment (IDE)) RGui interface, 20 statistical and data visualization R techniques, 20 RapidMiner, 15, 17 RStudio IDE R console, 39–42 Choose R Installation Reading data files dialog, 28–29 CSV file code editor, 33 class() function, 90 console results, 45 read.csv() function, 89 downloading, Linux and write.csv() function, 91 Mac OS, 23 Excel file Environment tab, 45 data frame data type, 93 Hello World application, 25 read.xlsx() function, 92 installation, 23–24, 26 require() function, 92 intelligent code completion, View() function, 92 21–22, 33, 37 write.xlsx() function, 93 interface, 22, 27, 32–33 JSON, 96–97 latest version, downloading, 26 SPSS file loaded data, 35–36 help() function, 95 options, 28 install.packages() plot() function, 32 function, 94 R console, 22 read.spss() function, 95 read.csv() function, 30–31 write.foreign() function, 96 results, 35 Regressions, 2, 4 RGui interface, 24 definition, 175 R project website, 22–23 linear, 218–222 running script, 34–35 multiple linear, 223 summary() function, 31

241 Index

RStudio IDE (cont.) mean, 109 Tools menu, 27 median, 109 version changing, 29–30 mode, 105–108 website, 25–26 normal distribution (see Normal distribution) numeric data, 104 S observation, 104 Sampling population, 104 cluster, 179–183 range, 110–111 SRS, 178 sample, 104 stratified, 179 standard deviation, 114–115 sapply() function, 173, 177 variable, 104 SAS Enterprise Miner, 15, 17 variance, 112–114 SAS programming, 15, 18 str() function, 123 Scatterplot matrix, 146–147 summary() function, 123, 203, 228 Scripts, 16 Syntax of R programming setwd() function, 89 code editor, 42–45 Simple random sampling (SRS), 178 code with comments, 46–47 Skewness, 119–120 data frame, 63–67 Social network analysis graph, data types, 48–50 147–149 functions, 75–77, 79 Softbench IDE, 21 list (see Lists) SPSS file logical statements, 67–69 reading, 94–95 loops (see Loops) writing, 96 matrix, 58 SPSS Modeler, 15, 17 R console, 39–42 SPSS Statistics, 15, 17 variables, 47–48 Standard deviation, 114–115 vectors, 50–53 Stanford NLP, 15 Statistical computing, 1, 36 T, U Statistics, 3–5, 15–16 Tableau, 15 binomial distribution, 121–124 Text mining, 15, 17 categorical data, 104 applications, 11 interquartile range, 111–112 data acquisition, 10

242 Index data mining CRISP-DM V model, 10 Variables, 47–48 definition, 9 Variance, 112–114 evaluation/validation, 11 Vectors, 50–53 modeling, 11 Velocity, 14 text Preprocessing, 10 Volume, 13 theme() function, 156 TIOBE, 1, 18 T-test W, X, Y, Z errors, type I Weka, 15, 17 and II, 188 Welch t-test formula, 192 one-sample, 188–189 While loop, 71–72, 75 two-sample Wilcoxon-Mann-Whitney dependent, 193–194 test, 213–215 two-sample Wilcoxon Signed Rank independent, 190–193 Test, 209–210, 212 types, 187 wilcox.test() function, 212, 215

243