<<

Applied Using SPSS, STATISTICA, MATLAB and Joaquim P. Marques de Sá Applied Statistics Using SPSS, STATISTICA, MATLAB and R

With 195 Figures and a CD

123 Editors

Prof. Dr. Joaquim P. Marques de Sá Universidade do Porto Fac. Engenharia Rua Dr. Roberto Frias s/n 4200-465 Porto Portugal e-mail: [email protected]

Library of Congress Control Number: 2007926024

ISBN 978-3-540-71971-7 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable for prosecution under the German Copyright Law.

Springer is a part of Springer Science+Business Media springer.com © Springer-Verlag Berlin Heidelberg 2007

The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant pro- tective laws and regulations and therefore free for general use.

Typesetting: by the editors Production: Integra Software Services Pvt. Ltd., India Cover design: WMX design, Heidelberg Printed on acid-free paper SPIN: 11908944 42/3100/Integra 5 4 3 2 1 0

To Wiesje and Carlos. Contents

Preface to the Second Edition xv

Preface to the First Edition xvii

Symbols and Abbreviations xix

1 Introduction 1

1.1 Deterministic Data and Random Data...... 1 1.2 Population, Sample and Statistics ...... 5 1.3 Random Variables...... 8 1.4 Probabilities and Distributions...... 10 1.4.1 Discrete Variables ...... 10 1.4.2 Continuous Variables ...... 12 1.5 Beyond a Reasonable Doubt...... 13 1.6 Statistical Significance and Other Significances...... 17 1.7 Datasets...... 19 1.8 Software Tools...... 19 1.8.1 SPSS and STATISTICA...... 20 1.8.2 MATLAB and R...... 22

2 Presenting and Summarising the Data 29

2.1 Preliminaries ...... 29 2.1.1 Reading in the Data ...... 29 2.1.2 Operating with the Data...... 34 2.2 Presenting the Data ...... 39 2.2.1 Counts and Bar Graphs...... 40 2.2.2 Frequencies and ...... 47 2.2.3 Multivariate Tables, Scatter Plots and 3D Plots ...... 52 2.2.4 Categorised Plots...... 56 2.3 Summarising the Data...... 58 2.3.1 Measures of Location ...... 58 2.3.2 Measures of Spread ...... 62 2.3.3 Measures of Shape...... 64 viii Contents

2.3.4 Measures of Association for Continuous Variables...... 66 2.3.5 Measures of Association for Ordinal Variables...... 69 2.3.6 Measures of Association for Nominal Variables...... 73 Exercises...... 77

3 Estimating Data Parameters 81

3.1 Point Estimation and Interval Estimation...... 81 3.2 Estimating a Mean ...... 85 3.3 Estimating a Proportion ...... 92 3.4 Estimating a Variance ...... 95 3.5 Estimating a Variance Ratio...... 97 3.6 Bootstrap Estimation...... 99 Exercises...... 107

4 Parametric Tests of Hypotheses 111

4.1 Hypothesis Test Procedure...... 111 4.2 Test Errors and Test Power...... 115 4.3 Inference on One Population...... 121 4.3.1 Testing a Mean ...... 121 4.3.2 Testing a Variance...... 125 4.4 Inference on Two Populations ...... 126 4.4.1 Testing a Correlation ...... 126 4.4.2 Comparing Two Variances...... 129 4.4.3 Comparing Two Means ...... 132 4.5 Inference on More than Two Populations...... 141 4.5.1 Introduction to the Analysis of Variance...... 141 4.5.2 One-Way ANOVA ...... 143 4.5.3 Two-Way ANOVA ...... 156 Exercises...... 166

5 Non-Parametric Tests of Hypotheses 171

5.1 Inference on One Population...... 172 5.1.1 The Runs Test...... 172 5.1.2 The Binomial Test ...... 174 5.1.3 The Chi-Square Goodness of Fit Test ...... 179 5.1.4 The Kolmogorov-Smirnov Goodness of Fit Test ...... 183 5.1.5 The Lilliefors Test for Normality ...... 187 5.1.6 The Shapiro-Wilk Test for Normality ...... 187 5.2 Contingency Tables...... 189 5.2.1 The 2×2 Contingency Table ...... 189 5.2.2 The rxc Contingency Table ...... 193 Contents ix

5.2.3 The Chi-Square Test of Independence ...... 195 5.2.4 Measures of Association Revisited...... 197 5.3 Inference on Two Populations ...... 200 5.3.1 Tests for Two Independent Samples...... 201 5.3.2 Tests for Two Paired Samples ...... 205 5.4 Inference on More Than Two Populations...... 212 5.4.1 The Kruskal-Wallis Test for Independent Samples...... 212 5.4.2 The Friedmann Test for Paired Samples ...... 215 5.4.3 The Cochran Q test...... 217 Exercises...... 218

6 Statistical Classification 223

6.1 Decision Regions and Functions...... 223 6.2 Linear Discriminants...... 225 6.2.1 Minimum Euclidian Distance Discriminant ...... 225 6.2.2 Minimum Mahalanobis Distance Discriminant...... 228 6.3 Bayesian Classification...... 234 6.3.1 Bayes Rule for Minimum Risk...... 234 6.3.2 Normal Bayesian Classification ...... 240 6.3.3 Dimensionality Ratio and Error Estimation...... 243 6.4 The ROC Curve ...... 246 6.5 Feature Selection...... 253 6.6 Classifier Evaluation...... 256 6.7 Tree Classifiers ...... 259 Exercises...... 268

7 Data Regression 271

7.1 Simple Linear Regression ...... 272 7.1.1 Simple Linear Regression Model ...... 272 7.1.2 Estimating the Regression Function ...... 273 7.1.3 Inferences in Regression Analysis...... 279 7.1.4 ANOVA Tests ...... 285 7.2 Multiple Regression ...... 289 7.2.1 General Linear Regression Model...... 289 7.2.2 General Linear Regression in Matrix Terms ...... 289 7.2.3 Multiple Correlation ...... 292 7.2.4 Inferences on Regression Parameters ...... 294 7.2.5 ANOVA and Extra Sums of Squares...... 296 7.2.6 Polynomial Regression and Other Models ...... 300 7.3 Building and Evaluating the Regression Model...... 303 7.3.1 Building the Model...... 303 7.3.2 Evaluating the Model ...... 306 7.3.3 Case Study...... 308 7.4 Regression Through the Origin...... 314 x Contents

7.5 Ridge Regression ...... 316 7.6 Logit and Probit Models ...... 322 Exercises...... 327

8 Data Structure Analysis 329

8.1 Principal Components...... 329 8.2 Dimensional Reduction...... 337 8.3 Principal Components of Correlation Matrices...... 339 8.4 Factor Analysis ...... 347 Exercises...... 350

9 Survival Analysis 353

9.1 Survivor Function and Hazard Function ...... 353 9.2 Non-Parametric Analysis of Survival Data...... 354 9.2.1 The Life Table Analysis ...... 354 9.2.2 The Kaplan-Meier Analysis...... 359 9.2.3 Statistics for Non-Parametric Analysis...... 362 9.3 Comparing Two Groups of Survival Data ...... 364 9.4 Models for Survival Data...... 367 9.4.1 The Exponential Model ...... 367 9.4.2 The Weibull Model...... 369 9.4.3 The Cox Regression Model ...... 371 Exercises...... 373

10 Directional Data 375

10.1 Representing Directional Data ...... 375 10.2 Descriptive Statistics...... 380 10.3 The von Mises Distributions ...... 383 10.4 Assessing the Distribution of Directional Data...... 387 10.4.1 Graphical Assessment of Uniformity ...... 387 10.4.2 The Rayleigh Test of Uniformity ...... 389 10.4.3 The Watson Goodness of Fit Test ...... 392 10.4.4 Assessing the von Misesness of Spherical Distributions...... 393 10.5 Tests on von Mises Distributions...... 395 10.5.1 One-Sample Mean Test ...... 395 10.5.2 Mean Test for Two Independent Samples ...... 396 10.6 Non-Parametric Tests...... 397 10.6.1 The Uniform Scores Test for Circular Data...... 397 10.6.2 The Watson Test for Spherical Data...... 398 10.6.3 Testing Two Paired Samples ...... 399 Exercises...... 400 Contents xi

Appendix A - Short Survey on Probability Theory 403

A.1 Basic Notions...... 403 A.1.1 Events and Frequencies ...... 403 A.1.2 Probability Axioms...... 404 A.2 Conditional Probability and Independence ...... 406 A.2.1 Conditional Probability and Intersection Rule...... 406 A.2.2 Independent Events ...... 406 A.3 Compound Experiments...... 408 A.4 Bayes’ Theorem ...... 409 A.5 Random Variables and Distributions ...... 410 A.5.1 Definition of Random Variable ...... 410 A.5.2 Distribution and Density Functions...... 411 A.5.3 Transformation of a Random Variable...... 413 A.6 Expectation, Variance and Moments ...... 414 A.6.1 Definitions and Properties ...... 414 A.6.2 Moment-Generating Function ...... 417 A.6.3 Chebyshev Theorem...... 418 A.7 The Binomial and Normal Distributions...... 418 A.7.1 The ...... 418 A.7.2 The Laws of Large Numbers...... 419 A.7.3 The ...... 420 A.8 Multivariate Distributions...... 422 A.8.1 Definitions...... 422 A.8.2 Moments...... 425 A.8.3 Conditional Densities and Independence...... 425 A.8.4 Sums of Random Variables ...... 427 A.8.5 ...... 428

Appendix B - Distributions 431

B.1 Discrete Distributions ...... 431 B.1.1 ...... 431 B.1.2 Uniform Distribution...... 432 B.1.3 ...... 433 B.1.4 Hypergeometric Distribution...... 434 B.1.5 Binomial Distribution...... 435 B.1.6 ...... 436 B.1.7 ...... 438 B.2 Continuous Distributions ...... 439 B.2.1 Uniform Distribution...... 439 B.2.2 Normal Distribution...... 441 B.2.3 ...... 442 B.2.4 ...... 444 B.2.5 ...... 445 B.2.6 ...... 446 B.2.7 Chi-Square Distribution...... 448 xii Contents

B.2.8 Student’s t Distribution...... 449 B.2.9 F Distribution ...... 451 B.2.10 Von Mises Distributions...... 452

Appendix C - Point Estimation 455

C.1 Definitions...... 455 C.2 Estimation of Mean and Variance...... 457

Appendix D - Tables 459

D.1 Binomial Distribution ...... 459 D.2 Normal Distribution ...... 465 D.3 Student´s t Distribution ...... 466 D.4 Chi-Square Distribution ...... 467 D.5 Critical Values for the F Distribution ...... 468

Appendix E - Datasets 469

E.1 Breast Tissue...... 469 E.2 Car Sale...... 469 E.3 Cells ...... 470 E.4 Clays ...... 470 E.5 Cork Stoppers...... 471 E.6 CTG ...... 472 E.7 Culture ...... 473 E.8 Fatigue ...... 473 E.9 FHR...... 474 E.10 FHR-Apgar ...... 474 E.11 Firms ...... 475 E.12 Flow Rate...... 475 E.13 Foetal Weight...... 475 E.14 Forest Fires...... 476 E.15 Freshmen...... 476 E.16 Heart Valve ...... 477 E.17 Infarct...... 478 E.18 Joints ...... 478 E.19 Metal Firms...... 479 E.20 Meteo ...... 479 E.21 Moulds ...... 479 E.22 Neonatal...... 480 E.23 Programming...... 480 E.24 Rocks ...... 481 E.25 Signal & Noise...... 481 Contents xiii

E.26 Soil Pollution ...... 482 E.27 Stars ...... 482 E.28 Stock Exchange...... 483 E.29 VCG...... 484 E.30 Wave ...... 484 E.31 Weather...... 484 E.32 Wines ...... 485

Appendix F - Tools 487

F.1 MATLAB Functions...... 487 F.2 R Functions ...... 488 F.3 Tools EXCEL File ...... 489 F.4 SCSize Program...... 489

References 491

Index 499

Preface to the Second Edition

Four years have passed since the first edition of this book. During this time I have had the opportunity to apply it in classes obtaining feedback from students and inspiration for improvements. I have also benefited from many comments by users of the book. For the present second edition large parts of the book have undergone major revision, although the basic concept – concise but sufficiently rigorous mathematical treatment with emphasis on computer applications to real datasets –, has been retained. The second edition improvements are as follows:

• Inclusion of R as an application tool. As a matter of fact, R is a free software product which has nowadays reached a high level of maturity and is being increasingly used by many people as a statistical analysis tool. • Chapter 3 has an added section on bootstrap estimation methods, which have gained a large popularity in practical applications. • A revised explanation and treatment of tree classifiers in Chapter 6 with the inclusion of the QUEST approach. • Several improvements of Chapter 7 (regression), namely: details concerning the meaning and computation of multiple and partial correlation coefficients, with examples; a more thorough treatment and exemplification of the ridge regression topic; more attention dedicated to model evaluation. • Inclusion in the book CD of additional MATLAB functions as well as a set of R functions. • Extra examples and exercises have been added in several chapters. • The bibliography has been revised and new references added.

I have also tried to improve the quality and clarity of the text as well as notation. Regarding notation I follow in this second edition the more widespread use of denoting random variables with italicised capital letters, instead of using small cursive font as in the first edition. Finally, I have also paid much attention to correcting errors, misprints and obscurities of the first edition.

J.P. Marques de Sá Porto, 2007 Preface to the First Edition

This book is intended as a reference book for students, professionals and research workers who need to apply statistical analysis to a large variety of practical problems using STATISTICA, SPSS and MATLAB. The book chapters provide a comprehensive coverage of the main statistical analysis topics (data description, statistical inference, classification and regression, factor analysis, survival data, directional statistics) that one faces in practical problems, discussing their solutions with the mentioned software packages. The only prerequisite to use the book is an undergraduate knowledge level of mathematics. While it is expected that most readers employing the book will have already some knowledge of elementary statistics, no previous course in probability or statistics is needed in order to study and use the book. The first two chapters introduce the basic needed notions on probability and statistics. In addition, the first two Appendices provide a short survey on Probability Theory and Distributions for the reader needing further clarification on the theoretical foundations of the statistical methods described. The book is partly based on tutorial notes and materials used in data analysis disciplines taught at the Faculty of Engineering, Porto University. One of these disciplines is attended by students of a Master’s Degree course on information management. The students in this course have a variety of educational backgrounds and professional interests, which generated and brought about datasets and analysis objectives which are quite challenging concerning the methods to be applied and the interpretation of the results. The datasets used in the book examples and exercises were collected from these courses as well as from research. They are included in the book CD and cover a broad spectrum of areas: engineering, medicine, biology, psychology, economy, geology, and astronomy. Every chapter explains the relevant notions and methods concisely, and is illustrated with practical examples using real data, presented with the distinct intention of clarifying sensible practical issues. The solutions presented in the examples are obtained with one of the software packages STATISTICA, SPSS or MATLAB; therefore, the reader has the opportunity to closely follow what is being done. The book is not intended as a substitute for the STATISTICA, SPSS and MATLAB user manuals. It does, however, provide the necessary guidance for applying the methods taught without having to delve into the manuals. This includes, for each topic explained in the book, a clear indication of which STATISTICA, SPSS or MATLAB tools to be applied. These indications appear in specific “Commands” frames together with a complementary description on how to use the tools, whenever necessary. In this way, a comparative perspective of the xviii Preface to the First Edition capabilities of those software packages is also provided, which can be quite useful for practical purposes. STATISTICA, SPSS or MATLAB do not provide specific tools for some of the statistical topics described in the book. These range from such basic issues as the choice of the optimal number of bins to more advanced topics such as directional statistics. The book CD provides these tools, including a set of MATLAB functions for directional statistics. I am grateful to many people who helped me during the preparation of the book. Professor Luís Alexandre provided help in reviewing the book contents. Professor Willem van Meurs provided constructive comments on several topics. Professor Joaquim Góis contributed with many interesting discussions and suggestions, namely on the topic of data structure analysis. Dr. Carlos Felgueiras and Paulo Sousa gave valuable assistance in several software issues and in the development of some software tools included in the book CD. My gratitude also to Professor Pimenta Monteiro for his support in elucidating some software tricks during the preparation of the text files. A lot of people contributed with datasets. Their names are mentioned in Appendix E. I express my deepest thanks to all of them. Finally, I would also like to thank Alan Weed for his thorough revision of the texts and the clarification of many editing issues.

J.P. Marques de Sá Porto, 2003 Symbols and Abbreviations

Sample Sets A event A set (of events)

{A1, A2,…} set constituted of events A1, A2,… A complement of {A}

AU B union of {A} with {B} AI B intersection of {A} with {B} E set of all events (universe) φ empty set

Functional Analysis ∃ there is ∀ for every ∈ belongs to ∉ doesn’t belong to ≡ equivalent to || || Euclidian norm (vector length) ⇒ implies → converges to ℜ real number set ℜ + [0, +∞ [ [a, b] closed interval between and including a and b ]a, b] interval between a and b, excluding a [a, b[ interval between a and b, excluding b xx Symbols and Abbreviations

]a, b[ open interval between a and b (excluding a and b)

n ∑i=1 sum for index i = 1,…, n n ∏ product for index i = 1,…, n i=1 b integral from a to b ∫a k! factorial of k, k! = k(k−1)(k−2)...2.1 n (k ) combinations of n elements taken k at a time | x | absolute value of x x largest integer smaller or equal to x gX(a) function g of variable X evaluated at a dg derivative of function g with respect to X dX d n g derivative of order n of g evaluated at a n dX a ln(x) natural logarithm of x log(x) logarithm of x in base 10 sgn(x) sign of x mod(x,y) remainder of the integer division of x by y

Vectors and Matrices x vector (column vector), multidimensional random vector x' transpose vector (row vector)

[x1 x2…xn] row vector whose components are x1, x2,…,xn xi i-th component of vector x xk,i i-th component of vector xk ∆x vector x increment x'y inner (dot) product of x and y A matrix aij i-th row, j-th column element of matrix A A' transpose of matrix A A−1 inverse of matrix A Symbols and Abbreviations xxi

|A| determinant of matrix A tr(A) trace of A (sum of the diagonal elements) I unit matrix

λi eigenvalue i

Probabilities and Distributions X random variable (with value denoted by the same lower case letter, x) P(A) probability of event A P(A|B) probability of event A conditioned on B having occurred P(x) discrete probability of random vector x

P(ωi|x) discrete conditional probability of ωi given x f(x) probability density function f evaluated at x f(x |ωi) conditional probability density function f evaluated at x given ωi X ~ f X has probability density function f X ~ F X has function (is distributed as) F Pe probability of misclassification (error) Pc probability of correct classification df degrees of freedom xdf,α α-percentile of X distributed with df degrees of freedom bn,p binomial probability for n trials and probability p of success

Bn,p binomial distribution for n trials and probability p of success u uniform probability or density function U uniform distribution gp geometric probability (Bernoulli trial with probability p)

Gp geometric distribution (Bernoulli trial with probability p) hN,D,n hypergeometric probability (sample of n out of N with D items)

HN,D,n hypergeometric distribution (sample of n out of N with D items) pλ Poisson probability with event rate λ

Pλ Poisson distribution with event rate λ nµ,σ normal density with mean µ and standard deviation σ xxii Symbols and Abbreviations

Nµ,σ normal distribution with mean µ and standard deviation σ

ελ exponential density with spread factor λ

Ελ exponential distribution with spread factor λ wα,β Weibull density with parameters α, β

Wα,β Weibull distribution with parameters α, β

γa,p Gamma density with parameters a, p

Γa,p Gamma distribution with parameters a, p

βp,q Beta density with parameters p, q

Βp,q Beta distribution with parameters p, q

2 χ df Chi-square density with df degrees of freedom

2 Χ df Chi-square distribution with df degrees of freedom tdf Student’s t density with df degrees of freedom

Tdf Student’s t distribution with df degrees of freedom f F density with df , df degrees of freedom df1,df2 1 2 F F distribution with df , df degrees of freedom df1,df2 1 2

Statistics xˆ estimate of x Ε[]X expected value (average, mean) of X V[]X variance of X Ε[x | y] expected value of x given y (conditional expectation)

mk central moment of order k µ mean value σ standard deviation

σ XY covariance of X and Y ρ correlation coefficient µ mean vector Symbols and Abbreviations xxiii

Σ covariance matrix x arithmetic mean v sample variance s sample standard deviation xα α-quantile of X ( FX (xα ) = α ) med(X) median of X (same as x0.5) S sample covariance matrix α significance level (1−α is the confidence level) xα α-percentile of X ε tolerance

Abbreviations FNR False Negative Ratio FPR False Positive Ratio iff if an only if i.i.d. independent and identically distributed IRQ inter-quartile range pdf probability density function LSE Least Square Error ML Maximum Likelihood MSE Mean Square Error PDF probability distribution function RMS Root Mean Square Error r.v. Random variable ROC Receiver Operating Characteristic SSB Between-group Sum of Squares SSE Error Sum of Squares SSLF Lack of Fit Sum of Squares SSPE Pure Error Sum of Squares SSR Regression Sum of Squares xxiv Symbols and Abbreviations

SST Total Sum of Squares SSW Within-group Sum of Squares TNR True Negative Ratio TPR True Positive Ratio VIF Variance Inflation Factor

Tradenames EXCEL Microsoft Corporation MATLAB The MathWorks, Inc. SPSS SPSS, Inc. STATISTICA Statsoft, Inc. WINDOWS Microsoft Corporation