Principal Component Analysis, Second Edition
Total Page:16
File Type:pdf, Size:1020Kb
Principal Component Analysis, Second Edition I.T. Jolliffe Springer Preface to the Second Edition Since the first edition of the book was published, a great deal of new ma- terial on principal component analysis (PCA) and related topics has been published, and the time is now ripe for a new edition. Although the size of the book has nearly doubled, there are only two additional chapters. All the chapters in the first edition have been preserved, although two have been renumbered. All have been updated, some extensively. In this updat- ing process I have endeavoured to be as comprehensive as possible. This is reflected in the number of new references, which substantially exceeds those in the first edition. Given the range of areas in which PCA is used, it is certain that I have missed some topics, and my coverage of others will be too brief for the taste of some readers. The choice of which new topics to emphasize is inevitably a personal one, reflecting my own interests and biases. In particular, atmospheric science is a rich source of both applica- tions and methodological developments, but its large contribution to the new material is partly due to my long-standing links with the area, and not because of a lack of interesting developments and examples in other fields. For example, there are large literatures in psychometrics, chemometrics and computer science that are only partially represented. Due to consid- erations of space, not everything could be included. The main changes are now described. Chapters 1 to 4 describing the basic theory and providing a set of exam- ples are the least changed. It would have been possible to substitute more recent examples for those of Chapter 4, but as the present ones give nice illustrations of the various aspects of PCA, there was no good reason to do so. One of these examples has been moved to Chapter 1. One extra prop- vi Preface to the Second Edition erty (A6) has been added to Chapter 2, with Property A6 in Chapter 3 becoming A7. Chapter 5 has been extended by further discussion of a number of ordina- tion and scaling methods linked to PCA, in particular varieties of the biplot. Chapter 6 has seen a major expansion. There are two parts of Chapter 6 concerned with deciding how many principal components (PCs) to retain and with using PCA to choose a subset of variables. Both of these topics have been the subject of considerable research in recent years, although a regrettably high proportion of this research confuses PCA with factor anal- ysis, the subject of Chapter 7. Neither Chapter 7 nor 8 have been expanded as much as Chapter 6 or Chapters 9 and 10. Chapter 9 in the first edition contained three sections describing the use of PCA in conjunction with discriminant analysis, cluster analysis and canonical correlation analysis (CCA). All three sections have been updated, but the greatest expansion is in the third section, where a number of other techniques have been included, which, like CCA, deal with relationships be- tween two groups of variables. As elsewhere in the book, Chapter 9 includes yet other interesting related methods not discussed in detail. In general, the line is drawn between inclusion and exclusion once the link with PCA becomes too tenuous. Chapter 10 also included three sections in first edition on outlier de- tection, influence and robustness. All have been the subject of substantial research interest since the first edition; this is reflected in expanded cover- age. A fourth section, on other types of stability and sensitivity, has been added. Some of this material has been moved from Section 12.4 of the first edition; other material is new. The next two chapters are also new and reflect my own research interests more closely than other parts of the book. An important aspect of PCA is interpretation of the components once they have been obtained. This may not be easy, and a number of approaches have been suggested for simplifying PCs to aid interpretation. Chapter 11 discusses these, covering the well- established idea of rotation as well recently developed techniques. These techniques either replace PCA by alternative procedures that give simpler results, or approximate the PCs once they have been obtained. A small amount of this material comes from Section 12.4 of the first edition, but the great majority is new. The chapter also includes a section on physical interpretation of components. My involvement in the developments described in Chapter 12 is less direct than in Chapter 11, but a substantial part of the chapter describes method- ology and applications in atmospheric science and reflects my long-standing interest in that field. In the first edition, Section 11.2 was concerned with ‘non-independent and time series data.’ This section has been expanded to a full chapter (Chapter 12). There have been major developments in this area, including functional PCA for time series, and various techniques appropriate for data involving spatial and temporal variation, such as (mul- Preface to the Second Edition vii tichannel) singular spectrum analysis, complex PCA, principal oscillation pattern analysis, and extended empirical orthogonal functions (EOFs). Many of these techniques were developed by atmospheric scientists and are little known in many other disciplines. The last two chapters of the first edition are greatly expanded and be- come Chapters 13 and 14 in the new edition. There is some transfer of material elsewhere, but also new sections. In Chapter 13 there are three new sections, on size/shape data, on quality control and a final ‘odds-and- ends’ section, which includes vector, directional and complex data, interval data, species abundance data and large data sets. All other sections have been expanded, that on common principal component analysis and related topics especially so. The first section of Chapter 14 deals with varieties of non-linear PCA. This section has grown substantially compared to its counterpart (Sec- tion 12.2) in the first edition. It includes material on the Gifi system of multivariate analysis, principal curves, and neural networks. Section 14.2 on weights, metrics and centerings combines, and considerably expands, the material of the first and third sections of the old Chapter 12. The content of the old Section 12.4 has been transferred to an earlier part in the book (Chapter 10), but the remaining old sections survive and are updated. The section on non-normal data includes independent compo- nent analysis (ICA), and the section on three-mode analysis also discusses techniques for three or more groups of variables. The penultimate section is new and contains material on sweep-out components, extended com- ponents, subjective components, goodness-of-fit, and further discussion of neural nets. The appendix on numerical computation of PCs has been retained and updated, but, the appendix on PCA in computer packages has been dropped from this edition mainly because such material becomes out-of-date very rapidly. The preface to the first edition noted three general texts on multivariate analysis. Since 1986 a number of excellent multivariate texts have appeared, including Everitt and Dunn (2001), Krzanowski (2000), Krzanowski and Marriott (1994) and Rencher (1995, 1998), to name just a few. Two large specialist texts on principal component analysis have also been published. Jackson (1991) gives a good, comprehensive, coverage of principal com- ponent analysis from a somewhat different perspective than the present book, although it, too, is aimed at a general audience of statisticians and users of PCA. The other text, by Preisendorfer and Mobley (1988), con- centrates on meteorology and oceanography. Because of this, the notation in Preisendorfer and Mobley differs considerably from that used in main- stream statistical sources. Nevertheless, as we shall see in later chapters, especially Chapter 12, atmospheric science is a field where much devel- opment of PCA and related topics has occurred, and Preisendorfer and Mobley’s book brings together a great deal of relevant material. viii Preface to the Second Edition A much shorter book on PCA (Dunteman, 1989), which is targeted at social scientists, has also appeared since 1986. Like the slim volume by Daultrey (1976), written mainly for geographers, it contains little technical material. The preface to the first edition noted some variations in terminology. Likewise, the notation used in the literature on PCA varies quite widely. Appendix D of Jackson (1991) provides a useful table of notation for some of the main quantities in PCA collected from 34 references (mainly textbooks on multivariate analysis). Where possible, the current book uses notation adopted by a majority of authors where a consensus exists. To end this Preface, I include a slightly frivolous, but nevertheless in- teresting, aside on both the increasing popularity of PCA and on its terminology. It was noted in the preface to the first edition that both terms ‘principal component analysis’ and ‘principal components analysis’ are widely used. I have always preferred the singular form as it is compati- ble with ‘factor analysis,’ ‘cluster analysis,’ ‘canonical correlation analysis’ and so on, but had no clear idea whether the singular or plural form was more frequently used. A search for references to the two forms in key words or titles of articles using the Web of Science for the six years 1995–2000, re- vealed that the number of singular to plural occurrences were, respectively, 1017 to 527 in 1995–1996; 1330 to 620 in 1997–1998; and 1634 to 635 in 1999–2000. Thus, there has been nearly a 50 percent increase in citations of PCA in one form or another in that period, but most of that increase has been in the singular form, which now accounts for 72% of occurrences.