Chapter 2 Principal Components Analysis

Total Page:16

File Type:pdf, Size:1020Kb

Chapter 2 Principal Components Analysis Chapter 2 Principal Components Analysis Math 3210 Dr. Zeng Outline I What is Principal Components Analysis (PCA)? I Data Examples I Ideas behind PCA I PC loadings and scores I Performing PCA in R I Practice What is Principal Components Analysis (PCA) Principal component analysis is performed in order to simplify the description of a set of interrelated variables. The techinque can be summarized as a method of transforming the original variables into new, uncorrelated variables. The new variables are called the principal components. Each principal component is a linear combination of the original variables. The goal of PCA is to find the best low-dimensional representation of the variation in a multivariate data set. Thus, it is one of the classical methods in dimension reductions. It reduces the number of variables/dimensions without losing much of the information. Data Example In the case of the wine data set, we have 13 chemical concentrations describing wine samples from three different cultivars. We can carry out a principal component analysis to investigate whether we can capture most of the variation between samples using a smaller number of new variables (principal components), where each of these new variables is a linear combination of all or some of the 13 chemical concentrations. Ideas behind PCA Suppose that there are p variables x1, x2, ..., xp in the dataset. In PCA, we create a set of p variables, C1, C2, ..., Cp, each of which is a linear combination of x1, x2, ..., xp. These linear combinations, C1, C2, ..., Cp are called the principal components. C1 = a11x1 + a12x2 + ... + a1pxp C2 = a21x1 + a22x2 + ... + a2pxp . Cp = ap1x1 + ap2x2 + ... + appxp The coefficients a11, a12, ..., app are chosen to satisfy three requirements. I Variance of C1 is as large as possible. var(C1) ≥ var(C2) ≥ var(Cp) I The value of any two principal components are uncorrelated. I For any principal component the sum of the squares of the 2 2 2 coefficients is one. That is, ai1 + ai2 + ... + aip = 1 for any i = 1, ..., p The mathematical solution for the coefficients was originally derived by Hostelling (1993). The solution is illustrated graphically in the next Figure. Illustration of Two Principal Components 30 25 V5 20 15 10 1.5 2.0 2.5 3.0 V4 Question: How to locate the two principal components C1 and C2 in this Figure? Principal Components Loadings I C1 = a11x1 + a12x2 + ... + a1pxp is called the first principal component of x1, x2, ..., xp I The coefficients a11, ..., a1p are called the loadings of the first principal component. I The loadings jointly make up the principal component loading vector denoted by a11 . φ1 = . a1p Principal Component Scores I Once we obtain the loadings (aij ) for the principal components, we can calculate the principal component scores for each obervation. I The first PC score for observation 1 (row 1) is obtained by plug in the observed variable values for that observation into the first PC linear combination: C1 = a11x1 + a12x2 + ... + a1pxp I You will have n PC scores for the first principal component because there is a total of n observations. I You can repeat the same process to obtain n PC scores for the second principal component, and so on. Questions to consider: I How do we estimate the PC loadings (coefficients aij )? [use R] I How do we estimate the corresponding PC scores? [use R] I How do we determine how many principal components we essentially need? [numerically and graphically] Performing PCA in R To carry out a principal component analysis (PCA) on a multivariate data set, the first step is often to standardise the variables under study using the “scale()” function. This is necessary if the input variables have very different variances, which is true in this case as the concentrations of the 13 chemicals have very different variances. For example, to standardise the concentrations of the 13 chemicals in the wine samples stdconcentrations<- as.data.frame(scale(wine[2:14])) # standardise the variables Once you have standardised your variables, you can carry out a principal component analysis on the standardised concentrations using the princomp() function in R. wine.pca <- princomp(stdconcentrations) # do a PCA summary(wine.pca) Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Standard deviation 2.1631951 1.5757366 1.1991447 0.9559347 0.92110518 Proportion of Variance 0.3619885 0.1920749 0.1112363 0.0706903 0.06563294 Cumulative Proportion 0.3619885 0.5540634 0.6652997 0.7359900 0.80162293 Comp.6 Comp.7 Comp.8 Comp.9 Standard deviation 0.79878171 0.74022473 0.58867607 0.53596364 Proportion of Variance 0.04935823 0.04238679 0.02680749 0.02222153 Cumulative Proportion 0.85098116 0.89336795 0.92017544 0.94239698 Comp.10 Comp.11 Comp.12 Comp.13 Standard deviation 0.49949266 0.47383559 0.40966094 0.320619963 Proportion of Variance 0.01930019 0.01736836 0.01298233 0.007952149 Cumulative Proportion 0.96169717 0.97906553 0.99204785 1.000000000 Note: I The summaries above gives us the standard deviation of each component, and the proportion of variance explained by each component. I The first five principal components account for over 80% of the variability in the data. Subsequent principal component contribute lower amounts. The standard deviation of the components is stored in a named element called “sdev” of the output variable made by “princomp”: wine.pca$sdev #standard deviation of the PCs Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 2.1631951 1.5757366 1.1991447 0.9559347 0.9211052 0.7987817 0.7402247 Comp.8 Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 0.5886761 0.5359636 0.4994927 0.4738356 0.4096609 0.3206200 We can also extract the PC loadings from the object wine.pca by using wine.pca$loadings #loadings Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10 V2 -0.144 -0.484 -0.207 0.266 0.214 0.396 0.509 0.212 V3 0.245 -0.225 0.537 0.537 -0.421 -0.309 V4 -0.316 0.626 -0.214 0.143 0.154 0.149 -0.170 -0.308 V5 0.239 0.612 -0.101 0.287 0.428 0.200 V6 -0.142 -0.300 0.131 -0.352 -0.727 -0.323 -0.156 0.271 V7 -0.395 0.146 0.198 0.149 -0.406 0.286 -0.320 V8 -0.423 0.151 0.152 0.109 -0.187 -0.163 V9 0.299 0.170 -0.203 0.501 -0.259 -0.595 -0.233 0.196 0.216 V10 -0.313 0.149 0.399 -0.137 -0.534 -0.372 0.368 -0.209 0.134 V11 -0.530 -0.137 -0.419 0.228 -0.291 V12 -0.297 0.279 -0.428 0.174 0.106 -0.232 0.437 -0.522 V13 -0.376 0.164 0.166 0.184 0.101 0.266 0.137 0.524 V14 -0.287 -0.365 -0.127 -0.232 0.158 0.120 0.120 -0.576 0.162 Comp.11 Comp.12 Comp.13 V2 -0.226 -0.266 V3 0.122 V4 -0.499 0.141 V5 0.479 V6 V7 0.304 -0.304 0.464 V8 -0.832 V9 0.117 -0.114 V10 -0.237 0.117 V11 0.604 V12 0.259 V13 0.601 0.157 V14 0.539 Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 Proportion Var 0.077 0.077 0.077 0.077 0.077 0.077 0.077 0.077 Cumulative Var 0.077 0.154 0.231 0.308 0.385 0.462 0.538 0.615 Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 SS loadings 1.000 1.000 1.000 1.000 1.000 Proportion Var 0.077 0.077 0.077 0.077 0.077 Cumulative Var 0.692 0.769 0.846 0.923 1.000 Note: I The function above provides the estimate loadings for the data for each component I Note that some values are missing. These values are not zero but are close to zero and should be ignored as we only concern about the largest loadings in the absolute values. I Take the first PC as an example, the three largest loadings are -0.423 for V8, -0.395 for V7, and -0.376 for V13. I The first PC makes up 36.19% of the variability in the original data. We can also extract the PC scores from the object wine.pca by using winescores<-wine.pca$scores #PC scores winescores[1:5,1] #extract the PC scores from the first 5 observations for the first PC [1] -3.307421 -2.203250 -2.509661 -3.746497 -1.006070 Determine how many principal components to retain In order to decide how many principal components should be retained, it is common to summarise the results of a principal components analysis by making a scree plot, which we can do in R using the “screeplot()” function: screeplot(wine.pca,main="scree plot") scree plot 4 3 2 Variances 1 0 Comp.1 Comp.4 Comp.7 Comp.10 screeplot(wine.pca,type="lines") wine.pca 4 3 Variances 2 1 Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10 There are three ways to decide how many principal components to retain.
Recommended publications
  • A Generalized Linear Model for Principal Component Analysis of Binary Data
    A Generalized Linear Model for Principal Component Analysis of Binary Data Andrew I. Schein Lawrence K. Saul Lyle H. Ungar Department of Computer and Information Science University of Pennsylvania Moore School Building 200 South 33rd Street Philadelphia, PA 19104-6389 {ais,lsaul,ungar}@cis.upenn.edu Abstract they are not generally appropriate for other data types. Recently, Collins et al.[5] derived generalized criteria We investigate a generalized linear model for for dimensionality reduction by appealing to proper- dimensionality reduction of binary data. The ties of distributions in the exponential family. In their model is related to principal component anal- framework, the conventional PCA of real-valued data ysis (PCA) in the same way that logistic re- emerges naturally from assuming a Gaussian distribu- gression is related to linear regression. Thus tion over a set of observations, while generalized ver- we refer to the model as logistic PCA. In this sions of PCA for binary and nonnegative data emerge paper, we derive an alternating least squares respectively by substituting the Bernoulli and Pois- method to estimate the basis vectors and gen- son distributions for the Gaussian. For binary data, eralized linear coefficients of the logistic PCA the generalized model's relationship to PCA is anal- model. The resulting updates have a simple ogous to the relationship between logistic and linear closed form and are guaranteed at each iter- regression[12]. In particular, the model exploits the ation to improve the model's likelihood. We log-odds as the natural parameter of the Bernoulli dis- evaluate the performance of logistic PCA|as tribution and the logistic function as its canonical link.
    [Show full text]
  • Choosing Principal Components: a New Graphical Method Based on Bayesian Model Selection
    Choosing principal components: a new graphical method based on Bayesian model selection Philipp Auer Daniel Gervini1 University of Ulm University of Wisconsin–Milwaukee October 31, 2007 1Corresponding author. Email: [email protected]. Address: P.O. Box 413, Milwaukee, WI 53201, USA. Abstract This article approaches the problem of selecting significant principal components from a Bayesian model selection perspective. The resulting Bayes rule provides a simple graphical technique that can be used instead of (or together with) the popular scree plot to determine the number of significant components to retain. We study the theoretical properties of the new method and show, by examples and simulation, that it provides more clear-cut answers than the scree plot in many interesting situations. Key Words: Dimension reduction; scree plot; factor analysis; singular value decomposition. MSC: Primary 62H25; secondary 62-09. 1 Introduction Multivariate datasets usually contain redundant information, due to high correla- tions among the variables. Sometimes they also contain irrelevant information; that is, sources of variability that can be considered random noise for practical purposes. To obtain uncorrelated, significant variables, several methods have been proposed over the years. Undoubtedly, the most popular is Principal Component Analysis (PCA), introduced by Hotelling (1933); for a comprehensive account of the methodology and its applications, see Jolliffe (2002). In many practical situations the first few components accumulate most of the variability, so it is common practice to discard the remaining components, thus re- ducing the dimensionality of the data. This has many advantages both from an inferential and from a descriptive point of view, since the leading components are often interpretable indices within the context of the problem and they are statis- tically more stable than subsequent components, which tend to be more unstruc- tured and harder to estimate.
    [Show full text]
  • Exploratory and Confirmatory Factor Analysis Course Outline Part 1
    Course Outline Exploratory and Confirmatory Factor Analysis 1 Principal components analysis Part 1: Overview, PCA and Biplots FA vs. PCA Least squares fit to a data matrix Biplots Michael Friendly 2 Basic Ideas of Factor Analysis Parsimony– common variance small number of factors. Linear regression on common factors→ Partial linear independence Psychology 6140 Common vs. unique variance 3 The Common Factor Model Factoring methods: Principal factors, Unweighted Least Squares, Maximum λ1 X1 z1 likelihood ξ λ2 Factor rotation X2 z2 4 Confirmatory Factor Analysis Development of CFA models Applications of CFA PCA and Factor Analysis: Overview & Goals Why do Factor Analysis? Part 1: Outline Why do “Factor Analysis”? 1 PCA and Factor Analysis: Overview & Goals Why do Factor Analysis? Data Reduction: Replace a large number of variables with a smaller Two modes of Factor Analysis number which reflect most of the original data [PCA rather than FA] Brief history of Factor Analysis Example: In a study of the reactions of cancer patients to radiotherapy, measurements were made on 10 different reaction variables. Because it 2 Principal components analysis was difficult to interpret all 10 variables together, PCA was used to find Artificial PCA example simpler measure(s) of patient response to treatment that contained most of the information in data. 3 PCA: details Test and Scale Construction: Develop tests and scales which are “pure” 4 PCA: Example measures of some construct. Example: In developing a test of English as a Second Language, 5 Biplots investigators calculate correlations among the item scores, and use FA to Low-D views based on PCA construct subscales.
    [Show full text]
  • A Resampling Test for Principal Component Analysis of Genotype-By-Environment Interaction
    A RESAMPLING TEST FOR PRINCIPAL COMPONENT ANALYSIS OF GENOTYPE-BY-ENVIRONMENT INTERACTION JOHANNES FORKMAN Abstract. In crop science, genotype-by-environment interaction is of- ten explored using the \genotype main effects and genotype-by-environ- ment interaction effects” (GGE) model. Using this model, a singular value decomposition is performed on the matrix of residuals from a fit of a linear model with main effects of environments. Provided that errors are independent, normally distributed and homoscedastic, the signifi- cance of the multiplicative terms of the GGE model can be tested using resampling methods. The GGE method is closely related to principal component analysis (PCA). The present paper describes i) the GGE model, ii) the simple parametric bootstrap method for testing multi- plicative genotype-by-environment interaction terms, and iii) how this resampling method can also be used for testing principal components in PCA. 1. Introduction Forkman and Piepho (2014) proposed a resampling method for testing in- teraction terms in models for analysis of genotype-by-environment data. The \genotype main effects and genotype-by-environment interaction ef- fects"(GGE) analysis (Yan et al. 2000; Yan and Kang, 2002) is closely related to principal component analysis (PCA). For this reason, the method proposed by Forkman and Piepho (2014), which is called the \simple para- metric bootstrap method", can be used for testing principal components in PCA as well. The proposed resampling method is parametric in the sense that it assumes homoscedastic and normally distributed observations. The method is \simple", because it only involves repeated sampling of standard normal distributed values. Specifically, no parameters need to be estimated.
    [Show full text]
  • GGL Biplot Analysis of Durum Wheat (Triticum Turgidum Spp
    756 Bulgarian Journal of Agricultural Science, 19 (No 4) 2013, 756-765 Agricultural Academy GGL BIPLOT ANALYSIS OF DURUM WHEAT (TRITICUM TURGIDUM SPP. DURUM) YIELD IN MULTI-ENVIRONMENT TRIALS N. SABAGHNIA2*, R. KARIMIZADEH1 and M. MOHAMMADI1 1 Department of Agronomy and Plant Breeding, Faculty of Agriculture, University of Maragheh, Maragheh, Iran 2 Dryland Agricultural Research Institute (DARI), Gachsaran, Iran Abstract SABAGHNIA, N., R. KARIMIZADEH and M. MOHAMMADI, 2013. GGL biplot analysis of durum wheat (Triticum turgidum spp. durum) yield in multi-environment trials. Bulg. J. Agric. Sci., 19: 756-765 Durum wheat (Triticum turgidum spp. durum) breeders have to determine the new genotypes responsive to the environ- mental changes for grain yield. Matching durum wheat genotype selection with its production environment is challenged by the occurrence of significant genotype by environment (GE) interaction in multi-environment trials (MET). This investigation was conducted to evaluate 20 durum wheat genotypes for their stability grown in five different locations across three years using randomized completely block design with 4 replications. According to combined analysis of variance, the main effects of genotypes, locations and years, were significant as well as the interactions effects. The first two principal components of the site regression model accounted for 60.3 % of the total variation. Polygon view of genotype plus genotype by location (GGL) biplot indicated that there were three winning genotypes in three mega-environments for durum wheat in rain-fed conditions. Genotype G14 was the most favorable genotype for location Gachsaran and the most favorable genotype of mega-environment Kouhdasht and Ilam was G12 while G10 was the most favorable genotypes for mega-environment Gonbad and Moghan.
    [Show full text]
  • A Biplot-Based PCA Approach to Study the Relations Between Indoor and Outdoor Air Pollutants Using Case Study Buildings
    buildings Article A Biplot-Based PCA Approach to Study the Relations between Indoor and Outdoor Air Pollutants Using Case Study Buildings He Zhang * and Ravi Srinivasan UrbSys (Urban Building Energy, Sensing, Controls, Big Data Analysis, and Visualization) Laboratory, M.E. Rinker, Sr. School of Construction Management, University of Florida, Gainesville, FL 32603, USA; sravi@ufl.edu * Correspondence: rupta00@ufl.edu Abstract: The 24 h and 14-day relationship between indoor and outdoor PM2.5, PM10, NO2, relative humidity, and temperature were assessed for an elementary school (site 1), a laboratory (site 2), and a residential unit (site 3) in Gainesville city, Florida. The primary aim of this study was to introduce a biplot-based PCA approach to visualize and validate the correlation among indoor and outdoor air quality data. The Spearman coefficients showed a stronger correlation among these target environmental measurements on site 1 and site 2, while it showed a weaker correlation on site 3. The biplot-based PCA regression performed higher dependency for site 1 and site 2 (p < 0.001) when compared to the correlation values and showed a lower dependency for site 3. The results displayed a mismatch between the biplot-based PCA and correlation analysis for site 3. The method utilized in this paper can be implemented in studies and analyzes high volumes of multiple building environmental measurements along with optimized visualization. Keywords: air pollution; indoor air quality; principal component analysis; biplot Citation: Zhang, H.; Srinivasan, R. A Biplot-Based PCA Approach to Study the Relations between Indoor and 1. Introduction Outdoor Air Pollutants Using Case The 2020 Global Health Observatory (GHO) statistics show that indoor and outdoor Buildings 2021 11 Study Buildings.
    [Show full text]
  • Principal Component Analysis (PCA) As a Statistical Tool for Identifying Key Indicators of Nuclear Power Plant Cable Insulation
    Iowa State University Capstones, Theses and Graduate Theses and Dissertations Dissertations 2017 Principal component analysis (PCA) as a statistical tool for identifying key indicators of nuclear power plant cable insulation degradation Chamila Chandima De Silva Iowa State University Follow this and additional works at: https://lib.dr.iastate.edu/etd Part of the Materials Science and Engineering Commons, Mechanics of Materials Commons, and the Statistics and Probability Commons Recommended Citation De Silva, Chamila Chandima, "Principal component analysis (PCA) as a statistical tool for identifying key indicators of nuclear power plant cable insulation degradation" (2017). Graduate Theses and Dissertations. 16120. https://lib.dr.iastate.edu/etd/16120 This Thesis is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. Principal component analysis (PCA) as a statistical tool for identifying key indicators of nuclear power plant cable insulation degradation by Chamila C. De Silva A thesis submitted to the graduate faculty in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Major: Materials Science and Engineering Program of Study Committee: Nicola Bowler, Major Professor Richard LeSar Steve Martin The student author and the program of study committee are solely responsible for the content of this thesis. The Graduate College will ensure this thesis is globally accessible and will not permit alterations after a degree is conferred.
    [Show full text]
  • Exam-Srm-Sample-Questions.Pdf
    SOCIETY OF ACTUARIES EXAM SRM - STATISTICS FOR RISK MODELING EXAM SRM SAMPLE QUESTIONS AND SOLUTIONS These questions and solutions are representative of the types of questions that might be asked of candidates sitting for Exam SRM. These questions are intended to represent the depth of understanding required of candidates. The distribution of questions by topic is not intended to represent the distribution of questions on future exams. April 2019 update: Question 2, answer III was changed. While the statement was true, it was not directly supported by the required readings. July 2019 update: Questions 29-32 were added. September 2019 update: Questions 33-44 were added. December 2019 update: Question 40 item I was modified. January 2020 update: Question 41 item I was modified. November 2020 update: Questions 6 and 14 were modified and Questions 45-48 were added. February 2021 update: Questions 49-53 were added. July 2021 update: Questions 54-55 were added. Copyright 2021 by the Society of Actuaries SRM-02-18 1 QUESTIONS 1. You are given the following four pairs of observations: x1=−==−=( 1,0), xx 23 (1,1), (2, 1), and x 4(5,10). A hierarchical clustering algorithm is used with complete linkage and Euclidean distance. Calculate the intercluster dissimilarity between {,xx12 } and {}x4 . (A) 2.2 (B) 3.2 (C) 9.9 (D) 10.8 (E) 11.7 SRM-02-18 2 2. Determine which of the following statements is/are true. I. The number of clusters must be pre-specified for both K-means and hierarchical clustering. II. The K-means clustering algorithm is less sensitive to the presence of outliers than the hierarchical clustering algorithm.
    [Show full text]
  • Logistic Biplot by Conjugate Gradient Algorithms and Iterated SVD
    mathematics Article Logistic Biplot by Conjugate Gradient Algorithms and Iterated SVD Jose Giovany Babativa-Márquez 1,2,* and José Luis Vicente-Villardón 1 1 Department of Statistics, University of Salamanca, 37008 Salamanca, Spain; [email protected] 2 Facultad de Ciencias de la Salud y del Deporte, Fundación Universitaria del Área Andina, Bogotá 1321, Colombia * Correspondence: [email protected] Abstract: Multivariate binary data are increasingly frequent in practice. Although some adaptations of principal component analysis are used to reduce dimensionality for this kind of data, none of them provide a simultaneous representation of rows and columns (biplot). Recently, a technique named logistic biplot (LB) has been developed to represent the rows and columns of a binary data matrix simultaneously, even though the algorithm used to fit the parameters is too computationally demanding to be useful in the presence of sparsity or when the matrix is large. We propose the fitting of an LB model using nonlinear conjugate gradient (CG) or majorization–minimization (MM) algo- rithms, and a cross-validation procedure is introduced to select the hyperparameter that represents the number of dimensions in the model. A Monte Carlo study that considers scenarios with several sparsity levels and different dimensions of the binary data set shows that the procedure based on cross-validation is successful in the selection of the model for all algorithms studied. The comparison of the running times shows that the CG algorithm is more efficient in the presence of sparsity and when the matrix is not very large, while the performance of the MM algorithm is better when the binary matrix is balanced or large.
    [Show full text]
  • Dynamic Factor Models Catherine Doz, Peter Fuleky
    Dynamic Factor Models Catherine Doz, Peter Fuleky To cite this version: Catherine Doz, Peter Fuleky. Dynamic Factor Models. 2019. halshs-02262202 HAL Id: halshs-02262202 https://halshs.archives-ouvertes.fr/halshs-02262202 Preprint submitted on 2 Aug 2019 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. WORKING PAPER N° 2019 – 45 Dynamic Factor Models Catherine Doz Peter Fuleky JEL Codes: C32, C38, C53, C55 Keywords : dynamic factor models; big data; two-step estimation; time domain; frequency domain; structural breaks DYNAMIC FACTOR MODELS ∗ Catherine Doz1,2 and Peter Fuleky3 1Paris School of Economics 2Université Paris 1 Panthéon-Sorbonne 3University of Hawaii at Manoa July 2019 ∗Prepared for Macroeconomic Forecasting in the Era of Big Data Theory and Practice (Peter Fuleky ed), Springer. 1 Chapter 2 Dynamic Factor Models Catherine Doz and Peter Fuleky Abstract Dynamic factor models are parsimonious representations of relationships among time series variables. With the surge in data availability, they have proven to be indispensable in macroeconomic forecasting. This chapter surveys the evolution of these models from their pre-big-data origins to the large-scale models of recent years. We review the associated estimation theory, forecasting approaches, and several extensions of the basic framework.
    [Show full text]
  • Principal Components Analysis Maths and Statistics Help Centre
    Principal Components Analysis Maths and Statistics Help Centre Usage - Data reduction tool - Vital first step in the multivariate analysis of continuous data - Used to find patterns in high dimensional data - Expresses data such that similarities and differences are highlighted - Once the patterns in the data are found, PCA is used to reduce the dimensionality of the data without much loss of information (choosing the right number of principal components is very important) - If you are using PCA for modelling purposes (either subsequent gradient analyses or regression) then normality is ideal. If it is for data reduction or exploratory purposes, then normality is not a strict requirement How does PCA work? - Uses the covariance/correlations of the raw data to define principal components that combine many correlated variables - The principal components are uncorrelated - Similar concept to regression – in regression you use one line to describe a relationship between two variables; here the principal component represents the line. Given a point on the line, or a value of the principal component you can discover the values of the variables. Choosing the right number of principal components - This is one of the most important parts of PCA. The number you choose needs to be the ones that give you the most information without significant loss of information. - Scree plots show the eigenvalues. These are used to tell us how important the principal components are. - When the scree plot plateaus then no more principal components are needed. - The loadings are a measure of how much each original variable contributes to each of the principal components.
    [Show full text]
  • Pcatools: Everything Principal Components Analysis
    Package ‘PCAtools’ October 1, 2021 Type Package Title PCAtools: Everything Principal Components Analysis Version 2.5.15 Description Principal Component Analysis (PCA) is a very powerful technique that has wide applica- bility in data science, bioinformatics, and further afield. It was initially developed to anal- yse large volumes of data in order to tease out the differences/relationships between the logi- cal entities being analysed. It extracts the fundamental structure of the data with- out the need to build any model to represent it. This 'summary' of the data is ar- rived at through a process of reduction that can transform the large number of vari- ables into a lesser number that are uncorrelated (i.e. the 'principal compo- nents'), while at the same time being capable of easy interpretation on the original data. PCA- tools provides functions for data exploration via PCA, and allows the user to generate publica- tion-ready figures. PCA is performed via BiocSingular - users can also identify optimal num- ber of principal components via different metrics, such as elbow method and Horn's paral- lel analysis, which has relevance for data reduction in single-cell RNA-seq (scRNA- seq) and high dimensional mass cytometry data. License GPL-3 Depends ggplot2, ggrepel Imports lattice, grDevices, cowplot, methods, reshape2, stats, Matrix, DelayedMatrixStats, DelayedArray, BiocSingular, BiocParallel, Rcpp, dqrng Suggests testthat, scran, BiocGenerics, knitr, Biobase, GEOquery, hgu133a.db, ggplotify, beachmat, RMTstat, ggalt, DESeq2, airway,
    [Show full text]