CHAPTER 4 Exploratory Factor Analysis and Principal Components
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Choosing Principal Components: a New Graphical Method Based on Bayesian Model Selection
Choosing principal components: a new graphical method based on Bayesian model selection Philipp Auer Daniel Gervini1 University of Ulm University of Wisconsin–Milwaukee October 31, 2007 1Corresponding author. Email: [email protected]. Address: P.O. Box 413, Milwaukee, WI 53201, USA. Abstract This article approaches the problem of selecting significant principal components from a Bayesian model selection perspective. The resulting Bayes rule provides a simple graphical technique that can be used instead of (or together with) the popular scree plot to determine the number of significant components to retain. We study the theoretical properties of the new method and show, by examples and simulation, that it provides more clear-cut answers than the scree plot in many interesting situations. Key Words: Dimension reduction; scree plot; factor analysis; singular value decomposition. MSC: Primary 62H25; secondary 62-09. 1 Introduction Multivariate datasets usually contain redundant information, due to high correla- tions among the variables. Sometimes they also contain irrelevant information; that is, sources of variability that can be considered random noise for practical purposes. To obtain uncorrelated, significant variables, several methods have been proposed over the years. Undoubtedly, the most popular is Principal Component Analysis (PCA), introduced by Hotelling (1933); for a comprehensive account of the methodology and its applications, see Jolliffe (2002). In many practical situations the first few components accumulate most of the variability, so it is common practice to discard the remaining components, thus re- ducing the dimensionality of the data. This has many advantages both from an inferential and from a descriptive point of view, since the leading components are often interpretable indices within the context of the problem and they are statis- tically more stable than subsequent components, which tend to be more unstruc- tured and harder to estimate. -
Arxiv:1908.07390V1 [Stat.AP] 19 Aug 2019
An Overview of Statistical Data Analysis Rui Portocarrero Sarmento Vera Costa LIAAD-INESC TEC FEUP, University of Porto PRODEI - FEUP, University of Porto [email protected] [email protected] August 21, 2019 Abstract The use of statistical software in academia and enterprises has been evolving over the last years. More often than not, students, professors, workers and users in general have all had, at some point, exposure to statistical software. Sometimes, difficulties are felt when dealing with such type of software. Very few persons have theoretical knowledge to clearly understand software configurations or settings, and sometimes even the presented results. Very often, the users are required by academies or enterprises to present reports, without the time to explore or understand the results or tasks required to do a optimal prepara- tion of data or software settings. In this work, we present a statistical overview of some theoretical concepts, to provide a fast access to some concepts. Keywords Reporting Mathematics Statistics and Applications · · 1 Introduction Statistics is a set of methods used to analyze data. The statistic is present in all areas of science involving the collection, handling and sorting of data, given the insight of a particular phenomenon and the possibility that, from that knowledge, inferring possible new results. One of the goals with statistics is to extract information from data to get a better understanding of the situations they represent. Thus, the statistics can be thought of as the science of learning from data. Currently, the high competitiveness in search technologies and markets has caused a constant race for the information. -
Discriminant Function Analysis
Discriminant Function Analysis Discriminant Function Analysis ● General Purpose ● Computational Approach ● Stepwise Discriminant Analysis ● Interpreting a Two-Group Discriminant Function ● Discriminant Functions for Multiple Groups ● Assumptions ● Classification General Purpose Discriminant function analysis is used to determine which variables discriminate between two or more naturally occurring groups. For example, an educational researcher may want to investigate which variables discriminate between high school graduates who decide (1) to go to college, (2) to attend a trade or professional school, or (3) to seek no further training or education. For that purpose the researcher could collect data on numerous variables prior to students' graduation. After graduation, most students will naturally fall into one of the three categories. Discriminant Analysis could then be used to determine which variable(s) are the best predictors of students' subsequent educational choice. A medical researcher may record different variables relating to patients' backgrounds in order to learn which variables best predict whether a patient is likely to recover completely (group 1), partially (group 2), or not at all (group 3). A biologist could record different characteristics of similar types (groups) of flowers, and then perform a discriminant function analysis to determine the set of characteristics that allows for the best discrimination between the types. To index Computational Approach Computationally, discriminant function analysis is very similar to analysis of variance (ANOVA ). Let us consider a simple example. Suppose we measure height in a random sample of 50 males and 50 females. Females are, on the average, not as tall as http://www.statsoft.com/textbook/stdiscan.html (1 of 11) [2/11/2008 8:56:30 AM] Discriminant Function Analysis males, and this difference will be reflected in the difference in means (for the variable Height ). -
Factor Analysis
Factor Analysis Qian-Li Xue Biostatistics Program Harvard Catalyst | The Harvard Clinical & Translational Science Center Short course, October 27, 2016 1 Well-used latent variable models Latent Observed variable scale variable scale Continuous Discrete Continuous Factor Discrete FA analysis IRT (item response) LISREL Discrete Latent profile Latent class Growth mixture analysis, regression General software: MPlus, Latent Gold, WinBugs (Bayesian), NLMIXED (SAS) Objectives § What is factor analysis? § What do we need factor analysis for? § What are the modeling assumptions? § How to specify, fit, and interpret factor models? § What is the difference between exploratory and confirmatory factor analysis? § What is and how to assess model identifiability? 3 What is factor analysis § Factor analysis is a theory driven statistical data reduction technique used to explain covariance among observed random variables in terms of fewer unobserved random variables named factors 4 An Example: General Intelligence (Charles Spearman, 1904) Y1 δ1 δ Y2 2 δ General Y3 3 Intelligence Y4 δ4 F Y5 δ5 5 Why Factor Analysis? 1. Testing of theory § Explain covariation among multiple observed variables by § Mapping variables to latent constructs (called “factors”) 2. Understanding the structure underlying a set of measures § Gain insight to dimensions § Construct validation (e.g., convergent validity) 3. Scale development § Exploit redundancy to improve scale’s validity and reliability 6 Part I. Exploratory Factor Analysis (EFA) 7 One Common Factor Model: Model Specification Y δ1 λ1 1 Y1 = λ1F +δ1 λ2 Y δ F 2 2 Y2 = λ2 F +δ 2 λ3 Y δ 3 3 Y3 = λ3F +δ3 § The factor F is not observed; only Y1, Y2, Y3 are observed § δi represent variability in the Yi NOT explained by F § Yi is a linear function of F and δi 8 One Common Factor Model: Model Assumptions λ Y δ1 1 1 Y1 = λ1F +δ1 λ2 Y δ F 2 2 Y2 = λ2 F +δ 2 λ3 Y δ 3 3 Y3 = λ3F +δ3 § Factorial causation § F is independent of δj, i.e. -
Principal Component Analysis (PCA) As a Statistical Tool for Identifying Key Indicators of Nuclear Power Plant Cable Insulation
Iowa State University Capstones, Theses and Graduate Theses and Dissertations Dissertations 2017 Principal component analysis (PCA) as a statistical tool for identifying key indicators of nuclear power plant cable insulation degradation Chamila Chandima De Silva Iowa State University Follow this and additional works at: https://lib.dr.iastate.edu/etd Part of the Materials Science and Engineering Commons, Mechanics of Materials Commons, and the Statistics and Probability Commons Recommended Citation De Silva, Chamila Chandima, "Principal component analysis (PCA) as a statistical tool for identifying key indicators of nuclear power plant cable insulation degradation" (2017). Graduate Theses and Dissertations. 16120. https://lib.dr.iastate.edu/etd/16120 This Thesis is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. Principal component analysis (PCA) as a statistical tool for identifying key indicators of nuclear power plant cable insulation degradation by Chamila C. De Silva A thesis submitted to the graduate faculty in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Major: Materials Science and Engineering Program of Study Committee: Nicola Bowler, Major Professor Richard LeSar Steve Martin The student author and the program of study committee are solely responsible for the content of this thesis. The Graduate College will ensure this thesis is globally accessible and will not permit alterations after a degree is conferred. -
Covid-19 Epidemiological Factor Analysis: Identifying Principal Factors with Machine Learning
medRxiv preprint doi: https://doi.org/10.1101/2020.06.01.20119560; this version posted June 5, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . Covid-19 Epidemiological Factor Analysis: Identifying Principal Factors with Machine Learning Serge Dolgikh1[0000-0001-5929-8954] Abstract. Based on a subset of Covid-19 Wave 1 cases at a time point near TZ+3m (April, 2020), we perform an analysis of the influencing factors for the epidemics impacts with several different statistical methods. The consistent con- clusion of the analysis with the available data is that apart from the policy and management quality, being the dominant factor, the most influential factors among the considered were current or recent universal BCG immunization and the prevalence of smoking. Keywords: Infectious epidemiology, Covid-19 1 Introduction A possible link between the effects of Covid-19 pandemics such as the rate of spread and the severity of cases; and a universal immunization program against tuberculosis with BCG vaccine (UIP, hereinafter) was suggested in [1] and further investigated in [2]. Here we provide a factor analysis based on the available data for the first group of countries that encountered the epidemics (Wave 1) by applying several commonly used factor-ranking methods. The intent of the work is to repeat similar analysis at several different points in the time series of cases that would allow to make a confident conclusion about the epide- miological and social factors with strong influence on the course of the epidemics. -
T-Test & Factor Analysis
University of Brighton Parametric tests T-test & factor analysis Better than non parametric tests Stringent assumptions More strings attached Assumes population distribution of sample is normal Major problem Alternatives • Continue using parametric test if the sample is large or enough evidence available to prove the usage • Transforming and manipulating variables • Use non parametric test Dr. Paurav Shukla t-test Tests used for significant differences between Used when they are two groups only groups Compares the mean score on some continuous Independent sample t-test variable Paired sample t-test Largely used in measuring before – after effect One-way ANOVA (between groups) Also called intervention effect One-way ANOVA (repeated groups) Option 1: Same sample over a period of time Two-way ANOVA (between groups) Option 2: Two different groups at a single point of MANOVA time Assumptions associated with t-test Why t-test is important? Population distribution of sample is normal Highly used technique for hypothesis testing Probability (random) sampling Can lead to wrong conclusions Level of measurement Type 1 error Dependent variable is continuous • Reject null hypothesis instead of accepting – When we assume there is a difference between groups, when it Observed or measured data must be is not independent Type 2 error If interaction is unavoidable use a stringent alpha • Accept null hypothesis instead of rejecting value (p<.01) – When we assume there is no difference between groups, when Homogeneity of variance there is Assumes that samples are obtained from populations Solution of equal variance • Interdependency between both errors ANOVA is reasonably robust against this • Selection of alpha level Dr. -
Exam-Srm-Sample-Questions.Pdf
SOCIETY OF ACTUARIES EXAM SRM - STATISTICS FOR RISK MODELING EXAM SRM SAMPLE QUESTIONS AND SOLUTIONS These questions and solutions are representative of the types of questions that might be asked of candidates sitting for Exam SRM. These questions are intended to represent the depth of understanding required of candidates. The distribution of questions by topic is not intended to represent the distribution of questions on future exams. April 2019 update: Question 2, answer III was changed. While the statement was true, it was not directly supported by the required readings. July 2019 update: Questions 29-32 were added. September 2019 update: Questions 33-44 were added. December 2019 update: Question 40 item I was modified. January 2020 update: Question 41 item I was modified. November 2020 update: Questions 6 and 14 were modified and Questions 45-48 were added. February 2021 update: Questions 49-53 were added. July 2021 update: Questions 54-55 were added. Copyright 2021 by the Society of Actuaries SRM-02-18 1 QUESTIONS 1. You are given the following four pairs of observations: x1=−==−=( 1,0), xx 23 (1,1), (2, 1), and x 4(5,10). A hierarchical clustering algorithm is used with complete linkage and Euclidean distance. Calculate the intercluster dissimilarity between {,xx12 } and {}x4 . (A) 2.2 (B) 3.2 (C) 9.9 (D) 10.8 (E) 11.7 SRM-02-18 2 2. Determine which of the following statements is/are true. I. The number of clusters must be pre-specified for both K-means and hierarchical clustering. II. The K-means clustering algorithm is less sensitive to the presence of outliers than the hierarchical clustering algorithm. -
Dynamic Factor Models Catherine Doz, Peter Fuleky
Dynamic Factor Models Catherine Doz, Peter Fuleky To cite this version: Catherine Doz, Peter Fuleky. Dynamic Factor Models. 2019. halshs-02262202 HAL Id: halshs-02262202 https://halshs.archives-ouvertes.fr/halshs-02262202 Preprint submitted on 2 Aug 2019 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. WORKING PAPER N° 2019 – 45 Dynamic Factor Models Catherine Doz Peter Fuleky JEL Codes: C32, C38, C53, C55 Keywords : dynamic factor models; big data; two-step estimation; time domain; frequency domain; structural breaks DYNAMIC FACTOR MODELS ∗ Catherine Doz1,2 and Peter Fuleky3 1Paris School of Economics 2Université Paris 1 Panthéon-Sorbonne 3University of Hawaii at Manoa July 2019 ∗Prepared for Macroeconomic Forecasting in the Era of Big Data Theory and Practice (Peter Fuleky ed), Springer. 1 Chapter 2 Dynamic Factor Models Catherine Doz and Peter Fuleky Abstract Dynamic factor models are parsimonious representations of relationships among time series variables. With the surge in data availability, they have proven to be indispensable in macroeconomic forecasting. This chapter surveys the evolution of these models from their pre-big-data origins to the large-scale models of recent years. We review the associated estimation theory, forecasting approaches, and several extensions of the basic framework. -
Statistical Analysis in JASP
Copyright © 2018 by Mark A Goss-Sampson. All rights reserved. This book or any portion thereof may not be reproduced or used in any manner whatsoever without the express written permission of the author except for the purposes of research, education or private study. CONTENTS PREFACE .................................................................................................................................................. 1 USING THE JASP INTERFACE .................................................................................................................... 2 DESCRIPTIVE STATISTICS ......................................................................................................................... 8 EXPLORING DATA INTEGRITY ................................................................................................................ 15 ONE SAMPLE T-TEST ............................................................................................................................. 22 BINOMIAL TEST ..................................................................................................................................... 25 MULTINOMIAL TEST .............................................................................................................................. 28 CHI-SQUARE ‘GOODNESS-OF-FIT’ TEST............................................................................................. 30 MULTINOMIAL AND Χ2 ‘GOODNESS-OF-FIT’ TEST. .......................................................................... -
Principal Components Analysis Maths and Statistics Help Centre
Principal Components Analysis Maths and Statistics Help Centre Usage - Data reduction tool - Vital first step in the multivariate analysis of continuous data - Used to find patterns in high dimensional data - Expresses data such that similarities and differences are highlighted - Once the patterns in the data are found, PCA is used to reduce the dimensionality of the data without much loss of information (choosing the right number of principal components is very important) - If you are using PCA for modelling purposes (either subsequent gradient analyses or regression) then normality is ideal. If it is for data reduction or exploratory purposes, then normality is not a strict requirement How does PCA work? - Uses the covariance/correlations of the raw data to define principal components that combine many correlated variables - The principal components are uncorrelated - Similar concept to regression – in regression you use one line to describe a relationship between two variables; here the principal component represents the line. Given a point on the line, or a value of the principal component you can discover the values of the variables. Choosing the right number of principal components - This is one of the most important parts of PCA. The number you choose needs to be the ones that give you the most information without significant loss of information. - Scree plots show the eigenvalues. These are used to tell us how important the principal components are. - When the scree plot plateaus then no more principal components are needed. - The loadings are a measure of how much each original variable contributes to each of the principal components. -
Optimal Principal Component Analysis of STEM XEDS Spectrum Images Pavel Potapov1,2* and Axel Lubk2
Potapov and Lubk Adv Struct Chem Imag (2019) 5:4 https://doi.org/10.1186/s40679-019-0066-0 RESEARCH Open Access Optimal principal component analysis of STEM XEDS spectrum images Pavel Potapov1,2* and Axel Lubk2 Abstract STEM XEDS spectrum images can be drastically denoised by application of the principal component analysis (PCA). This paper looks inside the PCA workfow step by step on an example of a complex semiconductor structure con- sisting of a number of diferent phases. Typical problems distorting the principal components decomposition are highlighted and solutions for the successful PCA are described. Particular attention is paid to the optimal truncation of principal components in the course of reconstructing denoised data. A novel accurate and robust method, which overperforms the existing truncation methods is suggested for the frst time and described in details. Keywords: PCA, Spectrum image, Reconstruction, Denoising, STEM, XEDS, EDS, EDX Background components [3–9]. In general terms, PCA reduces the Scanning transmission electron microscopy (STEM) dimensionality of a large dataset by projecting it into an delivers images of nanostructures at high spatial resolu- orthogonal basic of lower dimension. It can be shown tion matching that of broad beam transmission electron that among all possible linear projections, PCA ensures microscopy (TEM). Additionally, modern STEM instru- the smallest Euclidean diference between the initial and ments are typically equipped with electron energy-loss projected datasets or, in other words, provides the mini- spectrometers (EELS) and/or X-rays energy-dispersive mal least squares errors when approximating data with a spectroscopy (XEDS, sometimes abbreviated as EDS smaller number of variables [10].