Principal Component Analysis, Second Edition

Total Page:16

File Type:pdf, Size:1020Kb

Principal Component Analysis, Second Edition Principal Component Analysis, Second Edition I.T. Jolliffe Springer Preface to the Second Edition Since the first edition of the book was published, a great deal of new ma- terial on principal component analysis (PCA) and related topics has been published, and the time is now ripe for a new edition. Although the size of the book has nearly doubled, there are only two additional chapters. All the chapters in the first edition have been preserved, although two have been renumbered. All have been updated, some extensively. In this updat- ing process I have endeavoured to be as comprehensive as possible. This is reflected in the number of new references, which substantially exceeds those in the first edition. Given the range of areas in which PCA is used, it is certain that I have missed some topics, and my coverage of others will be too brief for the taste of some readers. The choice of which new topics to emphasize is inevitably a personal one, reflecting my own interests and biases. In particular, atmospheric science is a rich source of both applica- tions and methodological developments, but its large contribution to the new material is partly due to my long-standing links with the area, and not because of a lack of interesting developments and examples in other fields. For example, there are large literatures in psychometrics, chemometrics and computer science that are only partially represented. Due to consid- erations of space, not everything could be included. The main changes are now described. Chapters 1 to 4 describing the basic theory and providing a set of exam- ples are the least changed. It would have been possible to substitute more recent examples for those of Chapter 4, but as the present ones give nice illustrations of the various aspects of PCA, there was no good reason to do so. One of these examples has been moved to Chapter 1. One extra prop- vi Preface to the Second Edition erty (A6) has been added to Chapter 2, with Property A6 in Chapter 3 becoming A7. Chapter 5 has been extended by further discussion of a number of ordina- tion and scaling methods linked to PCA, in particular varieties of the biplot. Chapter 6 has seen a major expansion. There are two parts of Chapter 6 concerned with deciding how many principal components (PCs) to retain and with using PCA to choose a subset of variables. Both of these topics have been the subject of considerable research in recent years, although a regrettably high proportion of this research confuses PCA with factor anal- ysis, the subject of Chapter 7. Neither Chapter 7 nor 8 have been expanded as much as Chapter 6 or Chapters 9 and 10. Chapter 9 in the first edition contained three sections describing the use of PCA in conjunction with discriminant analysis, cluster analysis and canonical correlation analysis (CCA). All three sections have been updated, but the greatest expansion is in the third section, where a number of other techniques have been included, which, like CCA, deal with relationships be- tween two groups of variables. As elsewhere in the book, Chapter 9 includes yet other interesting related methods not discussed in detail. In general, the line is drawn between inclusion and exclusion once the link with PCA becomes too tenuous. Chapter 10 also included three sections in first edition on outlier de- tection, influence and robustness. All have been the subject of substantial research interest since the first edition; this is reflected in expanded cover- age. A fourth section, on other types of stability and sensitivity, has been added. Some of this material has been moved from Section 12.4 of the first edition; other material is new. The next two chapters are also new and reflect my own research interests more closely than other parts of the book. An important aspect of PCA is interpretation of the components once they have been obtained. This may not be easy, and a number of approaches have been suggested for simplifying PCs to aid interpretation. Chapter 11 discusses these, covering the well- established idea of rotation as well recently developed techniques. These techniques either replace PCA by alternative procedures that give simpler results, or approximate the PCs once they have been obtained. A small amount of this material comes from Section 12.4 of the first edition, but the great majority is new. The chapter also includes a section on physical interpretation of components. My involvement in the developments described in Chapter 12 is less direct than in Chapter 11, but a substantial part of the chapter describes method- ology and applications in atmospheric science and reflects my long-standing interest in that field. In the first edition, Section 11.2 was concerned with ‘non-independent and time series data.’ This section has been expanded to a full chapter (Chapter 12). There have been major developments in this area, including functional PCA for time series, and various techniques appropriate for data involving spatial and temporal variation, such as (mul- Preface to the Second Edition vii tichannel) singular spectrum analysis, complex PCA, principal oscillation pattern analysis, and extended empirical orthogonal functions (EOFs). Many of these techniques were developed by atmospheric scientists and are little known in many other disciplines. The last two chapters of the first edition are greatly expanded and be- come Chapters 13 and 14 in the new edition. There is some transfer of material elsewhere, but also new sections. In Chapter 13 there are three new sections, on size/shape data, on quality control and a final ‘odds-and- ends’ section, which includes vector, directional and complex data, interval data, species abundance data and large data sets. All other sections have been expanded, that on common principal component analysis and related topics especially so. The first section of Chapter 14 deals with varieties of non-linear PCA. This section has grown substantially compared to its counterpart (Sec- tion 12.2) in the first edition. It includes material on the Gifi system of multivariate analysis, principal curves, and neural networks. Section 14.2 on weights, metrics and centerings combines, and considerably expands, the material of the first and third sections of the old Chapter 12. The content of the old Section 12.4 has been transferred to an earlier part in the book (Chapter 10), but the remaining old sections survive and are updated. The section on non-normal data includes independent compo- nent analysis (ICA), and the section on three-mode analysis also discusses techniques for three or more groups of variables. The penultimate section is new and contains material on sweep-out components, extended com- ponents, subjective components, goodness-of-fit, and further discussion of neural nets. The appendix on numerical computation of PCs has been retained and updated, but, the appendix on PCA in computer packages has been dropped from this edition mainly because such material becomes out-of-date very rapidly. The preface to the first edition noted three general texts on multivariate analysis. Since 1986 a number of excellent multivariate texts have appeared, including Everitt and Dunn (2001), Krzanowski (2000), Krzanowski and Marriott (1994) and Rencher (1995, 1998), to name just a few. Two large specialist texts on principal component analysis have also been published. Jackson (1991) gives a good, comprehensive, coverage of principal com- ponent analysis from a somewhat different perspective than the present book, although it, too, is aimed at a general audience of statisticians and users of PCA. The other text, by Preisendorfer and Mobley (1988), con- centrates on meteorology and oceanography. Because of this, the notation in Preisendorfer and Mobley differs considerably from that used in main- stream statistical sources. Nevertheless, as we shall see in later chapters, especially Chapter 12, atmospheric science is a field where much devel- opment of PCA and related topics has occurred, and Preisendorfer and Mobley’s book brings together a great deal of relevant material. viii Preface to the Second Edition A much shorter book on PCA (Dunteman, 1989), which is targeted at social scientists, has also appeared since 1986. Like the slim volume by Daultrey (1976), written mainly for geographers, it contains little technical material. The preface to the first edition noted some variations in terminology. Likewise, the notation used in the literature on PCA varies quite widely. Appendix D of Jackson (1991) provides a useful table of notation for some of the main quantities in PCA collected from 34 references (mainly textbooks on multivariate analysis). Where possible, the current book uses notation adopted by a majority of authors where a consensus exists. To end this Preface, I include a slightly frivolous, but nevertheless in- teresting, aside on both the increasing popularity of PCA and on its terminology. It was noted in the preface to the first edition that both terms ‘principal component analysis’ and ‘principal components analysis’ are widely used. I have always preferred the singular form as it is compati- ble with ‘factor analysis,’ ‘cluster analysis,’ ‘canonical correlation analysis’ and so on, but had no clear idea whether the singular or plural form was more frequently used. A search for references to the two forms in key words or titles of articles using the Web of Science for the six years 1995–2000, re- vealed that the number of singular to plural occurrences were, respectively, 1017 to 527 in 1995–1996; 1330 to 620 in 1997–1998; and 1634 to 635 in 1999–2000. Thus, there has been nearly a 50 percent increase in citations of PCA in one form or another in that period, but most of that increase has been in the singular form, which now accounts for 72% of occurrences.
Recommended publications
  • San Models; *Research Methodology; *Structural Equation
    DOCUMENT RESUME ED 383 760 TM 023 293 AUTHOR Fan, Xitao TITLE Structural Equation Modeling and Canonical Correlation Analysis: What Do They Have in Common? PUB DATE Apr 95 NOTE 28p.; Paper presented at the Annual Meeting of the American Educational Research Association (San Francisco, CA, April 18-22, 1995). PUB TYPE Reports Evaluative/Feasibility (142) Speeches /Conference Papers (150) EDRS PRICE MF01/PCO2 Plus Postage. DESCRIPTORS Comparative Analysis; *Correlation; Mathematical Models; *Research Methodology; *Structural Equation Models ABSTRACT This paper, in a fashion easy to follow, illustrates the interesting relationship between structural equation modeling and canonical correlation analysis. Although computationally somewhat inconvenient, representing canonical correlation as a structural equation model may provide some information which is not available from conventional canonical correlation analysis. Hierarchically, with regard to the degree of generality of the techniques, it is suggested that structural equation modeling stands to be a more general approach. For researchers interested in these techniques, understanding the interrelationship among them is meaningful, since our judicious choice of these analytic techniques for our research depends on such understanding. Three tables and two figures present the data. (Contains 25 references.) (Author) *********************************************************************** Reproductions supplied by EDRS are the best that can be made from the original document. * ***********************************************************************
    [Show full text]
  • Weighted Means and Grouped Data on the Graphing Calculator
    Weighted Means and Grouped Data on the Graphing Calculator Weighted Means. Suppose that, in a recent month, a bank customer had $2500 in his account for 5 days, $1900 for 2 days, $1650 for 3 days, $1375 for 9 days, $1200 for 1 day, $900 for 6 days, and $675 for 4 days. To calculate the average balance for that month, you would use a weighted mean: ∑ ̅ ∑ To do this automatically on a calculator, enter the account balances in L1 and the number of days (weight) in L2: As before, go to STAT CALC and 1:1-Var Stats. This time, type L1, L2 after 1-Var Stats and ENTER: The weighted (sample) mean is ̅ (so that the average balance for the month is $1430.83.) We can also see that the weighted sample standard deviation is . Estimating the Mean and Standard Deviation from a Frequency Distribution. If your data is organized into a frequency distribution, you can still estimate the mean and standard deviation. For example, suppose that we are given only a frequency distribution of the heights of the 30 males instead of the list of individual heights: Height (in.) Frequency (f) 62 - 64 3 65- 67 7 68 - 70 9 71 - 73 8 74 - 76 3 ∑ n = 30 We can just calculate the midpoint of each height class and use that midpoint to represent the class. We then find the (weighted) mean and standard deviation for the distribution of midpoints with the given frequencies as in the example above: Height (in.) Height Class Frequency (f) Midpoint 62 - 64 (62 + 64)/2 = 63 3 65- 67 (65 + 67)/2 = 66 7 68 - 70 69 9 71 - 73 72 8 74 - 76 75 3 ∑ n = 30 The approximate sample mean of the distribution is ̅ , and the approximate sample standard deviation of the distribution is .
    [Show full text]
  • Lecture 3: Measure of Central Tendency
    Lecture 3: Measure of Central Tendency Donglei Du ([email protected]) Faculty of Business Administration, University of New Brunswick, NB Canada Fredericton E3B 9Y2 Donglei Du (UNB) ADM 2623: Business Statistics 1 / 53 Table of contents 1 Measure of central tendency: location parameter Introduction Arithmetic Mean Weighted Mean (WM) Median Mode Geometric Mean Mean for grouped data The Median for Grouped Data The Mode for Grouped Data 2 Dicussion: How to lie with averges? Or how to defend yourselves from those lying with averages? Donglei Du (UNB) ADM 2623: Business Statistics 2 / 53 Section 1 Measure of central tendency: location parameter Donglei Du (UNB) ADM 2623: Business Statistics 3 / 53 Subsection 1 Introduction Donglei Du (UNB) ADM 2623: Business Statistics 4 / 53 Introduction Characterize the average or typical behavior of the data. There are many types of central tendency measures: Arithmetic mean Weighted arithmetic mean Geometric mean Median Mode Donglei Du (UNB) ADM 2623: Business Statistics 5 / 53 Subsection 2 Arithmetic Mean Donglei Du (UNB) ADM 2623: Business Statistics 6 / 53 Arithmetic Mean The Arithmetic Mean of a set of n numbers x + ::: + x AM = 1 n n Arithmetic Mean for population and sample N P xi µ = i=1 N n P xi x¯ = i=1 n Donglei Du (UNB) ADM 2623: Business Statistics 7 / 53 Example Example: A sample of five executives received the following bonuses last year ($000): 14.0 15.0 17.0 16.0 15.0 Problem: Determine the average bonus given last year. Solution: 14 + 15 + 17 + 16 + 15 77 x¯ = = = 15:4: 5 5 Donglei Du (UNB) ADM 2623: Business Statistics 8 / 53 Example Example: the weight example (weight.csv) The R code: weight <- read.csv("weight.csv") sec_01A<-weight$Weight.01A.2013Fall # Mean mean(sec_01A) ## [1] 155.8548 Donglei Du (UNB) ADM 2623: Business Statistics 9 / 53 Will Rogers phenomenon Consider two sets of IQ scores of famous people.
    [Show full text]
  • In the Term Multivariate Analysis Has Been Defined Variously by Different Authors and Has No Single Definition
    Newsom Psy 522/622 Multiple Regression and Multivariate Quantitative Methods, Winter 2021 1 Multivariate Analyses The term "multivariate" in the term multivariate analysis has been defined variously by different authors and has no single definition. Most statistics books on multivariate statistics define multivariate statistics as tests that involve multiple dependent (or response) variables together (Pituch & Stevens, 2105; Tabachnick & Fidell, 2013; Tatsuoka, 1988). But common usage by some researchers and authors also include analysis with multiple independent variables, such as multiple regression, in their definition of multivariate statistics (e.g., Huberty, 1994). Some analyses, such as principal components analysis or canonical correlation, really have no independent or dependent variable as such, but could be conceived of as analyses of multiple responses. A more strict statistical definition would define multivariate analysis in terms of two or more random variables and their multivariate distributions (e.g., Timm, 2002), in which joint distributions must be considered to understand the statistical tests. I suspect the statistical definition is what underlies the selection of analyses that are included in most multivariate statistics books. Multivariate statistics texts nearly always focus on continuous dependent variables, but continuous variables would not be required to consider an analysis multivariate. Huberty (1994) describes the goals of multivariate statistical tests as including prediction (of continuous outcomes or
    [Show full text]
  • Random Variables and Applications
    Random Variables and Applications OPRE 6301 Random Variables. As noted earlier, variability is omnipresent in the busi- ness world. To model variability probabilistically, we need the concept of a random variable. A random variable is a numerically valued variable which takes on different values with given probabilities. Examples: The return on an investment in a one-year period The price of an equity The number of customers entering a store The sales volume of a store on a particular day The turnover rate at your organization next year 1 Types of Random Variables. Discrete Random Variable: — one that takes on a countable number of possible values, e.g., total of roll of two dice: 2, 3, ..., 12 • number of desktops sold: 0, 1, ... • customer count: 0, 1, ... • Continuous Random Variable: — one that takes on an uncountable number of possible values, e.g., interest rate: 3.25%, 6.125%, ... • task completion time: a nonnegative value • price of a stock: a nonnegative value • Basic Concept: Integer or rational numbers are discrete, while real numbers are continuous. 2 Probability Distributions. “Randomness” of a random variable is described by a probability distribution. Informally, the probability distribution specifies the probability or likelihood for a random variable to assume a particular value. Formally, let X be a random variable and let x be a possible value of X. Then, we have two cases. Discrete: the probability mass function of X specifies P (x) P (X = x) for all possible values of x. ≡ Continuous: the probability density function of X is a function f(x) that is such that f(x) h P (x < · ≈ X x + h) for small positive h.
    [Show full text]
  • Mistaking the Forest for the Trees: the Mistreatment of Group-Level Treatments in the Study of American Politics
    Mistaking the Forest for the Trees: The Mistreatment of Group-Level Treatments in the Study of American Politics Kelly T. Rader Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences COLUMBIA UNIVERSITY 2012 c 2012 Kelly T. Rader All Rights Reserved ABSTRACT Mistaking the Forest for the Trees: The Mistreatment of Group-Level Treatments in the Study of American Politics Kelly T. Rader Over the past few decades, the field of political science has become increasingly sophisticated in its use of empirical tests for theoretical claims. One particularly productive strain of this devel- opment has been the identification of the limitations of and challenges in using observational data. Making causal inferences with observational data is difficult for numerous reasons. One reason is that one can never be sure that the estimate of interest is un-confounded by omitted variable bias (or, in causal terms, that a given treatment is ignorable or conditionally random). However, when the ideal hypothetical experiment is impractical, illegal, or impossible, researchers can often use quasi-experimental approaches to identify causal effects more plausibly than with simple regres- sion techniques. Another reason is that, even if all of the confounding factors are observed and properly controlled for in the model specification, one can never be sure that the unobserved (or error) component of the data generating process conforms to the assumptions one must make to use the model. If it does not, then this manifests itself in terms of bias in standard errors and incor- rect inference on statistical significance of quantities of interest.
    [Show full text]
  • Principal Component Analysis: Application to Statistical Process Control
    Chapter 1 Principal Component Analysis: Application to Statistical Process Control 1.1. Introduction Principal component analysis (PCA) is an exploratory statistical method for graphical description of the information present in large datasets. In most applications, PCA consists of studying p variables measured on n individuals. When n and p are large, the aim is to synthesize the huge quantity of information into an easy and understandable form. Unidimensional or bidimensional studies can be performed on variables using graphical tools (histograms, box plots) or numerical summaries (mean, variance, correlation). However, these simple preliminary studies in a multidimensional context are insufficient since they do not take into account the eventual relationships between variables, which is often the most important point. Principal component analysis is often considered as the basic method of factor analysis, which aims to find linear combinations of the p variables called components used to visualize the observations in a simple way. Because it transforms a large number of correlated variables into a few uncorrelated principal components, PCA is a dimension reduction method. However, PCA can also be used as a multivariate outlier detection method, especially by studying the last principal components. This property is useful in multidimensional quality control. Chapter written by Gilbert SAPORTA and Ndèye NIANG. 1 2 Data Analysis 1.2. Data table and related subspaces 1.2.1. Data and their characteristics Data are generally represented in a rectangulartable with n rows for the individuals and p columns corresponding to the variables. Choosing individuals and variables to analyze is a crucial phase which has an important influence on PCA results.
    [Show full text]
  • Statistics, Measures of Central Tendency I
    Statistics, Measures of Central TendencyI We are considering a random variable X with a probability distribution which has some parameters. We want to get an idea what these parameters are. We perfom an experiment n times and record the outcome. This means we have X1;:::; Xn i.i.d. random variables, with probability distribution same as X . We want to use the outcome to infer what the parameters are. Mean The outcomes are x1;:::; xn. The Sample Mean is x1+···+xn x := n . Also sometimes called the average. The expected value of X , EX , is also called the mean of X . Often denoted by µ. Sometimes called population mean. Median The number so that half the values are below, half above. If the sample is of even size, you take the average of the middle terms. Mode The number that occurs most frequently. There could be several modes, or no mode. Dan Barbasch Math 1105 Chapter 9 Week of September 25 1 / 24 Statistics, Measures of Central TendencyII Example You have a coin for which you know that P(H) = p and P(T ) = 1 − p: You would like to estimate p. You toss it n times. You count the number of heads. The sample mean should be an estimate of p: EX = p, and E(X1 + ··· + Xn) = np: So X + ··· + X E 1 n = p: n Dan Barbasch Math 1105 Chapter 9 Week of September 25 2 / 24 Descriptive StatisticsI Frequency Distribution Divide into a number of equal disjoint intervals. For each interval count the number of elements in the sample occuring.
    [Show full text]
  • Lesson 16.5 Measures of Central Tendency and Grouped Data
    NAME: INTEGRATED ALGEBRA 1 DATE: MR. THOMPSON LESSON 16.5 MEASURES OF CENTRAL TENDENCY AND GROUPED DATA HOMEWORK ASSIGNMENT # 126: PAGES 695-697: # 2 - 12 EVENS EXAMPLE I OiliflitEMIZZEITIOMMESEMZESEM72ra, In the table, the data indicate the heights, in inches, of 17 basketball players. For these data find: a. the mode b. the median c. the mean 77 2 Solution a. The greatest frequency, 5, occurs for the height 76 0 of 75 inches. The mode, or height appearing most often, is 75. 75 5 74 3 b.For 17 players, the median is the 9th number, so there are 8 heights greater than or equal to the 73 4 median and 8 heights less than or equal to the 72 2 median. Counting the frequencies going down, 1 we have 2 + 0 + 5 = 7. Since the frequency of 71 the next interval is 3, the 8th, 9th, and 10th heights are in this interval, 74. Counting the frequencies going up, we have 1 + 2 + 4 = 7. Again, the fre- quency of the next interval is 3, and the 8th, 9th, and 10th heights are in this interval. The 9th height, the median, is 74. c.(1) Multiply each height by its corresponding frequency: 77 x 2 = 154 76 x 0 = 0 75 x 5 375 74 x 3 = 222 73 X 4 = 292 72 X 2 = 144 71 x 1 = 71 (2) Fmd the total of these products: 154 + 0 + 375 + 222 + 292 + 144 + 71 = 1,258 (3) Divide this total, 1,258, by the total frequency, 17 to obtain the mean: 1258 +47 = 74 14.
    [Show full text]
  • Multimodal Representation Learning Using Deep Multiset Canonical
    MULTIMODAL REPRESENTATION LEARNING USING DEEP MULTISET CANONICAL CORRELATION ANALYSIS Krishna Somandepalli, Naveen Kumar, Ruchir Travadi, Shrikanth Narayanan Signal Analysis and Interpretation Laboratory, University of Southern California April 4, 2019 ABSTRACT We propose Deep Multiset Canonical Correlation Analysis (dMCCA) as an extension to represen- tation learning using CCA when the underlying signal is observed across multiple (more than two) modalities. We use deep learning framework to learn non-linear transformations from different modalities to a shared subspace such that the representations maximize the ratio of between- and within-modality covariance of the observations. Unlike linear discriminant analysis, we do not need class information to learn these representations, and we show that this model can be trained for com- plex data using mini-batches. Using synthetic data experiments, we show that dMCCA can effectively recover the common signal across the different modalities corrupted by multiplicative and additive noise. We also analyze the sensitivity of our model to recover the correlated components with respect to mini-batch size and dimension of the embeddings. Performance evaluation on noisy handwritten datasets shows that our model outperforms other CCA-based approaches and is comparable to deep neural network models trained end-to-end on this dataset. 1 Introduction and Background In many signal processing applications, we often observe the underlying phenomenon through different modes of signal measurement, or views or modalities. For example, the act of running can be measured by electrodermal activity, heart rate monitoring and respiratory rate. We refer to these measurements as modalities. These observations are often corrupted by sensor noise, as well as the variability of the signal across people, influenced by interactions with other physiological processes.
    [Show full text]
  • Propensity Score Matching Regression Discontinuity Limited Dependent Variables
    Propensity Score Matching Regression Discontinuity Limited Dependent Variables Christopher F Baum EC 823: Applied Econometrics Boston College, Spring 2013 Christopher F Baum (BC / DIW) PSM, RD, LDV Boston College, Spring 2013 1 / 99 Propensity score matching Propensity score matching Policy evaluation seeks to determine the effectiveness of a particular intervention. In economic policy analysis, we rarely can work with experimental data generated by purely random assignment of subjects to the treatment and control groups. Random assignment, analogous to the ’randomized clinical trial’ in medicine, seeks to ensure that participation in the intervention, or treatment, is the only differentiating factor between treatment and control units. In non-experimental economic data, we observe whether subjects were treated or not, but in the absence of random assignment, must be concerned with differences between the treated and non-treated. For instance, do those individuals with higher aptitude self-select into a job training program? If so, they are not similar to corresponding individuals along that dimension, even though they may be similar in other aspects. Christopher F Baum (BC / DIW) PSM, RD, LDV Boston College, Spring 2013 2 / 99 Propensity score matching The key concern is that of similarity. How can we find individuals who are similar on all observable characteristics in order to match treated and non-treated individuals (or plants, or firms...) With a single measure, we can readily compute a measure of distance between a treated unit and each candidate match. With multiple measures defining similarity, how are we to balance similarity along each of those dimensions? The method of propensity score matching (PSM) allows this matching problem to be reduced to a single dimension: that of the propensity score.
    [Show full text]
  • Causal Inference Using Educational Observational Data: Statistical Bias Reduction Methods and Multilevel Data Extensions
    Causal Inference Using Educational Observational Data: Statistical Bias Reduction Methods and Multilevel Data Extensions Jose M. Hernandez A Dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Washington 2015 Reading Committee: Robert Abbott, Chair Elizabeth Sanders Joe Lott Program Authorized to Offer Degree: College of Education © Copyright 2015 Jose M. Hernandez University of Washington Abstract Causal Inference Using Educational Observational Data: Statistical Bias Reduction Methods and Multilevel Data Extensions Jose M. Hernandez Chair of the Supervisory Committee: Professor Robert Abbott, Ph.D. College of Education This study utilizes a data driven simulation design, which deviates from the traditional model-based approaches most commonly adopted in quasi-experimental Monte Carlo (MC) simulation studies, to answer two main questions. First, this study explores the finite sample properties of the most utilized quasi-experimental methods that control for observable selection bias in the field of education and compares them to traditional regression methods. Second, this study lends an insight into the effects of ignoring the multilevel structure of data commonly found in the field when using quasi-experimental methods. Specifically, treatment effects were estimated using (1) Ordinary Least Squares (OLS) multiple linear regression (treatment effects, adjusted for mean differences on confounders), (2) Propensity Score Matching (PSM) using nearest neighbor 1:n with replacement,
    [Show full text]