Outline Nature of Heteroscedasticity Possible Reasons

Total Page:16

File Type:pdf, Size:1020Kb

Outline Nature of Heteroscedasticity Possible Reasons 1/25 Outline Basic Econometrics in Transportation WWhathat iiss tthehe nnatureature ooff hheteroscedasticity?eteroscedasticity? What are its consequences? How does one detect it? What are the remedial measures? Heteroscedasticity Amir Samimi Civil Engineering Department Sharif University of Technology Primary Source: Basic Econometrics (Gujarati) 2/25 3/25 Nature of Heteroscedasticity Possible Reasons 2 2 An impor tant assu mpt io n in C LRM is t hat E(u i) = σ 1. As peopl e l earn , the ir e rro rs o f be hav io r beco me s ma lle r ove r This is the assumption of equal (homo) spread (scedasticity). time. Example: the higher income families on the average save more than the lower- As the number of hours of typing practice increases, the average number of income families, but there is also more variability in their savings. typing errors as well as their variances decreases. 2. As incomes grow, people have more choices about the disposition of their income. Rich people have more choices about their savings behavior. 2 3. As data collecting techniques improve, σ i is likely to decrease. Banks that have sophisticated data processing equipment are likely to commit fewer errors. 4/25 5/25 Possible Reasons Cross-sectional and Time Series Data 44.. HHeteroscedasticityeteroscedasticity cacann aariserise wwhenhen ttherehere aarere outoutliers.liers. HHeteroscedasticityeteroscedasticity iiss liklikelyely to be mmoreore cocommonmmon in ccrossross- An observation that is much different than other observations in the sample. sectional than in time series data. 5. Heteroscedasticity arises when model is not correctly specified. In cross-sectional data, one usually deals with members of a population at a given point in time. These members may be of different sizes, income, etc. Very often what looks like heteroscedasticity may be due to the fact that some important variables are omitted from the model. In time series data, the variables tend to be of similar orders of magnitude because one generally collects the data for the same entity over a period of 6. Skewness in distribution of a regressor is an other source. time. Distribution of income and wealth in most societies is uneven, with the bulk of the income and wealth being owned by a few at the top. 7. Other sources of heteroscedasticity: Incorrect data transformation (ratio or first difference transformations). Incorrect functional form (linear versus log–linear models). 6/25 7/25 OLS Estimation with Heteroscedasticity Method of Generalized Least Squares OLS est im ator s an d th eir vari an ces w he n Ideall y, we woul d lik e to gi ve l ess weig ht to t he obse r vati on s . coming from populations with greater variability. . Consider: Yi = β1 + β2Xi + ui = β1X0i + β2Xi + ui Assume the heteroscedastic variances are known: Is it still BLUE when we drop only the homoscedasticity assumption? We can easily prove that it is still linear and unbiased. We can also show that it is a consistent estimator. Variance of transformed disturbance term is now homoscedastic: It is no longer best and the minimum variance is not given by the equation above. What is BLUE in the presence of heteroscedasticity? Apply OLS to the transformed model and get BLUE estimators. 8/25 9/25 GLS Estimators Consequences of Using OLS MinimizMinimizee OOLSLS estestimatorimator fforor vavarianceriance iiss a bbiasediased estestimator.imator. Overestimates or underestimates, on average Cannot tell whether the bias is positive or negative No longer rely on confidence intervals, t and F tests Follow the standard calculus techniques, we have: If we persist in using the usual testing procedures despite heteroscedasticity, whatever conclusions we draw may be very misleading. Heteroscedasticity is potentially a serious problem and the researcher needs to know whether it is present in a given situation. 10/25 11/25 Detection Informal Methods Theeeaere are no ha adrd-aadnd-fast r ul es fo r detect ing hete roscedast ic ity, Nature of the Problem only a few rules of thumb. Nature of problem may suggest heteroscedasticity is likely to be encountered. Residual variance around the regression of consumption on income increases This is inevitable because σ2 can be known only if we have the entire Y i with income. population corresponding to the chosen X’s, More often than not, there is only one sample Y value corresponding to a Graphical Method 2 2 particular value of X. And there is no way one can know σ i from just one Y Estimated u i are plotted against estimated Yi observation. Is the estimated mean value of Y systematically Thus, heteroscedasticity may be a matter of intuition , educated guesswork , or rela te d to the squared resid ual? prior empirical experience. a) no systematic pattern, perhaps no Most of the detection methods are based on examination of OLS heteroscedasticity. residuals. b-e) definite pattern, perhaps no homoscedasticity. Those are the ones we observe, and not ui. We hope they are good estimates. Using such knowledge, one may transform the This hope may be fulfilled if the sample size is fairly large. data to alleviate the problem. 12/25 13/25 Formal Methods Formal Methods PParkark TTestest GGlejserlejser TTestest He formalizes the graphical method, by suggesting a Log-linear model: Glejser suggests regressing the estimated error term on the X variable: 2 2 ln σ i = ln σ + β ln Xi + vi Following functional forms are suggested: 2 Since σ i is generally unknown, Park suggests If β turns out to be insignificant, homoscedasticity assumption may be accepted. Thilfilfhbkilhe particular functional form chosen by Park is only suggesti ve. For large samples the first four give generally satisfactory results. The last two models are nonlinear in the parameters. Note: the error term vi may not satisfy the OLS assumptions. Note: some argued that vi does not have a zero expected value, it is serially correlated, and heteroscedastic. 14/25 15/25 Formal Methods Formal Methods Spearma n’s Ra nk Co rre lat io n Test Gold f el d-Quandt Test Fit the regression to the data on Y and X and estimate the residuals. Rank the observations according to Xi values. Rank both absolute value of residuals and Xi (or estimated Yi) and compute the Omit c central observations, and divide the remaining observations into two Spearman’s rank correlation coefficient: groups each of (n − c) / 2 observations. Fit separate OLS regressions to the first and last set of observations, and obtain th • di = difference in the ranks for i observation. the residual sums of squares RSS1 and RSS2. Assuming that the population rank correlation coefficient is zero and n > 8, the Compute the ratio siifiignificance o fthf the sampl e rs can btbe tes tdbthtttted by the t test, with df = n − 2: If ui are assumed to be normally distributed, and if the assumption of homoscedasticity is valid, then it can be shown that λ follows the F distribution. The ability of the test depends on how c is chosen. If the computed t value exceeds the critical t value, we may accept the Goldfeld and Quandt suggest that c = 8 if n = 30, c = 16 if n = 60. hypothesis of heteroscedasticity. Judge et al. note that c = 4 if n = 30 and c = 10 if n is about 60. 16/25 17/25 Formal Methods Formal Methods BrBreuscheusch–PPaganagan–GodGodfreyfrey TTestest WWhitehite’s GeGeneralneral HHeteroscedasticityeteroscedasticity TTestest Success of GQ test depends on c and X with which observations are ordered. Does not rely on the normality assumption and is easy to implement. Estimate Yi = β1 + β2X2i + ··· + βkXki + ui by OLS and obtain the residuals. Estimate Yi = β1 + β2X2i + β3X3i + ui and obtain the residuals. Obtain , (ML estimator of σ2) Run the following auxiliary regression: Construct variables pi defined as Regress pi on the Z’s as pi = α1 + α2Z2i + ··· + αmZmi + vi Higher powers of regressors can also be introduced. 2 o σ i is assumed to be a linear function of the Z’s. Under the null hypothesis (homoscedasticity), if the sample size n increases o Some or all of the X’s can serve as Z’s. idfiiindefinite ly, it can b e sh own th at nR2 ∼ χ2 (df = numbfber of regressors) Obtain the ESS (explained sum of squares) = 0.5 ESS If the chi-square value exceeds the critical value, the conclusion is that there is heteroscedasticity. Assuming ui are normally distributed, one can show that if there is 2 If it does not α = α = α = α = α = 0. homoscedasticity and if the sample size n increases indefinitely, then ∼ χ m−1 2 3 4 5 6 BPG test is an asymptotic, or large-sample, test. It has been argued that if cross-product terms are present, then it is a test of heteroscedasticity and specification bias. 18/25 19/25 Remedial Measures Remedial Measures 2 Heteroscedast ic ity does not dest roy unb iased ness aadnd WWehen σ i issow: known: consistency. The most straightforward method of correcting heteroscedasticity is But OLS estimators are no longer efficient, not even by means of weighted least squares. asymptotically. WLS method provides BLUE estimators. There are two approaches to remediation: 2 2 when σ i is known, and When σ i is unknown: 2 When σ i is not known. Is there a way of obtaining consistent estimates of the variances and covariances of OLS estimators even if there is heteroscedasticity? The answer is yes. 20/25 21/25 White’s Correction White’s Procedure WWhitehite hhasas suggested a pprocedurerocedure by wwhichhich asyasymptoticallymptotically vavalidlid FForor a 2-2 vavariableriable rregressionegression mmodelodel Yi = β1 + β2X2i + ui we sshowed:howed: statistical inferences can be made about the true parameter values.
Recommended publications
  • Auditor: an R Package for Model-Agnostic Visual Validation and Diagnostics
    auditor: an R Package for Model-Agnostic Visual Validation and Diagnostics APREPRINT Alicja Gosiewska Przemysław Biecek Faculty of Mathematics and Information Science Faculty of Mathematics and Information Science Warsaw University of Technology Warsaw University of Technology Poland Faculty of Mathematics, Informatics and Mechanics [email protected] University of Warsaw Poland [email protected] May 27, 2020 ABSTRACT Machine learning models have spread to almost every area of life. They are successfully applied in biology, medicine, finance, physics, and other fields. With modern software it is easy to train even a complex model that fits the training data and results in high accuracy on test set. The problem arises when models fail confronted with the real-world data. This paper describes methodology and tools for model-agnostic audit. Introduced tech- niques facilitate assessing and comparing the goodness of fit and performance of models. In addition, they may be used for analysis of the similarity of residuals and for identification of outliers and influential observations. The examination is carried out by diagnostic scores and visual verification. Presented methods were implemented in the auditor package for R. Due to flexible and con- sistent grammar, it is simple to validate models of any classes. K eywords machine learning, R, diagnostics, visualization, modeling 1 Introduction Predictive modeling is a process that uses mathematical and computational methods to forecast outcomes. arXiv:1809.07763v4 [stat.CO] 26 May 2020 Lots of algorithms in this area have been developed and are still being develop. Therefore, there are countless possible models to choose from and a lot of ways to train a new new complex model.
    [Show full text]
  • Heteroscedastic Errors
    Heteroscedastic Errors ◮ Sometimes plots and/or tests show that the error variances 2 σi = Var(ǫi ) depend on i ◮ Several standard approaches to fixing the problem, depending on the nature of the dependence. ◮ Weighted Least Squares. ◮ Transformation of the response. ◮ Generalized Linear Models. Richard Lockhart STAT 350: Heteroscedastic Errors and GLIM Weighted Least Squares ◮ Suppose variances are known except for a constant factor. 2 2 ◮ That is, σi = σ /wi . ◮ Use weighted least squares. (See Chapter 10 in the text.) ◮ This usually arises realistically in the following situations: ◮ Yi is an average of ni measurements where you know ni . Then wi = ni . 2 ◮ Plots suggest that σi might be proportional to some power of 2 γ γ some covariate: σi = kxi . Then wi = xi− . Richard Lockhart STAT 350: Heteroscedastic Errors and GLIM Variances depending on (mean of) Y ◮ Two standard approaches are available: ◮ Older approach is transformation. ◮ Newer approach is use of generalized linear model; see STAT 402. Richard Lockhart STAT 350: Heteroscedastic Errors and GLIM Transformation ◮ Compute Yi∗ = g(Yi ) for some function g like logarithm or square root. ◮ Then regress Yi∗ on the covariates. ◮ This approach sometimes works for skewed response variables like income; ◮ after transformation we occasionally find the errors are more nearly normal, more homoscedastic and that the model is simpler. ◮ See page 130ff and check under transformations and Box-Cox in the index. Richard Lockhart STAT 350: Heteroscedastic Errors and GLIM Generalized Linear Models ◮ Transformation uses the model T E(g(Yi )) = xi β while generalized linear models use T g(E(Yi )) = xi β ◮ Generally latter approach offers more flexibility.
    [Show full text]
  • Power Comparisons of the Mann-Whitney U and Permutation Tests
    Power Comparisons of the Mann-Whitney U and Permutation Tests Abstract: Though the Mann-Whitney U-test and permutation tests are often used in cases where distribution assumptions for the two-sample t-test for equal means are not met, it is not widely understood how the powers of the two tests compare. Our goal was to discover under what circumstances the Mann-Whitney test has greater power than the permutation test. The tests’ powers were compared under various conditions simulated from the Weibull distribution. Under most conditions, the permutation test provided greater power, especially with equal sample sizes and with unequal standard deviations. However, the Mann-Whitney test performed better with highly skewed data. Background and Significance: In many psychological, biological, and clinical trial settings, distributional differences among testing groups render parametric tests requiring normality, such as the z test and t test, unreliable. In these situations, nonparametric tests become necessary. Blair and Higgins (1980) illustrate the empirical invalidity of claims made in the mid-20th century that t and F tests used to detect differences in population means are highly insensitive to violations of distributional assumptions, and that non-parametric alternatives possess lower power. Through power testing, Blair and Higgins demonstrate that the Mann-Whitney test has much higher power relative to the t-test, particularly under small sample conditions. This seems to be true even when Welch’s approximation and pooled variances are used to “account” for violated t-test assumptions (Glass et al. 1972). With the proliferation of powerful computers, computationally intensive alternatives to the Mann-Whitney test have become possible.
    [Show full text]
  • ROC Curve Analysis and Medical Decision Making
    ROC curve analysis and medical decision making: what’s the evidence that matters for evidence-based diagnosis ? Piergiorgio Duca Biometria e Statistica Medica – Dipartimento di Scienze Cliniche Luigi Sacco – Università degli Studi – Via GB Grassi 74 – 20157 MILANO (ITALY) [email protected] 1) The ROC (Receiver Operating Characteristic) curve and the Area Under the Curve (AUC) The ROC curve is the statistical tool used to analyse the accuracy of a diagnostic test with multiple cut-off points. The test could be based on a continuous diagnostic indicant, such as a serum enzyme level, or just on an ordinal one, such as a classification based on radiological imaging. The ROC curve is based on the probability density distributions, in actually diseased and non diseased patients, of the diagnostician’s confidence in a positive diagnosis, and upon a set of cut-off points to separate “positive” and “negative” test results (Egan, 1968; Bamber, 1975; Swets, Pickett, 1982; Metz, 1986; Zhou et al, 2002). The Sensitivity (SE) – the proportion of diseased turned out to be test positive – and the Specificity (SP) – the proportion of non diseased turned out to be test negative – will depend on the particular confidence threshold the observer applies to partition the continuously distributed perceptions of evidence into positive and negative test results. The ROC curve is the plot of all the pairs of True Positive Rates (TPR = SE), as ordinate, and False Positive Rates (FPR = (1 – SP)), as abscissa, related to all the possible cut-off points. An ROC curve represents all of the compromises between SE and SP can be achieved, changing the confidence threshold.
    [Show full text]
  • Estimating the Variance of a Propensity Score Matching Estimator for the Average Treatment Effect
    Observational Studies 4 (2018) 71-96 Submitted 5/15; Published 3/18 Estimating the variance of a propensity score matching estimator for the average treatment effect Ronnie Pingel [email protected] Department of Statistics Uppsala University Uppsala, Sweden Abstract This study considers variance estimation when estimating the asymptotic variance of a propensity score matching estimator for the average treatment effect. We investigate the role of smoothing parameters in a variance estimator based on matching. We also study the properties of estimators using local linear estimation. Simulations demonstrate that large gains can be made in terms of mean squared error, bias and coverage rate by prop- erly selecting smoothing parameters. Alternatively, a residual-based local linear estimator could be used as an estimator of the asymptotic variance. The variance estimators are implemented in analysis to evaluate the effect of right heart catheterisation. Keywords: Average Causal Effect, Causal Inference, Kernel estimator 1. Introduction Matching estimators belong to a class of estimators of average treatment effects (ATEs) in observational studies that seek to balance the distributions of covariates in a treatment and control group (Stuart, 2010). In this study we consider one frequently used matching method, the simple nearest-neighbour matching with replacement (Rubin, 1973; Abadie and Imbens, 2006). Rosenbaum and Rubin (1983) show that instead of matching on covariates directly to remove confounding, it is sufficient to match on the propensity score. In observational studies the propensity score must almost always be estimated. An important contribution is therefore Abadie and Imbens (2016), who derive the large sample distribution of a nearest- neighbour propensity score matching estimator when using the estimated propensity score.
    [Show full text]
  • Research Report Statistical Research Unit Goteborg University Sweden
    Research Report Statistical Research Unit Goteborg University Sweden Testing for multivariate heteroscedasticity Thomas Holgersson Ghazi Shukur Research Report 2003:1 ISSN 0349-8034 Mailing address: Fax Phone Home Page: Statistical Research Nat: 031-77312 74 Nat: 031-77310 00 http://www.stat.gu.se/stat Unit P.O. Box 660 Int: +4631 773 12 74 Int: +4631 773 1000 SE 405 30 G6teborg Sweden Testing for Multivariate Heteroscedasticity By H.E.T. Holgersson Ghazi Shukur Department of Statistics Jonkoping International GOteborg university Business school SE-405 30 GOteborg SE-55 111 Jonkoping Sweden Sweden Abstract: In this paper we propose a testing technique for multivariate heteroscedasticity, which is expressed as a test of linear restrictions in a multivariate regression model. Four test statistics with known asymptotical null distributions are suggested, namely the Wald (W), Lagrange Multiplier (LM), Likelihood Ratio (LR) and the multivariate Rao F-test. The critical values for the statistics are determined by their asymptotic null distributions, but also bootstrapped critical values are used. The size, power and robustness of the tests are examined in a Monte Carlo experiment. Our main findings are that all the tests limit their nominal sizes asymptotically, but some of them have superior small sample properties. These are the F, LM and bootstrapped versions of Wand LR tests. Keywords: heteroscedasticity, hypothesis test, bootstrap, multivariate analysis. I. Introduction In the last few decades a variety of methods has been proposed for testing for heteroscedasticity among the error terms in e.g. linear regression models. The assumption of homoscedasticity means that the disturbance variance should be constant (or homoscedastic) at each observation.
    [Show full text]
  • An Introduction to Logistic Regression: from Basic Concepts to Interpretation with Particular Attention to Nursing Domain
    J Korean Acad Nurs Vol.43 No.2, 154 -164 J Korean Acad Nurs Vol.43 No.2 April 2013 http://dx.doi.org/10.4040/jkan.2013.43.2.154 An Introduction to Logistic Regression: From Basic Concepts to Interpretation with Particular Attention to Nursing Domain Park, Hyeoun-Ae College of Nursing and System Biomedical Informatics National Core Research Center, Seoul National University, Seoul, Korea Purpose: The purpose of this article is twofold: 1) introducing logistic regression (LR), a multivariable method for modeling the relationship between multiple independent variables and a categorical dependent variable, and 2) examining use and reporting of LR in the nursing literature. Methods: Text books on LR and research articles employing LR as main statistical analysis were reviewed. Twenty-three articles published between 2010 and 2011 in the Journal of Korean Academy of Nursing were analyzed for proper use and reporting of LR models. Results: Logistic regression from basic concepts such as odds, odds ratio, logit transformation and logistic curve, assumption, fitting, reporting and interpreting to cautions were presented. Substantial short- comings were found in both use of LR and reporting of results. For many studies, sample size was not sufficiently large to call into question the accuracy of the regression model. Additionally, only one study reported validation analysis. Conclusion: Nurs- ing researchers need to pay greater attention to guidelines concerning the use and reporting of LR models. Key words: Logit function, Maximum likelihood estimation, Odds, Odds ratio, Wald test INTRODUCTION The model serves two purposes: (1) it can predict the value of the depen- dent variable for new values of the independent variables, and (2) it can Multivariable methods of statistical analysis commonly appear in help describe the relative contribution of each independent variable to general health science literature (Bagley, White, & Golomb, 2001).
    [Show full text]
  • Njit-Etd2007-041
    Copyright Warning & Restrictions The copyright law of the United States (Title 17, United States Code) governs the making of photocopies or other reproductions of copyrighted material. Under certain conditions specified in the law, libraries and archives are authorized to furnish a photocopy or other reproduction. One of these specified conditions is that the photocopy or reproduction is not to be “used for any purpose other than private study, scholarship, or research.” If a, user makes a request for, or later uses, a photocopy or reproduction for purposes in excess of “fair use” that user may be liable for copyright infringement, This institution reserves the right to refuse to accept a copying order if, in its judgment, fulfillment of the order would involve violation of copyright law. Please Note: The author retains the copyright while the New Jersey Institute of Technology reserves the right to distribute this thesis or dissertation Printing note: If you do not wish to print this page, then select “Pages from: first page # to: last page #” on the print dialog screen The Van Houten library has removed some of the personal information and all signatures from the approval page and biographical sketches of theses and dissertations in order to protect the identity of NJIT graduates and faculty. ABSTRACT PROBLEMS RELATED TO EFFICACY MEASUREMENT AND ANALYSES by Sibabrata Banerjee In clinical research it is very common to compare two treatments on the basis of an efficacy vrbl Mr pfll f Χ nd Υ dnt th rpn f ptnt n th t trtnt A nd rptvl th
    [Show full text]
  • T-Statistic Based Correlation and Heterogeneity Robust Inference
    t-Statistic Based Correlation and Heterogeneity Robust Inference Rustam IBRAGIMOV Economics Department, Harvard University, 1875 Cambridge Street, Cambridge, MA 02138 Ulrich K. MÜLLER Economics Department, Princeton University, Fisher Hall, Princeton, NJ 08544 ([email protected]) We develop a general approach to robust inference about a scalar parameter of interest when the data is potentially heterogeneous and correlated in a largely unknown way. The key ingredient is the following result of Bakirov and Székely (2005) concerning the small sample properties of the standard t-test: For a significance level of 5% or lower, the t-test remains conservative for underlying observations that are independent and Gaussian with heterogenous variances. One might thus conduct robust large sample inference as follows: partition the data into q ≥ 2 groups, estimate the model for each group, and conduct a standard t-test with the resulting q parameter estimators of interest. This results in valid and in some sense efficient inference when the groups are chosen in a way that ensures the parameter estimators to be asymptotically independent, unbiased and Gaussian of possibly different variances. We provide examples of how to apply this approach to time series, panel, clustered and spatially correlated data. KEY WORDS: Dependence; Fama–MacBeth method; Least favorable distribution; t-test; Variance es- timation. 1. INTRODUCTION property of the correlations. The key ingredient to the strategy is a result by Bakirov and Székely (2005) concerning the small Empirical analyses in economics often face the difficulty that sample properties of the usual t-test used for inference on the the data is correlated and heterogeneous in some unknown fash- mean of independent normal variables: For significance levels ion.
    [Show full text]
  • Two-Way Heteroscedastic ANOVA When the Number of Levels Is Large
    Two-way Heteroscedastic ANOVA when the Number of Levels is Large Short Running Title: Two-way ANOVA Lan Wang 1 and Michael G. Akritas 2 Revision: January, 2005 Abstract We consider testing the main treatment e®ects and interaction e®ects in crossed two-way layout when one factor or both factors have large number of levels. Random errors are allowed to be nonnormal and heteroscedastic. In heteroscedastic case, we propose new test statistics. The asymptotic distributions of our test statistics are derived under both the null hypothesis and local alternatives. The sample size per treatment combination can either be ¯xed or tend to in¯nity. Numerical simulations indicate that the proposed procedures have good power properties and maintain approximately the nominal ®-level with small sample sizes. A real data set from a study evaluating forty varieties of winter wheat in a large-scale agricultural trial is analyzed. Key words: heteroscedasticity, large number of factor levels, unbalanced designs, quadratic forms, projection method, local alternatives 1385 Ford Hall, School of Statistics, University of Minnesota, 224 Church Street SE, Minneapolis, MN 55455. Email: [email protected], phone: (612) 625-7843, fax: (612) 624-8868. 2414 Thomas Building, Department of Statistics, the Penn State University, University Park, PA 16802. E-mail: [email protected], phone: (814) 865-3631, fax: (814) 863-7114. 1 Introduction In many experiments, data are collected in the form of a crossed two-way layout. If we let Xijk denote the k-th response associated with the i-th level of factor A and j -th level of factor B, then the classical two-way ANOVA model speci¯es that: Xijk = ¹ + ®i + ¯j + γij + ²ijk (1.1) where i = 1; : : : ; a; j = 1; : : : ; b; k = 1; 2; : : : ; nij, and to be identi¯able the parameters Pa Pb Pa Pb are restricted by conditions such as i=1 ®i = j=1 ¯j = i=1 γij = j=1 γij = 0.
    [Show full text]
  • Profiling Heteroscedasticity in Linear Regression Models
    358 The Canadian Journal of Statistics Vol. 43, No. 3, 2015, Pages 358–377 La revue canadienne de statistique Profiling heteroscedasticity in linear regression models Qian M. ZHOU1*, Peter X.-K. SONG2 and Mary E. THOMPSON3 1Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, British Columbia, Canada 2Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, U.S.A. 3Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario, Canada Key words and phrases: Heteroscedasticity; hybrid test; information ratio; linear regression models; model- based estimators; sandwich estimators; screening; weighted least squares. MSC 2010: Primary 62J05; secondary 62J20 Abstract: Diagnostics for heteroscedasticity in linear regression models have been intensively investigated in the literature. However, limited attention has been paid on how to identify covariates associated with heteroscedastic error variances. This problem is critical in correctly modelling the variance structure in weighted least squares estimation, which leads to improved estimation efficiency. We propose covariate- specific statistics based on information ratios formed as comparisons between the model-based and sandwich variance estimators. A two-step diagnostic procedure is established, first to detect heteroscedasticity in error variances, and then to identify covariates the error variance structure might depend on. This proposed method is generalized to accommodate practical complications, such as when covariates associated with the heteroscedastic variances might not be associated with the mean structure of the response variable, or when strong correlation is present amongst covariates. The performance of the proposed method is assessed via a simulation study and is illustrated through a data analysis in which we show the importance of correct identification of covariates associated with the variance structure in estimation and inference.
    [Show full text]
  • Propensity Score Analysis with Hierarchical Data
    Section on Statistics in Epidemiology Propensity score analysis with hierarchical data Fan Li, Alan M. Zaslavsky, Mary Beth Landrum Department of Health Care Policy, Harvard Medical School 180 Longwood Avenue, Boston, MA 02115 October 29, 2007 Abstract nors et al., 1996; D’Agostino, 1998, and references therein). This approach, which involves comparing subjects weighted Propensity score (Rosenbaum and Rubin, 1983) methods are (or stratified, matched) according to their propensity to re- being increasingly used as a less parametric alternative to tra- ceive treatment (i.e., propensity score), attempts to balance ditional regression methods in medical care and health policy subjects in treatment groups in terms of observed character- research. Data collected in these disciplines are often clus- istics as would occur in a randomized experiment. Propensity tered or hierarchically structured, in the sense that subjects are score methods permit control of all observed confounding fac- grouped together in one or more ways that may be relevant tors that might influence both choice of treatment and outcome to the analysis. However, propensity score was developed using a single composite measure, without requiring specifi- and has been applied in settings with unstructured data. In cation of the relationships between the control variables and this report, we present and compare several propensity-score- outcome. weighted estimators of treatment effect in the context of hier- Propensity score methods were developed and have been archically structured
    [Show full text]