DIAGNOSTICS in MULTIPLE REGRESSION the Data Are in the Form

Total Page:16

File Type:pdf, Size:1020Kb

DIAGNOSTICS in MULTIPLE REGRESSION the Data Are in the Form PubH 7405: REGRESSION ANALYSIS DIAGNOSTICS IN MULTIPLE REGRESSION The data are in the form : {( yi ; x1i , x2i ,, xki )}i=1,,n Multiple Regression Model : Y = β0 + β1x1 + β2 x2 +βk xk + ε ε ∈ N(0,σ 2 ) The error terms are identically and independently distributed as normal with a constant variance. A simple strategy for the building of a regression model consists of some, most, or all of the following five steps or phases: (1) Data collection and preparation, (2) Preliminary model investigation, (3) Reduction of the predictor variables, (4) Model refinement and selection, and (5) Model validation Let say we finished step #3; we tentatively have a good model: it’s time for some fine tuning! Diagnostics to identify possible violations of model’s assumptions are often focused on these “major issues”: (1) Non-linearity, (2) Outlying & influential cases, (3) Multi-collinearity, (4) Non-constant variance & (5) Non-independent errors. DETECTION OF NON-LINEARITY A limitation of the usual residual plots (say, residuals against values of a predictor variable): they may not show the nature of the “additional contribution” of a predictor variable to those by other variables already in the model. For example, we consider a multiple regression model with 2 independent variables X1 and X2; is the relationship between Y and X1 linear? “Added-variable plots” (also called “partial regression plots” or “adjusted variable plots”) are refined residual plots that provide graphic information about the marginal importance of a predictor variable given the other variables already in the model. ADDED-VARIABLE PLOTS In an added-variable plot, both the response variable Y and the predictor variable under investigation (say, X1) are both regressed against the other predictor variables already in the regression model and the residuals are obtained for each. These two sets of residuals reflect the part of each (Y and X1) that is not linearly associated with the other predictor variables. The plot of one set of residuals against the other set would show the marginal contribution of the candidate predictor in reducing the residual variability as well as the information about the nature of its marginal contribution. A SIMPLE & SPECIFIC EXAMPLE Consider the case in which we already have a regression model of Y on predictor variable X2 and is now considering if we should add X1 into the model (if we do, we would have a multiple regression model of Y on (X1,X2)). In order to decide, we investigate 2 simple linear regression models: (a) The regression of Y on X2 and (b) The regression of X1 on X2 and obtain 2 sets of residuals as follows: Regression #1: Y on X2 (first or old model) ^ Y i = b0 + b2 X i2 ^ ei (Y | X 2 ) = Yi −Y i These residuals represent the part of Y not explained by X2 Regression #2 : X1 on X2 (second or new model) ^ * * X 1i = b0 + b2 X i2 ^ ei (X1 | X 2 ) = X1i − X 1i These residuals represent the part of X1 not contained in X2 We now do a “regression of ei(Y|X2) – as new dependent variable on ei(X1|X2) – as independent variable: That is to see if “the part of X1 not contained in X2” can further explained “the part of Y not explained by X2”; (if it can, X1 should be added to the model for Y which already has X2 in it). There are three possibilities: The “horizontal band” shows that X1 contains no “additional information” useful for the prediction of Y beyond that contained in and provided for by X2 This “linear band with a non-zero slope” indicates that “the part of X1 not contained in X2” is linearly related to “the part of Y not explained by X2” alone. That is, X1 should be added to form a 2-variable model. Note that if we do a “regression through the origin”, the slope b1 should be the same as the coefficient of X1 if it is added to the regression model already containing X2. AN EXAMPLE Suppose : ^ (1) Y (X 2 ) = 50.70 +15.54X 2 ^ (2) X 1(X 2 ) = 40.78 +1.72X 2 and : (3) e(Y | X 2 ) = 6.29e(X1 | X 2 ) Then : [Y − (50.70 +15.54X 2 )] = 6.29[X1 − (40.78 +1.72X 2 )], we have the s ame result as in multiple regression fitting : Y = [50.70 − (6.29)(40.78)] + 6.29X1 +[15.54 − (6.29)(1.72)]X 2 Y = −205.81+ 6.29X1 + 4.72X2 The “curvilinear band” indicates that, like the case (b) previously, X1 should be added to the model already containing X2. However, it further suggests that the inclusion of X1 is justified but some power terms – or some type of data transformation – are needed. The fact that an added-variable plot may suggest “the nature of the functional form” in which a predictor variable should be added to the regression model is more important than that the variable possible inclusion (which can be resolved without graphical help). The added-variable plots play the role that we use scatter diagrams for in simple linear regression; they would tell if data transformation or if certain polynomial model is desirable. ADDED-VARIABLE PLOT PROC REG data = Origset; 2 Simple Linear Regression here Model Y X1 = X2/noprint; Output Out = SaveMe R = YResid X1Resid; Run; PROC PLOT data = SaveMe; Plot YResid*X1Resid; Run; Another Example: Suppose we have a dependent variable Y and 5 predictor variables X1-X5; and assume that we focus on X5, our primary predictor variable: (1) we regress Y on X1-X4 and obtain residuals (say, YR), and (2) we regress X5 on X1-X4 and obtain residuals (say, X5R), then (3) we regress YR on X5R (SLR, no intercept) – linear assumption about X5 can be studied from this last SLR & its (added-value) plot. “MARGINAL” SL REGRESSION PROC REG data = Origset; Model Y X5 = X1 X2 X3 X4/noprint; Output Out = SaveMe R = YR X5R; Run; 2 Multiple Linear Regression here PROC PLOT data = SaveMe; Plot YR*X5R; New: Run; Marginal Simple Linear Regression PROC REG data = SaveMe; Model YR = X5R/ noint; Run; REMEDIAL MEASURES • If a “linear model” is found not appropriate for the added value of certain predictor, there are two choices: (1) Add in a power terms (quadratic), or (2) Use some transformation on the data to create a fit for the transformed data Each has advantages & disadvantages: first approach (quadratic model) may yield better insights but may lead to more technical difficulties; transformations (log, reciprocal, etc…) are more simple but may obscure the fundamental real relationship between Y and that predictor. One is hard “to do” and one is hard “to explain” OUTLYING & INFLUENTIAL CASES SEMI-STUDENTIZED RESIDUALS _ * ei − e ei ei = = MSE MSE If √MSE were an estimate of the standard deviation of the residual e, we would call e* a studentized (or standardized) residual. However, the standard deviation of the residual is complicated and varies for different residuals, and √MSE is only an approximation. Therefore, e* is call a “semi-studentized residual”. But “exact” standard deviations can be found! THE HAT MATRIX ^ Y = Xb = X[(X' X) −1 X' Y] = [X(X' X) −1 X' ]Y = HY H = X(X' X) −1 X' is called the "Hat Matrix" The Hat matrix can be obtained from data RESIDUALS Model : Y = Xβ + ε Fitted Value : ^ Y = Xb Residuals : ^ e = Y − Y = Y − Xb = Y − HY = (I − H)Y Like the Hat Matrix H, (I-H) is symmetric & idempotent VARIANCE OF RESIDUALS e = (I − H)Y σ2 (e) = (I − H)σ2 (Y)(I − H)' = (I − H)(σ 2I)(I − H)' = σ 2 (I − H)(I − H)' = σ 2 (I − H)(I − H) = σ 2 (I − H) ^ = MSE(I − H) For the ith observatio n : 2 2 σ (ei ) = (1 − hii )σ 2 s (ei ) = (1− hii )MSE 2 σ (ei ,e j ) = −hijσ ;i ≠ j s(ei ,e j ) = −hijMSE;i ≠ j hii is the ith element on the main diagonal & hij is on ith row and jth column of the hat matrix STUDENTIZED RESIDUALS ei ri = s(ei ) e = i (1− hii )MSE Studentized residuals is a “refine” version of the semi- studentized residuals; in “r” we use the exact standard deviation of the residual “e” – not an approximation. Semi-studentized residuals and Studentized Residuals are tools for detecting Outliers. DELETED RESIDUALS The second refinement to make residuals more effective for detecting outlying or extreme observations is to measure the ith residual when the fitted regression is based on all cases except the ith case (similar to the concept of jackknifing). The reason is to avoid the influence of the ith case – especially if it is an outlying observation - on the fitted value. If the ith case is an outlying observation, its exclusion will make the “deleted residual” larger and, therefore, more likely to “confirm” its outlying status: ^ di = Yi −Y i(i) STUDENTIZED DELETED RESIDUALS Combining the two refinements, the studentized residual and the deleted residual, we get the “Studentized Deleted Residual”. A studentized residual is also called an internal studentized residual and a studentized deleted residual is an external studentized residual; MSE(i) is the mean square error when the ith case is omitted in fitting the regression model. di ti = s(di ) e = i (1− hii )MSE(i) ei ti = (1− hii )MSE(i) The studentized deleted residuals can be calculated without having to fit new regression functions each time a different case is omitted; that is, we need to fit the model only once – with all n cases to obtain MSE – not the same model n times, here “p” is the number of parameters, p = k+1: 2 ei (n − p)MSE = (n − p −1)MSE(i) + 1− hii 2 ei (n − p)MSE = (n − p −1)MSE(i) + 1− hii leading to : n − p −1 = ti ei 2 SSE(1 − hii ) − ei which is distribute d as "t" with (n − p −1) degrees of freedom under H0 of no outliers.
Recommended publications
  • Multiple Discriminant Analysis
    1 MULTIPLE DISCRIMINANT ANALYSIS DR. HEMAL PANDYA 2 INTRODUCTION • The original dichotomous discriminant analysis was developed by Sir Ronald Fisher in 1936. • Discriminant Analysis is a Dependence technique. • Discriminant Analysis is used to predict group membership. • This technique is used to classify individuals/objects into one of alternative groups on the basis of a set of predictor variables (Independent variables) . • The dependent variable in discriminant analysis is categorical and on a nominal scale, whereas the independent variables are either interval or ratio scale in nature. • When there are two groups (categories) of dependent variable, it is a case of two group discriminant analysis. • When there are more than two groups (categories) of dependent variable, it is a case of multiple discriminant analysis. 3 INTRODUCTION • Discriminant Analysis is applicable in situations in which the total sample can be divided into groups based on a non-metric dependent variable. • Example:- male-female - high-medium-low • The primary objective of multiple discriminant analysis are to understand group differences and to predict the likelihood that an entity (individual or object) will belong to a particular class or group based on several independent variables. 4 ASSUMPTIONS OF DISCRIMINANT ANALYSIS • NO Multicollinearity • Multivariate Normality • Independence of Observations • Homoscedasticity • No Outliers • Adequate Sample Size • Linearity (for LDA) 5 ASSUMPTIONS OF DISCRIMINANT ANALYSIS • The assumptions of discriminant analysis are as under. The analysis is quite sensitive to outliers and the size of the smallest group must be larger than the number of predictor variables. • Non-Multicollinearity: If one of the independent variables is very highly correlated with another, or one is a function (e.g., the sum) of other independents, then the tolerance value for that variable will approach 0 and the matrix will not have a unique discriminant solution.
    [Show full text]
  • An Overview and Application of Discriminant Analysis in Data Analysis
    IOSR Journal of Mathematics (IOSR-JM) e-ISSN: 2278-5728, p-ISSN: 2319-765X. Volume 11, Issue 1 Ver. V (Jan - Feb. 2015), PP 12-15 www.iosrjournals.org An Overview and Application of Discriminant Analysis in Data Analysis Alayande, S. Ayinla1 Bashiru Kehinde Adekunle2 Department of Mathematical Sciences College of Natural Sciences Redeemer’s University Mowe, Redemption City Ogun state, Nigeria Department of Mathematical and Physical science College of Science, Engineering &Technology Osun State University Abstract: The paper shows that Discriminant analysis as a general research technique can be very useful in the investigation of various aspects of a multi-variate research problem. It is sometimes preferable than logistic regression especially when the sample size is very small and the assumptions are met. Application of it to the failed industry in Nigeria shows that the derived model appeared outperform previous model build since the model can exhibit true ex ante predictive ability for a period of about 3 years subsequent. Keywords: Factor Analysis, Multiple Discriminant Analysis, Multicollinearity I. Introduction In different areas of applications the term "discriminant analysis" has come to imply distinct meanings, uses, roles, etc. In the fields of learning, psychology, guidance, and others, it has been used for prediction (e.g., Alexakos, 1966; Chastian, 1969; Stahmann, 1969); in the study of classroom instruction it has been used as a variable reduction technique (e.g., Anderson, Walberg, & Welch, 1969); and in various fields it has been used as an adjunct to MANOVA (e.g., Saupe, 1965; Spain & D'Costa, Note 2). The term is now beginning to be interpreted as a unified approach in the solution of a research problem involving a com-parison of two or more populations characterized by multi-response data.
    [Show full text]
  • Heteroscedastic Errors
    Heteroscedastic Errors ◮ Sometimes plots and/or tests show that the error variances 2 σi = Var(ǫi ) depend on i ◮ Several standard approaches to fixing the problem, depending on the nature of the dependence. ◮ Weighted Least Squares. ◮ Transformation of the response. ◮ Generalized Linear Models. Richard Lockhart STAT 350: Heteroscedastic Errors and GLIM Weighted Least Squares ◮ Suppose variances are known except for a constant factor. 2 2 ◮ That is, σi = σ /wi . ◮ Use weighted least squares. (See Chapter 10 in the text.) ◮ This usually arises realistically in the following situations: ◮ Yi is an average of ni measurements where you know ni . Then wi = ni . 2 ◮ Plots suggest that σi might be proportional to some power of 2 γ γ some covariate: σi = kxi . Then wi = xi− . Richard Lockhart STAT 350: Heteroscedastic Errors and GLIM Variances depending on (mean of) Y ◮ Two standard approaches are available: ◮ Older approach is transformation. ◮ Newer approach is use of generalized linear model; see STAT 402. Richard Lockhart STAT 350: Heteroscedastic Errors and GLIM Transformation ◮ Compute Yi∗ = g(Yi ) for some function g like logarithm or square root. ◮ Then regress Yi∗ on the covariates. ◮ This approach sometimes works for skewed response variables like income; ◮ after transformation we occasionally find the errors are more nearly normal, more homoscedastic and that the model is simpler. ◮ See page 130ff and check under transformations and Box-Cox in the index. Richard Lockhart STAT 350: Heteroscedastic Errors and GLIM Generalized Linear Models ◮ Transformation uses the model T E(g(Yi )) = xi β while generalized linear models use T g(E(Yi )) = xi β ◮ Generally latter approach offers more flexibility.
    [Show full text]
  • S41598-018-25035-1.Pdf
    www.nature.com/scientificreports OPEN An Innovative Approach for The Integration of Proteomics and Metabolomics Data In Severe Received: 23 October 2017 Accepted: 9 April 2018 Septic Shock Patients Stratifed for Published: xx xx xxxx Mortality Alice Cambiaghi1, Ramón Díaz2, Julia Bauzá Martinez2, Antonia Odena2, Laura Brunelli3, Pietro Caironi4,5, Serge Masson3, Giuseppe Baselli1, Giuseppe Ristagno 3, Luciano Gattinoni6, Eliandre de Oliveira2, Roberta Pastorelli3 & Manuela Ferrario 1 In this work, we examined plasma metabolome, proteome and clinical features in patients with severe septic shock enrolled in the multicenter ALBIOS study. The objective was to identify changes in the levels of metabolites involved in septic shock progression and to integrate this information with the variation occurring in proteins and clinical data. Mass spectrometry-based targeted metabolomics and untargeted proteomics allowed us to quantify absolute metabolites concentration and relative proteins abundance. We computed the ratio D7/D1 to take into account their variation from day 1 (D1) to day 7 (D7) after shock diagnosis. Patients were divided into two groups according to 28-day mortality. Three diferent elastic net logistic regression models were built: one on metabolites only, one on metabolites and proteins and one to integrate metabolomics and proteomics data with clinical parameters. Linear discriminant analysis and Partial least squares Discriminant Analysis were also implemented. All the obtained models correctly classifed the observations in the testing set. By looking at the variable importance (VIP) and the selected features, the integration of metabolomics with proteomics data showed the importance of circulating lipids and coagulation cascade in septic shock progression, thus capturing a further layer of biological information complementary to metabolomics information.
    [Show full text]
  • Power Comparisons of the Mann-Whitney U and Permutation Tests
    Power Comparisons of the Mann-Whitney U and Permutation Tests Abstract: Though the Mann-Whitney U-test and permutation tests are often used in cases where distribution assumptions for the two-sample t-test for equal means are not met, it is not widely understood how the powers of the two tests compare. Our goal was to discover under what circumstances the Mann-Whitney test has greater power than the permutation test. The tests’ powers were compared under various conditions simulated from the Weibull distribution. Under most conditions, the permutation test provided greater power, especially with equal sample sizes and with unequal standard deviations. However, the Mann-Whitney test performed better with highly skewed data. Background and Significance: In many psychological, biological, and clinical trial settings, distributional differences among testing groups render parametric tests requiring normality, such as the z test and t test, unreliable. In these situations, nonparametric tests become necessary. Blair and Higgins (1980) illustrate the empirical invalidity of claims made in the mid-20th century that t and F tests used to detect differences in population means are highly insensitive to violations of distributional assumptions, and that non-parametric alternatives possess lower power. Through power testing, Blair and Higgins demonstrate that the Mann-Whitney test has much higher power relative to the t-test, particularly under small sample conditions. This seems to be true even when Welch’s approximation and pooled variances are used to “account” for violated t-test assumptions (Glass et al. 1972). With the proliferation of powerful computers, computationally intensive alternatives to the Mann-Whitney test have become possible.
    [Show full text]
  • Multicollinearity Diagnostics in Statistical Modeling and Remedies to Deal with It Using SAS
    Multicollinearity Diagnostics in Statistical Modeling and Remedies to deal with it using SAS Harshada Joshi Session SP07 - PhUSE 2012 www.cytel.com 1 Agenda • What is Multicollinearity? • How to Detect Multicollinearity? Examination of Correlation Matrix Variance Inflation Factor Eigensystem Analysis of Correlation Matrix • Remedial Measures Ridge Regression Principal Component Regression www.cytel.com 2 What is Multicollinearity? • Multicollinearity is a statistical phenomenon in which there exists a perfect or exact relationship between the predictor variables. • When there is a perfect or exact relationship between the predictor variables, it is difficult to come up with reliable estimates of their individual coefficients. • It will result in incorrect conclusions about the relationship between outcome variable and predictor variables. www.cytel.com 3 What is Multicollinearity? Recall that the multiple linear regression is Y = Xβ + ϵ; Where, Y is an n x 1 vector of responses, X is an n x p matrix of the predictor variables, β is a p x 1 vector of unknown constants, ϵ is an n x 1 vector of random errors, 2 with ϵi ~ NID (0, σ ) www.cytel.com 4 What is Multicollinearity? • Multicollinearity inflates the variances of the parameter estimates and hence this may lead to lack of statistical significance of individual predictor variables even though the overall model may be significant. • The presence of multicollinearity can cause serious problems with the estimation of β and the interpretation. www.cytel.com 5 How to detect Multicollinearity? 1. Examination of Correlation Matrix 2. Variance Inflation Factor (VIF) 3. Eigensystem Analysis of Correlation Matrix www.cytel.com 6 How to detect Multicollinearity? 1.
    [Show full text]
  • Best Subset Selection for Eliminating Multicollinearity
    Journal of the Operations Research Society of Japan ⃝c The Operations Research Society of Japan Vol. 60, No. 3, July 2017, pp. 321{336 BEST SUBSET SELECTION FOR ELIMINATING MULTICOLLINEARITY Ryuta Tamura Ken Kobayashi Yuichi Takano Tokyo University of Fujitsu Laboratories Ltd. Senshu University Agriculture and Technology Ryuhei Miyashiro Kazuhide Nakata Tomomi Matsui Tokyo University of Tokyo Institute of Tokyo Institute of Agriculture and Technology Technology Technology (Received July 27, 2016; Revised December 2, 2016) Abstract This paper proposes a method for eliminating multicollinearity from linear regression models. Specifically, we select the best subset of explanatory variables subject to the upper bound on the condition number of the correlation matrix of selected variables. We first develop a cutting plane algorithm that, to approximate the condition number constraint, iteratively appends valid inequalities to the mixed integer quadratic optimization problem. We also devise a mixed integer semidefinite optimization formulation for best subset selection under the condition number constraint. Computational results demonstrate that our cutting plane algorithm frequently provides solutions of better quality than those obtained using local search algorithms for subset selection. Additionally, subset selection by means of our optimization formulation succeeds when the number of candidate explanatory variables is small. Keywords: Optimization, statistics, subset selection, multicollinearity, linear regression, mixed integer semidefinite optimization 1. Introduction Multicollinearity, which exists when two or more explanatory variables in a regression model are highly correlated, is a frequently encountered problem in multiple regression analysis [11, 14, 24]. Such an interrelationship among explanatory variables obscures their relationship with the explained variable, leading to computational instability in model es- timation.
    [Show full text]
  • Research Report Statistical Research Unit Goteborg University Sweden
    Research Report Statistical Research Unit Goteborg University Sweden Testing for multivariate heteroscedasticity Thomas Holgersson Ghazi Shukur Research Report 2003:1 ISSN 0349-8034 Mailing address: Fax Phone Home Page: Statistical Research Nat: 031-77312 74 Nat: 031-77310 00 http://www.stat.gu.se/stat Unit P.O. Box 660 Int: +4631 773 12 74 Int: +4631 773 1000 SE 405 30 G6teborg Sweden Testing for Multivariate Heteroscedasticity By H.E.T. Holgersson Ghazi Shukur Department of Statistics Jonkoping International GOteborg university Business school SE-405 30 GOteborg SE-55 111 Jonkoping Sweden Sweden Abstract: In this paper we propose a testing technique for multivariate heteroscedasticity, which is expressed as a test of linear restrictions in a multivariate regression model. Four test statistics with known asymptotical null distributions are suggested, namely the Wald (W), Lagrange Multiplier (LM), Likelihood Ratio (LR) and the multivariate Rao F-test. The critical values for the statistics are determined by their asymptotic null distributions, but also bootstrapped critical values are used. The size, power and robustness of the tests are examined in a Monte Carlo experiment. Our main findings are that all the tests limit their nominal sizes asymptotically, but some of them have superior small sample properties. These are the F, LM and bootstrapped versions of Wand LR tests. Keywords: heteroscedasticity, hypothesis test, bootstrap, multivariate analysis. I. Introduction In the last few decades a variety of methods has been proposed for testing for heteroscedasticity among the error terms in e.g. linear regression models. The assumption of homoscedasticity means that the disturbance variance should be constant (or homoscedastic) at each observation.
    [Show full text]
  • T-Statistic Based Correlation and Heterogeneity Robust Inference
    t-Statistic Based Correlation and Heterogeneity Robust Inference Rustam IBRAGIMOV Economics Department, Harvard University, 1875 Cambridge Street, Cambridge, MA 02138 Ulrich K. MÜLLER Economics Department, Princeton University, Fisher Hall, Princeton, NJ 08544 ([email protected]) We develop a general approach to robust inference about a scalar parameter of interest when the data is potentially heterogeneous and correlated in a largely unknown way. The key ingredient is the following result of Bakirov and Székely (2005) concerning the small sample properties of the standard t-test: For a significance level of 5% or lower, the t-test remains conservative for underlying observations that are independent and Gaussian with heterogenous variances. One might thus conduct robust large sample inference as follows: partition the data into q ≥ 2 groups, estimate the model for each group, and conduct a standard t-test with the resulting q parameter estimators of interest. This results in valid and in some sense efficient inference when the groups are chosen in a way that ensures the parameter estimators to be asymptotically independent, unbiased and Gaussian of possibly different variances. We provide examples of how to apply this approach to time series, panel, clustered and spatially correlated data. KEY WORDS: Dependence; Fama–MacBeth method; Least favorable distribution; t-test; Variance es- timation. 1. INTRODUCTION property of the correlations. The key ingredient to the strategy is a result by Bakirov and Székely (2005) concerning the small Empirical analyses in economics often face the difficulty that sample properties of the usual t-test used for inference on the the data is correlated and heterogeneous in some unknown fash- mean of independent normal variables: For significance levels ion.
    [Show full text]
  • On the Determination of Proper Regularization Parameter
    Universität Stuttgart Geodätisches Institut On the determination of proper regularization parameter α-weighted BLE via A-optimal design and its comparison with the results derived by numerical methods and ridge regression Bachelorarbeit im Studiengang Geodäsie und Geoinformatik an der Universität Stuttgart Kun Qian Stuttgart, Februar 2017 Betreuer: Dr.-Ing. Jianqing Cai Universität Stuttgart Prof. Dr.-Ing. Nico Sneeuw Universität Stuttgart Erklärung der Urheberschaft Ich erkläre hiermit an Eides statt, dass ich die vorliegende Arbeit ohne Hilfe Dritter und ohne Benutzung anderer als der angegebenen Hilfsmittel angefertigt habe; die aus fremden Quellen direkt oder indirekt übernommenen Gedanken sind als solche kenntlich gemacht. Die Arbeit wurde bisher in gleicher oder ähnlicher Form in keiner anderen Prüfungsbehörde vorgelegt und auch noch nicht veröffentlicht. Ort, Datum Unterschrift III Abstract In this thesis, several numerical regularization methods and the ridge regression, which can help improve improper conditions and solve ill-posed problems, are reviewed. The deter- mination of the optimal regularization parameter via A-optimal design, the optimal uniform Tikhonov-Phillips regularization (a-weighted biased linear estimation), which minimizes the trace of the mean square error matrix MSE(ˆx), is also introduced. Moreover, the comparison of the results derived by A-optimal design and results derived by numerical heuristic methods, such as L-curve, Generalized Cross Validation and the method of dichotomy is demonstrated. According to the comparison, the A-optimal design regularization parameter has been shown to have minimum trace of MSE(ˆx) and its determination has better efficiency. VII Contents List of Figures IX List of Tables XI 1 Introduction 1 2 Least Squares Method and Ill-Posed Problems 3 2.1 The Special Gauss-Markov medel .
    [Show full text]
  • Baseline and Interferent Correction by the Tikhonov Regularization Framework for Linear Least Squares Modeling
    Baseline and interferent correction by the Tikhonov Regularization framework for linear least squares modeling Joakim Skogholt1, Kristian Hovde Liland1 and Ulf Geir Indahl1 1Faculty of Science and Technology, Norwegian University of Life Sciences,˚As, Norway Abstract Spectroscopic data is usually perturbed by noise from various sources that should be re- moved prior to model calibration. After conducting a pre-processing step to eliminate un- wanted multiplicative effects (effects that scale the pure signal in a multiplicative manner), we discuss how to correct a model for unwanted additive effects in the spectra. Our approach is described within the Tikhonov Regularization (TR) framework for linear regression model building, and our focus is on ignoring the influence of non-informative polynomial trends. This is obtained by including an additional criterion in the TR problem penalizing the resulting regression coefficients away from a selected set of possibly disturbing directions in the sam- ple space. The presented method builds on the Extended Multiplicative Signal Correction (EMSC), and we compare the two approaches on several real data sets showing that the sug- gested TR-based method may improve the predictive power of the resulting model. We discuss the possibilities of imposing smoothness in the calculation of regression coefficients as well as imposing selection of wavelength regions within the TR framework. To implement TR effi- ciently in the model building, we use an algorithm that is heavily based on the singular value decomposition (SVD). Due to some favourable properties of the SVD it is possible to explore the models (including their generalized cross-validation (GCV) error estimates) associated with a large number of regularization parameter values at low computational cost.
    [Show full text]
  • Two-Way Heteroscedastic ANOVA When the Number of Levels Is Large
    Two-way Heteroscedastic ANOVA when the Number of Levels is Large Short Running Title: Two-way ANOVA Lan Wang 1 and Michael G. Akritas 2 Revision: January, 2005 Abstract We consider testing the main treatment e®ects and interaction e®ects in crossed two-way layout when one factor or both factors have large number of levels. Random errors are allowed to be nonnormal and heteroscedastic. In heteroscedastic case, we propose new test statistics. The asymptotic distributions of our test statistics are derived under both the null hypothesis and local alternatives. The sample size per treatment combination can either be ¯xed or tend to in¯nity. Numerical simulations indicate that the proposed procedures have good power properties and maintain approximately the nominal ®-level with small sample sizes. A real data set from a study evaluating forty varieties of winter wheat in a large-scale agricultural trial is analyzed. Key words: heteroscedasticity, large number of factor levels, unbalanced designs, quadratic forms, projection method, local alternatives 1385 Ford Hall, School of Statistics, University of Minnesota, 224 Church Street SE, Minneapolis, MN 55455. Email: [email protected], phone: (612) 625-7843, fax: (612) 624-8868. 2414 Thomas Building, Department of Statistics, the Penn State University, University Park, PA 16802. E-mail: [email protected], phone: (814) 865-3631, fax: (814) 863-7114. 1 Introduction In many experiments, data are collected in the form of a crossed two-way layout. If we let Xijk denote the k-th response associated with the i-th level of factor A and j -th level of factor B, then the classical two-way ANOVA model speci¯es that: Xijk = ¹ + ®i + ¯j + γij + ²ijk (1.1) where i = 1; : : : ; a; j = 1; : : : ; b; k = 1; 2; : : : ; nij, and to be identi¯able the parameters Pa Pb Pa Pb are restricted by conditions such as i=1 ®i = j=1 ¯j = i=1 γij = j=1 γij = 0.
    [Show full text]