Simple Linear Regression 80 60 Rating 40 20
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Lecture 12 Robust Estimation
Lecture 12 Robust Estimation Prof. Dr. Svetlozar Rachev Institute for Statistics and Mathematical Economics University of Karlsruhe Financial Econometrics, Summer Semester 2007 Prof. Dr. Svetlozar Rachev Institute for Statistics and MathematicalLecture Economics 12 Robust University Estimation of Karlsruhe Copyright These lecture-notes cannot be copied and/or distributed without permission. The material is based on the text-book: Financial Econometrics: From Basics to Advanced Modeling Techniques (Wiley-Finance, Frank J. Fabozzi Series) by Svetlozar T. Rachev, Stefan Mittnik, Frank Fabozzi, Sergio M. Focardi,Teo Jaˇsic`. Prof. Dr. Svetlozar Rachev Institute for Statistics and MathematicalLecture Economics 12 Robust University Estimation of Karlsruhe Outline I Robust statistics. I Robust estimators of regressions. I Illustration: robustness of the corporate bond yield spread model. Prof. Dr. Svetlozar Rachev Institute for Statistics and MathematicalLecture Economics 12 Robust University Estimation of Karlsruhe Robust Statistics I Robust statistics addresses the problem of making estimates that are insensitive to small changes in the basic assumptions of the statistical models employed. I The concepts and methods of robust statistics originated in the 1950s. However, the concepts of robust statistics had been used much earlier. I Robust statistics: 1. assesses the changes in estimates due to small changes in the basic assumptions; 2. creates new estimates that are insensitive to small changes in some of the assumptions. I Robust statistics is also useful to separate the contribution of the tails from the contribution of the body of the data. Prof. Dr. Svetlozar Rachev Institute for Statistics and MathematicalLecture Economics 12 Robust University Estimation of Karlsruhe Robust Statistics I Peter Huber observed, that robust, distribution-free, and nonparametrical actually are not closely related properties. -
Lecture 22: Bivariate Normal Distribution Distribution
6.5 Conditional Distributions General Bivariate Normal Let Z1; Z2 ∼ N (0; 1), which we will use to build a general bivariate normal Lecture 22: Bivariate Normal Distribution distribution. 1 1 2 2 f (z1; z2) = exp − (z1 + z2 ) Statistics 104 2π 2 We want to transform these unit normal distributions to have the follow Colin Rundel arbitrary parameters: µX ; µY ; σX ; σY ; ρ April 11, 2012 X = σX Z1 + µX p 2 Y = σY [ρZ1 + 1 − ρ Z2] + µY Statistics 104 (Colin Rundel) Lecture 22 April 11, 2012 1 / 22 6.5 Conditional Distributions 6.5 Conditional Distributions General Bivariate Normal - Marginals General Bivariate Normal - Cov/Corr First, lets examine the marginal distributions of X and Y , Second, we can find Cov(X ; Y ) and ρ(X ; Y ) Cov(X ; Y ) = E [(X − E(X ))(Y − E(Y ))] X = σX Z1 + µX h p i = E (σ Z + µ − µ )(σ [ρZ + 1 − ρ2Z ] + µ − µ ) = σX N (0; 1) + µX X 1 X X Y 1 2 Y Y 2 h p 2 i = N (µX ; σX ) = E (σX Z1)(σY [ρZ1 + 1 − ρ Z2]) h 2 p 2 i = σX σY E ρZ1 + 1 − ρ Z1Z2 p 2 2 Y = σY [ρZ1 + 1 − ρ Z2] + µY = σX σY ρE[Z1 ] p 2 = σX σY ρ = σY [ρN (0; 1) + 1 − ρ N (0; 1)] + µY = σ [N (0; ρ2) + N (0; 1 − ρ2)] + µ Y Y Cov(X ; Y ) ρ(X ; Y ) = = ρ = σY N (0; 1) + µY σX σY 2 = N (µY ; σY ) Statistics 104 (Colin Rundel) Lecture 22 April 11, 2012 2 / 22 Statistics 104 (Colin Rundel) Lecture 22 April 11, 2012 3 / 22 6.5 Conditional Distributions 6.5 Conditional Distributions General Bivariate Normal - RNG Multivariate Change of Variables Consequently, if we want to generate a Bivariate Normal random variable Let X1;:::; Xn have a continuous joint distribution with pdf f defined of S. -
Generalized Linear Models and Generalized Additive Models
00:34 Friday 27th February, 2015 Copyright ©Cosma Rohilla Shalizi; do not distribute without permission updates at http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ Chapter 13 Generalized Linear Models and Generalized Additive Models [[TODO: Merge GLM/GAM and Logistic Regression chap- 13.1 Generalized Linear Models and Iterative Least Squares ters]] [[ATTN: Keep as separate Logistic regression is a particular instance of a broader kind of model, called a gener- chapter, or merge with logis- alized linear model (GLM). You are familiar, of course, from your regression class tic?]] with the idea of transforming the response variable, what we’ve been calling Y , and then predicting the transformed variable from X . This was not what we did in logis- tic regression. Rather, we transformed the conditional expected value, and made that a linear function of X . This seems odd, because it is odd, but it turns out to be useful. Let’s be specific. Our usual focus in regression modeling has been the condi- tional expectation function, r (x)=E[Y X = x]. In plain linear regression, we try | to approximate r (x) by β0 + x β. In logistic regression, r (x)=E[Y X = x] = · | Pr(Y = 1 X = x), and it is a transformation of r (x) which is linear. The usual nota- tion says | ⌘(x)=β0 + x β (13.1) · r (x) ⌘(x)=log (13.2) 1 r (x) − = g(r (x)) (13.3) defining the logistic link function by g(m)=log m/(1 m). The function ⌘(x) is called the linear predictor. − Now, the first impulse for estimating this model would be to apply the transfor- mation g to the response. -
A Generalized Linear Model for Principal Component Analysis of Binary Data
A Generalized Linear Model for Principal Component Analysis of Binary Data Andrew I. Schein Lawrence K. Saul Lyle H. Ungar Department of Computer and Information Science University of Pennsylvania Moore School Building 200 South 33rd Street Philadelphia, PA 19104-6389 {ais,lsaul,ungar}@cis.upenn.edu Abstract they are not generally appropriate for other data types. Recently, Collins et al.[5] derived generalized criteria We investigate a generalized linear model for for dimensionality reduction by appealing to proper- dimensionality reduction of binary data. The ties of distributions in the exponential family. In their model is related to principal component anal- framework, the conventional PCA of real-valued data ysis (PCA) in the same way that logistic re- emerges naturally from assuming a Gaussian distribu- gression is related to linear regression. Thus tion over a set of observations, while generalized ver- we refer to the model as logistic PCA. In this sions of PCA for binary and nonnegative data emerge paper, we derive an alternating least squares respectively by substituting the Bernoulli and Pois- method to estimate the basis vectors and gen- son distributions for the Gaussian. For binary data, eralized linear coefficients of the logistic PCA the generalized model's relationship to PCA is anal- model. The resulting updates have a simple ogous to the relationship between logistic and linear closed form and are guaranteed at each iter- regression[12]. In particular, the model exploits the ation to improve the model's likelihood. We log-odds as the natural parameter of the Bernoulli dis- evaluate the performance of logistic PCA|as tribution and the logistic function as its canonical link. -
Applied Biostatistics Mean and Standard Deviation the Mean the Median Is Not the Only Measure of Central Value for a Distribution
Health Sciences M.Sc. Programme Applied Biostatistics Mean and Standard Deviation The mean The median is not the only measure of central value for a distribution. Another is the arithmetic mean or average, usually referred to simply as the mean. This is found by taking the sum of the observations and dividing by their number. The mean is often denoted by a little bar over the symbol for the variable, e.g. x . The sample mean has much nicer mathematical properties than the median and is thus more useful for the comparison methods described later. The median is a very useful descriptive statistic, but not much used for other purposes. Median, mean and skewness The sum of the 57 FEV1s is 231.51 and hence the mean is 231.51/57 = 4.06. This is very close to the median, 4.1, so the median is within 1% of the mean. This is not so for the triglyceride data. The median triglyceride is 0.46 but the mean is 0.51, which is higher. The median is 10% away from the mean. If the distribution is symmetrical the sample mean and median will be about the same, but in a skew distribution they will not. If the distribution is skew to the right, as for serum triglyceride, the mean will be greater, if it is skew to the left the median will be greater. This is because the values in the tails affect the mean but not the median. Figure 1 shows the positions of the mean and median on the histogram of triglyceride. -
Should We Think of a Different Median Estimator?
Comunicaciones en Estad´ıstica Junio 2014, Vol. 7, No. 1, pp. 11–17 Should we think of a different median estimator? ¿Debemos pensar en un estimator diferente para la mediana? Jorge Iv´an V´eleza Juan Carlos Correab [email protected] [email protected] Resumen La mediana, una de las medidas de tendencia central m´as populares y utilizadas en la pr´actica, es el valor num´erico que separa los datos en dos partes iguales. A pesar de su popularidad y aplicaciones, muchos desconocen la existencia de dife- rentes expresiones para calcular este par´ametro. A continuaci´on se presentan los resultados de un estudio de simulaci´on en el que se comparan el estimador cl´asi- co y el propuesto por Harrell & Davis (1982). Mostramos que, comparado con el estimador de Harrell–Davis, el estimador cl´asico no tiene un buen desempe˜no pa- ra tama˜nos de muestra peque˜nos. Basados en los resultados obtenidos, se sugiere promover la utilizaci´on de un mejor estimador para la mediana. Palabras clave: mediana, cuantiles, estimador Harrell-Davis, simulaci´on estad´ısti- ca. Abstract The median, one of the most popular measures of central tendency widely-used in the statistical practice, is often described as the numerical value separating the higher half of the sample from the lower half. Despite its popularity and applica- tions, many people are not aware of the existence of several formulas to estimate this parameter. We present the results of a simulation study comparing the classic and the Harrell-Davis (Harrell & Davis 1982) estimators of the median for eight continuous statistical distributions. -
Understanding Linear and Logistic Regression Analyses
EDUCATION • ÉDUCATION METHODOLOGY Understanding linear and logistic regression analyses Andrew Worster, MD, MSc;*† Jerome Fan, MD;* Afisi Ismaila, MSc† SEE RELATED ARTICLE PAGE 105 egression analysis, also termed regression modeling, come). For example, a researcher could evaluate the poten- Ris an increasingly common statistical method used to tial for injury severity score (ISS) to predict ED length-of- describe and quantify the relation between a clinical out- stay by first producing a scatter plot of ISS graphed against come of interest and one or more other variables. In this ED length-of-stay to determine whether an apparent linear issue of CJEM, Cummings and Mayes used linear and lo- relation exists, and then by deriving the best fit straight line gistic regression to determine whether the type of trauma for the data set using linear regression carried out by statis- team leader (TTL) impacts emergency department (ED) tical software. The mathematical formula for this relation length-of-stay or survival.1 The purpose of this educa- would be: ED length-of-stay = k(ISS) + c. In this equation, tional primer is to provide an easily understood overview k (the slope of the line) indicates the factor by which of these methods of statistical analysis. We hope that this length-of-stay changes as ISS changes and c (the “con- primer will not only help readers interpret the Cummings stant”) is the value of length-of-stay when ISS equals zero and Mayes study, but also other research that uses similar and crosses the vertical axis.2 In this hypothetical scenario, methodology. -
Generalized and Weighted Least Squares Estimation
LINEAR REGRESSION ANALYSIS MODULE – VII Lecture – 25 Generalized and Weighted Least Squares Estimation Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur 2 The usual linear regression model assumes that all the random error components are identically and independently distributed with constant variance. When this assumption is violated, then ordinary least squares estimator of regression coefficient looses its property of minimum variance in the class of linear and unbiased estimators. The violation of such assumption can arise in anyone of the following situations: 1. The variance of random error components is not constant. 2. The random error components are not independent. 3. The random error components do not have constant variance as well as they are not independent. In such cases, the covariance matrix of random error components does not remain in the form of an identity matrix but can be considered as any positive definite matrix. Under such assumption, the OLSE does not remain efficient as in the case of identity covariance matrix. The generalized or weighted least squares method is used in such situations to estimate the parameters of the model. In this method, the deviation between the observed and expected values of yi is multiplied by a weight ω i where ω i is chosen to be inversely proportional to the variance of yi. n 2 For simple linear regression model, the weighted least squares function is S(,)ββ01=∑ ωii( yx −− β0 β 1i) . ββ The least squares normal equations are obtained by differentiating S (,) ββ 01 with respect to 01 and and equating them to zero as nn n ˆˆ β01∑∑∑ ωβi+= ωiixy ω ii ii=11= i= 1 n nn ˆˆ2 βω01∑iix+= βω ∑∑ii x ω iii xy. -
Linear, Ridge Regression, and Principal Component Analysis
Linear, Ridge Regression, and Principal Component Analysis Linear, Ridge Regression, and Principal Component Analysis Jia Li Department of Statistics The Pennsylvania State University Email: [email protected] http://www.stat.psu.edu/∼jiali Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis Introduction to Regression I Input vector: X = (X1, X2, ..., Xp). I Output Y is real-valued. I Predict Y from X by f (X ) so that the expected loss function E(L(Y , f (X ))) is minimized. I Square loss: L(Y , f (X )) = (Y − f (X ))2 . I The optimal predictor ∗ 2 f (X ) = argminf (X )E(Y − f (X )) = E(Y | X ) . I The function E(Y | X ) is the regression function. Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis Example The number of active physicians in a Standard Metropolitan Statistical Area (SMSA), denoted by Y , is expected to be related to total population (X1, measured in thousands), land area (X2, measured in square miles), and total personal income (X3, measured in millions of dollars). Data are collected for 141 SMSAs, as shown in the following table. i : 1 2 3 ... 139 140 141 X1 9387 7031 7017 ... 233 232 231 X2 1348 4069 3719 ... 1011 813 654 X3 72100 52737 54542 ... 1337 1589 1148 Y 25627 15389 13326 ... 264 371 140 Goal: Predict Y from X1, X2, and X3. Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis Linear Methods I The linear regression model p X f (X ) = β0 + Xj βj . -
Simple Linear Regression with Least Square Estimation: an Overview
Aditya N More et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 7 (6) , 2016, 2394-2396 Simple Linear Regression with Least Square Estimation: An Overview Aditya N More#1, Puneet S Kohli*2, Kshitija H Kulkarni#3 #1-2Information Technology Department,#3 Electronics and Communication Department College of Engineering Pune Shivajinagar, Pune – 411005, Maharashtra, India Abstract— Linear Regression involves modelling a relationship amongst dependent and independent variables in the form of a (2.1) linear equation. Least Square Estimation is a method to determine the constants in a Linear model in the most accurate way without much complexity of solving. Metrics where such as Coefficient of Determination and Mean Square Error is the ith value of the sample data point determine how good the estimation is. Statistical Packages is the ith value of y on the predicted regression such as R and Microsoft Excel have built in tools to perform Least Square Estimation over a given data set. line The above equation can be geometrically depicted by Keywords— Linear Regression, Machine Learning, Least Squares Estimation, R programming figure 2.1. If we draw a square at each point whose length is equal to the absolute difference between the sample data point and the predicted value as shown, each of the square would then represent the residual error in placing the I. INTRODUCTION regression line. The aim of the least square method would Linear Regression involves establishing linear be to place the regression line so as to minimize the sum of relationships between dependent and independent variables. -
Application of General Linear Models (GLM) to Assess Nodule Abundance Based on a Photographic Survey (Case Study from IOM Area, Pacific Ocean)
minerals Article Application of General Linear Models (GLM) to Assess Nodule Abundance Based on a Photographic Survey (Case Study from IOM Area, Pacific Ocean) Monika Wasilewska-Błaszczyk * and Jacek Mucha Department of Geology of Mineral Deposits and Mining Geology, Faculty of Geology, Geophysics and Environmental Protection, AGH University of Science and Technology, 30-059 Cracow, Poland; [email protected] * Correspondence: [email protected] Abstract: The success of the future exploitation of the Pacific polymetallic nodule deposits depends on an accurate estimation of their resources, especially in small batches, scheduled for extraction in the short term. The estimation based only on the results of direct seafloor sampling using box corers is burdened with a large error due to the long sampling interval and high variability of the nodule abundance. Therefore, estimations should take into account the results of bottom photograph analyses performed systematically and in large numbers along the course of a research vessel. For photographs taken at the direct sampling sites, the relationship linking the nodule abundance with the independent variables (the percentage of seafloor nodule coverage, the genetic types of nodules in the context of their fraction distribution, and the degree of sediment coverage of nodules) was determined using the general linear model (GLM). Compared to the estimates obtained with a simple Citation: Wasilewska-Błaszczyk, M.; linear model linking this parameter only with the seafloor nodule coverage, a significant decrease Mucha, J. Application of General in the standard prediction error, from 4.2 to 2.5 kg/m2, was found. The use of the GLM for the Linear Models (GLM) to Assess assessment of nodule abundance in individual sites covered by bottom photographs, outside of Nodule Abundance Based on a direct sampling sites, should contribute to a significant increase in the accuracy of the estimation of Photographic Survey (Case Study nodule resources. -
Propensity Scores up to Head(Propensitymodel) – Modify the Number of Threads! • Inspect the PS Distribution Plot • Inspect the PS Model
Overview of the CohortMethod package Martijn Schuemie CohortMethod is part of the OHDSI Methods Library Cohort Method Self-Controlled Case Series Self-Controlled Cohort IC Temporal Pattern Disc. Case-control New-user cohort studies using Self-Controlled Case Series A self-controlled cohort A self-controlled design, but Case-control studies, large-scale regressions for analysis using fews or many design, where times preceding using temporals patterns matching controlss on age, propensity and outcome predictors, includes splines for exposure is used as control. around other exposures and gender, provider, and visit models age and seasonality. outcomes to correct for time- date. Allows nesting of the Estimation methods varying confounding. study in another cohort. Patient Level Prediction Feature Extraction Build and evaluate predictive Automatically extract large models for users- specified sets of featuress for user- outcomes, using a wide array specified cohorts using data in of machine learning the CDM. Prediction methods algorithms. Empirical Calibration Method Evaluation Use negative control Use real data and established exposure-outcomes pairs to reference sets ass well as profile and calibrate a simulations injected in real particular analysis design. data to evaluate the performance of methods. Method characterization Database Connector Sql Render Cyclops Ohdsi R Tools Connect directly to a wide Generate SQL on the fly for Highly efficient Support tools that didn’t fit range of databases platforms, the various SQLs dialects. implementations of regularized other categories,s including including SQL Server, Oracle, logistic, Poisson and Cox tools for maintaining R and PostgreSQL. regression. libraries. Supporting packages Under construction Technologies CohortMethod uses • DatabaseConnector and SqlRender to interact with the CDM data – SQL Server – Oracle – PostgreSQL – Amazon RedShift – Microsoft APS • ff to work with large data objects • Cyclops for large scale regularized regression Graham study steps 1.