Non-Linear & Logistic Regression

Non-Linear & Logistic Regression “If the statistics are boring, then you've got the wrong numbers.” Edward R. Tufte (Statistics Professor, Yale University) Regression Analyses When do we use these? PART 1: find a relationship between response variable (Y) and a predictor variable (X) (e.g. Y~X) PART 2: use relationship to predict Y from X Simple linear regression: y = b + m*x y = β0 + β1 * x1 Multiple linear regression: y = β0 + β1*x1 + β2*x2 … + βn*xn Non linear regression: when a line just doesn’t fit our data Logistic regression: when our data is binary (data is represented as 0 or 1) Non-linear Regression Curvilinear relationship between response and predictor variables • The right type of non-linear model are usually conceptually determined based on biological considerations • For a starting point we can plot the relationship between the 2 variables and “visually check” which model might be a good option • There are obviously MANY curves you can generate to try and fit your data Exponential Curve 퐸푥푝표푛푒푛푡푎푙: 푦 = 푎 + 푏푐푥 Non-linear regression option #1 • Rapid increasing/decreasing change in Y or X for a change in the other Ex: bacteria growth/decay, human population growth, infection rates (humans, trees, etc.) 0 < c < 1 c > 1 +b +b response (y) response (y) response a a predictor (x) predictor (x) a a 0 < c < 1 c > 1 response (y) response -b (y) response -b predictor (x) predictor (x) Logarithmic Curve 퐿표푎푟푡ℎ푚푐: 푦 = 푎 + 푏푥푐 Non-linear regression option #2 • Rapid increasing/decreasing change in Y or X for a change in the other Ex: survival thresholds, resource optimization -c +c +b +b response (y) response (y) response a a predictor (x) predictor (x) a a -c +c response (y) response -b (y) response -b predictor (x) predictor (x) 푏 Hyperbolic Curve 퐻푦푝푒푟푏표푙푐: 푦 = 푎 + 푥 + 푐 Non-linear regression option #3 • Rapid increasing/decreasing change in Y or X for a change in the other Ex: survival of a function of population • Similar to exponential and logarithmic curve but now we have 2 asymptotes a +b -b response (y) response (y) response a c predictor (x) c predictor (x) Parabolic Curve 푃푎푟푎푏표푙푐: 푦 = 푎 + 푏 ∗ 푥 − 푐 2 Non-linear regression option #4 • Rapid increasing/decreasing change in Y or X for a change in the other followed by the reverse trend Ex: survival of a function of an environmental variable Upward Parabolic Downward Parabolic a +b -b response (y) response (y) response a c c predictor (x) predictor (x) 2 Gaussian Curve 퐺푎푢푠푠푎푛: 푦 = 푎 ∗ 푏 푥−푐 Non-linear regression option #5 • Resembles a normal distribution Ex: survival of a function of an environmental variable • Where 0 < b < 1 a b response (y) response c predictor (x) Sigmoidal Curve 푎 푆푛표푑푎푙: 푦 = 푥−푐 + 푑 Non-linear regression option #6 1 + 푏 • Stability in Y followed by rapid increase then stability again Ex: restricted growth, learning response, a threshold has to occur for a response effect • Where b > 1 and c > 1 a c b response (y) response d predictor (x) Michaelis Menten Curve 푎 ∗ 푥 푀푐ℎ푎푒푙푠 푀푒푛푡푒푛: 푦 = Non-linear regression option #7 푏 + 푥 • Rapid increasing/decreasing change in Y or X for a change in the other Ex: biological process as a function of resource availability • Similar to exponential and logarithmic curve but now we have 2 parameters – this model comes from kinetics/physiology a 1 a 2 response (y) response response (y) response -b a b predictor (x) predictor (x) Non-Linear Regression Curve Fitting Procedure: 1. Plot your variables to visualize the relationship a. What curve does the pattern resemble? b. What might alternative options be? 2. Decide on the curves you want to compare and run a non-linear regression curve fitting a. You will have to estimate your parameters from your curve to have starting values for your curve fitting function 3. Once you have parameters for your curves compare models with AIC 4. Plot the model with the lowest AIC on your point data to visualize fit Non-linear regression curve fitting in R: install.packages("minpack.lm") nlsLM(responseY~MODEL, start=list(starting values for model parameters)) Non-Linear Regression Output from R Non-linear model that we fit Simplified logarithmic with slope=0 Estimates of model parameters Residual sum-of-squares for your non-linear model Number of iterations needed to estimate the parameters Non-Linear Regression Curve Fitting Procedure: 1. Plot your variables to visualize the relationship a. What curve does the pattern resemble? b. What might alternative options be? 2. Decide on the curves you want to compare and run a non-linear regression curve fitting a. You will have to estimate your parameters from your curve to have starting values for your curve fitting function 3. Once you have parameters for your curves compare models with AIC 4. Plot the model with the lowest AIC on your point data to visualize fit Non-linear regression curve fitting in R: install.packages("minpack.lm") nlsLM(responseY~MODEL, start=list(starting values for model parameters)) Akaike’s Information Criterion (AIC) How do we decide which model is best? In the 1970s he used information theory to build a numerical equivalent of Occam's razor Occam’s razor: All else being equal, the simplest explanation is the best one • For model selection, this means the simplest model is preferred to a more complex one Hirotugu Akaike, 1927-2009 • Of course, this needs to be weighed against the ability of the model to actually predict anything • AIC considers both the fit of the model and the model complexity • Complexity is measured as number parameters or the use of higher order polynomials • Allows us to balance over- and under-fitting in our modelled relationships – We want a model that is as simple as possible, but no simpler – A reasonable amount of explanatory power is traded off against model complexity – AIC measures the balance of this for us Akaike’s Information Criterion (AIC) AIC in R • AIC is useful because it can be calculated for any kind of model allowing comparisons across different modelling approaches and model fitting techniques • Model with the lowest AIC value is the model that fits your data best (e.g. minimizes your model residuals) – Output from R is a single AIC value Akaike’s Information Criterion in R to determine best model: AIC(nlsLM(responseY~MODEL1, start=list(starting values))) AIC(nlsLM(responseY~MODEL2, start=list(starting values))) AIC(nlsLM(responseY~MODEL3, start=list(starting values))) Non-Linear Regression Curve fitting • Use the parameter estimates outputted from nlsLM() to generate curve for plotting Non-Linear Regression Assumptions • NLR make no assumptions for normality, equal variances, or outliers • However the assumptions of independence (spatial & temporal) and design considerations (randomization, sufficient replicates, no pseudoreplication) still apply • We don’t have to worry about statistical power here because we are fitting relationships – All we care about is if or how well we can model the relationship between our response and predictor variables Non-Linear Regression R2 for “goodness of fit” • Calculating an R2 is NOT APPROPIATE for non-linear regression • Why? – For linear models, the sums of the squared errors always add up in a specific manner: 푆푆푅푒푔푟푒푠푠푖표푛 + 푆푆퐸푟푟표푟 = 푆푆푇표푡푎푙 2 푆푆푅푒푔푟푒푠푠푖표푛 – Therefore 푅 = 푆푆푇표푡푎푙 which mathematically must produce a value between 0 and 100% – But in nonlinear regression 푆푆푅푒푔푟푒푠푠푖표푛 + 푆푆퐸푟푟표푟 ≠ 푆푆푇표푡푎푙 – Therefore the ratio used to construct R2 is bias in nonlinear regression • Best to use AIC value and the measurement of the residual sum-of-squares to pick best model then plot the curve to visualize the fit Logistic Regression (a.k.a logit regression) Relationship between a binary response variable and predictor variables 푒훽0+훽1푥1+훽2푥2+⋯+훽푛푥푛 퐿표푠푡푐 푀표푑푒푙: 푦 = Logit Model 1 − 푒훽0+훽1푥1+훽2푥2+⋯+훽푛푥푛 • Binary response variable can be considered a class (1 or 0) • Yes or No • Present or Absent • The linear part of the logistic regression equation is used to find the probability of being in a category based on the combination of predictors • Predictor variables are usually (but not necessarily) continuous • But it is harder to make inferences from regression outputs that use discrete or categorical variables Binomial distribution vs Normal distribution • Key difference: Values are continuous (Normal) vs discrete (Binomial) • As sample size increases the binomial distribution appears to resemble the normal distribution • Binomial distribution is a family of distributions because the shape references both the number of observations and the probability of “getting a success” - a value of 1 “What is probability of x success in n independent and identically distributed Bernoulli trials?” • Bernoulli trial (or binomial trial) - a random experiment with exactly two possible outcomes, "success" and "failure", in which the probability of success is the same every time the experiment is conducted Logistic Regression vs Linear Regression • Linear Regression - references the Gaussian (normal) distribution - uses ordinary least squares to find a best fitting line the estimates parameters that predict the change in the dependent variable for change in the independent variable • Logistic regression - references the Binomial distribution - estimates the probability (p) of an event occurring (y=1) rather then not occurring (y=0) from a knowledge of relevant independent variables (our data) - regression coefficients are estimated using maximum likelihood estimation (iterative process) Maximum likelihood estimation How coefficients are estimated for logistic regression • Complex iterative process to find coefficient values that maximizes the likelihood function Likelihood function - probability for the occurrence of a observed set of values X and Y given a function with defined parameters Process: 1. Begins with a tentative solution for each coefficient 2.

Non-Linear & Logistic Regression

A Toolbox for Nonlinear Regression in R: the Package Nlstools

An Efficient Nonlinear Regression Approach for Genome-Wide

Model Selection, Transformations and Variance Estimation in Nonlinear Regression

Nonlinear Regression, Nonlinear Least Squares, and Nonlinear Mixed Models in R

Fitting Models to Biological Data Using Linear and Nonlinear Regression

Nonlinear Regression and Nonlinear Least Squares

Part III Nonlinear Regression

Nonlinear Regression for Split Plot Experiments

Nonlinear Regression

Nonlinear Regression (Part 1)

Sketching Structured Matrices for Faster Nonlinear Regression

Likelihood Ratio Tests for Goodness-Of-Fit of a Nonlinear Regression Model