Pubh 7405: REGRESSION ANALYSIS

PubH 7405: REGRESSION ANALYSIS INTRODUCTION TO POISSON REGRESSION PROTOTYPE EXAMPLE #1 We have data for 44 physicians working in emergency medicine at a major hospital system. The concern is about the number of complaints each received the previous year. Why is it different from physician to physician? Could it be explained using other factors? Data available consist of the number of patient visits - and four covariates (the revenue, in dollars per hour; work load at the emergency service, in hours; gender, Female/Male, and residency training in emergency medicine, No/Yes. Question: Can we do Regression using the Normal Error Regression Model? Possible Issue: The response, Number of Complaints, is not on continuous scale; it’s a count – of course, not normally distributed. PROTOTYPE EXAMPLE #2 Skin Cancer data for different age groups in two metropolitan areas: Minneapolis-St. Paul and Dallas-Fort Worth: (1) Any age effect? If so, is it the same for the two cities? (2) Any weather effect? (difference between two cities) If so, is it the same for all age groups? Skin Cancer Data City: Minneapolis-St. Paul Dallas-Ft. Worth Age Group Cases Population Cases Population 15-24 1 172,675 4 181,343 25-34 16 123,063 38 146,207 35-44 30 96,216 119 121,374 45-54 71 92,051 221 111,353 55-64 102 72,159 259 83,004 65-74 130 54,722 310 55,932 75-84 133 32,185 226 29,007 85+ 40 8,328 65 7,538 Questions are regression-type questions (about main effects or marginal contributions and about possible effect modifications). & the same possible issue: The Response or Dependent Variable, the Number of Skin Cancer Cases – a count, is not on the continuous scale and not normally distributed. In Regression Analysis/Model, we usually impose the condition that the Response Variable Y is on the continuous scale maybe because of the popular “Normal Error Model” - not because Y is always on the continuous scale. In a previous lecture, the Dependent Variable of interest was represented by an Binary or Indicator Variable Y taking on values 0 and 1. The distribution is “Bernoulli” which is a special case of the Binomial Distribution. And we introduced “Logistic Regression”. We are now introducing a new form of regression, the Poisson Regression, where the Response or Dependent Variable Y represents “count” data – non-negative integers. RARE EVENTS It can be shown that the limiting form of the binomial distribution, when n is increasingly large (n → ∞) and π is increasingly small (π → 0) while θ = nπ (the mean) remains constant, is: θxe−θ Pr(X = x) = x! = p(x; θ) A random variable having this probability function is said to have “Poisson Distribution” P(θ). For example, with n = 48 and π = .05: b(x = 5; n, π) = .059 p(x = 5; θ) = .060 The Poisson Model is often used when the random variable X is supposed to represent the number of occurrences of some random event in an unit interval of time or space, or some volume of matter; numerous applications in health sciences have been documented. For example, the number of virus in a solution, the number of defective teeth per individual, the number of focal lesions in virology, the number of victims of specific diseases, the number of cancer deaths per household, the number of infant deaths in certain locality during a given year, among others. The mean and variance of the Poisson Distribution are: μ = θ σ2 = θ (A very special and “strong” characteristic where the variance is equal to the mean). REGRESSION NEEDS The Poisson model is often used when the random variable X is supposed to represent the number of occurrences of some random event in an interval of time or space, or some volume of matter and numerous applications in health sciences have been documented. In some of these applications, one may be interested in to see if the Poisson-distributed dependent variable Y can be predicted from or explained by other variables. The other variables are called predictors, or explanatory or independent variables. For example, we may be interested in the number of defective teeth Y per individual as a function of gender and age of a child, branch of toothpaste, and whether the family has or does not have dental insurance. & refer to Example 1 (Number of complains) and Example 2 (Number of skin cancer cases). POISSON REGRESSION MODEL When the dependent variable Y is assumed to follow a Poisson distribution with mean θ; the Poisson regression model expresses this mean as a function of certain independent variables X1, X2, ..., Xk, in addition to the size of the observation unit from which one obtained the count of interest. For example, if Y is the number of virus in a solution then the size is the volume of the solution; or if Y is the number of defective teeth then the size is the total number of teeth for that same individual. In our frame work, the dependent variable Y is assumed to follow a Poisson distribution; its values yi's are available from n “observation unit” which is also characterized by an independent variable X. For the observation unit “i” (i ≤ n), let si be the size and xi be the covariate value. The Poisson regression model assumes that the relationship between the mean of Y and the covariate X is described by: = E(Yi ) siλ(xi ) = + siexp(β0 β1xi ) where λ(xi) is called the “risk” of/for observation unit i (1 ≤ i ≤ n). The basic rationale for using the term “risk” is the approximation of the Binomial distribution by the Poisson distribution. Recall that, when n goes to infinity, π tends 0 while θ = nπ remains constant, the binomial distribution B(n,π) can be approximated by the Poisson distribution P(θ). The number n is the size of the observation unit; so the ratio between the mean and the size represents π (or λ(x) in the new model); that’s the “probability” or “risk” (and, the ratio of risks is called the “risks ratio” or “relative risk”). Model with Several Covariates Suppose we want to consider k covariates, X1, X2, …, Xk, simultaneously. The simple Poisson regression model of previous section can be easily generalized and expressed as: E(Yi ) = siλ(x1i ,x2i ,...,xki ) = siexp(β0 + β1x1i + β2x2i + ... + β1xki ) k = siexp(β0 + ∑β jx ji ) j=1 where λ(xji’s) is called the “risk” of/for observation unit i (1 ≤ i ≤ n), xji is the value of the covariate Xj measured from subject i. EXAMPLE The purpose of this study was to examine the data for 44 physicians working for an emergency at a major hospital so as to determine which of the following four variables are related to the number of complaints received during the previous year. In addition to the number of complaints, served as the dependent variable, data available consist of the number of visits - which serves as the size for the observation unit, the physician - and four covariates. Table 6.2 presents the complete data set. For each of the 44 physician there are two continuous independent variables, the revenue (dollars per hour) and work load at the emergency service (hours) and two binary variables, gender (Female/Male) and residency training in emergency services (No/Yes). No. of Visits Complaint Gender Residency Revenue Hours 2014 2 Y F 263.02 1287.25 3091 3 N M 334.94 1588.00 879 1 Y M 206.42 705.25 1780 1 N M 236.32 1005.50 3646 11 N M 288.91 1667.25 2690 1 N M 275.94 1517.75 1864 2 Y M 295.71 967.00 2782 6 N M 224.91 1609.25 3071 9 N F 249.32 1747.75 1502 3 Y M 269.00 906.25 2438 2 N F 225.61 1787.75 2278 2 N M 212.43 1480.50 2458 5 N M 211.05 1733.50 2269 2 N F 213.23 1847.25 2431 7 N M 257.30 1433.00 3010 2 Y M 326.49 1520.00 2234 5 Y M 290.53 1404.75 2906 4 N M 268.73 1608.50 2043 2 Y M 231.61 1220.00 3022 7 N M 241.04 1917.25 2123 5 N F 238.65 1506.25 1029 1 Y F 287.76 589.00 3003 3 Y F 280.52 1552.75 2178 2 N M 237.31 1518.00 2504 1 Y F 218.70 1793.75 2211 1 N F 250.01 1548.00 2338 6 Y M 251.54 1446.00 3060 2 Y M 270.52 1858.25 2302 1 N M 247.31 1486.25 1486 1 Y F 277.78 933.95 1863 1 Y M 259.68 1168.25 1661 0 N M 260.92 877.25 2008 2 N M 240.22 1387.25 2138 2 N M 217.49 1312.00 2556 5 N M 250.31 1551.50 1451 3 Y F 229.43 9.73.75 3328 3 Y M 313.48 1638.25 2928 8 N M 293.47 1668.25 2701 8 N M 275.40 16.52.75 2046 1 Y M 289.56 1029.75 2548 2 Y M 305.67 1127.00 2592 1 N M 252.35 1547.25 2741 1 Y F 276.86 1499.25 3763 10 Y M 308.84 1747.50 The interpretation or meaning of the “Regression Coefficients” could be seen as follows – which is similar to the case of the “normal Error Regression Model”.

Load more