<<

Thamerus: Modelling Count with Heteroscedastic Measurement Error in the Covariates

Sonderforschungsbereich 386, Paper 58 (1997)

Online unter: http://epub.ub.uni-muenchen.de/

Projektpartner

Mo delling with Heteroscedastic

Measurement Error in the Covariates

Markus Thamerus SFB

Institut fur Statistik Universitat Munc hen

Abstract

This pap er is concerned with the estimation of the regression co ecients for

a count data mo del when one of the explanatory variables is sub ject to hete

roscedastic measurement error The observed values W are related to the true

regressor X by the additive error mo del WXU The errors U are assumed to

b e normally distributed with zero but heteroscedastic which

are known or can b e estimated from rep eated measurements Inference is done

by using quasi likeliho o d metho ds where a mo del of the observed data is sp e

cied only through a mean and a function for the resp onse Y given W

and other correctly observed covariates Although this approachweakens the

assumption of a parametric regression mo del there is still the need to determine

the marginal distribution of the unobserved variable X which is treated as a

Provided appropriate functions for the mean and variance are

stated the regression can b e estimated consistentlyWe illustrate

our metho ds through an analysis of lung cancer rates in Switzerland One of

the covariates the regional radon averages cannot b e measured exactly due

to the strong dep endency of radon on geological conditions and various other

environmental sources of inuence The distribution of the unobserved true

radon measure is mo delled as a nite mixture of normals

Keywords measurement error quasi likelihood Poisson regression radon data

Intro duction

When ordinary regression techniques are applied to a mo del where one or several

predictors are sub ject to measurement error the regression estimates are

asymptotically biased For nonlinear mo dels the monograph of Carroll Rupp ert and

Stefanski gives a fundamental intro duction into the dierent metho ds to adjust

for this eect In this article we will fo cus on estimation and inference of a Poisson

regression mo del with heteroscedastic measurement error in one of the predictors

Let the true mo del relate the resp onse Y given in counts to the predictors X Z

where X denotes a continuous covariate that cannot b e measured directly and is only

observed through a proxy W andZ is a set of covariates measured without error

Throughout this pap er we will fo cus on a structural mo del for the unobserved pre

dictor X which that X is treated as a random variable and its distribution

is parametrically mo deled Furthermore we make the assumption of nondierential

measurement error which means that the conditional distribution of Y given X and

Z is indep endentof W f f The observed predictor W is then called

Y jZX W Y jZX

a surrogate This includes the frequently used additive measurement error mo del

W X U where the measurement error U O is indep endentofY X

u

Quasilikeliho o d metho ds for regression mo dels with covariate measurement error re

quire information on the p osterior distribution of the true predictor X given the

observed covariates WZ If validation data for X are at hand and an assumption

for the error distribution of U is made one can pro ceed to estimate the distribution

of X j WZThisisvery often not the case and one has to make a strong assumption

on the distribution of X and use the observations of W to estimate it Therefore some

kno wledge ab out the error pro cess U that generated the observations W is needed

In contrast to most applications which assume the error variances to b e constant we

i n allow for heteroscedastic measurement errors that is Var U

i

i

Our work was mainly motivated by a data set from a Swiss study Minder and Volkle

where registered mortal lung cancer cases Y were related to regional ave

rage radon measurements W and other predictors Z The observed mean values

W for regional radon exp osure have to b e regarded as proxy variables for an existing

true mean X of each region Since the numb er of individual radon measurements

that were used to compute the average W for each region ranged from to the

errors cannot b e assumed to b e homoscedastic

The aim of this pap er is to showhow a quasilikeliho o d approach can b e used for a

count data mo del when one of the explanatory variables is sub ject to heteroscedastic

measurement error The assumption of a nite mixture of normal distributions as the

marginal distribution for the latentvariable X is very exible and it is shown that

the derivation of a regression mo del in the observable variables remains tractable

In the following section wewillintro duce the quasilikeliho o d mo del for a Poisson re

ariable gression and derive appropriate mean and variance functions when the latentv

X follows a normal mixture distribution In section three we will apply this approach

to the Swiss data The impact of measurement error on the estimation results and

other related asp ects will b e discussed in the last section

The Quasilikeliho o d Approach

The use of quasilikeliho o d techniques for regression mo dels with covariate measu

rement error has b een widely discussed in the literature One of the rst general

approaches has b een describ ed by Armstrong Asymptotic results and a very

detailed discussion of quasilikeliho o d metho ds for dierentobserved data structures

can b e found in Carroll and Stefanski

For i n let Y and Z b e the resp onse and a vector of covariates measured

i i

without error X denotes the unobservable regressor variable and W the measured

i i

surrogate We assume an additive heteroscedastic error mo del

fori n W X U with U N

i i i i

i

where the U s are mutually indep endent U and Y X are indep endentandthe

i i i i

error variances are known or can b e estimated from indep endent replications of

i

W The quasilikeliho o d approach only requires the sp ecication of a mean and

i

for the regression mo del which will b e denoted by f and f The

m v

rst step to obtain a quasilikeliho o d mo del in the observable variables is to set up

the unobservable mean and variance function as it is implied by the distribution of

Y given Z and X We will write those rst two conditional moments as

i i i

EY j Z X Z X for the mean and

i i i i i

VY j Z X Z X for the variance

i i i i i

function where is the vector of the regression parameters In a more general formu

lation the variance function dep ends on additional variance parameters orand is

expressed as a function of but as we will concentrate on a Poisson regression

merely there is no need for a more general notation To pro ceed to the mean

and variance functions for the observed data f Z W EY j Z W and

m i i i i

f Z W VY j Z W one iterates exp ectations and uses the nondierential

v i i i i i

error prop erty A quasilikeliho o d mo del in the observable variables can therefore b e

stated as

f Z W EEY j Z X W j Z W

m i i i i i i i i

E Z X j Z W and

i i i i

f Z W VE Y j Z X W j Z W EVY j Z X W j Z W

v i i i i i i i i i i i i i i

V Z X j Z W E Z X j Z W

i i i i i i i i

is given by the quasi score An unbiased estimating equation for

X 

Z

function

n n

X X

Y f Z W f Z W

i m i i m i i

n

s s

i

f Z W

v i i

i i

and the consistent quasilikeliho o d estimator is found as the ro ot of the equation

ql

s Its asymptotic normality is also established via the theory of unbiased

and it holds that

a

  

N n F V F

ql

The parts of the asymptotic matrix of are given by

ql

n

n

X

s s

i

F lim E lim E and

n n

n n

i

n

X

n

cov s lim E s s V lim

i i

n n

n n

i

It is estimated by

  

codv n F V F with

ql ql ql ql

n n

X X

s

i

F s s and V

ql i i ql



 n n

ql 



ql

i i

Both mean and variance function make use of the conditional distribution

of X given Z and W If this distribution can b e sp ecied parametricallyitisin

principle p ossible to calculate them directly In our case we will state f and f

m v

under the assumption that the heteroscedastic error variances are given and that

i

the parameters of the marginal distribution of X can b e estimated

Mo del for the Poisson Regression

For i n let Y Po with exp Z X The underlying

i i i  Z X i

i

unobservable regression mo del is given through

Z X Z X exp Z X

i i i i  Z X i

i

and by using the formulas as given in and it is easily seen that the mean and

variance functions of the observable mo del is of the form

EY j Z W exp Z E exp X j Z W and

i i i  Z X i i i

i

VY j Z W exp Z E exp X j Z W

i i i  Z X i i i

i

exp Z E exp X j Z W

 Z X i i i

i

E exp X j Z W exp Z

X i i i  Z

i

Now exp ectations of the form E exp cX j Z W have to b e computed To derive

i i i

closed form expressions for f and f we will pro ceed in the following way Under

m v

the assumption of a structural mo del we state a parametric distribution for the latent

variables X From these we nd the conditional distribution of X given W and as

i i i

only normal distributions are involved it is then p ossible to compute the conditional

exp ectation E exp cX j Z W

i i i

We will mo del the distribution of the iid variables X i n parametrically

i

as a mixture of normal distributions and write its densityas

m

X

f x x j

X i k i k

i

k

k 

where j denotes the normal density function with parameters and

k k

k k

Finite mixture distributions provide a exible class of distributions and often repre

sent a more realistic choice in practice as they do not demand that the observed

variables come from one homogeneous p opulation As we will see later in the exam

ple the assumption of a mixture distribution was indicated by the observed data

Additionally it is assumed that the latentvariables X are indep endent of the other

i

covariates Z The random variable of the kth comp onent of the mixture distribution

i

of X will b e denoted by X with density x j Since wehave an additive

i ki i k

k

error mo del W X U U N it is easily seen that the distribution of

i i i i

i

W is a mixture of normal distributions as well Indeed we nd that on each comp o

i

nentvariance of that mixture distribution an heteroscedastic variance part induced

by the measurement error is added Therefore the density of the kth comp onent

W of W is given by w j In order to nd the conditional distribu

ki i i k

i

k

tion of X j W we simplify the notation and write the densities of X W and U

i i i i i

P P

m m

as f x f x f w f w and f u resp ectively

X i k X i W i k W i U i

i i i

ki ki

k  k 

By applying a linear transformation to the joint densityofX and W wend

i i

P

m

f x w f x f w x f x f w x

X W i i X i U i i k X i U i i

i i i i i

ki

k 

P

f x

i

X jW

m

i i

f w f w f w

W i W i k W i

i i

ki

k 

m m

X X

f x f w x f w

X i U i i k W i

i

ki ki

P

f x

ki C i

m

ki

f w f w

j W i W i

ji

ki

j 

k  k 

As can b e seen from the conditional distribution of X given W again is a

i i

mixture of normally distributed random variables C k m with its asso ciated

ki

densities found by conditioning X on W for each k The prop ortions given by

ki ki ki

w j

k i k

i

k

P

ki

m

w j

j i j

j i

j 

are the p osteriori probabilities that the unobserved variable X b elongs to comp o

i

withits nent k when W was observed Furthermore it holds that C N

i ki ki

ki

parameters dened as

k

W and

ki k i k

i

k

k

ki k

i

k

Kuchenho and Carroll used a similar argumentation for a homoscedastic

measurement error mo del and a marginal mixture distribution with two comp onents

Now since X given W is a mixture distribution we can rewrite the conditional

i i

exp ectations required for the denition of the mean and variance function of the

quasilikeliho o d mo del and it holds that

m

X

EexpcX j W Z E exp cX j W Z

ki ki ki i i i i

k 

The prop erties of the generating function for normal distributions enables

us to express these exp ectations as

E exp cX j W Z EexpcC exp c c

ki ki i ki ki

ki

With this result and plugged into and the derived mo del in the obser

vable variables is given by

m

X

f Z W exp Z exp

Z m i i  ki X ki

ki i X

k 

f Z W f Z W f Z W

v i i m i i m i i

m

X

exp Z exp

 Z ki X ki

i X ki

k 

This mo del is clearly dierent from the unobservable Poisson regression mo del

as stated in Estimation is carried out by the usual iteratively reweigh ted

least square algorithm for mean and variance mo dels and requires to dierentiate

f Z W with resp ect to For details on tting metho ds for such mo dels see

m i i

Carroll and Rupp ert

Lung Cancer Data

In a recent study Minder and Volkle the ob jectivewas to nd out if there

exists a p ositive asso ciation b etween regional average radon measurements and regi

stered mortal lung cancer cases The study was carried out in Switzerland which

was divided into dierent regions In each region the numb ers of registered lung

cancer cases were given for each of sixteen age groups Regional average values of ra

don were obtained by rep eated indo or measurements from dierent sites across each

region Besides lo cation the sites diered from each other by the typ e of building

and the chosen o or level As the latentcovariate X we dene the true average

i

radon concentration for region i For each of the regions a mean value W was

i

obtained through n single observations W r n The variances S

i ir i

i

from these rep eated measurementswere given as well The concentration of the radon

gas strongly dep ends on lo cal geological and atmospherical conditions Furthermore

the physical prop erty of radon to decomp ose into other substances makes it dicult

to obtain exact values The lo cation of the measuring devices and the instruments

themselves are thus p ossible sources of measurementerrorWe will state the following

additive mo del for the measurement error pro cess each observation W is a proxy

ir

variable for the true regional average X and therefore we dene for all i

i

W X with E for r n andVar

ir i ir ir i ir

i

So we do not assume a particular distribution for the errors weonly

ir

require that they have exp ectation zero and equal variances For the observed values

P P

n n

 

i i

the W X U with U n W n

ir ir i i i i

i i

r  r 

p ermits us to assume a normal error distribution and we write

W X U with U N fori

i i i i

i

n are dierent for each region even Its easily seen that the error variances

i

i

i

when for i since they dep end on the numb er of measurements

i

n as well This number n varies regionally from to observations and for the

i i

estimated error variances S n wend The estimates

i

i i i i

will b e treated as the variances whichwe formerly assumed to b e known Figure

i

shows a scatterplot of the regional radon averages versus the estimated standard de

viations of their error distributions Marked by triangles and squares are averages

i

computed from less resp ectively equal or more than one hundred single measure

ments The plot clearly shows the heteroscedastic pattern of the error variances and

although it is obvious that will tend to zero if n increases this data showenough

i

i

variability within each region to pro duce nonignorable measurement error In the ori

ginal study a number of Poisson regression mo dels for dierent subgroups of the Swiss

p opulation were estimated We will restrict our analysis on that mo del that includes

all Swiss women only The resp onse variable is the numb er of registered mortal lung

cancer cases in region i and age interval j and will b e denoted by Y As describ ed

ij

ab ove the predictor of main interest the regional average radon concentration X

i

could only b e observed through the surrogate W with known error variances

i

i

Figure Scatterplot of the regional radon averages against their error standard

deviations The triangles and squares indicate if n or if n

i i

The correctly observed covariables are the p opulation under risk N and age A

ij j

measured as the transformed midp ointofthe j th age interval which equals zero for

women b etween and years equals one for age b etween and years aso

Additionally an indicator variable C of the regional structure for urban for

i

rural is given The observed data structure for the regression mo del is summarized

in Table The unobservable loglinear oset regression mo del relates the prop ortions

Y N to the covariates Z A A C and X so that the of the

ij ij ij j i i

j

rst conditional momentof Y given N Z and X can b e written as

ij ij ij i

A A C X ln N Z X ln N

 A j A C i X i ij i i ij

j and holds

jth age group j A

j

ith region i Y registered lung cancer cases

ij

W with C N p opulation under risk

i i ij

i

Table Data structure of Swiss Study observed variables

Figure shows the regional radon averages plotted against ln Y N The regions

ij ij

considered as urban are marked by triangles The plot itself gives no clear hint for

the presence of an eect of radon on the o ccurrence of lung cancer Markedly visible

is the characteristic of the radon averages to app ear in three distinct clusters The



main part of the data clusters around Bqm the second group scatters around



Bqm and on the right hand side of the plot are four regions with averages ab ove



Bqm

Figure Scatterplot of ln Y N against the observed radon averages W

ij ij i

Regions considered as urbanrural C are marked by trianglesstars

i

As the marginal distribution of the true radon averages X we assume a normal

i

mixture distribution with three comp onents Maximum likeliho o d estimates of the

parameters were obtained by applying an EM algorithm to the observed radon means

W The results are shown in Table for more details see Thamerus

i

Comp onent Comp onent Comp onent

prop ortions

k

means

k

stand dev

k

Table Estimation results for a three comp onent normal mixture distribu

tion of the true radon means Standard errors are given in brackets

A naive estimator for was originally obtained by replacing

 A A C X

X with the observed averages W Itiswell known that this metho d yields incon

i i

sistent estimates The estimated regression co ecients of the quasilikeliho o d mo del

are found by applying an IRLS algorithm to the mo del given through the mean and

variance functions and These estimates are presented in Table together

with those of the naive approach

naive mo del quasilikeliho o d mo del

variable se p se p

age

age

urban

radon

Table Estimation results for the regression mo del of the Swiss lung cancer

data Given are the estimated regression co ecients their standard errors and

asso ciated p values

For the naive pro cedure we found that the null hyp othesis for the presence of a

radon eect could not b e rejected on a signicance level Note that the statisti

cal inference is dierent for the quasilikeliho o d mo del that considers the individual

measurement errors Relative to their standard errors b oth mo dels pro duce similar

results for the correctly observed covariates age age squared and the urbanization

indicator The estimated radon eect of the quasilikeliho o d mo del however is greater

than the one obtained from the naive mo del and its accompanying p value conrms a

signicant eect for the radon variable at the level This dierence in the p values

of the two mo dels is explained by the almost identical values of their standard errors

As a result wemay state that for this particular mo del the naive estimation me

tho d nds a nonsignicant radon eect and that in comparison the quasilikelihood

approach leads to a dierent result

Discussion

Most epidemiologists will conrm that age and smoking status have the strongest

eects on the o ccurrence of lung cancer and that in this data set the absence of an

appropriate smoking variable pro duces misleading results This issue is also discussed

in the original pap er of Minder and Volkle They compared their estimation

results of separate mo dels for distinct age groups under the alternative assumptions

whether the overall smoking b ehavior of the p opulation remained constantorwas

dynamic Since there is no information that smoking will b e a factor for

radon wecannotcontribute anything new to this discussion

We will rather concentrate on two other topics The rst one is ab out the asymptotic

of In our mo del the parameters of the distribution of X for

ql i

simplicity denoted by are treated as known The sandwich estimator that was

used to estimate the covariance of do es not consider the estimation of According

ql

to Liang and Liu an estimator of similar form as for the covariance can b e

constructed if V is replaced by a term that contains one part for the estimation

ql

of and an additional part for the estimation of It remains op en whether the

estimated standard errors for would increase signicantly if the additional part

ql

was used

Avery common metho d to describ e the degree of attenuation of the estimated regres

sion co ecients in the presence of measurement error is the denition of a ratio that re

lates the error variance to the variance of the latentvariable X If the error variance is

homoscedastic the so called noisetosignal ratio of X that is Var U Var X

i i

is often used see eg Fuller As our error mo del W X U U N

i i i i

i

is heteroscedastic we will use this idea to dene a mean noisetosignal ratio of X

P

n



as n Var X and estimate it by replacing Var X with an

m i i

i X

i

estimate and make use of the known error variances

i

The latentvariables X i n were assumed to b e iid variables of a mixture

i

of normal distributions so we write X M ixN V

i  m  m

 m

and suggest twoways of estimating The rst metho d uses the estimated para

X

meters of the mixture distribution from the EM algorithm Let H b e a classication

i

variable that denes to which comp onent of the mixture X b elongs Then the va

i

riance of X can b e found by

i

m m

X X

Var X EVar X j H Var E X j H

i i i i i k k k

k

k  k 

P

m

where EX The ML estimate is simply obtained by replacing

i k k

X

k 

the distribution parameters with its estimates The metho d of moments uses the

sample variance S of the observed variables W and an estimator for is found by

i

W X

n

X

n

S s

i W X

n

i

It is easily seen that s is unbiased This estimator is of great practical use since it

X

can b e computed without any knowledge of the distribution of the latentvariable X

Most variation in X is caused by the four radon means that constitute the third

comp onent of the mixture distribution To get an idea of the measurement error

eect on the estimation results we p erformed an exp eriment and removed the four



regions with radon averages ab ove Bqm from the data and tted a normal

mixture distribution with two comp onents to the remaining averages Table gives

the estimated variances and mean signaltonoise ratios for the original Swiss radon

data three comp onents and the reduced data two comp onents

three comp onents n two comp onents n

variance ratio variance ratio

X

s

X

Table Estimated variances and mean noisetosignal ratios of X for the

three comp onents mixture mo del full data and the two comp onents mixture

mo dell four data p oints omitted

The ratios for the full data mo del are rather small a fact which is mainly caused by

the dierent lo cations of the three comp onents of the mixture Therefore the impact

of measurement error on the estimated radon eect is small and the naive estimator

is only little biased That the error variances inuence the estimation results can b e

seen from the mo del of the observed data given in and Both functions

dep end nonlinearily on the ratios and the lo cation parameters through the

k

i

k

conditional moments and

ki ki

The mean noisetosignal ratios for the reduced data two comp onents are appro

ximately ve times bigger than those for the original data and the biasing eect of

measurement error on the naive estimates should b e seen more clearly Indeed we

computed the regression co ecients for the naive and the quasilikeliho o d regression

for those data and got estimated radon eects of for the

X naiv e

naive and for the quasilikeliho o d mo del with standard

Xql

errors given in brackets Relative to their standard errors those two estimates dier

from each other by a factor around two Not surprisingly this example also reveals

as found by the full data mo del disapp ears that the p ositive eect of radon as it w once the four highest radon exp osed regions are not considered

Quasilikeliho o d mo dels are useful to ols to analyze regression mo dels when some of the

covariates are sub ject to measurementerror Collecting rep eated measurements of

the erroneous regressor variable provides additional information on the measurement

error pro cess and is recommended to the researchers If the marginal distribution of

the latentvariable is normal or a mixture of normal distributions even a heterosce

dastic error structure can b e emb edded into a quasilikeliho o d mo del for count data

Esp ecially weak eects like the discussed eect of radon exp osure on lung cancer can

b e detected by a mo del that considers the individual measurementerror

Acknowledgement

Iwant to thank WA Fuller H Schneeweiss and H Kuc henho for their useful

comments and Ch E Minder for providing the problem

References

Armstrong B Measurement Error in the Generalized Linear Mo del Com

munications in Statistics Series B

Carroll RJ and Rupp ert D Transformation and Weighting in Regression

Chapman and Hall London

Carroll RJ Rupp ert D and Stefanski LA Measurement Error in Non

linear Mo dels Chapman and Hall London

Carroll RJ and Stefanski LA Approximate Quasilikeliho o d Estimation

in Mo dels with Surrogate Predictors Journal of the American Statistical As

so ciation

Fuller WA Measurement Error Mo dels John Wiley New York

Kuchenho H and Carroll R J with Errors in

Predictors SemiParametric and Parametric Metho ds Statistics in Medicine

Vol

Liang K Y and Liu X Estimating Equations in Generalized Linear Mo dels

with Measurement Error In Estimating Functions VP Go damb e ed New

York Oxford University Press

Minder Ch E and Volkle H Radon und Lungenkrebssterblichkeit in der

Schweiz Bericht der mathematischstatistischen Sektion der Forschungs

gesellschaft Johanneum

Thamerus M Fitting a Finite Mixture Distribution to a Variable Sub ject

to Heteroscedastic Measurement Error University of Munich Discussion Pap er

List of the SFB Nr