Determining Appropriate Sample Size for Cases in a Case-Control Study
Total Page:16
File Type:pdf, Size:1020Kb
UNIVERSITY OF CINCINNATI Date: 3-May-2010 I, Karen Weyer , hereby submit this original work as part of the requirements for the degree of: Master of Science in Biostatistics (Environmental Health) It is entitled: Determining Appropriate Sample Size for Cases in a Case-Control Study Utilizing Proxy Respondents Student Signature: Karen Weyer This work and its defense approved by: Committee Chair: Paul Succop, PhD Paul Succop, PhD Tania Carreon-valenci, PhD Tania Carreon-valenci, PhD 5/18/2010 596 Determining Appropriate Sample Size for Cases in a Case-Control Study Utilizing Proxy Respondents A thesis submitted to the Division of Research and Advanced Studies of the University of Cincinnati in partial fulfillment of the requirements for the degree of Master of Science in the Department of Environmental Health of the College of Medicine by Karen Weyer B.A. Agnes Scott College June 2010 Committee Chair: Paul Succop, PhD ABSTRACT There are many situations in research when the circumstances do not allow the subject to be interviewed about their medical, family or work history, due to their death, or being incapable of providing responses on their own. In these situations the researcher must rely on a proxy. Using the data from the Upper Midwest Health Study (UMHS), two power simulation analyses were completed to determine the appropriate sample size of proxy respondents required to maintain a power of 0.80 or higher. Results show that we can rely on the proxy responses for agreement to the subject response for demographic questions, but not for study-related questions. Additionally, the power analysis demonstrated that when the percent of disagreement between the proxy and subject responses are larger, fewer proxy respondents are required for a power of 0.80. iii iv Acknowledgements I would like to thank my academic advisor and committee chair Dr. Paul Succop for his guidance, wisdom, and patience. I would like to thank Dr. Tania Carreon-Valencia for providing me the opportunity to work on this project, as well as her support and efforts through the completion of this project. I would like to thank Dr. Avima Ruder for creating for me the data subset from the Upper Midwest Health Study from which this analysis is based upon. I would like to thank those that I work with at i3 statprobe who supported me during the duration of this project; John Lasley, Jennifer Savage-Sales, Ann Hemken, and Diana Cucos and the many others who would ask me ‘How is your thesis coming, Karen?’ Finally, I would like to thank my family for their ongoing support; Mom and Dad for encouraging me to get it done, my brother Stephen for unknowingly challenging me to get my degree done at the same time he finished his, and my husband Scott and son Benjamin for providing me love, support, and quiet Sunday afternoons to get this paper finished. v Table of Contents Abstract iii Acknowledgements v List of Tables vii List of Figures viii Section I: Sample-Size and Power 1 Section II: Analysis of Upper Midwest Health Study Data 2 Introduction 2 Methods 4 Results Part I: Analysis of UMHS Data 6 Results Part II: Analysis of Simulation 8 Discussion 10 Section III: Impact on Environmental Health Research 12 Bibliography 13 Tables 14 Figures 17 Appendix 1: Question List 29 Appendix 2: SAS Code 31 vi List of Tables 1. Percentage of Matched Responses Between Subject and Proxy Respondents 2. Z-test Scores (p-value) for Response Agreement Between Subjects and Proxies when Subject Answered “Yes” 3. Power of Proxy Agreement to Subject Responses vii List of Figures 1) Power versus Sample Size for Demographic Questions – alpha = 0.05 2) Power versus Sample Size for Study Related Questions – alpha = 0.05 3) Power versus Sample Size for Question Groups – alpha = 0.05 4) Power versus Sample Size for Demographic Questions – alpha = 0.01 5) Power versus Sample Size for Study Related Questions – alpha = 0.01 6) Power versus Sample Size for Question Groups – alpha = 0.01 7) Mean Change versus Sample Size for Demographic Questions: alpha = 0.05 and beta ≥ 0.80 8) Mean Change versus Sample Size for Study Related Questions: alpha = 0.05 and beta ≥ 0.80 9) Mean Change versus Sample Size for Question Groups: alpha = 0.05 and beta ≥ 0.80 10) Mean Change versus Sample Size for Demographic Questions: alpha = 0.01 and beta ≥ 0.80 11) Mean Change versus Sample Size for Study Related Questions: alpha = 0.01 and beta ≥ 0.80 Mean Change versus Sample Size for Question Groups: alpha = 0.01 and beta ≥ 0.80 viii Section I: Sample-Size and Power Choosing the appropriate samples size for a study is critical. Having too small of a sample size could cause the study to demonstrate non-significant results even if there are true differences between the true mean and the null mean. Having too large of a sample could be costly and not provide much more information than a smaller sample. There are four factors that impact the sample size, the standard deviation of the hypothesized effect (σ2), the probability of a type I error (α-level), power (1-β), and the effect size (i.e., the distance between the means under the null and alternative hypotheses) (Rosner, 2000). When these factors are adjusted per specifications of the study, the ideal sample size can be determined. Power is defined as one minus the probability of making a type II error. Power tells us the likelihood of rejecting the null hypothesis given that the alternative hypothesis is true (Rosner). It is influenced by the same factors of sample size, alpha, effect size and standard deviation of the hypothesized effect. Power decreases when alpha decreases, and the standard deviation of the hypothesized effect increases. Power increases when the difference between the true mean and the null mean increases, and the sample size increases (Rosner). Most researchers set power at 0.80 or higher when doing sample size estimation. 1 Section II: Analysis of Upper Midwest Health Study Data Introduction: There are often situations in research when the subject cannot provide a reliable medical, occupational, or life style history. Researchers must then rely on the history that is provided by the subject’s proxy respondent. In epidemiological studies, the proxy response increases study power, by increasing the sample size and improves representation of the case group (Campbell et al. 2007). However, the accuracy of the proxy’s information can be influenced by multiple factors, including length of relationship, bias and quality of relationship, and has the potential to misclassify the exposure (Johnson, et al., 1993). Many studies (Campbell, et al., 2007, Johnson, et al., 1993) have compared proxy responses for accuracy against self-respondents in case-control studies. Campbell et al. (2007) looked at the completeness of information of the proxy respondents in a population-based case- control study that used self-administered mailed questionnaires. They found that parents and spouses provided more complete information than did children, siblings, friends or others for many questions commonly asked in epidemiological research. Johnson et al. (1993) examined the quality of pesticide exposure data provided by self- and proxy respondents and concluded that the pesticide data from the proxy and self-respondents does not lead to the same conclusions or the same estimate of risk. If the information between the self and proxy respondent do not lead to the same conclusions due to differences in responses, then adding proxy respondents to increase study power may actually cause incorrect conclusions to be made during study analysis. This paper uses data from the Upper Midwest Health Study (UMHS) to determine the optimal sample size for studies examining pesticide exposure when using data from proxies. The analysis looks at the agreement between proxy and respondent answers and if a sample of 2 proxies provides a similar response as the subject/self-respondent. Explicitly the null hypothesis is that the proxies will respond the same as the subject. A power analysis is then completed using the standard deviation and difference between subject-proxy agreement to determine appropriate sample size for similar studies. 3 Methods: Data from the Upper Midwest Health Study, a case-control study, was analyzed to determine the optimal sample size when collecting proxy data. Researchers from the National Institute for Occupational Safety and Health (NIOSH) collected pesticide usage data from residents in non-urban areas in four upper Midwestern states: Michigan, Minnesota, Iowa and Wisconsin (Ruder, et al., 2006). The 872 cases were diagnosed with gliomas (the most common type of brain cancer) from January 1995 through January 1997. The 1669 control participants had no diagnosis of glioma. All participants (or their proxies) were interviewed and answered questions regarding their occupational experiences, ethnicity, education and lifestyle. A subset of case respondents had a next of kin (proxy) interviewed 1-2 years after the original interview. This subset of 105 matched responses was used for this analysis. Seven questions from the administered questionnaire were analyzed (Appendix 1). For this analysis, the questions were grouped into two categories, ‘demographic’ and ‘study-related’. The questions were analyzed separately and across the two groups to determine if there is a significant difference between the subjects and their proxies and possible recall bias. Additional analyses looked at the number of subjects who answered ‘Yes’ to the question and compared that to the number of proxies who also answered ‘Yes’ to the same question to determine if there was any significant difference between the responses. The subject’s response was considered the ‘gold-standard’ and a z-test was used to compare the proxy’s response to the subject’s response.