Reanalysis of Ecological Data from the 1989 China Study Using SAS® Enterprise Guide® Thomas E
Total Page:16
File Type:pdf, Size:1020Kb
Reanalysis of Ecological Data from the 1989 China Study Using SAS® Enterprise Guide® Thomas E. Billings, MUFG Union Bank, N.A., San Francisco, California This work by Thomas E. Billings is licensed (2014) under a Creative Commons Attribution 4.0 International License. Abstract The China Studies are a series of large-scale ecological and epidemiological studies conducted in China in the 1980’s, with Taiwan added in 1989. Over 600 parameters were captured and most raw data are in the form of averages for up to 69 counties. The data are freely available online. We import the 1989 data into SAS® Enterprise Guide®, clean/transform it, and validate against the published data. The metadata are also imported. Simple analyses are demonstrated using the data: correlations, Hoeffding’s D statistic, partial correlations, regressions, etc. The Correlation Task is used to create a large correlation matrix for all 639 variables in the study, which is then unrolled and reformatted for reporting. (Note, for contrast, that the published 1989 China Study monograph provides correlations for less than 400 of the variables.) Possible issues that may constrain analyses of the data are discussed, and we end with another sample exploratory analysis: a by-county plot of principal components derived from diet variables for animal foods vs. plant foods consumption. Background -- the raw data: A series of large-scale ecological and epidemiological studies were conducted in China in the 1980’s, with Taiwan added in 1989. The surveys involved were: 1973-5 mainland China mortality: mortality data recompiled from Chinese government data 1983 mainland China survey: survey data on diet, lifestyle, and geographic variables; also analysis of plasma, red blood cell and urine parameters for survey participants 1986-8 mainland China mortality: mortality data similar to data from 1973-5 collection 1989 mainland China resurvey: data similar to 1983 collection 1993 mainland China resurvey: data similar to 1983, 1989 collections 1986-8 Taiwan mortality: mortality data similar to mainland 1983 collection 1989 Taiwan survey: data similar to mainland 1983 collection but no diet data. The 1983 mainland survey covered 65 mostly rural counties in China; the 1989 survey covered the same 65 and added 4 additional counties to the base. In each county, 2 xiangs (communes) were sampled and data were collected for men and women (when relevant). The 1993 mainland resurvey sampled a smaller number of counties. The data are in the form of averages for the individuals sampled in each: county, commune, and gender classification. The result is that for the 2 principal mainland China studies, the data consists of n=65 (1983) or n=69 (1989) means. These data are ecological, an important point. The data from the surveys above were analyzed and summary statistics and graphs – with an emphasis on Pearson correlations –were published in 2 monographs: Data from 1973-5 mortality, 1983 mainland surveys: Chen J. et al. (1990). Diet, Lifestyle and Mortality in China: a study of the characteristics of 65 Chinese counties. Oxford : Oxford University Press. Google Books URL: http://books.google.com/books?id=kv1WAAAAYAAJ The rest of the data, including some 1983 vs. 1989 comparisons are published in a lengthy monograph, for which some text is available online: 1 Chen J. et al. (2006). Mortality, biochemistry, diet, and lifestyle in rural China: geographic study of the characteristics of 69 counties in mainland China and 16 areas in Taiwan. Oxford: Oxford University Press. URL: http://www.ctsu.ox.ac.uk/~china/monograph/index.htm Accessing the raw data and scope of this project: The data from all of the above surveys are freely available to the public online via the URL: http://www.ctsu.ox.ac.uk/~china/monograph/chdata.htm There are 13 main data sets (all .csv files), plus a supplementary data set that contains metadata (.txt file). This paper focuses on/uses data from only 1 time period, specifically the 1986-1989 mainland China mortality, diet, lifestyle, and geographic variables. Combining the surveys for this time period yields a data set with 642 variables. The raw data were acquired via download from the URL above, and saved into text and .csv files for import into the SAS® system. The web page above contains the following advice concerning the data: “NB These files are not particularly user- friendly”. We shall see that –indeed- this is true, and the data are discussed in detail below. Background -- Constraints of ecological data: the ecological fallacy With very few exceptions, the data from the subject surveys are provided in the form of group averages. Most variables were collected from samples drawn from 2 different communes (xiang) per county. This yields averages at these hierarchical levels: Xiang 1 & 2: Males, Females, Total/combined Overall averages, xiang 1 & 2 combined: males, females, all subjects. The end result is that we have 9 rows of averages for each of the counties in the study. Note that some data variables were captured/recorded for males only, others for females only. The data includes a large number of missing values, the result of being captured at fewer than 9 levels. Data that are averages for groups are defined as ecological data. There are major constraints on the interpretation of ecological data, summarized in the ecological fallacy: “the ecological fallacy consists in thinking that relationships observed for groups necessarily hold for individuals.” Source: Freedman, D (1999). Ecological Inference and the Ecological Fallacy. Freely available at URL: http://www.stat.berkeley.edu/~census/549.pdf Billings (1999) provides a discussion of other issues and concerns in using data from the China studies. Anyone planning to do serious research with these data should also read Greenland & Robins (1994). Additionally, since the emphasis in the research monographs is on correlations, we note that correlation is not proof of a causal relationship, and the same statement can be made more generally in regards to statistical models. Correlations and statistical models are evidence of possible associations, not necessarily (by themselves) proof of causation. Accordingly, ecological data are useful primarily for exploring relationships and proposing hypotheses, which can then be explored using other research methods, i.e., something other than ecological surveys. Motivation and objectives: In late 2006, one of the researchers involved in the China studies co-authored a popular-press, non-peer-reviewed book titled The China Study (i.e., “gray literature”; Campbell & Campbell (2006)). Despite the frankly misleading title, only 1 chapter in the book directly addressed the China studies listed above. The book made a number of controversial claims that inspired crowdsourced research by interested parties on the Internet. 2 At the time, the debates on the Internet were harsh/intense and some commenters went to the significant effort of (manually) importing some of the raw data from the 1983, 1989 China studies into Microsoft Excel for statistical analysis. The native statistical analysis capabilities of Excel are quite limited: add-ins are available for Excel, e.g., SAS Add-In for Microsoft Excel (and other, competing add-ins), but the individuals doing the analyses were –for the most part - not using add-ins to boost the statistical abilities of Excel (presumably they did not have access to add-in products). The results of some of these analyses showed that data from the 1983, 1989 China studies did not necessarily support –and in some cases appeared to contradict - some of the claims in The China Study book. A major motivation for this paper is to make the China Study data available in SAS – where serious analyses can be done – so that those who wish to explore the data do not have to rely on the limited statistical capabilities of Excel. We also provide some sample exploratory results to demonstrate what can be done directly in SAS Enterprise Guide. In more detail, the objectives of the present paper are as follows: To import the 1989 raw data (and metadata) into SAS using SAS Enterprise Guide®, with some user-written code, in a way that allows reconstruction of many of the analyses reported in the research monographs for the 1983, 1989 China studies, and to support additional analyses as appropriate To provide some sample exploratory/demo analyses using the 1989 raw data To compute correlations for all 600+ variables in the 1989 data (the relevant monograph provided correlations for <400 variables) and to provide this data in a form that supports reporting To do a sample demo analysis of select 1989 lifestyle variables using principal components, ending with A discussion of the limitations on analyses of these data: how complex should such an analysis be? Along the way we will find that the data – and the SAS system – have some interesting surprises and challenges in store for us. Recognizing that the data used here were the topic of contentious debate, and to avoid any misunderstandings, readers should be aware that this paper does not attempt to: Reproduce every statistic computed in the monographs for the 1983, 1989 studies Derive detailed, advanced statistical models using the data Be “totally comprehensive” or the “last word” on the data or research involving these data. Applied statistical research and programming involve a myriad of decisions regarding how to process the data (and, by the way, some of these decisions are subjective). As suggested in Speed (2014), and so that researchers fully understand the target data, we provide documentation on many of the processing and design decisions made in the completion of this paper. Four SAS Enterprise Guide projects were created for this paper; those projects are described in the sections that follow. Project #1: import raw data and perform sample demo analyses Project #1 is the “most essential” project and may be the only project of interest to those wishing to directly work with/analyze the raw data, i.e., those who have no interest in correlations.