<<

Reanalysis of Ecological Data from the 1989 Study Using SAS® Enterprise Guide® Thomas E. Billings, MUFG Union Bank, N.A., San Francisco, California

This work by Thomas E. Billings is licensed (2014) under a Creative Commons Attribution 4.0 International License.

Abstract

The China Studies are a series of large-scale ecological and epidemiological studies conducted in China in the 1980’s, with Taiwan added in 1989. Over 600 parameters were captured and most raw data are in the form of averages for up to 69 counties. The data are freely available online. We import the 1989 data into SAS® Enterprise Guide®, clean/transform it, and validate against the published data. The metadata are also imported. Simple analyses are demonstrated using the data: correlations, Hoeffding’s D statistic, partial correlations, regressions, etc. The Correlation Task is used to create a large correlation matrix for all 639 variables in the study, which is then unrolled and reformatted for reporting. (Note, for contrast, that the published 1989 China Study monograph provides correlations for less than 400 of the variables.) Possible issues that may constrain analyses of the data are discussed, and we end with another sample exploratory analysis: a by-county plot of principal components derived from variables for animal vs. foods consumption.

Background -- the raw data:

A series of large-scale ecological and epidemiological studies were conducted in China in the 1980’s, with Taiwan added in 1989. The surveys involved were:

 1973-5 mainland China mortality: mortality data recompiled from Chinese government data  1983 mainland China survey: survey data on diet, lifestyle, and geographic variables; also analysis of plasma, red blood cell and urine parameters for survey participants  1986-8 mainland China mortality: mortality data similar to data from 1973-5 collection  1989 mainland China resurvey: data similar to 1983 collection  1993 mainland China resurvey: data similar to 1983, 1989 collections  1986-8 Taiwan mortality: mortality data similar to mainland 1983 collection  1989 Taiwan survey: data similar to mainland 1983 collection but no diet data.

The 1983 mainland survey covered 65 mostly rural counties in China; the 1989 survey covered the same 65 and added 4 additional counties to the base. In each county, 2 xiangs (communes) were sampled and data were collected for men and women (when relevant). The 1993 mainland resurvey sampled a smaller number of counties.

The data are in the form of averages for the individuals sampled in each: county, commune, and gender classification. The result is that for the 2 principal mainland China studies, the data consists of n=65 (1983) or n=69 (1989) means. These data are ecological, an important point.

The data from the surveys above were analyzed and summary statistics and graphs – with an emphasis on Pearson correlations –were published in 2 monographs:

Data from 1973-5 mortality, 1983 mainland surveys: Chen J. et al. (1990). Diet, Lifestyle and Mortality in China: a study of the characteristics of 65 Chinese counties. Oxford : Oxford University Press. Google Books URL: http://books.google.com/books?id=kv1WAAAAYAAJ

The rest of the data, including some 1983 vs. 1989 comparisons are published in a lengthy monograph, for which some text is available online:

1

Chen J. et al. (2006). Mortality, biochemistry, diet, and lifestyle in rural China: geographic study of the characteristics of 69 counties in mainland China and 16 areas in Taiwan. Oxford: Oxford University Press. URL: http://www.ctsu.ox.ac.uk/~china/monograph/index.htm

Accessing the raw data and scope of this project:

The data from all of the above surveys are freely available to the public online via the URL: http://www.ctsu.ox.ac.uk/~china/monograph/chdata.htm

There are 13 main data sets (all .csv files), plus a supplementary data set that contains metadata (.txt file). This paper focuses on/uses data from only 1 time period, specifically the 1986-1989 mainland China mortality, diet, lifestyle, and geographic variables. Combining the surveys for this time period yields a data set with 642 variables. The raw data were acquired via download from the URL above, and saved into text and .csv files for import into the SAS® system.

The web page above contains the following advice concerning the data: “NB These files are not particularly user- friendly”. We shall see that –indeed- this is true, and the data are discussed in detail below.

Background -- Constraints of ecological data: the ecological fallacy

With very few exceptions, the data from the subject surveys are provided in the form of group averages. Most variables were collected from samples drawn from 2 different communes (xiang) per county. This yields averages at these hierarchical levels:

 Xiang 1 & 2:  Males, Females, Total/combined  Overall averages, xiang 1 & 2 combined: males, females, all subjects.

The end result is that we have 9 rows of averages for each of the counties in the study. Note that some data variables were captured/recorded for males only, others for females only. The data includes a large number of missing values, the result of being captured at fewer than 9 levels.

Data that are averages for groups are defined as ecological data. There are major constraints on the interpretation of ecological data, summarized in the ecological fallacy:

“the ecological fallacy consists in thinking that relationships observed for groups necessarily hold for individuals.” Source: Freedman, D (1999). Ecological Inference and the Ecological Fallacy. Freely available at URL: http://www.stat.berkeley.edu/~census/549.pdf

Billings (1999) provides a discussion of other issues and concerns in using data from the China studies. Anyone planning to do serious research with these data should also read Greenland & Robins (1994).

Additionally, since the emphasis in the research monographs is on correlations, we note that correlation is not proof of a causal relationship, and the same statement can be made more generally in regards to statistical models. Correlations and statistical models are evidence of possible associations, not necessarily (by themselves) proof of causation.

Accordingly, ecological data are useful primarily for exploring relationships and proposing hypotheses, which can then be explored using other research methods, i.e., something other than ecological surveys.

Motivation and objectives:

In late 2006, one of the researchers involved in the China studies co-authored a popular-press, non-peer-reviewed book titled (i.e., “gray literature”; Campbell & Campbell (2006)). Despite the frankly misleading title, only 1 chapter in the book directly addressed the China studies listed above. The book made a number of controversial claims that inspired crowdsourced research by interested parties on the Internet.

2

At the time, the debates on the Internet were harsh/intense and some commenters went to the significant effort of (manually) importing some of the raw data from the 1983, 1989 China studies into Microsoft Excel for statistical analysis. The native statistical analysis capabilities of Excel are quite limited: add-ins are available for Excel, e.g., SAS Add-In for Microsoft Excel (and other, competing add-ins), but the individuals doing the analyses were –for the most part - not using add-ins to boost the statistical abilities of Excel (presumably they did not have access to add-in products). The results of some of these analyses showed that data from the 1983, 1989 China studies did not necessarily support –and in some cases appeared to contradict - some of the claims in The China Study book.

A major motivation for this paper is to make the China Study data available in SAS – where serious analyses can be done – so that those who wish to explore the data do not have to rely on the limited statistical capabilities of Excel. We also provide some sample exploratory results to demonstrate what can be done directly in SAS Enterprise Guide.

In more detail, the objectives of the present paper are as follows:

 To import the 1989 raw data (and metadata) into SAS using SAS Enterprise Guide®, with some user-written code, in a way that allows reconstruction of many of the analyses reported in the research monographs for the 1983, 1989 China studies, and to support additional analyses as appropriate  To provide some sample exploratory/demo analyses using the 1989 raw data  To compute correlations for all 600+ variables in the 1989 data (the relevant monograph provided correlations for <400 variables) and to provide this data in a form that supports reporting  To do a sample demo analysis of select 1989 lifestyle variables using principal components, ending with  A discussion of the limitations on analyses of these data: how complex should such an analysis be?

Along the way we will find that the data – and the SAS system – have some interesting surprises and challenges in store for us.

Recognizing that the data used here were the topic of contentious debate, and to avoid any misunderstandings, readers should be aware that this paper does not attempt to:

 Reproduce every statistic computed in the monographs for the 1983, 1989 studies  Derive detailed, advanced statistical models using the data  Be “totally comprehensive” or the “last word” on the data or research involving these data.

Applied statistical research and programming involve a myriad of decisions regarding how to process the data (and, by the way, some of these decisions are subjective). As suggested in Speed (2014), and so that researchers fully understand the target data, we provide documentation on many of the processing and design decisions made in the completion of this paper. Four SAS Enterprise Guide projects were created for this paper; those projects are described in the sections that follow.

Project #1: import raw data and perform sample demo analyses

Project #1 is the “most essential” project and may be the only project of interest to those wishing to directly work with/analyze the raw data, i.e., those who have no interest in correlations. This project imports the raw data from .csv files and the metadata from .txt files. Multiple output files are created, and this gives the user the option of working with relatively raw data that has no metadata, raw data plus metadata, or data transformed to reproduce the correlations found in the China Study monographs.

The first step – before running the project - is to create a directory that will contain the downloaded raw data files (C:\CSData). Then download these files: CH89M.CSV, CH89PRU.CSV, CH89DG.CSV, CH89Q.CSV, and CHNAME.TXT from the URL listed above. To accomplish this, access the URL for each file – i.e., open the file in a browser, right click on the browser page and then “Save as:” into the target directory, C:\CSData.

The project does the following processing. A Program Task is defined to specify OPTIONS of interest. Then the Import Wizard is used on each of the 4 .csv files to create a SAS file containing the raw data. The .csv files contain an extraneous comma at the end of the first line that contains the variable names. This generates an extraneous variable when importing the .csv files; this was edited out of the list of input variables in the Import Wizard.

3

The 4 files, 1 from each .csv, are then joined using the Query Builder Task into a single file; join by: county, sex, xiang (commune). A left join is used but all 4 files have the same cardinality so it could be another type of join. This gives us a single file, CS_1989_all_raw, with the combined raw data from the .csv files. A series of Tasks are then used to extract the raw data at the county average level, i.e., WHERE sex=T and xiang=3. Additional tests are used to determine how many of the values are missing, for each of the 600+ variables, in the 69 counties. A large number of variables are identified in this analysis as being always missing at the county average level.

Examination of the raw data showed that averages were reported for these variables at either the Male or Female level, but not combined. The research monograph confirms this (observed) data structure. What this means is that in the 9 rows per county structure of the raw data, the value was missing in the county average row but populated in another row. Recognizing that we want to be able to do correlations for all of the variables, we need for all of the county averages to be populated in a single row. This requires moving these data – the variables with missing county averages – across rows.

The relevant Male/Female averages are mapped into the county averages, i.e., across rows, by a series of Tasks:

 Use the Query Builder to create an extract containing the target Male averages  Use the Query Builder to create an extract containing the target Female averages  In a Program task, use a DATA step MERGE operation to create 2 data sets:  An updated version of CS_1989_all_raw (name = CS_1989_V2) that contains the raw data with the 3 county average rows, i.e., WHERE sex=T and xiang=3, overlaid with the Male or Female averages (here missing values are overlaid with the raw Male/Female data)  A separate data set containing the county averages: WHERE sex=T and xiang=3; this is file CTY_MN_1989_NOMETA.

The processing done in a Program Task could have been done using multiple Query Builder Tasks. Instead, a DATA step is used as the code required is much shorter, and 2 data sets can be created in a single DATA step. The code used is very simple:

* use extract files in DATA step MERGE to overwrite missing values; * in sex = "T" rows, all 3 xiang values. create new version of ; * omnibus/all rows file, plus an extract file with overall county averages; data WORK.CS_1989_v2 work.cty_mn_1989_nometa; merge WORK.CS_1989_all_raw male_1989 female_1989; by county sex xiang; output WORK.CS_1989_v2;

if (sex = "T") and (xiang = 3) then output work.cty_mn_1989_nometa; run;

At this stage, we have a file with the raw data but no metadata have been imported. The following is a short extract from the PROC CONTENTS/Task=Data Set Attributes; the file actually has 642 variables; only a few are shown in the extract: Alphabetic List of Variables and Attributes # Variable Type Len Format Informat Label 1 County Char 2 $CHAR2. $CHAR2. 230 D001 Num 8 BEST4. BEST4. 231 D002 Num 8 BEST4. BEST4. 232 D003 Num 8 BEST4. BEST4. 233 D004 Num 8 BEST3. BEST3. 234 D005 Num 8 BEST4. BEST4.

These 2 files (CS_1989_V2, CTY_MN_1989_NOMETA) are complete/have all of the raw data but no metadata. Despite this, these files can be used as-is for analyses if desired.

4

The research monographs use, and the metadata file provides, short, descriptive variable names that are more informative than names like, e.g., D009. The metadata file also provides a longer description of the variable. In order to provide more informative SAS output, we want to use the short variable names from the metadata file, and provide the question ID/number appended to the long variable description, as a SAS variable label.

However, in order to use the metadata, we must overcome some limitations:

1. The short variable names do not conform to SAS variable name standards, hence we have to use OPTIONS VALIDVARNAME=ANY and the naming convention: “var name”n or ‘var name’n. 2. It turns out that SAS Enterprise Guide uses the convention “var name”n for this type of variable names in system-generated code. In early runs, this caused downstream tasks to fail because some of the short variable names include the % character, which the SAS macro language tries to interpret. To get around this, we use the TRANWRD function to replace “%” with “_pct” in variable names that contain the % character.

There is an additional complication to consider. Although this paper is not interested in comparing 1983 vs. 1989 data, male vs. female, xiang 1 vs. xiang 2, others who wish to work with the data may be interested in doing so. If, for example, you have 2 files, one with 1983 data, one with 1989 data, with the same variable names in both files, then trying to do 1983 vs. 1989 correlations would require major renaming of the variables in one file, prior to or during a join of the 2 files. In order to support this, we construct short variable names using the metadata file that are the concatenation of:

 Short variable name from metadata (processed to change % signs)  An underscore: _  The year of the data: 1989, 1983, or TAIW for Taiwan data  An underscore: _  2 characters, the concatenation of (M,F,T: male, female,.total) for sex and (1,2,3) for xiang 1, 2, and combined.

So for example, for question M001, mortality M001: ALL CAUSES AGE 0-4 (rate/1,000), with short name in metadata mALL0-4, we construct the short variable name: mALL0-4_1989_T3. This particular short name includes a nonstandard character (-) so in program code it must be referred to as “mALL0-4_1989_T3”n. This approach is more cumbersome than using, say, M001 for the variable but it is unambiguous and prevents mistakes. (If you instead choose to use the file with variable names like M001, D001, then you might want to rename those variables and add a similar suffix to avoid confusion.)

Attention: you will need to copy the metadata from the chname.txt file into the source code after the DATALINES4; statement in the Project #1 Program Task labelled “metadata from CHNAME file”. The raw data were deleted before the project was uploaded, for legal reasons. This is a precaution, in case the owners of the data claim copyright or sui generis property rights on the data. (The data are on a website in the U.K. so sui generis rights may be applicable; the U.S. does not recognize sui generis rights.) Having users copy/paste the raw metadata file avoids republication of the raw data in the project, hence there are no copyright or sui generis issues. It also lets you avoid having to write a SAS FILENAME statement.

A metadata file is created which can be exported as .csv or .xlsx. Exporting the file into a spreadsheet and formatting it for easy reading is highly recommended: this will let you search the file’s descriptions to identify variables of interest. The metadata file, in spreadsheet form, looks like this – here are a few rows for illustration:

vartype qnum varname category var_label m M001 mALL0-4 mortality M001: ALL CAUSES AGE 0-4 (rate/1,000) m M002 mALL5-14 mortality M002: ALL CAUSES AGE 5-14 (stand. rate/100,000) m M003 mALL15-34 mortality M003: ALL CAUSES AGE 15-34 (stand. rate/100,000) m M004 mALL0-34 mortality M004: ALL CAUSES AGE 0-34 (stand. rate/100,000)

5

A Program Task is then used with a SAS Macro to:

 Extract the target row from each 9-rows-per-county set, i.e., extract overall county averages  Generate the source code for PROC DATASETS to rename the variables to the preferred short names, and assign a label to each variable.

The source code for PROC DATASETS is generated in a DATA _NULL_ step instead of using macro variables/a macro loop. This is because the 1st version of the code worked with short variable names that included % signs and handling this in the macro feature is doable but messy. The _NULL_ approach provided a clean, 1-step solution. (A macro loop is used in a later project, for similar purposes.)

Remark. The SAS macro language is an excellent tool for code generation, but it is not the only way to generate code. The SAS DATA step can be used to generate code by writing the code to a separate file, then processing it via %INCLUDE. The SAS DATA step is more powerful than the SAS macro language; programmers should be able to use either method.

The macro options let you pull any one of the 9 rows per county, and in turn this facilitates the creation of data sets that will support 1983 vs. 1989 analyses, male vs. female, xiang 1 vs.2, etc. The end product of all this work is a single data set named CTY_MN_1989_META with 69 rows, 1 row per county, containing county averages for 600+ variables, in the form (from a PROC CONTENTS/Task=Data Set Attributes):

Alphabetic List of Variables and Attributes # Variable Type Len Format Informat Label 1 County Char 2 $CHAR2. $CHAR2. 2 Sex Char 1 $CHAR1. $CHAR1. Sex 3 Xiang Num 8 BEST1. BEST1. 329 d10:0_1989_T3 Num 8 BEST5. BEST5. D100: INTAKE OF 10:0 (mg/day/reference man) 330 d11:0_1989_T3 Num 8 BEST5. BEST5. D101: INTAKE OF 11:0 (mg/day/reference man) 331 d12:0_1989_T3 Num 8 BEST5. BEST5. D102: INTAKE OF 12:0 (mg/day/reference man) 332 d13:0_1989_T3 Num 8 BEST5. BEST5. D103: INTAKE OF 13:0 (mg/day/reference man) 333 d14:0_1989_T3 Num 8 BEST4. BEST4. D104: INTAKE OF 14:0 (g/day/reference man)

Compare the above with the file that has variable names D001, D002, etc. Injecting the metadata into the files produces a data set for which – in this writer’s opinion – the SAS procedure output will be better labelled and easier to understand (at the expense of having to use the cumbersome “variable name”n convention for at least some variable names). Other analysts might disagree and prefer to work with variable names like D001_1989_T3, M071, Q101, etc. The project produces both types of files so the user has a choice of which raw data set to work with.

From this point forward we will work with the file above, CTY_MN_1989_META. Considering the work required to reach this point, we note that the statement on the web page with the China studies raw data: “NB These files are not particularly user-friendly” is clearly true. Now that we have 69 rows with 1989 county averages, we can analyze it using SAS Enterprise Guide and the SAS System. Project #1 copies this raw data file to a (meta-) LIBNAME for use in other projects. Project #1 also includes some sample analyses, for illustration purposes. Let’s look at a few of these analyses. First, we use PROC CORR (Task: Multivariate > Correlations) to get the Pearson correlations and the values of Hoeffding’s D statistic for a few select variables. A small section of the output follows:

6

Pearson Correlation Coefficients, N = 69 Prob > |r| under H0: Rho=0

pTOTCHOL pHDLCHO dPLNTPRO dANIMPRO dSTCHSUGA dPUFA_ dSATFA_ _1989_T3 L_1989_T3 T_1989_T3 T_1989_T3 R_1989_T3 1989_T3 1989_T3 pTOTCHOL 1 0.55569 -0.44208 0.6858 0.03331 0.11763 0.49639 _1989_T3

P001: <.0001 0.0001 <.0001 0.7858 0.3357 <.0001 TOTAL CHOLESTE ROL (mg/dL) pHDLCHO 0.55569 1 -0.46081 0.58517 0.12553 0.22288 0.44492 L_1989_T3

P002: HIGH <.0001 <.0001 <.0001 0.3041 0.0657 0.0001 DENSITY LIPOPROT EIN CHOLESTE ROL (mg/dL)

Hoeffding Dependence Coefficients, N = 69 Prob > D under H0: D=0

pTOTCHOL pHDLCHO dPLNTPRO dANIMPRO dSTCHSUGA dPUFA_ dSATFA_ _1989_T3 L_1989_T3 T_1989_T3 T_1989_T3 R_1989_T3 1989_T3 1989_T3 pTOTCHOL 0.90564 0.09999 0.0484 0.14734 0.00157 -0.00205 0.0668 _1989_T3

P001: <.0001 <.0001 0.0007 <.0001 0.282 0.5102 <.0001 TOTAL CHOLESTE ROL (mg/dL) pHDLCHO 0.09999 0.98596 0.05961 0.10455 -0.00369 0.0045 0.06319 L_1989_T3

P002: HIGH <.0001 <.0001 0.0002 <.0001 0.6616 0.1786 0.0001 DENSITY LIPOPROT EIN CHOLESTE ROL (mg/dL)

7

In the Correlation Task, we request separate output data sets for Pearson correlations and Hoeffding’s D statistic. These data sets are in fact created but examination reveals that the data sets exclude/do not include the significance levels which are displayed in the output above! The data set produced by the Task is TYPE=CORR, a special pupose data set type that predates the Output Delivery System (ODS) feature of SAS. We show a way to get around this limitation in project #2.

Project #1 also includes a number of other sample analyses that illustrate what can be done using SAS Enterprise Guide Tasks:  Partial correlations  Scatter plot  Distribution analysis (histogram)  Regression analysis. Output from those Tasks is provided in the project. The sample histogram output illustrates a possible issue with the raw data: Distribution analysis of: dANIMFAT_1989_T3

8

The large value for 0 (zero) is suspicious. A check of the raw data suggests that 0 has been entered for missing in some rows/counties. A few other variables show a similar pattern. Note that it is not hard to edit the data and replace 0 with missing. In general, academicians are quick to edit raw data, sometimes for weak/unsupported reasons. (The 1983 research monograph excluded from analysis the diet data from one county, claiming the survey was done on a feast day. Now that we have 1989 data, a comparison/analysis may be possible to determine if that exclusion is supported by the data.) Those of us who work with data in are usually very reluctant to edit data. We have chosen to leave the zeroes in for this analysis/paper, recognizing that a future analysis might reset some of those to missing, yielding possibly different results.

Validation. The raw data files created here were verified by spot-comparison (of around 20 cells) to the .csv files when viewed in a spreadsheet.

This concludes our discussion of project #1; the next 2 projects derive correlations for all 600+ variables.

Projects #2, 3: Pearson correlations for all 639 variables

The 1983, 1989 research monographs provide correlations for <400 variables while the 1989 project generated data for 639 variables. It should be noted that you probably are not interested in correlations for all 639 variables, because:

1. Some variables are the sum of other variables 2. Some variables are derived on a strict subset of data that are also reflected in another variable 3. A few variables are general features and might not be of interest.

In cases 1,2 above there is a clear, pre-existing relationship between the variables hence correlations provide little new information, Here, we derive all of the possible correlations because it is easier to ignore correlations that are not of interest vs. you lack a correlation that you are interested in. We don’t derive Hoeffding’s D statistic for all 639 variables (it would be straightforward to do so) and the reasons for this are discussed in a later section.

Computing a correlation matrix with 3 statistics (correlation, significance level, N) for all 639 variables means we are computing 639 X 639 X 3 = 1,224,963 statistics. Computing that many statistics plus the voluminous output for display requires more system resources than are allocated to a typical SAS Enterprise Guide user/session. To get around this, Project #2 breaks the 639 X 639 matrix into 4 quadrants, each of which must be run manually.

Project #2 includes a “readme” note that users should read and follow carefully to be able to reproduce the runs. Recall that PROC CORR/Correlations Task as-is, does not provide what we need - an output data set that contains the correlations, significance levels, and N for each cell. To get such a data set, we must use the ODS OUTPUT statement, which must be manually inserted in the program code generated by the Correlation Task. This code is inserted just before the PROC CORR in the code: ods trace on; ods output Corr.PearsonCorr=CDMLNKS.pearson_part#; where # is a number from 1 to 4, and this code is inserted right after the PROC CORR: ods output close; ods trace off;

The code above directs the SAS ODS subsystem to create a SAS data set that contains the statistics computed in the PROC CORR.

Four Tasks are used and those barely work – you may get a message from the system saying the results are very large, should the process continue to run? Respond yes to the message and follow readme instructions to delete the display output immediately after the Task completes. You have to run the project manually 4 times, 1 Correlations Task per run/quadrant. The result here is 4 SAS data sets that contain the desired Pearson correlations and other statistics.

Project #3 takes the 4 files from Project #2, i.e., the 4 quadrants of the 639 X 639 correlation matrix, and combines them into a single file containing the complete correlation matrix. The file contains 639 rows, 1/raw data variable:

9

1920 variables per row: 3 ID variables (county, sex, xiang) plus 3 statistics for each of the 639 raw data variables: correlation, significance level, N.

The 639 X 639 correlation matrix/file is then unrolled into a “flat format” for use in reporting. The unrolling of the data is done using macro loops and a “subscripted” macro variable facilitated via the use of &&. The unrolled file (ALL_CORRS) has 1 row per correlation cell, with the variables:

 Row variable from correlation matrix: short variable name  row ID: question number like D001 (also appended with row_variable for listings/PROC PRINTs)  column variable  column ID: question number  correlation  significance level (raw value)  an indicator of significance level  N.

Although this structure has some similarity to the (controversial and criticized) EAV: entity-attribute-value data structure, it is primarily used here only as an intermediate file structure to support/simplify reporting and analysis (see Billings et al. (2013) for a discussion of EAV structures). The use here is analogous (in some respects) to the R melt and cast functions, and is similar to an output file produced by PROC TABULATE..

At this point we have a file with all possible correlations for the 1989 China study data, 1 correlation matrix cell/row: 408, 321 rows = 639 X 639. The next step is to filter this file to create an extract that contains all significant correlations that are of interest. The 1983, 1989 monographs identified as significant those correlations for which 2P < 0.05.

Now P=0.05 is a common standard for testing significance level, but this standard is completely arbitrary and it dates back to early work by R.A. Fisher, the founder of mathematical statistics in modern times. There is a large (and growing) statistical literature on P-values, much of it raising questions about the P=0.05 standard; see Mudge et al. (2012) for an entry into the literature.

Johnson (2013A, B) defines UMPBTs: uniformly most powerful Bayesian tests for exponential family distributions that have the same rejection regions as frequentist UMPTs: uniformly most powerful tests. This correspondence establishes an approximate equivalence between type I errors for UMPTs and the evidence thresholds (Bayes factors) from UPMBTs. Based on this, he suggests doing tests at the (maximum) P=0.005 level, to encourage reproducibility of results.

The China study data are ecological and are suitable only for generating hypotheses; reproducibility is to-be- determined. However, with so many correlations (>200K unique), it makes sense to limit the number designated as being possibly “significant”. Hence in Project #3 we create an indicator variable with 3 levels:  2P < 0.005  2P < 0.001  2P < 0.0001. It should be noted that the significance level to use in this context is a point on which reasonable people can disagree.

The Query Builder Task is used to create an extract of all correlations for which 2P<0.005. This is then sorted and output in 4 segments, based on the type of survey data. For each row variable, the significant correlations are sorted by increasing significance level and the list of variables is displayed.

A sample page of output from the list of all significant correlations is on the next page.

1989 China Study data -- correlations Diet Survey Intakes row_id=D001:dKCAL_1989_T3 (continued on next page)

10

column_variable column_id correlation P_value significance N dSOLCARB_1989_T3 D004 0.879767592 2.54E-23 2P<0.0001 69 dTRYPTOPH_1989_T3 D079 0.721470944 2.69E-12 2P<0.0001 69 dTOTFOOD_1989_T3 D030 0.666432865 4.10E-10 2P<0.0001 69 dTYROSINE_1989_T3 D080 0.661535001 6.09E-10 2P<0.0001 69 dMETH+CYS_1989_T3 D074 0.645389125 2.14E-09 2P<0.0001 69 dPLNTFOOD_1989_T3 D028 0.636509044 4.14E-09 2P<0.0001 69 dVALINE_1989_T3 D081 0.618376275 1.50E-08 2P<0.0001 69 dPHENYLALA_1989_T3 D075 0.596379037 6.40E-08 2P<0.0001 69 dZn_1989_T3 D027 0.570237188 3.14E-07 2P<0.0001 69 dCYSTINE_1989_T3 D066 0.568588719 3.46E-07 2P<0.0001 69 dSERINE_1989_T3 D077 0.563646249 4.59E-07 2P<0.0001 69 dARGININE_1989_T3 D064 0.555969933 7.07E-07 2P<0.0001 69 dPLNTPROT_1989_T3 D033 0.536950334 1.97E-06 2P<0.0001 69 dFe_1989_T3 D019 0.518558349 5.01E-06 2P<0.0001 69 dTOTPROT_1989_T3 D003 0.515220385 5.90E-06 2P<0.0001 69 dTHREONINE_1989_T3 D078 0.514382129 6.15E-06 2P<0.0001 69 dMETHIONIN_1989_T3 D073 0.510047036 7.58E-06 2P<0.0001 69 dHISTIDINE_1989_T3 D069 0.499156685 1.26663E-05 2P<0.0001 69 dCu_1989_T3 D020 0.497469548 1.36937E-05 2P<0.0001 69 dGLYCINE_1989_T3 D068 0.496518834 1.43063E-05 2P<0.0001 69 dISOLEUCIN_1989_T3 D070 0.494794101 1.54834E-05 2P<0.0001 69 dMn_1989_T3 D023 0.472063897 4.22232E-05 2P<0.0001 69 d_pctPROTKCAL_1989_T3 D006 -0.46292746 6.19687E-05 2P<0.001 69 dALANINE_1989_T3 D063 0.459833501 7.03938E-05 2P<0.001 69 dADDEDFAT_1989_T3 D055 0.45526679 8.47784E-05 2P<0.001 69 dASPARTATE_1989_T3 D065 0.454036131 8.90946E-05 2P<0.001 69 dLEUCINE_1989_T3 D071 0.452096369 9.63114E-05 2P<0.001 69 uTAUR/cre_1989_T3 U009 -0.41857121 0.000344604 2P<0.001 69 d18:2_1989_T3 D115 0.408436127 0.000493953 2P<0.001 69 dTOTn6_1989_T3 D093 0.406734514 0.000524155 2P<0.005 69 qbPIGS_1989_T3 Q083 0.399950939 0.00066201 2P<0.005 69 dPUFA_1989_T3 D083 0.397280535 0.00072477 2P<0.005 69 qdBUDDHIST_1989_T3 Q012 -0.39573952 0.000763399 2P<0.005 69 dMg_1989_T3 D022 0.394422367 0.000797886 2P<0.005 69 qdWTLOSS_1989_T3 Q126 0.389501731 0.000939585 2P<0.005 69 dVITE_1989_T3 D013 0.38711843 0.001016091 2P<0.005 69 dVEGOIL_1989_T3 D054 0.381432403 0.00122187 2P<0.005 69 dCa_1989_T3 D018 0.374434657 0.001526356 2P<0.005 69

11

Limiting correlations to those with significance levels: 2P < 0.005 yields 31072 rows. Because the correlation matrix is symmetric, we have 31072/2 = 15536 unique correlations. The project splits the listings by categories of the surveys: mortality, laboratory, diet + general, and questions. The complete file of 400K+ correlations is produced in the project; it can be filtered for different significance level values, used for reporting, exported as .csv and downloaded, etc.

Validation. Checking the correlations against the 1989 research monograph shows that most agree (the monograph provides only 2 digits accuracy). A few of the correlations disagree by 1 in the 2nd digit – this may be due to differences in rounding. There are a very small number that disagree by >1 in the 2nd digit; this bears further investigation. (This may be due to transcription errors in compiling the monograph, or possibly to zeroes being recoded in the monograph to missing values for a few variables; this paper does not recode any zeroes.)

Project #4: Plots of principal components

Project #4 is a sample analysis that illustrates what can be done with very little user-written code and most processing done via SAS Enterprise Guide tasks. The first task in the project is a Program Task, that copies the file with the 69 county averages from a (meta-)LIBNAME into a WORK file, (You will need to setup the LIBNAME statement for the project to work.) From this point on all processing is done using (non-Program) SAS Enterprise Guide tasks.

Two Query Builder Tasks are used to create extracts from the file containing the 69 county averages. One extract contains variables related to plant-based consumption; the other extract contains multiple variables related to animal food consumption. Strictly speaking these tasks are not necessary; however they make the processing easier to follow.

The plant-food-related variables used are: dPLNTFOOD D028: PLANT FOOD INTAKE (g/day/reference man) dPLNTPROT D033: PLANT INTAKE (g/day/reference man) dRICE D037: RICE INTAKE (g/day/reference man, air-dry basis) dWHTFLOUR D038: FLOUR INTAKE (g/day/reference man, air-dry basis) dOTHCEREAL D039: OTHER INTAKE (g/day/reference man, air-dry basis) dSTCHTUBER D040: STARCHY TUBER INTAKE (g/day/reference man, fresh weight) dLEGUME D041: AND LEGUME PRODUCT INTAKE (g/day/reference man, fresh weight) dLIGHTVEG D042: LIGHT COLOURED INTAKE (g/day/reference man, fresh weight) dGREENVEG D043: GREEN VEGETABLE INTAKE (g/day/reference man, fresh weight) dSALTVEG D044: DRIED AND SALT-PRESERVED VEGETABLE INTAKE (g/day/reference man, as- consumed basis) dFRUIT D045: INTAKE (g/day/reference man, fresh weight) dNUTS D046: INTAKE (g/day/reference man, as-consumed basis) dVEGOIL D054: ADDED (for etc) INTAKE (g/day/reference man) dSTCHSUGAR D056: PROCESSED STARCH AND SUGAR INTAKE (g/day/reference man, as-consumed basis)

The animal-food-related variables used are: dANIMFOOD D029: ANIMAL FOOD INTAKE (g/day/reference man) dANIMPROT D034: ANIMAL PROTEIN INTAKE (g/day/reference man) dMILK D047: MILK AND DAIRY PRODUCTS INTAKE (g/day/reference man, as-consumed basis) dEGGS D048: EGG INTAKE (g/day/reference man, as-consumed basis) dREDMEAT D050: RED MEAT (pork, , mutton) INTAKE (g/day/reference man, as-consumed basis) dPOULTRY D051: POULTRY INTAKE (g/day/reference man, as-consumed basis) dFISH D052: FISH INTAKE (g/day/reference man, as-consumed basis) dANIMFAT D053: ADDED ANIMAL FAT (for cooking, spreading etc) INTAKE (g/day/reference man)

These extracts are then used as input for the Principal Components task: principal components are computed for each set of input variables. In the project output, the variance plots for the sets of principal components indicate that the first principal component of the plant food variables explains slightly over 20% of the variance, and the first principal component of the animal food variables explains over 40% of the variance.

12

The Query Builder is used to join (and rename) the 1st principal component for each of the extracts. The Scatter Plot task is then used to generate a scatter plot:

Scatter Plot: 1st principal components, plant vs. animal extracts

The plot is interesting in that there are very few points in the upper right quadrant, and the 3-4 points in that quadrant could be studied and compared against the other points. The plot suggests that the remaining points could be classified into 2-3 groups. It would also be interesting to do a direct clustering of the data, and compare the results. SAS Enterprise Guide has a task for that purpose (task: Multivariate > Cluster Analysis), and that work is left as an exercise for the interested reader.

Limitations on analyses of averages/ecological data

Confronted with a large number of variables for which Pearson correlations are the predominant statistic that has been calculated so far, one might want to do more sophisticated analyses. To start with, recall that Pearson correlation is a measure of linear dependence, that independence implies Pearson correlation is zero, but the converse is not true, i.e., Pearson correlation can be zero when the variables are dependent.

So one might think of (first) using some of the alternatives to Pearson correlation; i.e., alternative measures of dependence that are not specialized for linear dependence like Pearson correlation. Statistics that come to mind here include:

13

 Hoeffding’s D statistic – available in SAS PROC CORR/Correlations Task  Distance correlation – not available in SAS  Maximal information coefficient – not available in SAS  Mutual information (which can be normalized; not available in SAS).

The statistics above are measures of dependence that are not limited to linear relationships. Unfortunately, only Hoeffding’s D is available in SAS (all 4 measures above are available in R). One might use the statistics above, in combination with Pearson correlation, partial correlations, and/or with scatter plots, to identify possible non-linear relationships. If any such relationships are suspected, models of non-linear functions could be fitted, e.g.,

Y = a + bX + cX2 + dX3 + e. (∆)

The above seems eminently reasonable until we remember that most of the data here are by-county averages: males, females, xiang 1 & 2, all combined. In particular, this paper has processed the overall averages for each of the 69 counties. These (1989) averages are based on counts (N) that vary by survey type, specifically:

 diet survey; N = 60/county  plasma, red blood cell; N = 120/county  questionnaire – varies by questionnaire – there are multiple questionnaires; N = up to 200/county.

The Central Limit Theorems (plural as there are multiple versions, e.g., a version for i.i.d. = independent and identically distributed random variables and another version that does not require i.i.d.) tell us that the estimate for the mean converges in distribution to a Normal distribution, as N increases asymptotically.

Given the sizes for N above, there is reason to believe that – with the possible exception of the mortality and general features variables - all other variables in the 1989 study have an approximately normal distribution. Furthermore, linear combinations of random variables with normal distributions also have a normal distribution, i.e., the family is closed under linear operations.

Now consider the model ∆ above. If X is approximately normal, it is unlikely that a + bX + cX2 + dX3 is also approximately normally distributed. If Y is also approximately normally distributed, is it appropriate to fit a model for Y where said model has a different distribution from Y? Which is better – a simple, linear model where all the dependent variables have exponent = 1, hence the estimator has a distribution in the same family as Y – or – a more complex model that may be better by a particular mathematical optimality criterion, but that has a different distribution than Y?

This writer does not have a quick, definitive answer to this interesting question. However, in the spirit of Occam’s razor, it is suggested that linear models with exponents all =1 be tried first, and that nonlinear models should not be used unless you have reason to believe the “approximate normal” distribution is a very poor approximation for the target dependent variable, and can justify models that are not linear. Because of this, we don’t do Hoeffding’s D for all variables, and demonstrate only linear methods herein.

The mortality statistics are recorded as deaths per 100,000 over specific age ranges. The research monograph “standardizes” these by taking averages over the age ranges. Once again we have averages but N = 2 to 5, so the Central Limit Theorem is less of a consideration. If you have no reason to believe the mortality statistics follow a normal or approximately normal distribution, then you can consider using the alternatives to Pearson correlation and investigate non-linear models. The principle of parsimony/Occam’s razor still applies, and one should consider complex models only if reasonable, simple models are insufficient.

Summary:

1. The China Studies are a series of large ecological/epidemiological studies conducted in China and Taiwan, 1983-1993. The results of the research were reported in 2 large research monographs, and Pearson correlations between the variables were the predominant statistic reported. 2. The China Study data are averages by county; this is ecological data. The ecological fallacy applies: results based on data for groups don’t necessarily apply to individuals. 3. The raw data for the studies are freely available online. However, they are in a format that makes the data relatively difficult to use. Because of this, extensive information is provided on the data processing aspects. 4. Four SAS Enterprise Guide projects are presented herein.

14

5. Project #1 inputs the (1989) raw data and metadata and creates a variety of SAS data sets, some with metadata, others with no metadata. A variety of simple analyses are illustrated using the data: Pearson correlations and Hoeffding’s D, partial correlations, scatter plots, histograms, and regression. 6. Projects #2 & 3 take the raw data file from #1 and construct – in multiple steps – a 639 X 639 correlation matrix (all possible correlations) with significance levels and N. This matrix is then unrolled into a 1 cell per row structure, for filtering and reporting. An output is produced showing all significant correlations, using a different significance threshold than the one used in the research monographs. 7. Project #4 is another sample analysis. Sets of variables representing plant food and animal food consumption are selected, principle components are derived for each set, and the 1st principal components, plant foods vs. animal foods, are plotted. 8. Discussion re: the limitations on processing data that are averages.

Not the last word

As mentioned above, this paper does not pretend to be “totally comprehensive” or the “last word” on the data or research involving these data. The China study data sets are large and rich, offering many opportunities to interested researchers.

As statistical papers go, this one is somewhat atypical, i.e., it includes extensive discussion on how to process the data and metadata. However, this is necessary to enable others to access and prepare the data for statistical analysis. Also, instead of embracing the (totally arbitrary but adopted as standard anyway) practice of doing statistical tests at P < 0.05, we instead use 2P < 0.005 following the research monograph practice, modified per Johnson (2013A, B). Instead of encouraging you to build complex models (while ignoring the distributions of the independent variables), we discuss the limitations of the data and why simpler models may – at least in some cases - be more appropriate.

There are some unresolved issues. A few of the variables in the raw data appear to use 0 (zero) in some cases for missing values. This needs to be investigated, with possible follow-up of filtering/setting those to missing, and then rerun the correlation matrix. (Note: it is easy to filter the variables in a SAS DATA step using an ARRAY and a DO loop.) The omission of one county’s 1983 data (in the research monograph-reported correlations) because of timing of data collection also deserves follow-up. The distribution of the mortality statistics is also a question of interest.

Epilogue

This paper was inspired not only by the intense debates on Internet in which the China Studies raw data were the subject of crowdsourced statistical analysis, but also by the development of the maximal information coefficient (MIC), an alternative to the Pearson correlation coefficient; see Reshef et al. (2011). Recognizing that the China studies are predominantly Pearson correlation coefficients, they presented an opportunity for reanalysis using the maximal information coefficient. Unfortunately, subsequent research by Kinney & Atwal (2013) proves that the maximal information coefficient has major issues; this limits the relevance of the MIC. Additionally, the MIC is not available in SAS at present, though it is available in R and MatLab. The weakness of the MIC is not a disaster, as there are other, competing statistics that can be used instead, e.g., distance correlation.

Finally, if anyone reading this has processed the China study data in Excel, I encourage you to switch to SAS. This will open up a world of processing that is not available to you in Excel.

Appendix – Administrative details

Software used. The test projects here were run using SAS Enterprise Guide 5.1 (5.100.0.12019), 64-bit, on Windows 7 Professional, Service Pack 1, connecting to a Linux 64-bit server running SAS 9.3_M2.

Supplementary materials. A page for this article will be created on sascommunity.org; see the author’s page for a link: http://www.sascommunity.org/wiki/Presentations:Tebillings_Papers_and_Presentations

15

and the article page will contain links to supplementary materials on sascommunity.org and/or other sites as appropriate.

References

Note: all URLs quoted or cited herein were accessed in July 2014.

Billings, T (1999). The Cornell China Project: Authoritative Proof, or Misinterpretation by Dietary Advocates? URL: http://www.beyondveg.com/billings-t/comp-anat/comp-anat-8e.shtml

Billings T, Anjappan B, Ji J (2013) Converting Entity-Attribute-Value (EAV) Source Tables into Useful Data with SAS® DATA Steps. WUSS Conference Proceedings, 2013. URL: http://wuss.org/Proceedings13/76_Paper.pdf

Campbell TC, Campbell TM (2006). The China Study: The Most Comprehensive Study of Ever Conducted. BenBella Books, Dallas, Texas, USA. Google Books URL: http://books.google.com/books?id=KgRR12F0RPAC&printsec=frontcover#v=onepage&q&f=false

Chen J. et al. (1990). Diet, Lifestyle and Mortality in China: a study of the characteristics of 65 Chinese counties. Oxford : Oxford University Press. Google Books URL: http://books.google.com/books?id=kv1WAAAAYAAJ

Chen J. et al. (2006). Mortality, biochemistry, diet, and lifestyle in rural China: geographic study of the characteristics of 69 counties in mainland China and 16 areas in Taiwan. Oxford: Oxford University Press. URL: http://www.ctsu.ox.ac.uk/~china/monograph/index.htm

Greenland S, Robins J (1994). "Invited commentary: ecologic studies--biases, misconceptions, and counterexamples." American Journal of Epidemiology, vol. 139, pp. 747-760. Abstract URL: http://aje.oxfordjournals.org/content/139/8/747.short

Johnson V (2013A). Uniformly most powerful Bayesian tests. Annals of Statistics, 41(4): 1716–1741. URL: http://arxiv.org/pdf/1309.4656

Johnson V (2013B). Revised standards for statistical evidence. Proceedings of the National Academy of of Sciences. 110(48): 19313–19317. URL: http://www.pnas.org/content/110/48/19313.long

Kinney JB, Atwal GS (2013). Equitability, mutual information, and the maximal information coefficient. Proceedings of the National Academy of of Sciences. 111(9): 3354–3359. URL: http://www.pnas.org/content/111/9/3354.long

Mudge JF et al. (2012). Setting an optimal α that minimizes errors in null hypothesis significance tests. PLoS ONE 7(2): e32734. URL: http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0032734#pone-0032734-g003

Reshef DN et al. (2011). Detecting Novel Associations in Large Datasets. Science. 334(6062): 1518–1524. URL: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3325791/

Speed T (2014). Terence’s Stuff: Creativity in Statistics. IMS Bulletin. 43(4): 17. URL: http://bulletin.imstat.org/2014/05/terence%e2%80%99s-stuff-creativity-in-statistics/

16

Contact Information:

Thomas E. Billings MUFG Union Bank, N.A. Basel II - Credit BTMU 350 California St.; 9th floor MC H-925 San Francisco, CA 94104

Phone: 415-273-2522 Email: [email protected]

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.

Other brand and product names are trademarks of their respective companies.

17