Load Stomach Cancer Data By

Practical 2: Analysis of general clustering and point source investigations using the Splus software

Introduction

In this practical you will be using the Splus software to carry out some of the tests for disease clustering and investigation of disease risk around a point source that were discussed in the lecture. Splus is a commercial software package for data analysis and graphical display; it also provides an object-oriented programming language. During the practical, you will be using Splus to produce graphical displays of the data to help visualise e.g. locations of possible disease clusters and relationships between disease risk and distance from a point source, and to carry out data manipulation and regression analysis using standard Splus functions. We have also written a library of functions using the Splus programming language to implement some of the statistical methods for cluster and point source investigations that were discussed in the lecture. If you do not have access to Splus after you finish this course, don’t worry! The graphical analysis and regression models are easily implemented in most statistical packages. You can download the code for the special functions we have written as ascii text files from: http://stats.ma.ic.ac.uk/~ngb30 Most of the functions are quite simple and the code is (hopefully!) well documented, so it should be reasonably straightforward to translate the functions into whichever language is appropriate (e.g. re-write them as Stata functions, or as C++ code). Another option is to download the free software package called R from http://www.r-project.org. R is a language and environment for statistical computing and graphics; it is a GNU project which is similar to the Splus language and environment, and much of the code written for Splus runs unaltered in R.

How to use Splus The Windows version of S-plus that you will be using offers both a menu-based interface and a command-line interface. The command-line interface is more flexible, and allows the user to write his/her own functions. Most of the functions you will be using during this practical have been specially written and are not supplied as standard with the software, so most of the instructions for running the practical will focus on using the command-line interface. However, pointers on how to carry out the same procedure using the menu interface will be given where appropriate. Documentation to help you get started with Splus:  You can download a document entitled “Introduction to Splus” from the course web page (see below). This is a set of summary notes on some of the key features of Splus. You don’t need to read all of this before starting the practical, but it might help to refer to relevant sections if there are things you don’t understand as you go along.  A summary of useful Splus commands and menu options is provided in Appendix 3 of this practical exercise sheet.  Splus also provides extensive on-line documentation via the Help menu, so please use this as well.  A brief on-line demonstration of how to use the menu interface for loading data, plotting and regression can be run by selecting the Help menu, then Visual Demo.

To start Splus, select Start (bottom left of screen)  Programs  Splus2000. To use the command-line language, first select File  New  Script File then click on OK. This opens a text window where you can type in Splus commands. Once you have typed in a command, highlight the whole command and press the F10 key to execute the command. Note that you can save this file as an ascii file in order to keep a record of your Splus session, and re-run/edit the same commands in a future session.

All data files and the library of Splus functions for cluster detection and point source analysis can be downloaded from the course web page http://stats.ma.ic.ac.uk/~ngb30.

Copy these to your home directory before starting Splus. Exercise 2A: Area-level tests for general clustering of a disease

There are 2 datasets consisting of incident cases of (a) larynx cancer and (b) pancreatic cancer diagnosed during the 10-year period 1982-1991, in the Mersey and West Lancashire districts of northwest England. There are 876 and 2019 cases of larynx and pancreas cancer, respectively, in 144 electoral wards. Expected numbers were calculated using external standardisation based on national age- and sex-specific reference rates and population counts from the 1991 census. The data files also contain the x- and y-coordinates of the centroid of each ward. Maps showing the SMRs for each cancer are shown below.

(Note: see Chapter 8 of the Elliott et al (2000) book on Spatial Epidemiology for analysis of lung and brain cancer risk in the same study region).

Fig 1: SMR for pancreas cancer in Mersey and West Lancashire

SMR (7) < 0.5

N (12) 0.5 - 0.7

(19) 0.7 - 0.9

(37) 0.9 - 1.1

(37) 1.1 - 1.3

(17) 1.3 - 1.5

(15) >= 1.5

20.0km Fig 2: SMR for larynx cancer in Mersey and West Lancashire

SMR (31) < 0.5

N (21) 0.5 - 0.7

(10) 0.7 - 0.9

(18) 0.9 - 1.1

(11) 1.1 - 1.3

(11) 1.3 - 1.5

(42) >= 1.5

20.0km

Questions 1. The data are in files larynx.dat and pancreas.dat. Load these into Splus and create new columns containing the SMR (O/E) for each area. 2. For each cancer site a) use the summary() command in Splus to produce summaries of the distribution of SMRs across wards in the study region. b) Use the hist() command to produce histograms of the SMRs for each cancer. How variable are the SMRs? How does the spread of values compare for the two cancers? Does it look like there is evidence of heterogeneity in risk across the study region? 3. Carry out the Pothoff-Whittinghill test for heterogeneity. The file containing the source code is called potwhit.S –you should have already copied this file from the website (see end of previous page). To load the function into Splus, enter the command source(“potwhit.S”). Then have a look at the function by typing the function name without brackets (potwhit.test), and make sure you understand what it is doing. a) Run the function for the larynx and pancreas cancer datasets and record your results in the table given in Appendix 1. Is there evidence of heterogeneity in either of the two diseases? 4. Carry out Moran’s test for spatial autocorrelation. The Splus function is called moran.test() and the file containing the source code is called moran.S. Load the function into Splus, and have a look at it to make sure you understand what it is doing. One of the arguments to the moran.test() function is a weight matrix, W, that provides a measure of the ‘closeness’ of each pair of areas. ‘Closeness’ may be defined in many ways: here you will use 2 alternative measures, one based on adjacency and one based on distance. Adjacency weights: The adjacency data is in file mersy-adj.dat. Load this file by entering the command mersey.adj <- source(“mersey-adj.dat”). Have a look at this data object by typing the object name. It is a list containing 2 vectors. The second vector is called nneigh; the length of this vector is 144 (i.e. the total number of wards in the study region) and gives the number of neighbouring (i.e. adjacent) areas for each ward. For example, ward 1 has 3 neighbours; ward 2 has 6 neighbours; ward 3 has 7 neighbours etc.. The first vector is called adjacent and gives the ID of the neighbours for each ward. The first 3 values of adjacent are the IDs of the wards neighbouring ward 1; the next 6 values of adjacent are the IDs of the wards neighbouring ward 2; and so on. This adjacency data can be used with the function create.adjweights() (source code in file adj-weights.S) to create a 0-1 weight matrix measuring the closeness of each pair of areas: Wij = 1 if areas i and j are adjacent; Wij = 0 otherwise, with Wii also defined as 0. Distance weights: The function create.distmatrix() (source code in file distmatrix.S) can be used to create a distance matrix (Wij = distance between areas i and j) from the vectors of x- and y-co-ordinates of each area centroid. There are many choices of the function of this distance matrix that you can then use to define the weights for the Moran test, for example, 1/distance, 1/distance2, or an exponential decay function of distance. Some of these are implemented as optional arguments to the moran.test() function provided – see the relevant Splus code for each function for more details. a) Run Moran’s test for the larynx and pancreas cancer datasets using adjacency weights. b) Re-run each of the test using distance weights with different choices of the distance function. Record your results in the table given in Appendix 1. Is there evidence of spatial autocorreation in either of the two diseases? Howe sensitive are the results to the choice of weights?

5. Carry out Tango’s test for spatial autocorrelation. The function is called tango.test() and the file containing the source code is called tango.S. Load the function into Splus, and have a look at it to make sure you understand what it is doing. Tango’s test also uses a weight matrix, A, measuring the closeness of each pair of areas. Tango recommends using weights equal to exponential decay with distance,

i.e. Aij = exp(-dij/). You will need to specify a value for  as one of the arguments to the tango.test() function.

As a guide, choose  such that log 2 x  = dhalf where dhalf is the distance at which you believe the correlation between the relative risk in two areas is 0.5. a) Run Tango’s test for the larynx and pancreas cancer datasets using a range of different values for . Note that distance is measured in metres for these data, and the study region is about 40,000 m by 40,000m in size. Record your results in the table given in Appendix 1. Is there evidence of spatial autocorrelation in either of the two diseases? How sensitive are the results to the choice of ? 6. The tests you have carried out so far provide methods for detecting clustering, but do not allow you to identify the locations of possible clusters. The method of Besag and Newell is designed to do the latter. The source code for this function is in besag-newell-unfoc.S. Load this function into Splus and have a look at the code to understand what it does. You will notice that one of the function arguments is called map.bdry, which is a matrix giving the co-ordinates of the study region boundary; this matrix is used by the function to produce maps showing the location of detected clusters. The boundary file for the Mersey and West Lancashire data is called mersey-bdry.dat which you should read into Splus. a) Use the code to carry out Besag and Newell’s test for the larynx and pancreas cancer data, using different values for the cluster size, k, and the significance level alpha for detecting clusters – you should find that you need quite a large cluster size (say > 20) and ‘high’ significance level (say alpha = 0.05) in order to detect many clusters. 7. A more appropriate geographical scale for Besag and Newell’s test is to use census enumeration districts (EDs) rather than wards as the areal units. EDs are much smaller than wards and so the test will have greater power to detect small clusters of disease. There are 3096 EDs in the Mersey and West Lancashire study region. The ED-level larynx and pancreas cancer data can be found in files larynxED.dat and pancreasED.dat; the file format is the same as for the ward-level data. a) Load these data into Splus and re-run the Besag and Newell test at ED level using a range of values for k (say 4,8,12, 16 and 32). Record your results in the table given in Appendix 2.

Exercise 2B: Analysis of disease risk in relation to a pre-specified point source exposure

These data are taken from a paper by Sans et al (1995) on cancer incidence near Baglan Bay petrochemical works in South Wales.

Sans S, Elliott P, Kleinschmidt I, Shaddick G, Pattenden S, Walls P, Grundy C, Dolk H. Cancer incidence and mortality near the Baglan Bay petrochemical works. South Wales. Occ Environ Mod, 1995;52:217-224.

The data are in baglan-bay.dat and contain:  counts of observed (O) and expected (E) cases of incident larynx cancer occurring between 1974 and 1984 in the 80 EDs whose centroids fall within a radius of 7.5km from the point source (note: expected counts were adjusted for age and sex using national (GB) incidence rates; the data also differ slightly from those published in the paper to preserve confidentiality);  the x- and y-coordinates of the ED centroid;  the distance (d) of the ED centroid from the point source;  a variable called CAR giving the Carstairs deprivation score for each ED (high values = more deprived). The original study was carried out in response to local concerns about an alleged cluster of cancer (especially larynx and leukaemia) near the Baglan Bay petrochemical works over a period of about 6 years (1984-89).

Record your results for questions 4-8 in the tables provided in Appendix 2.

Questions 1. Load the data into Splus and create new columns containing the SMR (O/E) for each ED. 2. Produce a scatter plot of SMR versus distance from source and add a loess smooth line to your plot (see Splus summary sheet). The loess smooth line is a nonparametric scatter-plot smoother, which is based on a generalisation of running means. The primary parameter affecting the smoothness of the fit is the span, which controls the amount by which the fit at any particular point is influenced by the other data points in the graph. Span takes values between 0 and 1; small values result in less smoothing. The default in Splus is to automatically choose the span by means of cross validation. However, you may want to experiment with choosing your own values – reasonable span values are from 0.3 – 0.5. Note: there are a number of other scatter plot smoothing options available in Splus, for example, splines, Friedman super-smoothers and kernel smoothers, which you could also experiment with using if you wish. Does there appear to be a relationship between risk of cancer and distance from the petrochemical plant? 3. Now produce scatter plots with loess smoothers of a) Carstairs versus distance b) SMR versus Carstairs. What do the plots suggest about the relationship between the 3 variables? 4. Carry out a near-versus-reference test for increased risk around the petrochemical plant. The source code you will need is in files grp-near.S and near-vs-ref.S; load these files into Splus and have a look at the code. Then use the function near.vs.ref() to carry out the test for a) the whole study region (i.e. all 80 EDs); b) just a ‘near’ region (in the original paper, this was defined to be 3km from the source, although you could choose a different distance to define the ‘near’ region if you wish). You will need to use the function grp.near() to define this ‘near’ region and count the observed and expected cases before applying the near-versus-reference test. 5. Carry out a near-versus-far test for increased risk around the petrochemical plant. The source code is in file near-vs-far.S. a) First use the 3km threshold suggested by Sans et al. to define the ‘near’ region. b) Examine the sensitivity of the near versus far comparisons by carrying out reanalysis using different near/far thresholds of say, 0.5km, 1km, 1.5km, 2km and 2.5km. 6. The cluster test proposed by Besag and Newell can also be used as a focused test of clustering around a pre-specified location. The source code for the focused version of the test is in files besag-newell-foc.S and besag-newell-multiple.S. The first file contains the code to implement the test for a specified cluster size; the second file implements the test multiple times, each for a different cluster size, and produces a plot of significance (p-value) versus cluster size, with Bonferroni adjustment for multiple testing. Load these functions into Splus and take a look at the code. a) Use the besag-newell-foc function to run the test on the Baglan Bay data for a single cluster size (say, k=4). b) Use the besag-newell-multiple function to run the test for a range of cluster sizes from, say, 2 to 20. Is there evidence of excess cases close to the source? 7. Use Stone’s test to test the hypothesis that risk of larynx cancer is a non-increasing function of distance from the Baglan Bay petrochemical plant, and to obtain estimates of the relative risk at various distances under the constraint of non-increasing risk. The source code for this function is in stones.S – loading this file will install the following functions: stones.test() [used to carry out the test]; isoreg() [called by stones.test to estimate the constrained relative risks]; mlrstat() [called by stones.test to calculate the maximum likelihood ratio test statistic] and band.count() [used before running stones.test to aggregate data from a set of small areas e.g. EDs into distance bands around the point source]. a) Aggregate the ED-level data into the 8 distance bands used by Sans et al. (the radii for these bands are 0.5km, 1km, 2km, 3km, 4.6km, 5.7km, 6.7km and 7.5km) and run Stone’s unconditional test (recall that the null hypothesis for the unconditional test

assumes that the relative risk in each band, i = 1). b) Repeat part (a) but run the conditional version of Stone’s test (recall that the null

hypothesis for the conditional test assumes that the relative risk in each band, i =

constant, but not necessarily i = 1. This test is more appropriate if the risks in the whole study region are elevated – or reduced – compared to the reference rates used to calculate the expected counts). c) Now run the unconditional version of Stone’s test on the same data, but adjust your expected counts so that observed = expected (the easiest way to do this is to multiply your current set of expected counts by sum(baglan$O)/sum(baglan$E) [assuming you have called your data object baglan]. Compare your results to those obtained using the conditional test in part (b). 8. Recall the plots you produced in questions 2 and 3 showing the relationship between SMR, distance from source and deprivation. One way to attempt to reduce the potential confounding between deprivation and distance from the petrochemical plant is to compute a new set of expected counts which are standarised for Carstairs quintile as well as age and sex. These expected numbers are available in a file called baglan-e.dat. a) Load these into Splus, compute the SMR using these new expected counts and repeat the plots in questions 2 and 3. How has standardising for deprivation affected the relationships? b) Now repeat the near-versus-reference, near-versus-far, Besag and Newell and Stone’s tests using expected counts standardised for Carstairs quintile. How does this affect your conclusions about the risk of cancer around the Baglan Bay petrochemical plant? Appendix 1: Results tables for Practical A Record your results in the tables below: A3-5. Results of heterogeneity and autocorrelation tests: Test Larynx cancer Pancreas cancer Test Statistic p-value Test Statistic p-value Pothoff-Whittinghill Moran Adjacency weights Distance weights (specify function):

Tango  =  =  =  =

A7. Results of Besag and Newell focused test using ED data: Larynx Cancer Pancreas cancer Significance level, p < Significance level, p < Cluster size k Significant clusters Cluster size k Significant clusters Number % Number % 4 8 12 16 32 Appendix 2: Results tables for Practical B Record your results in the tables below: B4. Results for near versus reference test: Definition of Observed Expected SMR p-value 95% CI ‘near’ region All 80 EDs Within 3km

B5. Results for near versus far test: Near Far O E SMR O E SMR RR p-value region region (nr:far) 3km 7.5km 0.5km 7.5km 1km 7.5km 1.5km 7.5km 2km 7.5km 2.5km 7.5km

B6. Results for Besag and Newell focused test (multiple k) Cluster size # of areas in Radius of O E p-value k cluster cluster 2 4 6 8 10 12 15 20 B7. Results for Stone’s test Type of test Likelihood Ratio (MLR) p-value Unconditional Conditional Unconditional with O = E

Distance band O E SMR Lambda.hat 0.5km 1km 2km 3km 4.6km 5.7km 6.7km 7.5km

B8a. Results for near versus reference test with Carstairs-adjusted expected counts: Definition of Observed Expected SMR p-value 95% CI ‘near’ region All 80 EDs Within 3km

B8b. Results for near versus far test with Carstairs-adjusted expected counts: Near Far O E SMR O E SMR RR p-value region region (nr:far) 3km 7.5km 0.5km 7.5km 1km 7.5km 1.5km 7.5km 2km 7.5km 2.5km 7.5km B8c. Results for Besag and Newell focused test (multiple k) with Carstairs-adjusted E’s Cluster size # of areas in Radius of O E p-value k cluster cluster 2 4 6 8 10 12 15 20

B8d. Results for Stone’s test with Carstairs-adjusted expected counts Type of test Likelihood Ratio (MLR) p-value Unconditional Conditional Unconditional with O = E

Distance band O E SMR Lambda.hat 0.5km 1km 2km 3km 4.6km 5.7km 6.7km 7.5km Appendix 3: Summary of useful Splus functions for Practical 2

Command line syntax is given in Courier type face Menu / dialog box options are given in Arial type face

1. Loading an array of data from a file into Splus mydata <- read.table(“”, header=T)

File  Import Data  From File Then type in box and select Open

2. Adding a new column to an existing data frame in Splus

If new column is a function of existing columns: mydata$newname <- [enter function of existing columns here, e.g. mydata$AA + mydata$BB]

Data  Transform Then select mydata as the Data Set. Type name of new column as the Target Column Type the function of existing columns (e.g. AA + BB) in the Expression box and select OK

If new column is an existing Splus vector called xxx: mydata$newname <- xxx

Data  Merge Then enter mydata as Dataset 1 and xxx as Data set 2. Select Match by: Row names and type mydata for the name of the object to Save the results in. Then click on OK. Column xxx should be added to your mydata data object.

3. Producing a scatter plot of columns AA versus BB plot(mydata$AA, mydata$BB)

Select the 2D Plot button, then select the scatter plot button from the dialog box (1st row, 1st column – see Plots2D Palette option in Splus Help menu) A new dialog box will now appear: Enter mydata as the dataset Enter AA as the x Column Enter BB as the y Column Click on OK

4. Adding a straight line through the data on a scatter plot abline(lsfit(mydata$AA, mydata$BB))

Click once on the current plot so that the border of the graph is highlighted. Then hold down the shift key and select the linear fit from the Plots2D dialog box (1st column, 5th row). Enter mydata as the dataset Enter AA as the x Column Enter BB as the y Column Click on OK 5. Adding a nonparametric smooth line through the data on a scatter plot lines(loess.smooth(mydata$AA, mydata$BB))

You can change the degree of smoothness of the line by changing the ‘span’ argument to this function. Span must be a number between 0 and 1; small values of span imply less smoothing; values in the range 0.3 to 0.5 are usually sensible: lines(loess.smooth(mydata$AA, mydata$BB, span=0.3))

Click once on the current plot so that the border of the graph is highlighted. Then hold down the shift key and select the loess button from the Plots2D dialog box (1st column, 4th row). Enter mydata as the dataset Enter AA as the x Column Enter BB as the y Column Click on OK

You can change the span of the loess smooth by select the Smooth/Sort dialog box and entering, say, 0.3 as the value for Span before clicking on OK.

6. Customising a plot

You can add axis labels, change the plotting symbol, line type and colour etc. using the following options, which are specified as arguments to the plot() or lines() function: xlab=”” ylab=”” pch=”+” # Uses character in quotes as the plotting symbol lty=2 # plots dotted lines rather than solid lines lty=4 # plots dashed lines rather than solid col=n # where n is an integer (selects colour n for plotting)

See Splus Help menu for ‘par’ for more details

Colour, plotting symbols and line types can be customised by selecting the Line and Symbol options from the plot dialog box. Axis labels can be edited by clicking once on the current axis label, so that it is highlighted in green; then click again on the green area and typing in the required text.