<<

Environmental Forensics 2001) 2, 359±362 doi:10.1006/enfo.2001.0061, available online at http://www.idealibrary.com on

Detecting Trends Using Spearman's Rank Correlation Coecient

Thomas D. Gauthier

Sciences International Inc, 6150 State Road 70, East Bradenton, FL 34203, U.S.A.

Received 23 August 2001, Revised manuscript accepted 1 October 2001)

Spearman's rank correlation coecient is a useful tool for exploratory analysis in environmental forensic investigations. In this application it is used to detect monotonic trends in chemical concentration with time or space. # 2001 AEHS

Keywords: Spearman rank correlation coecient; trend analysis; soil; groundwater.

Environmental forensic investigations can involve a The idea behind the rank correlation coecient is variety of statistical analysis techniques. A relatively simple. Each variable is ranked separately from lowest simple technique that can be used for exploratory data to highest e.g. 1, 2, 3, etc.) and the di€erence between analysis is the Spearman rank correlation coecient. In ranks for each data pair is recorded. If the data are this technical note, I describe howthe Spearman rank correlated, then the sum of the square of the di€erence correlation coecient can be used as a statistical tool between ranks will be small. The magnitude of the sum to detect monotonic trends in chemical concentrations is related to the signi®cance of the correlation. with time or space. The Spearman rank correlation coecient is calcu- Trend analysis can be useful for detecting chemical lated according to the following equation: ``footprints'' at a site or evaluating the e€ectiveness of Pn an installed or natural attenuation remedy. Just as r ˆ 1 À 6 d2 ratios of chemical concentrations in a single sample can s i iˆ1 1† be viewed as a chemical ``®ngerprint,'' the overall n3 À n pattern created by a single chemical detected at multiple locations at a site can be viewed as a chemical where di is the di€erence between ranks for each xi, yi ``footprint.'' Point sources of contamination often data pair and n is the number of data pairs. If ties are leave footprints at a site that can be detected using involved, the equation is more complicated but only the Spearman rank correlation coecient. The foot- makes an appreciable di€erence if there are a large prints are created by advection and dispersion pro- number of ties Zar, 1984). cesses that lead to decreasing concentrations with Pn P P distance from the source. The Spearman rank corre- n3À3† 2 6 À di À Tx À Ty lation coecient can also be used to determine if iˆ1 concentrations are increasing or decreasing with time rs ˆ rhihi 2† n3À3† P n3À3† P at a given well or location to monitor water 6 À 2 Tx 6 À 2 Ty quality or evaluate the e€ectiveness of a remedy. The Spearman rank correlation coecient is a where g is the number of tied groups, tj is the number of nonparametric technique for evaluating the degree of tied data in the jth group, and linear association or correlation between two indepen- dent variables. It is similar to Pearson's product Pg  correlation coecient except that it operates t3 À t X j j on the ranks of the data rather than the rawdata. jˆ1 Tx ˆ for x values† 3† There are advantages to using Spearman's rank 12 correlation coecient over the more common product moment correlation coecient. It is a nonparametric and technique so it is una€ected by the distribution of the Pg  population. Because the technique operates on the t3 À t ranks of the data it is relatively insensitive to outliers X j j T ˆ jˆ1 for y values† 4† and there is no requirement that the data be collected y 12 over regularly spaced intervals. It can be used with very small sample sizes and it is easy to apply. The Equation 1) is constructed so that it gives rs ˆ 1 when disadvantage is that there is a loss of information the data pairs have a perfect positive correlation when the data are converted to ranks and, if the data di ˆ 0) and rs ˆÀ1 for a perfect negative correlation. are normally distributed, it is less powerful than the For example, with n ˆ 5, a perfect negative correlation Pearson correlation coecient. yields the following values for di:4,2,0,À2, and À4. 359 1527-5922/01/040359+04 $35.00/00 # 2001 AEHS 360 T. D. Gauthier

The sum of the square of these values is 40 and the values with nÀ2 degrees of freedom as a good rank correlation coecient is approximation. p 6 Á 40 rs n À 2 r ˆ 1 À ˆÀ1 5† t ˆ qÀÁ 6† s 125 À 5 2 1 À rs

In order to test the signi®cance of a correlation, we In order to demonstrate the technique, consider the assume that there is no correlation between the two hypothetical data set presented in Table 2. Concen- variables and then determine the probability of getting trations of lead in soil have been measured as a a correlation coecient as large or larger or as small function of distance north of a suspected sourceÐan or smaller for a negative correlation) than the onsite land®ll at a former battery-breaking site. The calculated value by chance. This is done by comparing data are plotted in Figure 1 and clearly indicate a trend the calculated value to a table of critical values. If the of decreasing concentration with distance from the absolute value of the calculated value is larger than the land®ll. The data ranks and the squares of the critical value, then the correlation is signi®cant. A di€erences between ranks are included in Table 2. partial table of critical values is presented in Table 1. Notice that the distance measurements are easily Alternatively, if the number of data pairs is large ranked and can be irregularly spaced. The sum of the squares of the di€erences between ranks is 108. The greater than about 30), a ``t '' can be calculated Spearman rank correlation coecient can be calculated using equation 6) and then compared to a table of t using equation 1) as illustrated below.

6 108† 648 r ˆ 1 À ˆ 1 À ˆÀ0:928 7† s 73 À 7 336 Table 1. Table of critical values for Spearman's rank correlation coecient In order to test if our calculated value of À0.928 is signi®cant at the 95% probability level, we can Level of signi®cance a) - one-tailed test Number of compare that value to the critical value for n ˆ 7and data pairs 0.100 0.050 0.025 0.010 a ˆ 0.05 from Table 1. From the table we ®nd rcrit ˆ 0.714. Since our calculated value neglecting 4 1.000 1.000 Ð Ð the direction of the correlation) of 0.928 exceeds the 5 0.800 0.900 1.000 1.000 critical value, we can conclude that the trend is 6 0.657 0.829 0.886 0.943 signi®cant at the 95% probability level. 7 0.571 0.714 0.786 0.893 8 0.524 0.643 0.738 0.833 The prior example was rather obvious. Now consider 9 0.483 0.600 0.700 0.783 the data presented in Figure 2. The trend of decreasing 10 0.455 0.564 0.648 0.745 concentration with distance from the source is not as 11 0.427 0.536 0.618 0.709 obvious but appears to be present. However, the 12 0.406 0.503 0.587 0.671 13 0.385 0.484 0.560 0.648 calculated rank correlation coecient of À0.393 is 14 0.367 0.464 0.538 0.622 15 0.354 0.443 0.521 0.604 16 0.341 0.429 0.503 0.582 600 17 0.328 0.414 0.485 0.566 18 0.317 0.401 0.472 0.550 500 19 0.309 0.391 0.460 0.535 400 20 0.299 0.380 0.447 0.520 21 0.292 0.370 0.435 0.508 300 22 0.284 0.361 0.425 0.496 23 0.278 0.353 0.415 0.486 200 24 0.271 0.344 0.406 0.476 100 25 0.265 0.337 0.398 0.466 Concentration (ppm) 26 0.259 0.331 0.390 0.457 0 27 0.255 0.324 0.382 0.448 0 2000 4000 6000 28 0.250 0.317 0.375 0.440 Distance north (feet) 29 0.245 0.312 0.368 0.433 30 0.240 0.306 0.362 0.425 Figure 1. Concentration of lead as a function of distance north for hypothetical data set 1.

Table 2. Hypothetical data set 1

Distance north feet) Lead concentration ppm) Distance rank Concentration rank Squared di€erence between ranks

0 510 1 7 36 50 380 2 5 9 300 450 3 6 9 800 300 4 4 0 1000 170 5 3 4 2000 45 6 1 25 5000 89 7 2 25 Detecting Trends Using Spearman's Rank Correlation Coecient 361

800 attenuation at each of the sites. The Mann±Kendall 700 test for trend is a nonparametric test that is based on 600 the number of times the data increase or decrease when 500 data points are compared to points that followin time. 400 The technique, described by Gilbert 1987), is similar 300 to the Spearman rank correlation coecient in that it is 200 100 nonparametric, can be used on small data sets, is Concentration (ppm) 0 relatively insensitive to outliers, and can be used with 0 1000 2000 3000 4000 5000 6000 irregularly spaced data. One advantage of the Spear- Distance north (feet) man rank correlation coecient over the Mann± Figure 2. Concentration of lead as a function of distance north for Kendall test is that is easier to calculate, particularly hypothetical data set 2. for larger data sets. MTBE concentrations recorded quarterly and then not signi®cant at the 95% probability level. There is semi-annually at two wells MW-1 and MW-4) over a sucient scatter in the data to suggest that the period of about 3.5 years are reported in Table 3 and observed spatial distribution of the lead concentrations plotted on a relative concentration scale in Figure 3. could have occurred by chance. Of course, proper Using the Mann±Kendall test, the authors reported a interpretation of the results should include other decreasing trend in MTBE concentration over time, at considerations like soil type, species of lead, and a 90% con®dence level, for MW-1 but not MW-4. presence of other metals, for example. At least, the Although absolute concentrations vary signi®cantly results of the rank correlation test should prompt a between MW-1 and MW-4, relative changes in closer inspection of the data. concentration over time are similar as shown in In this ®nal example, the Spearman rank correlation Figure 3. The only real di€erence occurs in April coecient is applied to a series of data reported by 1998 when the MTBE concentration spikes up from a Robb and Moyer 2001). In this recent paper, Robb previous value of 2 to 520 mg/L in MW-4; whereas in and Moyer describe use of the Mann±Kendall test for MW-1, MTBE concentrations remain essentially con- evaluating trends in benzene and MTBE concen- stant at about 4,800 mg/L. trations with time at four Midwestern retail gasoline The Spearman rank correlation coecients for MW- outlets in order to evaluate the e€ectiveness of natural 1 and MW-4 are À0.685 and À0.430, respectively. The critical value from Table 1 for n ˆ 10 and n ˆ 0.1 is 0.455. Thus, at the 90% probability, we conclude that Table 3. MTBE concentration data as a function of time 1Robb and the trend of decreasing concentration with time is Moyer, 2001) signi®cant in MW-1 but not in MW-4, similar to the Mann±Kendall test results reported by Robb and Concentration MTBE mg/L) Moyer. Once again, it should be stressed that the Sampling date MW-1 MW-4 Spearman rank correlation coecient should be used as an exploratory tool in conjunction with other data July 1997 15,000 460 and conclusions should be based on all the data. The October 1997 39,000 670 fact that MTBE concentrations were decreasing in two January 1998 4800 2 other wells in addition to MW-1 suggests that natural April 1998 4700 520 July 1998 1500 180 attenuation is occurring at the Site. The fact that no October 1998 7000 140 signi®cant trend was detected in MW-4 is not May 1999 4400 110 necessarily attributed to the spike observed in April November 1999 5000 190 1998. Neglecting this data point, the rank correlation June 2000 2200 62 November 2000 1300 150 coecient is calculated to be À0.417, which is still not signi®cant, compared to the critical value of 0.483.

1.2 MW-1 MW-4 1

0.8

0.6

0.4

Relative concentration 0.2

0 Dec-96 Jun-97 Jan-98 Jul-98 Feb-99 Aug-99 Mar-00 Oct-00 Apr-01 Sampling date

Figure 3. Relative concentration of MTBE detected in wells MW-1 and MW-4 Robb and Moyer, 2001). 362 T. D. Gauthier

What if the ®rst two sampling dates July 1997 and in the same month over several years. For example, El- October 1997) are omitted? Then the rank correlation Shaarawi, Esterby and Kuntz 1983) examined trends in coecients are calculated to be À0.405 and 0.0 for water quality parameters measured in the Niagara MW-1 and MW-4, respectively. The critical value for River using the Spearman rank correlation coecient n ˆ 8andn ˆ 0.1 is 0.524. Thus, the decreasing trend applied to monthly , one month at a time. Six in MW-1 is not signi®cant at the 90% con®dence level years of data were available. An alternative approach is and no trend is detected in MW-4. This analysis to use a test such as the seasonal Kendall test, described suggests that concentrations have not decreased sig- by Gilbert 1987) and examined in detail by Hirsch, ni®cantly since January 1998. Thus, in reviewing Slack and Smith 1982), that is able to handle seasonal statistical analysis results, it is important to consider variations. which data have been included in the analysis and The Spearman rank correlation coecient can be a which data have been omitted. This is always a concern useful tool for exploratory data analysis. Potential when the statistical tests are designed after the data applications are numerous. We have used this tech- have been collected. Ideally, the statistical tests should nique to establish trends in soil lead concentrations as a be designed prior to any . Unfortu- function of distance from the home paint lead), nately, in most forensic applications, such is rarely the nearby roadways automobile emissions) and a distant case. air emission source to help allocate source contri- Two other issues warrant consideration when using butions. Each source should be associated with a the Spearman rank correlation coecient for detecting unique footprint. The technique can also be used to trends in data: serial correlation and evaluate ground water contamination data to help seasonal e€ects. A common characteristic of time identify plumes when the data appear to be random. series data is that the data tend to be serially Concentrations in a single well can also be evaluated as correlatedÐthat is, the value of a data point tends to a function of time to showwhetherconcentrations are be related to previous data points collected in the series increasing or decreasing with time. even after the underlying trend and any seasonal a€ects are accounted for Montgomery and Reckhow, 1984; Gilbert, 1987; Berthouex and Brown, 1994). Serial References correlation or is detected by examining Berthouex, P.M. and Brown, L.C. 1994. for Environmental the residuals after modeled characteristics such as trend Engineers. Boca Raton, FL, CRC Press. and seasonality are removed Montgomery and Reck- El-Shaarawi, A.H., Esterby, S.R. and Kuntz, K.W. 1983. A statistical how, 1984). Serial correlation is dicult to detect in evaluation of trends in the water quality of the Niagara River. J. Great Lakes Res 9, 234±240. small samples but tests such as the Durbin Watson test Gilbert, R.O. 1987. Statistical Methods for Environmental Pollution are available Berthouex and Brown, 1994). Serial Monitoring. NewYork, NY, Van Nostrand Reinhold. correlation in time series data tends to be an issue when Hirsch, R.M., Slack, J.R. and Smith, R.A. 1982. Techniques of trend the time steps are small relative to the residence time of analysis for monthly water quality data. Water Resources Research the medium being sampled. 181), 107±121. Montgomery, R.H. and Reckhow, K.H. 1984. Techniques for Seasonal variations in a data set can hinder the detecting trends in lake water quality. Water Resources Bulletin detection of long-term trends. According to Gilbert 201), 43±52. 1987), there are two possible solutions to this Robb, J. and Moyer, E. 2001. Natural attenuation of benzene and problemÐremove the seasonal variation before apply- MTBE at four midwestern retail gasoline marketing outlets. Contaminated Soil Sediment and Water. Spring 2001 Special ing the test, or use a test that is una€ected by seasonal Issue, 64±71. variation. If the data set is large enough, the seasonal Zar, J.H. 1984. Biostatistical Analysis. Second Edition. Englewood variation could be removed by examining data collected Cli€s, NJ, Prentice Hall.