Chapter 11 Relationship Between Monitoring Variables. Correlation and Regression Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Chapter 11 Relationship between monitoring variables. Correlation and regression analysis This chapter shows you how to analyse the relationship between two or more variables from your monitoring programme (influent and effluent concentrations, environmental conditions, removal efficiencies, applied loading rates, or others). The topics include correlation and regression analysis between variables. Correlation studies encompass correlation coefficients, correlation matrices, cross- correlation, and autocorrelation, including parametric Pearson and non-parametric Spearman correlation methods. For regression analysis, we place emphasis on the linear regression model, which is covered in detail. Other regression models (multiple linear regression and non-linear regression) are also addressed in this chapter. The contents in this chapter are applicable to both treatment plant monitoring and water quality monitoring. CHAPTER CONTENTS 11.1 Introduction . 398 11.2 Correlation Coefficient. 402 11.3 Correlation Matrix . 424 11.4 Cross-correlation and Autocorrelation. 429 11.5 Simple Linear Regression. 440 11.6 Multiple Linear Regression . 470 11.7 Non-linear Regression . 473 11.8 Check-List for Your Report . 476 © 2020 The Authors. This is an Open Access book chapter distributed under the terms of the Creative Commons Attribution Licence (CC BY- NC-ND 4.0), which permits copying and redistribution for non-commercial purposes with no derivatives, provided the original work is properly cited (https://creativecommons.org/licenses/by-nc-nd/4.0/). This does not affect the rights licensed or assigned from any third party in this book. The chapter is from the book Assessment of Treatment Plant Performance and Water Quality Data: A Guide for Students, Researchers and Practitioners, Marcos von Sperling, Matthew E. Verbyla and Sílvia M. A. C. Oliveira (Authors). doi: 10.2166/9781780409320_0397 Downloaded from http://iwaponline.com/ebooks/chapter-pdf/643404/9781780409320_0397.pdf by guest on 01 October 2021 398 Assessment of Treatment Plant Performance and Water Quality Data 11.1 INTRODUCTION – BasicBasic In our book, we have encouraged you to do more with your data instead of simply reporting monitoring data, we are advising you to try to gain a deeper understanding of the behaviour of the system you are studying. As an example, in Chapter 10, we described how you could compare two variables to know C. 10 whether their central values (means or medians) are equal. In Chapters 12–15, we will show you how to integrate statistics with process analysis, covering water and mass balances, loading rates, reaction C. 12-15 kinetics, reactor hydraulics, and process modelling. In this chapter, we describe how you can study the relationship between two or more variables that are part of your monitoring programme. These variables can be influent and effluent concentrations, C. 7 environmental conditions, removal efficiencies (see Chapter 7), applied loading rates (see Chapter 13), or any other variable that may be considered to play an important role in your water body or treatment plant. C. 13 We will cover ‘correlation’ and ‘regression analysis’ in this chapter, including the following items: Correlation Regression Analysis • Correlation (simple correlation) (Section 11.2) • Simple linear regression (Section 11.5) • Correlation matrix (Section 11.3) • Multiple linear regression (Section 11.6) • Autocorrelation and cross-correlation (Section 11.4) • Non-linear regression (Section 11.7) Note that we use the expressions correlation and regression. A simplified difference between them can be stated as follows: • Correlation: Used to represent the strength of the linear relationship between two variables. In a correlation, there is no concept of dependent and independent variables, that is, the correlation between x and y is the same as the correlation between y and x. • Regression analysis: Describes how a dependent variable (y) is numerically related to the independent variable (x) or independent variables (x1, x2, …, xn) via a regression equation with coefficients in its structure. The regression model may be linear or non-linear. Figure 11.1 shows the concept of correlation between two variables.Ascatter plot is always a useful way of visually analysing the type of relationship between the variables. If the data points seem to be positioned over an ‘imaginary’ straight line (even if not perfectly), then we can suppose that there may be a linear relationship between the two variables. We measure the strength of the linear relationship by a linear S. 11.2 correlation coefficient (r). In Section 11.2, we will show you how to calculate and interpret the correlation coefficient. Now we will introduce the concept of regression analysis, which is illustrated in Figure 11.2 for the same data points from Figure 11.1. The figure also shows several elements of importance in a regression analysis. You can see clearly that the major difference from correlation is that now we have a fitting of a line to the data points and an associated equation, which allows us to predict the value of Y (dependent variable or response variable) based on a value of X (independent variable or predictor variable). A linear regression (fitting of a straight line) is illustrated. Since the data points are the same as in Figure 11.1, the coefficient of correlation (r) is the same. Because we have a model and the resulting predictions, we may also have prediction errors if the fitting is not perfect. We analyse the goodness of fit using the concept of the Coefficient of Determination (correlation coefficient raised to the power two, r2 or R2). Downloaded from http://iwaponline.com/ebooks/chapter-pdf/643404/9781780409320_0397.pdf by guest on 01 October 2021 Relationship between monitoring variables 399 Figure 11.1 Example of the concept of correlation between two variables X and Y. Figure 11.2 Example of the concept of regression analysis between X and Y and important elements associated with it. A linear regression is illustrated. Downloaded from http://iwaponline.com/ebooks/chapter-pdf/643404/9781780409320_0397.pdf by guest on 01 October 2021 400 Assessment of Treatment Plant Performance and Water Quality Data S. 11.5 The figure may seem, at this stage, a bit complex, with many elements, but all of them will be duly explained in Section 11.5. Let us discuss some basic concepts of regression analysis. Regression analysis is a statistical technique to model and investigate the relationship between two or more variables. It is mainly used for forecasting purposes (i.e., predicting future responses). A statistical model is developed to predict the values of a dependent variable or a response variable, based on the values of at least one independent or explanatory variable (Montgomery, 2005). The way the pairs of data points are related defines the type of relationship between the variables and the type of regression model. The purpose of the analysis is to fit a curve to the data points, and by fitting a curve, we mean defining a curve that passes as close to the points as possible. After fitting a curve, we can determine the values of the coefficients of the model. Thus, it will be possible to evaluate a possible dependence of y in relation to x and to express mathematically this behaviour by means of an equation. There are several models that can be tried to fit the data, involving one or more independent variables (Table 11.1). Figure 11.3 illustrates examples of a linear and a non-linear model. S. 11.5 Most of the concepts in this chapter are related to linear regression models (see Section 11.5 for simple linear regression and Section 11.6 for multiple linear regression). Linear models, especially simple linear S. 11.6 regression, are extremely important for the assessment of monitoring data. They are usually the first model we attempt to fit the data, in order to explore the quantitative relationship between our variables. But remember that this approach assumes a linear relationship between the variables, which frequently may not be the case, especially considering that we are dealing with environmental systems. For some environmental systems, non-linear relationships may be more applicable. Non-linear regression is also S. 11.7 covered in this chapter (Section 11.7). In other chapters of this book, we discuss other modelling approaches not directly associated with regression analysis (e.g., non-regression-based models). These other mechanistic models require an Table 11.1 Types of regression analysis. Factor Type of Characteristics Regression Regarding the Linear • The variables are linearly related. relationship regression • In an x–y scatter plot, the points should lie between each approximately in a straight line. independent • The least squares solution leads to a unique solution (minimization variable x and of the sum of the squared errors). the dependent Non-linear • The variables are not linearly related. variable y regression • In the scatter plot, the points are not distributed over a straight line. • The equation of a straight line is not used. • If there is no explicit solution, obtained by transformation (e.g. log-transformation) of variables, the solution must be obtained by iterative numerical methods of minimizing the error function (sum of the squared errors). There is no guarantee that the numerical methods may converge to the same solution. Regarding the Simple • There is only one independent variable (x) number regression of independent Multiple • There is more than one independent variable (x , x , …, x ). variables 1 2 n regression Downloaded from http://iwaponline.com/ebooks/chapter-pdf/643404/9781780409320_0397.pdf by guest on 01 October 2021 Relationship between monitoring variables 401 Figure 11.3 Example of the concepts of a linear regression (top) and a non-linear regression (bottom). C. 12 understanding of the principles of mass balance (Chapter 12), the kinetics of the reactions (Chapter 14), the C. 14 reactor hydraulics (Chapter 14), and other process-based considerations (Chapter 15).