minerals
Article Application of General Linear Models (GLM) to Assess Nodule Abundance Based on a Photographic Survey (Case Study from IOM Area, Pacific Ocean)
Monika Wasilewska-Błaszczyk * and Jacek Mucha
Department of Geology of Mineral Deposits and Mining Geology, Faculty of Geology, Geophysics and Environmental Protection, AGH University of Science and Technology, 30-059 Cracow, Poland; [email protected] * Correspondence: [email protected]
Abstract: The success of the future exploitation of the Pacific polymetallic nodule deposits depends on an accurate estimation of their resources, especially in small batches, scheduled for extraction in the short term. The estimation based only on the results of direct seafloor sampling using box corers is burdened with a large error due to the long sampling interval and high variability of the nodule abundance. Therefore, estimations should take into account the results of bottom photograph analyses performed systematically and in large numbers along the course of a research vessel. For photographs taken at the direct sampling sites, the relationship linking the nodule abundance with the independent variables (the percentage of seafloor nodule coverage, the genetic types of nodules in the context of their fraction distribution, and the degree of sediment coverage of nodules) was determined using the general linear model (GLM). Compared to the estimates obtained with a simple Citation: Wasilewska-Błaszczyk, M.; linear model linking this parameter only with the seafloor nodule coverage, a significant decrease Mucha, J. Application of General in the standard prediction error, from 4.2 to 2.5 kg/m2, was found. The use of the GLM for the Linear Models (GLM) to Assess assessment of nodule abundance in individual sites covered by bottom photographs, outside of Nodule Abundance Based on a direct sampling sites, should contribute to a significant increase in the accuracy of the estimation of Photographic Survey (Case Study nodule resources. from IOM Area, Pacific Ocean). Minerals 2021, 11, 427. https:// doi.org/10.3390/min11040427 Keywords: polymetallic nodules; nodule abundance; general linear models; linear regression; nodule coverage of seafloor; sediment coverage of nodules; Clarion-Clipperton Zone (CCZ); image analysis Academic Editors: Pedro Madureira and Tomasz Abramowski
Received: 12 March 2021 1. Introduction Accepted: 12 April 2021 Deposits of polymetallic (manganese-bearing) nodules occurring at the bottom of all Published: 17 April 2021 oceans are an attractive alternative to onshore deposits from the point of view of metal resources such as Ni, Co, Cu, Mn, Li, REE (the rare earth elements), and others. The Publisher’s Note: MDPI stays neutral Clarion–Clipperton Fracture Zone (CCZ) in the tropical NE Pacific is the area of greatest with regard to jurisdictional claims in economic interest for nodules [1–3] (Figure1A). It is expected that the demand for some published maps and institutional affil- metals will soon exceed their supply due to the depletion of onshore ore deposits resulting iations. from the intensive development of modern branches of the economy (high technology, green technology, emerging industries, and military applications). In the future, the shortage of some metals can be covered by the exploitation of offshore deposits. This issue was widely discussed in many publications, e.g., [1,2,4–14]. The exploitation of Copyright: © 2021 by the authors. these deposits, apart from their proper recognition and estimation of their resource [15], Licensee MDPI, Basel, Switzerland. requires solving a number of problems related to the technique of mineral extraction and This article is an open access article ore processing [7,16–18]. distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).
Minerals 2021, 11, 427. https://doi.org/10.3390/min11040427 https://www.mdpi.com/journal/minerals MineralsMinerals 20212021, 11,,11 x ,FOR 427 PEER REVIEW 2 of2 17of 17
FigureFigure 1. Location 1. Location of ofB2 B2 Interoceanmetal Interoceanmetal Joint Joint Organization Organization (IOM)(IOM) explorationexploration area area for for polymetallic polymetallic nodules nodules against against the the backgroundbackground of ofthe the Clarion–Clipperton Clarion–Clipperton zone zone [19] [19]( (AA)) and and H22 exploration blockblock ( B(B);); the the location location of of the the box box corer corer sampling sampling sites and seafloor photographs in the H22 exploration block (three variants of training and test subsets) (C). sites and seafloor photographs in the H22 exploration block (three variants of training and test subsets) (C).
TheThe development development of of an an appropriate scenarioscenarioand and schedule schedule for for the the future future short short and and medium-termmedium-term exploitation exploitation ofof nodules, nodules, after afte meetingr meeting the environmentalthe environmental protection protection require- re- quirementsments and and solving solving technical technical problems problems of mining, of mining, depends depends on a on detailed a detailed recognition recognition of ofthe the distribution distribution of of nodule nodule abundance abundance and and resources resources and and an an analysis analysis of of the the ocean ocean floor floor topographytopography based based on on reliable reliable contour maps.maps. AnAn accurate accurate assessment assessment of the abundanceabundance ofof polymetallic polymetallic nodules nodules at at seafloor seafloor sites sites locatedlocated far far away away from from direct direct sampling sampling stationsstations causes causes many many problems. problems. They They result result mainly mainly fromfrom the the large large distances distances between between samplingsampling stationsstations (which (which vary vary depending depending on on the the stage stage of recognition of the nodule-bearing areas), high variability of nodule abundance, and of recognition of the nodule-bearing areas), high variability of nodule abundance, and to to a lesser extent, from the inevitable errors during the sampling process [20]. Therefore, a lesser extent, from the inevitable errors during the sampling process [20]. Therefore, the the attempts to combine the estimation of nodule abundance based on the classical direct attemptssampling to (e.g.,combine using the box estimation corers) [of15 ,nodule21] with abundance indirect methods, based on such the classical as photographic direct sam- plingsurveys (e.g., [ 22using–25] box or widelycorers) understood[15,21] with hydro-acousticindirect methods, methods such as [26 photographic,27], seem natural surveys [22–25]and rational. or widely understood hydro-acoustic methods [26,27], seem natural and rational. TheThe routine routine and and continuous continuous video andand photo-profilingphoto-profiling (photographic (photographic survey) survey) of theof the oceanocean floor floor along along the the course course of a research vesselvessel from from which which direct direct sampling sampling is carriedis carried out out providesprovides a ahuge huge number number of photographs. TheirTheir analysis analysis provides provides indirect, indirect, approximate approximate informationinformation on on the the percentage percentage of seafloorseafloor nodulenodule coverage coverage (i.e., (i.e., the the percentage percentage of of seafloor seafloor coveredcovered by by the the nodules, nodules, hereinafter hereinafter abbreviatedabbreviated NC-S), NC-S), the the degree degree of of sediment sediment coverage coverage of nodules (SC), and the dominant genetic type of nodules (GT) between sampling stations [28]. The data obtained from the photographs in sampling sites are correlated with various strengths with the nodule abundance based on box core samples. The previous attempts
Minerals 2021, 11, 427 3 of 17
of nodules (SC), and the dominant genetic type of nodules (GT) between sampling sta- tions [28]. The data obtained from the photographs in sampling sites are correlated with various strengths with the nodule abundance based on box core samples. The previous attempts to develop a simple linear regression between the nodule abundance and the percentage of seafloor nodule coverage did not yield unequivocal results. In some parts of the studied area of the Interoceanmetal Joint Organization (IOM), a statistically significant and strong linear correlation between these parameters was found (with linear correlation coefficients of 0.6–0.7), while in other parts there was no statistically significant correlation and the coefficients of linear correlation were close to zero [20,28,29]. The main reasons for the weaker correlation of both parameters can be seen in the varying the degree of sediment coverage of nodules [28–30], diversity of the nodule genotypes [31,32], small scale variability of nodule abundance, different geometrical basis of measurements (area of the bottom covered by the photograph several times larger as compared to the area of the horizontal section of the box corer), and the variation in the quality of seafloor pho- tographs. For these reasons, some researchers introduced various coefficients correcting the relationship between the nodule abundance and the percentage of seafloor nodule coverage determined on the basis of photographs [30,33]. Improvements in the accuracy of the estimates of nodule abundance can be expected when additional independent qualitative variables (ordinal variable), determined based on the photographs, such as the distribution of nodule fractions associated with the genetic type of nodules and the degree of sediment coverage of nodules, are introduced to the relationship model. For the data from the H22 exploration block in the IOM area (Figure1B ), a statistically significant and relatively strong correlation was found between the nodule abundance and the genetic type of nodules, while the correlation between the nodule abundance and the degree of covering the nodules with bottom sediments was weak [28]. The coexistence of independent variables of different types (continuous and ordinal) requires the use of an appropriate mathematical model linking them with the nodule abundance used as a dependent variable. This can be achieved using general linear models (GLM). The results of their application to a dataset from a part of the area administered by the IOM [34] in the Clarion–Clipperton Zone in the Pacific are the subject of this article.
2. Research Objective The main aim of the research was to assess the accuracy of the prediction of nodule abundance (APN) at ocean floor points outside the sampling stations based on the multi- variate regression technique called General Linear Models (GLM) (Figure2). In the present case study, GLM was used to determine the form of the relationship linking the nodule abundance (continuous dependent variable) with the percentage of seafloor nodule cover- age NC-S (independent continuous variable) and two qualitative variables (independent ordinal variables)—the genetic type of nodules in the context of their fraction distribution (hereinafter referred to in the text abbreviated as the genetic type of nodules and marked as GT) and the degree of sediment coverage of SC nodules. The values of all independent variables were determined based on photographs of the ocean floor at box corer stations. The values of the NC-S variable were determined automatically with the use of computer software, while the values of the GT and SC variables were determined visually based on expert evaluation. Minerals 2021, 11, 427 4 of 17 Minerals 2021, 11, x FOR PEER REVIEW 4 of 17
Figure 2. FigureDiagram 2. ofDiagram the types of ofthe data types (variables) of data obtained(variables) from obta photographsined from photographs of the seafloor of andthe laboratoryseafloor and grid used in the generallaboratory linear model grid (GLM) used in and the the general simple linear linear modelmodel (SLM)(GLM) to and predict the simple the nodule linear abundance. model (SLM) Explanations: to pre- H, HD, D—geneticdict types the ofnodule nodules abundance. in the context Explanations: of their fraction H, HD distribution, D—genetic (H—hydrogenetic, types of nodules HD—hydrogenetic-diagenetic, in the context of their fraction distribution (H—hydrogenetic, HD—hydrogenetic-diagenetic, D—diagenetic). D—diagenetic).
To evaluate theThe effectiveness fraction distribution of GLM as of a genetic method type for of predicting nodules (GT nodule symbol, abundance, ordinal variable) acts the obtained resultsas an were identifier compared and expresses with the differencesresults of the in theprediction probability for the distributions simple linear of nodule sizes model (SLM) linkingcharacteristic the nodule for the abundance distinguished APN genetic and the types percentage of nodules. coverage The exact of ocean characteristics of floor NC-S. For bothcomparative ordinal variables purposes, and analogous other continuous analyses variables were also describing performed the nodule-bearingfor nod- areas ules from the boxin thecorer H22 samples exploration after blocktheir removal were presented and washing by Wasilewska-Błaszczyk and arranging on a and la- Mucha [28]. boratory grid, i.e.,The for analysis the percentage of variables of grid included coverage in the with cited nodules article NC-T was the (Figure basis for2). the selection of independent variables in GLM. 3. Materials To evaluate the effectiveness of GLM as a method for predicting nodule abundance, the obtained results were compared with the results of the prediction for the simple linear The usefulnessmodel of (SLM)GLM was linking assessed the nodule based abundance on 68 measurements APN and the of percentagethe nodule coverageabun- of ocean dance in samplesfloor collected NC-S. For from comparative the ocean purposes, floor using analogous box corers analyses and were the alsoresults performed of the for nodules analysis of 68 photographsfrom the box corertaken samples at the dire afterct their sampling removal sites. and washingThe datasets and arranging are derived on a laboratory from the H22 explorationgrid, i.e., for block the percentage (4151 km2 of) (located grid coverage in the with central-eastern nodules NC-T part (Figure of the2). B2 sector), best recognized in the area administered by the IOM (Figure 1). The data were3. Materials obtained during two cruises of a research vessel in 2014 and 2019. Both cruises used the sameThe methods usefulness of sampling of GLM was and assessed photographing based on of 68 the measurements ocean floor and of the for nodule abun- computer determinationdance in samplesof the nodule collected covera fromge the of oceanthe seafloor floor using based box on corers bottom and photo- the results of the graphs. Therefore,analysis from ofthe 68 point photographs of view of taken the accuracy at the direct of determining sampling sites. the Thevalues datasets of the are derived variables, the homogeneity of the initial dataset can be assumed. The box corer sampling covered a 0.25 m2 (0.5 m × 0.5 m) square section of the seafloor, while 62 photographs covered a bottom section of approximately 1.6 m2. In 6 out of 68 sampling sites, no bottom photographs were taken before the box corer sample was collected; therefore, the seafloor
Minerals 2021, 11, 427 5 of 17
from the H22 exploration block (4151 km2) (located in the central-eastern part of the B2 sector), best recognized in the area administered by the IOM (Figure1). The data were obtained during two cruises of a research vessel in 2014 and 2019. Both cruises used the same methods of sampling and photographing of the ocean floor and for computer determination of the nodule coverage of the seafloor based on bottom photographs. Therefore, from the point of view of the accuracy of determining the values of Minerals 2021, 11, x FOR PEER REVIEW 5 of 17 the variables, the homogeneity of the initial dataset can be assumed. The box corer sampling covered a 0.25 m2 (0.5 m × 0.5 m) square section of the seafloor, while 62 photographs covered a bottom section of approximately 1.6 m2. In 6 out of 68 sampling sites, no bottom photographsphotographs were obtained taken from before photo-profiling the box corer sample(the device was Neptun collected; C-M1, therefore, Russia the [35], seafloor cover- photographsing an area of obtained about 5 from m2, taken photo-profiling approximately (the device 5–50 m Neptun from the C-M1, box Russia corer sampling [35], covering sites, anwere area used of about instead. 5 m 2, taken approximately 5–50 m from the box corer sampling sites, were used instead.The statistics of continuous variables, i.e., the nodule abundance based on wet nodule weightThe (APN statistics symbol, of continuous dependent variables, variable) i.e., and the th nodulee percentage abundance of seafl basedoor nodule on wet coverage nodule weight(NC-S (APNsymbol, symbol, independent dependent variable), variable) are presen and theted percentage in Table 1, ofand seafloor their empirical nodule cover- distri- agebutions (NC-S in symbol,graphical independent form are presented variable), in areFigure presented 3. in Table1, and their empirical distributions in graphical form are presented in Figure3. Table 1. Statistics of the nodule abundance (APN) in the box core and the percentage of seafloor Tablenodule 1. coverageStatistics (NC-S) of the noduledeterm abundanceined from the (APN) photographs. in the box core and the percentage of seafloor nodule coverage (NC-S) determined from the photographs. Statistics APN (kg/m2) NC-S (%) StatisticsCount APN (kg/m2)68 NC-S (%)68 CountAverage 6813.47 68 39.51 AverageMedian 13.4713.55 39.5142.0 20%Median Trimmed mean 13.55 13.81 42.0 41.16 20%Standard Trimmed meandeviation 13.81 4.64 41.16 12.65 Standard deviation 4.64 12.65 Coeff.Coeff. of variation of variation 34.4% 34.4% 32.0% 32.0% MinimumMinimum 1.51.5 7.0 7.0 MaximumMaximum 23.123.1 72.0 72.0 RangeRange 21.621.6 65.0 65.0 Stnd. skewness −1.30 −1.89 Stnd.Stnd. kurtosis skewness −0.10−1.30 0.77−1.89 p-value (Shapiro–WilkStnd. kurtosis test) 0.428−0.10 0.0290.77 p-value (Shapiro–Wilk test) 0.428 0.029
FigureFigure 3. 3.Normal Normal probability probability plots plots (above) (above) and and box box and and whisker whisker plots plots (below) (below) for for nodule nodule abundance abun- (APN)dance and(APN) nodule and coveragenodule coverage of the bottom of the (NC-S).bottom (NC-S).
The values of all statistical measures of central tendency (average, median, 20% trimmed mean) both within the APN and NC-S sets differ only slightly (Table 1). The variability of both parameters (APN and NC-S) measured by the coefficient of variation is similar (32–35%) and can be described as moderate. Standardized skewness and kurto- sis are within the range expected for the data from a normal distribution (the range is from −2 to 2) [36]. The more precise normality test (Shapiro–Wilk test) did not provide grounds for rejecting the hypothesis of the normality of the distribution of the nodule abundance at the significance level of 0.05. An examination of the empirical distributions of APN and NC-S using the box and whisker method did not show the presence of outliers, which allowed us to consider both datasets as homogeneous (Figure 3).
Minerals 2021, 11, 427 6 of 17
The values of all statistical measures of central tendency (average, median, 20% trimmed mean) both within the APN and NC-S sets differ only slightly (Table1). The variability of both parameters (APN and NC-S) measured by the coefficient of variation is similar (32–35%) and can be described as moderate. Standardized skewness and kurtosis are within the range expected for the data from a normal distribution (the range is from −2 to 2) [36]. The more precise normality test (Shapiro–Wilk test) did not provide grounds for rejecting the hypothesis of the normality of the distribution of the nodule abundance at the Minerals 2021, 11, x FOR PEER REVIEWsignificance level of 0.05. An examination of the empirical distributions of APN and NC-S6 of 17
using the box and whisker method did not show the presence of outliers, which allowed us to consider both datasets as homogeneous (Figure3). TheThe otherother twotwo independentindependent variablesvariables usedused ininGLM GLMwere were categorical categorical (ordinal) (ordinal) and and defineddefined asas factorsfactors with some some nu numbermber of of assigned assigned levels. levels. In Inthe the case case of the of thesediment sediment cov- coverageerage of nodules of nodules (SC), (SC), four four levels levels of ofthis this factor factor were were distinguished distinguished based based on ona visual a visual as- assessmentsessment of of the the seafloor seafloor photographs photographs (Figure (Figure 44),), withwith numbersnumbers fromfrom 11 toto 4 4 in in the the order order correspondingcorresponding toto the the increasing increasing degree degree of of coverage coverage (low (low coverage coverage (1), (1), medium medium coverage coverage (2),(2), high high coverage coverage (3), (3), and and very very high high coverage coverage (4)) (4)) assigned assigned as as identifiers. identifiers.
FigureFigure 4. 4.Examples Examples of of photographs photographs (seafloor (seafloor and and box box corer) corer) from from the the H22 H22 exploration exploration block block showing show- ing different levels of sediment coverage of the nodules; the values (codes) of the ordinal variables different levels of sediment coverage of the nodules; the values (codes) of the ordinal variables are are given in parentheses. given in parentheses.
TheThe nodules nodules occurring occurring in in the the CCZ CCZ are are most most commonly commonly classified classified into into the threethe three genetic ge- typesnetic [types37,38 ]:[37,38]: • • HH (hydrogenetic)—small(hydrogenetic)—small nodules up up to to 3 3 cm cm [37] [37 ]or or 4 4cm cm [38] [38 in] indiameter, diameter, most most fre- frequentlyquently spheroidal spheroidal and and with with smooth smooth surfaces; surfaces; • • HDHD (hydrogenetic-diagenetic)—nodules(hydrogenetic-diagenetic)—nodules intermediateintermediate in size (by (by convention, convention, from from 3 3to to 6 6cm cm in in diameter) diameter) with with a smooth a smooth upper upper surface surface and and a rough a rough lower lower surface, surface, pre- predominantlydominantly ellipsoidal, ellipsoidal, flattened, flattened, and and plate-shaped; plate-shaped; • D (diagenetic)—large nodules, 6–12 cm in diameter, predominantly discoidal and el- lipsoidal in shape and with rough surfaces. Based on the scaled seafloor photographs, it is possible to determine the dominant fractions of nodules, and thus, with high probability, the genetic type of the nodules [28]. In the IOM area, the correctness of the determination of the genetic type of the nodules dominant in the photograph or its part (in the case of only locally increased sediment coverage of nodules) is usually not questionable. With regard to genotypes (GT), three levels of the factor were distinguished based on the nodule fractions dominating in the bottom photographs (Figure 5A): 1-H (hydrogenetic), 2-HD (hydrogenetic-diagenetic), and 3-D (diagenetic). Figure 5B presents the nodule fraction distributions for the different dominant genetic types of nodules with the example of the three box core samples. To
Minerals 2021, 11, 427 7 of 17
• D (diagenetic)—large nodules, 6–12 cm in diameter, predominantly discoidal and ellipsoidal in shape and with rough surfaces. Based on the scaled seafloor photographs, it is possible to determine the dominant fractions of nodules, and thus, with high probability, the genetic type of the nodules [28]. In the IOM area, the correctness of the determination of the genetic type of the nodules dominant in the photograph or its part (in the case of only locally increased sediment coverage of nodules) is usually not questionable. With regard to genotypes (GT), three levels of the factor were distinguished based on the nodule fractions dominating in the bottom photographs (Figure5A): 1-H (hydrogenetic), 2-HD (hydrogenetic-diagenetic), and 3-D (diagenetic). Figure5B presents the nodule fraction distributions for the different dominant genetic types of nodules with the example of the three box core samples. To illustrate the specificity of the fraction distributions of a given genetic type, mean fraction distributions averaged based on 8 (H), 17 (HD), and 43 (D) of 68 all box core samples in the H22 exploration block were used (Figure5B). The fraction distributions of nodules for different dominant genetic types usually have characteristic shapes (Figure5B): strongly skewed to the right (H), moderately skewed to the right (HD), and close to symmetric (D). This factor (GT) related to the differentiation in nodule sizes directly translates into the value of nodule abundance (APN), which is confirmed by a strong positive nonlinear correlation between weight (mass) of the nodules and their surface area [28,29]. Their statistical description was limited to providing the number of observations for individual levels of factors due to the specificity of both ordinal variables (Table2, Figure6). The SC factor levels are dominated by moderate sediment coverage (level 2), which constitutes 50% of all observations, while the GT factor levels are dominated by the diagenetic type D (level 3), slightly exceeding 63% (Figure6, Table2).
Table 2. Frequency and relative frequency for different levels of factors (ordinal variables): genetic type of nodules (GT), sediment coverage (SC).
Level of Ordinal Variable (Factor) Code Frequency Relative Frequency Factor Low 1 24 35.3% Medium 2 34 50.0% Sediment coverage (SC) High 3 6 8.8% Very high 4 4 5.9% H 1 8 11.8% Genetic type of nodules (GT) HD 2 17 25.0% D 3 43 63.2% Minerals 2021, 11, x FOR PEER REVIEW 7 of 17
illustrate the specificity of the fraction distributions of a given genetic type, mean fraction distributions averaged based on 8 (H), 17 (HD), and 43 (D) of 68 all box core samples in the H22 exploration block were used (Figure 5B). The fraction distributions of nodules for different dominant genetic types usually have characteristic shapes (Figure 5B): strongly skewed to the right (H), moderately skewed to the right (HD), and close to symmetric (D). Minerals 2021, 11, 427 This factor (GT) related to the differentiation in nodule sizes directly translates into the 8 of 17 value of nodule abundance (APN), which is confirmed by a strong positive nonlinear cor- relation between weight (mass) of the nodules and their surface area [28,29].
Minerals 2021, 11, x FOR PEER REVIEW 8 of 17
Their statistical description was limited to providing the number of observations for individual levels of factors due to the specificity of both ordinal variables (Table 2, Figure 6). The SC factor levels are dominated by moderate sediment coverage (level 2), which constitutes 50% of all observations, while the GT factor levels are dominated by the dia- genetic type D (level 3), slightly exceeding 63% (Figure 6, Table 2).
Table 2. Frequency and relative frequency for different levels of factors (ordinal variables): genetic type of nodules (GT), sediment coverage (SC).
Ordinal Variable (Factor) Level of Factor Code Frequency Relative Frequency Low 1 24 35.3% Medium 2 34 50.0% Sediment coverage (SC) High 3 6 8.8% Very high 4 4 5.9% FigureFigure 5. 5. ExamplesExamples of of box box corer corer photographs photographs for for three three genetic genetic types types of nodules of nodules (A); fraction (A); fraction distri- distribu- butions of the nodules defined based on the presented (on the left) box core samples (1—green tions of the nodules defined based on the presentedH (on the left)1 box core8 samples (1—green11.8% bars) and bars)Genetic and the type mean of number nodules of nodules in the fractions based on the box core samples from H22 the mean number of nodules in the fractionsHD based on the2 box core samples17 from H22 exploration 25.0% exploration block(GT) (2—dashed red line) (B). block (2—dashed red line) (B). D 3 43 63.2%
Figure 6.6. PiePie chartscharts for for sediment sediment coverage coverage (SC) (SC) and and genetic genetic type type of nodules of nodules (GT). (GT).
4. Methods The general linear models (GLM) procedure is used to construct statistical model de- scribing the impact of any set of explanatory variables (X), quantitative (continuous) or qualitative (categorical), on one or more dependent variables (Y). An independent cate- gorical variable (nominal or ordinal) is called a factor, and its categories are called the levels of the factor [39]. In its simplest form, GLM determines the (linear) relationship between the one de- pendent (response) variable Y and the set of predictors (explanatory variables) Xi and is expressed by the general formula:
Y = b0 + b1X1 + b2X2 + … + bkXk (1)
where b0—intercept, b1–k—partial regression coefficients. This particular case of a general linear model limited (restricted) to one dependent variable was used in the research. In terms of the obtained results, the method is equiva- lent to the multiple regression model. For use in regression, a categorical variable with k categories (levels) must be trans- formed (coded) into a set of (k–1) indicator variables also known as a dummy variable [40]. An example of such a transformation is given in the caption of the Table 3.
Minerals 2021, 11, 427 9 of 17
4. Methods The general linear models (GLM) procedure is used to construct statistical model describing the impact of any set of explanatory variables (X), quantitative (continuous) or qualitative (categorical), on one or more dependent variables (Y). An independent categorical variable (nominal or ordinal) is called a factor, and its categories are called the levels of the factor [39]. In its simplest form, GLM determines the (linear) relationship between the one de- pendent (response) variable Y and the set of predictors (explanatory variables) Xi and is expressed by the general formula:
Y = b0 + b1X1 + b2X2 + . . . + bkXk (1)
where b0—intercept, b1–k—partial regression coefficients. This particular case of a general linear model limited (restricted) to one dependent variable was used in the research. In terms of the obtained results, the method is equivalent to the multiple regression model. For use in regression, a categorical variable with k categories (levels) must be trans- formed (coded) into a set of (k–1) indicator variables also known as a dummy variable [40]. An example of such a transformation is given in the caption of the Table3.
Table 3. Regression models between nodule abundance (APN) and seafloor nodule coverage (NC-S) or grid nodule coverage (NC-T) supported by the categorical variables: level of sediment coverage of nodules (SC) and genetic type of nodules (GT) based on photographs (count of data = 68).
R2 Data Set (Count Regression Independent Equation of Estimated adj SEE MAE MPE MAPE of Data = 68) Method Variables Model p-Value 2 59.2 NC-T APN[kg/m ] = 1.33 + 0.27 2.96 2.23 −7.3 20.9 NC-T[%] (0.0130) SLM 2 − 52.6 APN[kg/m ] = 14.17 + 3.19 2.58 −2.0 28.0 ln(NC-T) (0.0000) Grid 7.38 ln(NC-T[%]) photographs APN[kg/m2] = 0.57 − NC-T, GT 80.7 2.04 1.53 −6.8 17.3 2.70I1(1) − 0.42I1(2) + (0.0000) 0.25NC-T[%] GLM 2 − − APN[kg/m ] = 15.82 81.7 ln(NC-T), 1.98 1.56 0.1 14.8 GT 3.20I1(1) − 0.40I1(2) + (0.0000) 7.34ln(NC-T[%]) 2 15.4 NC-S APN[kg/m ] = 7.55 + 0.15 4.27 3.56 −21.4 41.7 NC-S (0.0000) SLM 2 − 23.3 ln(NC-S) APN[kg/m ] = 5.46 + 4.06 3.39 −15.5 34.9 5.25ln(NC-S[%]) (0.0000) APN[kg/m2] = 2.65 − 60.2 − NC-S, GT 4.16I2(1) − 0.48I2(2) + (0.0000) 2.92 2.26 14.6 29.8 0.22NC-S[%] APN[kg/m2] = 0.95 − Seafloor NC-S, GT, 61.0 1.73I1(1) − 0.01I1(2) + 2.90 2.14 −13.6 28.0 photographs SC 0.02I1(3) − 4.30I2(1) − (0.0000) 0.46I2(2) + 0.27NC-S[%] GLM APN[kg/m2] = −13.25 − ln(NC-S), 67.4 2.65 2.05 −8.4 22.3 GT 3.82I2(1) − 0.73I2(2) + (0.0000) 6.79ln(NC-S[%]) APN[kg/m2] = −20.02 − 2.10I1(1) − 0.60I1(2) − 70.4 ln(NC-S), 2.52 1.88 −6.2 18.8 GT, SC 0.83I1(3) − 4.10I2(1) − (0.0000) 0.61I2(2) + 8.90ln(NC-S[%]) 2 Explanations: SLM—simple linear regression model, GLM—general linear models, Radj—adjusted coefficient of determination, SEE— standard error of estimation, MAE—mean absolute error, MPE—mean percentage error, MAPE—mean absolute percentage error. The values of indicator variables in GLM regression were determined based on the values of categorical (ordinal) variables according to the following scheme: I1(1) = 1 if SC = 1, −1 if SC = 4, 0 otherwise, I1(2) = 1 if SC = 2, −1 if SC = 4, 0 otherwise, I1(3) = 1 if SC = 3, −1 if SC = 4, 0 otherwise, I2(1) = 1 if GT = 1, −1 if GT = 3, 0 otherwise, I2(2) = 1 if GT = 2, −1 if GT = 3, 0 otherwise. Minerals 2021, 11, 427 10 of 17
Each of the determined regression models was tested for its statistical significance by calculating the so-called p-value. When the p-value ≤0.05, the model can be considered statistically significant with a risk of error not greater than 5%. In addition, five measurements were determined to indicate the model’s goodness of fit (the strength of relationships): 2 • The adjusted coefficient of determination Radj expresses the percentage of the variabil- ity in the dependent variable, which has been explained by the fitted model, ranging from 0% (lack of the dependency) to 100% (ideal, full relationship), adjusted for the number of coefficients in the model:
" 2 # n − 1 ∑n (yˆ − y ) R2 = 1 − × i=1 i i × 100% (2) adj n − p n 2 ∑i=1(yi − y)
where n—count of data, p—the number of estimated model coefficients, yˆi—theoretical value of the dependent variable Y determined from the model equation for the obser- vation “i”, yi—empirical value of the dependent variable Y for the observations “i”, y—arithmetic mean of the empirical values of the dependent variable Y. • The standard (prediction) error of estimation (SEE) characterizing the average scatter of the measured values of the dependent variable in the regression model:
v u 2 u n (yˆ − y ) SEE = t∑ i i (3) i=1 n
• The mean absolute error (MAE) characterizing the mean absolute deviation of the measured Y values from the values indicated by the model:
1 n MAE = ∑|yˆi − yi| (4) n i=1 • Mean percentage error (MPE):
100% n yˆ − y MPE = ∑ i i (5) n i=1 yi • Mean absolute percentage error (MAPE):
n 100% yˆi − yi MAPE = ∑ (6) n i=1 yi For comparison purposes, simple linear models (SLM) of the general form:
Y = b0 + b1·X1, (7)
linking the nodule abundance (Y) to the percentage of seafloor nodule coverage (NC-S) and box corer (NC-T) and their natural logarithms, were also analyzed. Logarithmically transforming variables in a regression model is a very common way of handling situations where a non-linear relationship exists between the independent and dependent variables [41]. The use of the natural logarithm in the analyzed cases resulted from preliminary studies which showed a slightly stronger correlation of APN with ln(NC-S) than with (NC-S) (Figure7). Minerals 2021, 11, x FOR PEER REVIEW 10 of 17
• The standard (prediction) error of estimation (SEE) characterizing the average scatter of the measured values of the dependent variable in the regression model: