Soil Mapping Based on Globally Optimal Decision Trees and Digital Imitations of Traditional Approaches

International Journal of Geo-Information

Article Soil Mapping Based on Globally Optimal Decision Trees and Digital Imitations of Traditional Approaches

Arseniy Zhogolev 1,* and Igor Savin 1,2 1 V.V. Dokuchaev Soil Science Institute, 119017 Moscow, Russia; [email protected] 2 Ecological Faculty, Peoples Friendship University of Russia (RUDN University), 117198 Moscow, Russia * Correspondence: [email protected]; Tel.: +7-495-953-7154

Received: 24 September 2020; Accepted: 2 November 2020; Published: 4 November 2020

Abstract: Most digital soil mapping (DSM) approaches aim at complete statistical model extraction. The value of the explicit rules of soil delineation formulated by soil-mapping experts is often underestimated. These rules can be used for expert testing of the notional consistency of soil maps, soil trend prediction, soil geography investigations, and other applications. We propose an approach that imitates traditional soil mapping by constructing compact globally optimal decision trees (EVTREE) for the covariates of traditionally used soil formation factor maps. We evaluated our approach by regional-scale soil mapping at a test site in the Belgorod region of Russia. The notional consistency and compactness of the decision trees created by EVTREE were found to be suitable for expert-based analysis and improvement. With a large sample set, the accuracy of the predictions was slightly lower for EVTREE (59%) than for CART (67%) and much lower than for Random Forest (87%). With smaller sample sets of 1785 and 1000 points, EVTREE produced comparable or more accurate predictions and much more accurate models of soil geography than CART or Random Forest.

Keywords: digital soil mapping; traditional soil mapping; machine learning; EVTREE; Belgorod region

1. Introduction Global Soil Map projects related to obtaining spatial soil information require new techniques for constructing soil maps that can use legacy data in digital soil mapping [1–3]. Digital soil mapping is the process of creating and populating geographically referenced soil databases generated at specific resolutions via field and laboratory observation methods coupled with environmental data through quantitative relationships [4]. This soil-mapping paradigm is based on the use of modelling methods for establishing the soil–landscape relationships needed to estimate the probability of soil occurrence or its properties described by environmental or remote sensing data [5–10]. The digital soil mapping paradigm is gradually replacing the traditional soil mapping paradigm; however, huge archives of traditional soil maps exist and were created by expert soil-mappers in accordance with the qualitative paradigm [5]. To completely utilize these archives, digital soil mapping approaches should consider the qualitative nature of these traditional soil mapping processes. When utilizing legacy soil maps, it is most beneficial to use statistical methods that provide pedological knowledge in a form close to traditional forms [11]. There should also be possibilities for extracting and exploiting traditional qualitative knowledge about spatial soil propagation in the statistical models in an objective form [12,13]. It is also important in digital soil mapping to adopt “sleeping beauty” ideas and techniques from the traditional processes of soil mapping [14]. In traditional soil mapping, the values of other products besides soil maps are often underestimated. This product involves explicit rules for the delineation of soil class units or soil properties. These explicit rules were formulated by soil-mapping experts and have rarely been expressed in their complete form. These rules were evaluated by experts to determine their notional consistency and have been

ISPRS Int. J. Geo-Inf. 2020, 9, 664; doi:10.3390/ijgi9110664 www.mdpi.com/journal/ijgi ISPRS Int. J. Geo-Inf. 2020, 9, 664 2 of 22 used to predict soil trends and investigate soil geography and many other applications. For digital soil mapping, qualitative knowledge can be incorporated into quantitative soil models to yield more accurate soil maps in soil geography. For example, alluvial soils are associated with the floodplains of rivers, whereas organic soils are associated with swamps and marshes. In addition to these models, such expert qualitative information could provide substantial improvements in the accuracy of soil maps [15]. There are many soil mapping frameworks that are based on statistical methods and provide satisfactory prediction results: DSMART—resampled classification trees [16]; SoilGRIDs—Random Forest, multiple linear regression, and ensemble methods [7]; and DoSoReMi—Random Forest [17], among others. If a complete sample set is used for modelling, ensemble methods, such as a Random Forest, yield the highest prediction accuracy [18–20]. However, there are several basic limitations for obtaining accurate statistical models based on “pure” quantitative statistical approaches without using additional qualitative knowledge [21]. In addition, extrapolation issues exist due to a dependence on the quality of spatial sampling and over-fitting [20]. Few developments in digital soil mapping represent the soil mapping rules in an explicit form. One such approach uses SoLIM (Soil Land Inference Mode) technology, which is based on fuzzy logic [22,23]. This approach differs from the traditional method of formulating rules for soil mapping. In 2003, Qi and Zhu developed another approach for formulating tree-based rules like traditional soil mapping using the See5 algorithm [11]. Subsequently, this algorithm was integrated with the statistical fuzzy-logic approach [24]. Therefore, the SoLIM framework integrates many traditional methods for soil mapping, enabling the development of pedologically sound soil maps. Most existing frameworks integrate only some traditional methods for soil mapping. However, the traditional process of soil mapping compilation can be imitated using the achievements of digital soil mapping. Many useful results could be obtained if the entire work process of a traditional soil mapper was imitated, including the process of soil map compilation via the cartographic methods used for covariate analysis [25]. In addition, there are approaches for incorporating expert knowledge into automatically created models, as well as extracting and transferring explicit knowledge [26–28]. New approaches enable the combination of expert-based knowledge and mathematical modelling to fuse the quantitative and qualitative descriptions of the relationships between soils and soil formation factors. Structural equation modelling (SEM) is a contemporary approach that facilitates the integration of knowledge about soils into a statistical model and provides an opportunity to assess the inter-relationships of various soil properties during modeling [21]. According to the authors, “It bridges the gap between empirical and mechanistic methods for soil–landscape modelling and is a tool that can help produce pedologically sound soil maps” [12]. However, the SEM approach is more aimed at incorporating expert knowledge than extracting knowledge from legacy data. Moreover, full theoretical models of SEM are presently more complex than simply representing practical rules for soil delineation in the traditional soil mapping process. Another method of extracting the soil–landscape relationship is to use methods based on classification and regression tree algorithms. Representation with decision trees involves using the practical rules that were considered by the soil-mapping experts for soil mapping. Philippe Lagacherie implemented one of the most widely used methods, the classification and regression tree algorithm, for soil mapping in 1992 [29]. Feng Qi and A-Xing Zhu wrote a broad review about using the See5 algorithm to acquire knowledge for soil mapping [11]. Most of the classification and regression tree algorithms use locally optimal splitting criteria such as Gini or entropy for recursive stepwise partitioning [30]. These methods are aimed at obtaining satisfactory prediction accuracies but not at developing an optimal model. This leads to the substantial overlearning of a tree. Such a tree can be large and have poor notional consistency for some rules while yielding satisfactory prediction accuracy for the key site. These locally optimal methods include CART (Classification And Regression Trees), CTREE (Conditional Inference Trees), CHAID (Chi-square Automatic Interaction Detection), See5, and CUBIST (Rule- And Instance-Based Regression Modeling), along with many others [31]. ISPRS Int. J. Geo-Inf. 2020, 9, 664 3 of 22

Classification and regression tree algorithms of a new type have been proposed and aim at finding the optimal and most meaningful classification trees, e.g., evolution trees [31], neural decision trees [32], and deep neural decision trees [33]. Such globally optimal methods potentially allow much more meaningful knowledge to be obtained about the relationships between soil and soil formation factors. Intrinsically, the methods for the extraction of soil–landscape relationships may only yield reliable models when the input data for modelling are available with good quality. In traditional soil mapping, soil mappers utilize a limited number of the most significant soil formation factor maps to formulate compact and comprehensive rules for soil delineation. The same strategy of using only meaningful covariates forms the basis of our approach that we propose for constructing meaningful statistical models that can be easily analyzed by experts [15,25,34]. In this study, we propose and evaluate a framework for constructing regional soil maps based on a digital imitation of traditional soil mapping and test this approach at a site in the Belgorod region of Russia.

2. Materials and Methods

2.1. Test Site The test site is situated in the Belgorod region of Russia (Figure1). The square size of the 2 test site was 1560 km . The coordinates of the corners of the study area are 50◦100 N, 37◦460 E, 50◦580 N, and 38◦120 E. Chernozems Pachic and Chernozems Albic are the dominant soils in the region. The terrain is a heavily dissected wavy-flat hilly plain. The parent materials are loess loam and clay, chalk sediments, alluvial and fluvioglacial sands, and sandy loams. The climate is continental with a hot summer and a moderately cold winter. The natural vegetation is represented by isolated oak and pine forests, meadows in dry valleys and floodplains, and calciphytic vegetation on steep slopes. However, croplands are dominant. Large-scale soil maps created using the legend for the Russian classification of soils 1977—RCS1977 [35], were used as the basis for modelling. Further, translations were added to the modern Russian classification of soils 2004—RCS2004 [36] and the World Reference Base for Soil Resources 2006—WRB2006 [37]. The main relationships between soils and the soil formation factors are shown in the Table1. For the test site, a sample set with a mean distance of 120 m between key points was made using large-scale soil maps. In the Integrated Land and Water Information System (ILWIS) program, points were set randomly, and the soil classes from the traditional large-scale soil map were assigned to them. The sample-set size of 5001 points covered the central part of the study area and represented all soils (Figure1). A distance of 120 m between points was chosen for the margins considering that one soil point should relate to the smallest possible soil polygon 2 mm by 2 mm in size on a traditional soil map at a medium scale of 1:200,000 [38–40]. The accuracy of the boundary positioning at this scale is the closest to the 90 m resolution of the covariates [41]. The whole approach for making the sample set was based on the classic method for the compilation of medium-scale soil maps. The smallest polygon on the map should be set using one soil pit or using information collected from a large-scale soil map.

2.2. Proposed Approach for Soil Mapping The main strategy of the proposed approach is to use statistical methods and remote sensing data to imitate traditional soil mapping methods for the compilation of a regional scale soil map. Classiﬁcation and regression tree algorithms were used to identify the relationships between the soil and soil formation factors in the form of an explicit decision tree model. This model was trained on the key sites in the soil region and then extrapolated to the area of the soil region where there was a lack of information about the soil. The obtained decision tree model can be traced and corrected manually. An appropriate statistical method was selected to ﬁnd the optimal decision tree. The selected method was integrated into the digital soil mapping framework based on the stages of the traditional soil mapping process (Table2). ISPRS Int. J. Geo-Inf. 2020, 9, x FOR PEER REVIEW 4 of 23

ISPRS Int. J. Geo-Inf. 2020, 9, 664 floodplain terraces,4 of 22 floodplains with lightly Sod Alluvial (SAL) Sod Alluvial Laminate Haplic Fluvisols drained water–glacial A model for the delineation of soils was created for each soil region of the test area, in which the deposits earth–landscape relationships between the ground and soil formation factors were considered the same. Alluvial Meadow Alluvial Dark-humus Umbric Fluvisols This strategy was used in traditional soil mapping (TSM) for soil features [11,38,42river]. The valleys soil region is (AL) soils Oxyaquic deﬁned as a lithologic–geomorphologic region with the same geological history. The regionalization beams, ravines, and wasGully obtained complex based (GC) on landform Gully or complex river watershed delineations - [42]. valleys of small rivers

Figure 1. Study area Novy Oskol.

ToFor develop the test site, the model,a sample maps set with of the a mean soil forming distance factors of 120 (covariates)m between key that points are the was same made as using those usedlarge-scale in traditional soil maps. soil In mapping the Integrated (Figure Land2) were and added Water to Information the covariates. System These (ILWIS) covariates program, are based points on thewere digital set randomly, elevation modeland the SRTM soil classes [43]—the from distance the traditional to the nearest large-scale river weighted soil map with were the assigned slopes and to verticalthem. The terrain sample-set variation size [25 of] 5001 and thepoints flow covered accumulation the central index part [44 of]. Usingthe study the area weighted and represented distance to theall soils nearest (Figure river 1). enabled A distance us to of imitate 120 m isoline between (contour) points frequencywas chosen traditional for the margins expert analysis,considering and that the distanceone soil point to rivers should from relate paper to topographicthe smallest possible maps was soil used polygon to delineate 2 mm by the 2 geomorphologicalmm in size on a traditional units as riversoil map floodplains, at a medium river scale terraces of 1:200,000 and slopes, [38–40]. and watershed The accuracy areas. of In the comparison boundary topositioning the widely at used this topographyscale is the closest wetness to indexthe 90 [ 45m], resolution the weighted of the distance covariates is less [41]. dependent The whole on approach noise in digital for making elevation the models.sample set Derivative was based maps on thatthe classic are based method on the for digital the compilation elevation model of medium-scale (slopes and flowsoil directions)maps. The were calculated using the median filtered elevation model. A vegetation map was created using satellite image classification via the Random Forest method for a manually prepared learning sample

ISPRS Int. J. Geo-Inf. 2020, 9, 664 5 of 22 set (Table3). Overall, the producer and user accuracies of the vegetation class delineations exceeded 99.5%. The following covariates were prepared for modeling (Table3). The proposed models for soil mapping were represented as decision trees. The main advantage of using decision trees is the possibility of intuitive manual correction. It is also possible to construct a seedling of a tree with branches that represent expert knowledge about soils (Figure3). Such branches can be created to delineate large taxonomic groups of soils and specific soil properties. All unknown relationships can be extracted automatically via classification and regression tree algorithms, such as via the evolutionary learning of a globally optimal tree. The classification and regression tree algorithms should obtain an interpretable model. Consequently, this model needs to be compact and stable. After that, the full decision tree will contain all the required information for constructing the soil map (Figure3). Such classification trees can be built for various soil regions and incorporated into a database to store the relevant knowledge of soils in a compact and explicit form. The obtained decision trees can also be linked to soil diagnostic features in soil classification systems to improve the soil classification systems or the decision trees themselves. Therefore, we considered two soil data sources: statistical knowledge in an interpretable form from decision trees extracted from legacy soil maps and expertISPRS Int. knowledge J. Geo-Inf. 2020 from, 9, x manualFOR PEER expert REVIEW correction of the trees. 6 of 23

Figure 2. Covariates from the traditional methodology of soil mapping with additional maps of the elevation model derivatives (*—optionally used maps and **—historical information about soils that are required for correct monosemanticmonosemantic soilsoil delineationdelineation viavia thethe soilsoil formationformation factorfactor approach).approach).

Table 3. Description of the developed covariates.

№ Covariate Basic Data Method of Preparation 1 Elevation map (DEM) DEM SRTM v.4.1 Original data [43] Elevation map median filtered 2 DEM SRTM v.4.1 Median filtration [25] through fifteen pixels (DEM_md15) 3 Slope map (SLP) Calculation [25] 4 Flow direction map (FLOWDIR) Calculation [25] Slope map median filtered through 5 Median filtration [25] fifteen pixels (SLP_md15) Flow direction map median filtered 6 through fifteen pixels Median filtration [25] DEM SRTM v.4.1 (FLOWDIR_md15) Weighted with slope distance to 7 Calculation [25] rivers (RIVDIST) Calculation of the 8 Flow accumulation (FLOWAC) number of input pixels of any outlet Vegetation map: coniferous forest, Classification via the 9 Landsat 8OLI leaf forest, water, built-up area, Random Forest

ISPRS Int. J. Geo-Inf. 2020, 9, x FOR PEER REVIEW 7 of 23

cropping areas and grasslands, and algorithm using a peat bogs (LNDCVR) prepared sample set Quaternary sediment map at a scale of Type of quaternary sediments 10 1:500,000 (based on Manual vectorization (QGEOL) maps at a scale of 1:200,000)

The proposed models for soil mapping were represented as decision trees. The main advantage of using decision trees is the possibility of intuitive manual correction. It is also possible to construct a seedling of a tree with branches that represent expert knowledge about soils (Figure 3). Such branches can be created to delineate large taxonomic groups of soils and specific soil properties. All unknown relationships can be extracted automatically via classification and regression tree algorithms, such as via the evolutionary learning of a globally optimal tree. The classification and regression tree algorithms should obtain an interpretable model. Consequently, this model needs to be compact and stable. After that, the full decision tree will contain all the required information for constructing the soil map (Figure 3). Such classification trees can be built for various soil regions and incorporated into a database to store the relevant knowledge of soils in a compact and explicit form. The obtained decision trees can also be linked to soil diagnostic features in soil classification systems to improve the soil classification systems or the decision trees themselves. Therefore, we considered two ISPRSsoil data Int. J. Geo-Inf.sources:2020 ,statistical9, 664 knowledge in an interpretable form from decision trees 6extracted of 22 from legacy soil maps and expert knowledge from manual expert correction of the trees.

FigureFigure 3. Completing 3. Completing a decision a decision tree tree of confident of conﬁdent expe expertrt knowledge knowledge with with additional additional branches via via machine learning (evtree package). Dist—distance to the nearest river weighted by slopes, machine learning (evtree package). Dist—distance to the nearest river weighted by slopes, Elev— Elev—elevation, Alluv—alluvial deposits in ﬂoodplains, and Histic—appearance of the Histic soil elevation, Alluv—alluvial deposits in floodplains, and Histic—appearance of the Histic soil horizon horizon with organic material (Accuracy: 90% refers to the accuracy of recognizing the soil organic withsurfaces organic based material on Landsat (Accuracy: satellite 90% images). refers to the accuracy of recognizing the soil organic surfaces based on Landsat satellite images).

The implementation of the evolutionary learning of the globally optimal classification and regression tree algorithm from the EVTREE package in R was selected to construct the classification and regression trees [31]. The EVTREE algorithm aims at finding the optimal decision tree rather than

ISPRS Int. J. Geo-Inf. 2020, 9, 664 7 of 22

Table 1. The main relationships between soils and the soil formation factors.

Soil name in RCS1977 Classification [35] Soil Name in RCS2004 Classification [36] Soil Name in WRB2006 Classification [37] Main Soil Forming Factors Chernozems Typical (CHt) Agrochernozem Migrational-Mycelial Voronic Chernozems Pachic flat interfluve on watersheds and riverine flat interfluves Chernozems Leached (CHv) Agrochernozems Clay-Illuvial Voronic Chernozems Pachic depressions of flat interfluves Chernozems Residual-Calcareous (CHrc) Agrochernozems Textural-Carbonate Leptic Chernozems steep slopes with calcareous sediment Chernozems Podzolic (CHo) Agrochernozems Podzolic Clay-Illuvial Luvic Phaeozems flat interfluves where oak forests were cut 1–2 hundred years ago Grey and Dark Grey forest (G23) Grey and Dark Grey soils Greyic Phaeozems Albic under oak forests Meadow Chernozems (CHl) Agro-Dark-humus Gley soils Voronic Chernozems Pachic ancient alluvial terraces Sod Alluvial (SAL) Sod Alluvial Laminate Haplic Fluvisols floodplain terraces, floodplains with lightly drained water–glacial deposits Alluvial Meadow (AL) Alluvial Dark-humus soils Umbric Fluvisols Oxyaquic river valleys Gully complex (GC) Gully complex - beams, ravines, and valleys of small rivers

Table 2. Traditional and proposed digital soil mapping frameworks.

Traditional Soil Mapping Framework [25,38] Digital Soil Mapping Framework Using the Proposed Approach 1. Data preparation 1. Data preparation 2. Compilation of the soil regionalization map 2. Development of a work programme 3. Development of the legend of the soil map 3. Creation of maps for soil formation factors 4. Construction of maps for soil formation factors 4. Compilation of a preliminary map at a detailed scale 5. Creation of the decision trees seedlings based on expert knowledge 5. Field surveys 6. Completion of the decision trees via machine learning 6. Generalization of the soil units using lithologic–geomorphological maps (regionalization maps) 7. Compilation of the soil map 7. Compilation of the legend and pre-final soil map 8. Verification of the soil map 8. Compilation of the final soil map 9. Correction of the soil map

Table 3. Description of the developed covariates.

№ Covariate Basic Data Method of Preparation 1 Elevation map (DEM) DEM SRTM v.4.1 Original data [43] 2 Elevation map median filtered through fifteen pixels (DEM_md15) DEM SRTM v.4.1 Median filtration [25] 3 Slope map (SLP) Calculation [25] 4 Flow direction map (FLOWDIR) Calculation [25] 5 Slope map median filtered through fifteen pixels (SLP_md15) Median filtration [25] DEM SRTM v.4.1 6 Flow direction map median filtered through fifteen pixels (FLOWDIR_md15) Median filtration [25] 7 Weighted with slope distance to rivers (RIVDIST) Calculation [25] 8 Flow accumulation (FLOWAC) Calculation of the number of input pixels of any outlet Vegetation map: coniferous forest, leaf forest, water, built-up area, cropping Classification via the Random Forest algorithm using a 9 Landsat 8OLI areas and grasslands, and peat bogs (LNDCVR) prepared sample set Quaternary sediment map at a scale of 1:500,000 10 Type of quaternary sediments (QGEOL) Manual vectorization (based on maps at a scale of 1:200,000) ISPRS Int. J. Geo-Inf. 2020, 9, 664 8 of 22

The implementation of the evolutionary learning of the globally optimal classification and regression tree algorithm from the EVTREE package in R was selected to construct the classification and regression trees [31]. The EVTREE algorithm aims at finding the optimal decision tree rather than one of the possible decision trees. This is important when trying to extract pedologically meaningful rules for soil mapping from legacy soil maps or the sample points obtained during soil surveys. This algorithm is based on concepts such as inheritance, mutation, and natural selection. These concepts are population-based, and a whole collection of candidate solutions—trees in this case—is processed simultaneously and iteratively modified by variation operators (namely, mutation (applied to single solutions) and crossover (merging multiple solutions) operators). Finally, a survivor selection process is used and favors solutions that perform well according to a specified quality criterion, which is usually called the evaluation function. In this evolutionary process, the mean quality of the population increases over time. A notable difference from comparable algorithms is the survivor selection mechanism here, with which it is essential to avoid premature convergence [31]. For most applications, globally optimal algorithms, in contrast to locally optimal algorithms, enable much more complete coverage of the parameter space, as follows:

Θ = (v1, s1, ... , vM 1, sM 1), (1) − − where vr denotes the splitting variables, sr denotes the associated splitting rules for the internal nodes r, and M denotes the number of terminal nodes. Equation (2) seeks trees Θˆ that minimize the misclassiﬁcation loss under a Bayesian information criterion (BIC) trade-oﬀ with the number of terminal nodes:

Θˆ = argmin loss Y, f (X, Θ) + comp(Θ), (2) Θ Θ { } ∈ where loss( , ) is a suitable loss function; X the predictor variable vector; Y is the response variable · · vector; comp( ) is a complexity function that is monotonically non-decreasing based on the number · Mmax of terminal nodes M of the tree Θ from ΘM; and the overall parameter space is Θ = ΘM; ΘM, ∪M=1 the space of conceivable trees with M terminal nodes. For EVTREE, the quality of a classification tree is most commonly measured as a function of its misclassification rate (MC) and the complexity of a tree is determined by a function of the number of terminal nodes M: loss(Y, f (X, Θ)) = 2N MC(Y, f (X, Θ)) · PN = 2 I(Yn , f (Xn, Θ)), (3) · n=1 comp(Θ) = α M log (N), · · where N is the number of distinct values of predictor variables, comp describes the complexity function of a tree, and α is a user-specified parameter (default value α = 1). Notably, there are other possible rational values of α [15]. Using EVTREE as a method for finding the relationships between soil and soil formation factors, we implemented our approach of imitating the traditional soil mapping process in the R software packages (Figure4). The main package, “imsoil”, was developed as a working experimental version. Development of the R package for the interactive manual correction of the “constparty” decision tree models from the “partykit” package was also required [46]. We used the model form “constparty” because it is widely used in various statistical packages and can be easily transformed into the classical format, “.pmml”. ISPRS Int. J. Geo-Inf. 2020, 9, 664 9 of 22 ISPRS Int. J. Geo-Inf. 2020, 9, x FOR PEER REVIEW 9 of 23

Figure 4. Structure of our ongoing R packages with implementationimplementation of the traditional soil mapping imitation approach.

To evaluate the real performance of EVTREE in the context of the proposed approach, thethe trees created byby EVTREEEVTREE werewere compared compared to to trees trees created created by by the the classic classic locally locally optimal optimal algorithm, algorithm, CART CART [47]. [47].Currently, Currently, ensemble ensemble methods methods are the are most the common most co approachesmmon approaches for digital for soil digital mapping soil mapping due to their due high to theirpredictive high performance;predictive performance; hence, a comparison hence, a withcomp Randomarison with Forest Random is also provided.Forest is Randomalso provided. Forest wasRandom selected Forest as onewas ofselected the best as methods one of the in best terms methods of prediction in terms accuracy of prediction [20]. accuracy [20]. The trees produced by CART were pruned to the smallest values of their complexity parameters (CPs). TheThe numbernumber of of minimum minimum cases cases in thein splitthe split for CART for CART was equal was toequal 10. Theto 10. Gini The splitting Gini splitting criterion wascriterion used was for used CART. for For CART. EVTREE For EVTREE modelling, modelling, we used we the used following the following parameters: parameters: minbucket minbucket= 10, =minsplit 10, minsplit= 20, = niterations 20, niterations= 2000, = 2000, ntrees ntrees= 50, = alpha50, alpha= 1, = and 1, and maxdepth maxdepth= 10. = Prior10. Prior probabilities probabilities for forthe branchesthe branches of the of CART the CART and EVTREE and EVTREE decision decision trees were trees calculated were calculated via the 10-fold via the cross-validation 10-fold cross- validationprocedure. procedure. Random Random Forest analysis Forest analysis with default with default settings settings from thefrom randomForest the randomForest package package was wasthen then conducted. conducted. The sample set we used consisted of 5001 points, as described above. Versions Versions of the sample set with reducedreduced numbers numbers of of sample sample points points were were also createdalso created to compare to compare the performance the performance of the methods of the methodsfor diﬀerent for sampledifferent set sample sizes. set sizes. The following criteria were used to assess the quality of soil–landscape relationshiprelationship models:models: • Stability: determined via an analys analysisis of the dispersion of the numb numberer of branches and via expert • analysis for 10 runs of an algorithm; • Complexity: the number of conditions in the trees; • • Notional consistency: expert-based interpretation of one tree of each algorithm;

ISPRS Int. J. Geo-Inf. 2020, 9, 664 10 of 22

Notional consistency: expert-based interpretation of one tree of each algorithm; • The mean and minimum prior probabilities of proper classification were calculated as the mean • and minimum number of properly classified points of a class from the training sample set in each tree terminal node divided by the total number of properly and improperly classified points in the same node; Spatial uncertainty in pixels: the mean probability of the most likely class for 10 runs of • an algorithm; Map accuracy assessment: overall, the producer’s, user’s, and kappa accuracy [48]. The overall • accuracy shows the number of similar depicted pixels in the predicted and reference maps multiplied by the total number of pixels. The producer’s accuracy represents the number of pixels that were depicted as part of the same class multiplied by the total number of that class’s pixels in the reference map. The user’s accuracy is the number of pixels that were depicted in the similar class but multiplied by the total number of that class’s pixels in the predicted map. The producer’s accuracy is associated with the error of misclassification, and the user’s accuracy is dedicated to the error of over-classification. The Cohen’s kappa index represents the overall po pe accuracy considering the possibility of the agreement occurring by chance: κ − , where p 1 pe o ≡ − is the relative observed agreement among raters (identical to overall accuracy), and pe is the hypothetical probability of chance agreement, using the observed data to calculate the probabilities of each observer randomly seeing each category. For k categories, N observations to categorize P n i k p = 1 n n and ki the number of times rater predicted category : e N2 k k1 k2. The ten obtained decision trees were then expertly analyzed, and qualitative descriptions of the relationships between the soil and soil formation factors from the soil geography perspective were extracted from the trees. Then, qualitative decision trees were manually created, and statistical quantitative decision trees were packaged into them. This combination of qualitative descriptions and quantitative rules enabled further improvement of the prediction accuracy and the notional consistency of the soil mapping. This was accomplished by improving the list of covariates used for modelling and manually correcting the quantitative rules in accordance with qualitative information on the soil geography. Knowledge in the form of decision trees can be flexibly changed by the individual parameters and structures.

3. Results Decision trees with rules for soil delineation were created via the EVTREE and CART methods (Figure5). All trees that were produced by EVTREE were of a similar size; the trees that were produced by CART were also of a similar size, and the trees that were produced by EVTREE were more compact than the CART trees. Moreover, the decision trees that were created by EVTREE had different sequences of similar rules for soil class delineation (Figure5). All EVTREE trees, despite their formally di fferent structures, presented similar relationships. The decision trees created by CART had similar rules for soil class delineation and similar sequences in the top part but different rules in the bottom part and, consequently, different sets of rules for soil class delineation. The next example of rules for soil delineation is expressed in text form as follows: 3_G23—Grey and Dark Grey forest/Greyic Phaeozems Albic: EVTREE_№5: (1) DEM_md15 < 203 m ˆ RIVDIST >= 454505 m ˆ DEM >= 171 m ˆ LNDCVR = 45_Foret ˆ DEM_md15 >= 203 m ˆ 3_G23 (Probability: 12/15); (2) DEM_md15 >= 203 m ˆ RIVDIST >= 333451 m ˆ RIVDIST >= 371027 m ˆ LNDCVR = 3_Water, 4_Leaf_Forest ˆ 3_G23 (Probability: 47/58). CART_№2: DEM < 171 m ˆ RIVDIST < 337000 m ˆ LNDCVR = 1_Grass, 2_Build_up_soils, 3_Water, 5_Coniferous_Forest ˆ DEM_md15 < 203 m ˆ (1) RIVDIST <= 539000 m ˆ 3_G23 (Probability: 35/37); (2) DEM_ < 208 m ˆ 3_G23 (Probability: 19/24); (3) DEM ->= 208 m ˆ FLOWDIR = 2_E, 3_SE, 7_NW ˆ 3_G23 (Probability: 4/10). ISPRS Int. J. Geo-Inf. 2020, 9, 664 11 of 22 ISPRS Int. J. Geo-Inf. 2020, 9, x FOR PEER REVIEW 11 of 23

Figure 5. The appearance of the trees that were created via EVTREE and CART. Figure 5. The appearance of the trees that were created via EVTREE and CART. Description: EVTREE created branches for Phaeozem delineation in forests far from floodplains, whereThe only next leaf example forests exist of rules (Figure for6 soil). CART delineation created is many expressed branches in text with form various as follows: elevation derivatives, including3_G23—Grey elevations, and slopes, Dark Grey and flowforest/Greyic directions. Phaeozems Hence, the Albic: rules for Phaeozem delineation were overlearnedEVTREE_ by№ CART.5: (1) DEM_md15 < 203 m ^ RIVDIST >= 454505 m ^ DEM >= 171 m ^ LNDCVR = 45_ForetAnother ^ DEM_md15 example of >= the 203 rules m ^ for3_G23 soil (Probability: delineation is 12/15); expressed (2) DEM_md15 in text form >= as 203 follows: m ^ RIVDIST >= 33345121_AL—Alluvial m ^ RIVDIST >= Meadow 371027/ Umbricm ^ LNDCVR Fluvisols = 3_Wa Oxyaquic:ter, 4_Leaf_Forest ^ 3_G23 (Probability: 47/58). EVTREE_CART_№№2: 5:DEM DEM << 171203 m m ˆ^ RIVDISTRIVDIST< <454,505 337000 m m ˆ (1)^ LNDCVR == any1_Grass, except 2_Build_up_soils, 2_Build-up_soils or3_Water, 5_Coniferous_Forest 5_Coniferous_Forest ˆ DEM_md15 ^ DEM_md15< 160 m< 203 ˆ RIVDIST m ^ (1) RIVDIST< 23,679 m<= ˆ539000 21_AL m(Probability: ^ 3_G23 (Probability: 191/211); (2)35/37); LNDCVR (2) DEM_= any < 208 except m ^ 3_G23 2_Build-up_soils (Probability: and 19/24); 5_Coniferous_Forest (3) DEM ->= 208 m ˆ ^ DEM_md15 FLOWDIR =>= 2_E,160 3_SE, m ˆ7_NW FLOWAC ^ 3_G23< (Probability:380 pixels ˆ4/10). 21_AL (Probability: 33/64); (3) LNDCVR = 2_Build-up_soils or 5_Coniferous_ForestDescription: EVTREE ˆ ˆ RIVDIST created< branches8398 m ˆ 21_ALfor Phaeoz (Probability:em delineation 19/21). in forests far from floodplains, whereCART_ only№ leaf2: DEM forests< 171 exist m ˆ(Figure RIVDIST 6).< 9305CART m created ˆ 21_AL (Probability:many branches 190 /200);with (2)various DEM = 9305including m ˆ LNDCVR elevations,<> 1_Grass, slopes, 45_Mixed_Forest, and flow directions. 3_Water, Hence, 4_Leaf_Forest the rules ˆ FLOWAC for Phaeozem>= 68 ˆdelineation FLOWDIR were= 2_E, overlearned 8_N, 9_NE ˆby DEM_md15 CART. < 156 m ˆ FLOWDIR = 2_E, 9_NE ˆ RIVDIST < 212,000 m ˆ DEM_md15 > 120 m ˆ 21_AL (Probability: 3/10). Six other sets of rules for the delineation of Fluvisols (21_AL) were not included in the text because of their large size (more than eight branches). These sets were applied to the total number of 61 test points. Description: EVTREE delineated most Fluvisols directly according to their elevations and distances to rivers weighted with slopes. These are the correct rules for Fluvisol delineation. The CART algorithm created many overlearned sets of rules that led to large areas of Fluvisols in the river terraces (Figure6). Delineations by Random Forest were not correct as large areas of Fluvisols were identified in the river terraces instead of the floodplains (Figure6). The rules for the delineation of Fluvisols obtained by EVTREE were much more direct than those obtained by CART. The decision trees that were created using the evolutionary learning of globally optimal trees (EVTREE) had much higher notional consistency than the trees created using the classification and regression tree algorithm (CART).

ISPRS Int. J. Geo-Inf. 2020, 9, 664 12 of 22 ISPRS Int. J. Geo-Inf. 2020, 9, x FOR PEER REVIEW 12 of 23

Figure 6. EVTREE quantitative decision tree №5 for the Chernozems soil zone in the Belgorod region (Soil Types:Types: 3_G23—Greyic 3_G23—Greyic Phaeozems Phaeozems Albic, Albic, 5_CHv—Voronic 5_CHv—Voronic Chernozems Chernozems Pachic, Pachic, 6_CHt—Voronic 6_CHt— ChernozemsVoronic Chernozems Pachic, 10_CHrc—LepticPachic, 10_CHrc—Leptic Chernozems, Chernozems, 20_CHtk—Leptic 20_CHtk—Leptic Voronic Voronic Chernozems Chernozems Pachic, 21_AL—UmbricPachic, 21_AL—Umbric Fluvisols Fl Oxyaquic,uvisols Oxyaquic, 22_GC—Gully 22_GC—Gully Complex, Comple 25_SAL—Haplicx, 25_SAL—Haplic Fluvisols, N: Fluvisols, number N: of thenumber points of withthe points the soil, with Err: the misclassiﬁcation soil, Err: misclassification rate). rate).

TheAnother trees example for the study of the area rules that for were soil delineation produced by is EVTREEexpressed had in 20–29text form conditions as follows: in comparison to 142–16221_AL—Alluvial conditions forMeadow/Umbric the trees that wereFluvisols produced Oxyaquic: by CART (Table4). The mean and minimum priorEVTREE_ probabilities№5: wereDEM similar< 203 m for^ RIVDIST both methods < 454,505 (Table m ^4 (1)). TheLNDCVR minimum = any prior except probabilities 2_Build-up_soils were relatedor 5_Coniferous_Forest to typical Chernozem ^ DEM_md15 soil in both < 160 the m EVTREE ^ RIVDIST and CART< 23,679 methods. m ^ 21_AL Thus, (Probability: the EVTREE 191/211); models are(2) muchLNDCVR more = compact,any except which 2_Build-up_soils facilitates their and interpretation. 5_Coniferous_Forest ^ DEM_md15 >= 160 m ^ FLOWAC < 380 pixels ^ 21_AL (Probability: 33/64); (3) LNDCVR = 2_Build-up_soils or 5_Coniferous_Forest ^ ^ RIVDISTTable 4. Quality < 8398 assessmentm ^ 21_AL parameters (Probability: of the 19/21). models. Mean prior Minimum Prior Mean Number Notional Soil ModelCART_№2: DEM < 171 m ^ RIVDIST < 9305 mStability ^ 21_AL (Expert) (Probability: 190/200);Comments (2) DEM (Expert) < 171 m Probability, % Probability, % of Conditions Consistency (Expert) ^ RIVDISTEVTREE >= 9305 62.2 m ^ LNDCVR 32.4 <> 1_Grass, 24 45_Mixed_Forest,++/ 3_Water,High 4_Leaf_Forest correct Fluvisols ^delineation FLOWAC − CART 60.8 27.2 155 +/ Medium overlearned for Fluvisols and Phaeozems −− >=Random 68 ^ FLOWDIRnot applied to= 2_E,not applied8_N, to ensemble9_NE ^ DEM_md15not < applied 156 tom ^ FLOWDIRnot applied to = 2_E, 9_NE ^ RIVDIST < overlearned for Fluvisols 212,000Forest 1 m ensemble^ DEM_md15 methods 1 > 120methods m1 ^ 21_AL (Probability:→∞ ensemble 3/10). methods 1Sixensemble other methods sets 1of rules for the delineation 1 of FluvisolsRandom (21_AL) Forest does were not represent not included a model as in a separate the te classificationxt because and of regressiontheir large tree, insize contrast (more to EVTREEthan eight and CART. branches). These sets were applied to the total number of 61 test points. TheDescription: soil maps EVTREE were generally delineated similar most but exhibited Fluvisols some directly local diaccordingfferences (Figureto their7). elevationsFluvisols wereand correctlydistances mappedto rivers usingweighted only with EVTREE slopes. as theyThese were are alwaysthe correct associated rules for with Fluvisol the rivers’ delineation. floodplains The andCART not algorithm the rivers’ created terraces many or overlearned watershed areas. sets of EVTREErules that selectedled to large the areas simplest of Fluvisols rules for in Fluvisols the river thatterraces gave (Figure the best 6). accuracy.Delineations For by soils Random with Forest complex were soil–landscape not correct as relationships, large areas of EVTREEFluvisols owereffers identified in the river terraces instead of the floodplains (Figure 6). The rules for the delineation of Fluvisols obtained by EVTREE were much more direct than those obtained by CART.

ISPRS Int. J. Geo-Inf. 2020, 9, 664 13 of 22 less accuracy than CART. However, its method of selecting the simplest rules is much closer to the ISPRS Int. J. Geo-Inf. 2020, 9, x FOR PEER REVIEW 14 of 23 traditional approach.

Figure 7. Sample set and soil maps that were created using the three methods: EVTREE, CART, andFigure Random 7. Sample Forest. set and soil maps that were created using the three methods: EVTREE, CART, and Random Forest. The overall accuracy of the soil map created via Random Forest using the full sample set was 87%The (Table overall5). Possibly, accuracy a of part the of soil the map remaining created 13%via Random could be Forest mapped using using the informationfull sample set about was previous87% (Table soil 5). formation Possibly, factors a part (soil of formationthe remain factorsing 13% such could as the be vegetation mapped andusing climate information that existed about a thousandprevious soil years formation ago could factors substantially (soil formation aﬀect the factor propertiess such as of the soils vegetation but are not and currently climate representedthat existed bya thousand covariates years in digital ago soil could mapping). substantially The probability affect the map properties created viaof thesoils Random but are Forest not algorithmcurrently demonstratedrepresented by moderate covariates quality, in digital as the soil mean mapping). probability The wasprobability 0.60 (Figure map8 ).created Despite via the the moderate Random qualityForest algorithm of the model, demonstrated Random moderate Forest correctly quality, predicted as the mean 87% probability of the independent was 0.60 (Figure test sample 8). Despite set. Thethe moderate CART probability quality of map the showed model, a Random high mean Forest probability correctly of morepredicted than 80%87% forof the large independent areas, while test the realsample prediction set. The accuracy CART probability was only 67%. map Asshowed in other a high studies, mean the probability Random Forest of more model than demonstrated 80% for large theareas, best while prediction the real accuracy prediction for accuracy the full sample was only set 67%. [20]. As in other studies, the Random Forest model demonstrated the best prediction accuracy for the full sample set [20].

Table 5. Accuracy of the soil maps.

Soil Model Overall Accuracy, % Producer’s Accuracy, % User’s Accuracy, % Kappa EVTREE 59 56 65 0.45 CART 67 56 63 0.55 Random Forest 87 81 91 0.83

ISPRS Int. J. Geo-Inf. 2020, 9, 664 14 of 22

Table 5. Accuracy of the soil maps.

Soil Model Overall Accuracy, % Producer’s Accuracy, % User’s Accuracy, % Kappa EVTREE 59 56 65 0.45 CART 67 56 63 0.55 Random Forest 87 81 91 0.83 ISPRS Int. J. Geo-Inf. 2020, 9, x FOR PEER REVIEW 15 of 23

Figure 8. Probability maps of the most likely soil class created using three methods: EVTREE (stochastic), Figure 8. Probability maps of the most likely soil class created using three methods: EVTREE CART (stochastic), and Random Forest. (stochastic), CART (stochastic), and Random Forest. The EVTREE probability map yielded moderate and high accuracy for watershed areas and floodplains, whereThe the EVTREE delineation probability of soils was map easier. yielded The lowest moderate probabilities and high for EVTREEaccuracy were for observedwatershed for areas watershed and floodplains,slopes, where where the delineation the delineation of Chernozem of soils soils was was easier. not clear. The Forlowest the Randomprobabilities Forest for approach, EVTREE the were low observedaccuracy on for the watershed probability slopes, map corresponded where the todelineati occasionallyon of satisfactoryChernozem prediction soils was accuracy, not clear. as measured For the Randomby the test Forest set, despite approach, using manythe low test accuracy points (approximately on the probability 2000 points map ).corresponded Therefore, only to the occasionally probability satisfactorymap created byprediction EVTREE wasaccuracy, adequate. as measured Notably, the by limited the testnumber set, ofdespite covariates using provided many an test advantage points (approximatelyto EVTREE, while 2000 the points). approach Therefore, using stochastic only th calculatione probability of themap probability created by map EVTREE was not was optimal. adequate. Notably,After the reducing limited the number size of of the covariates training sample provided setfrom an advantage 5001 to 3304 to pointsEVTREE, byreducing while the the approach squared usingareas withstochastic sample calculation points (2 kmof 2the), the probability accuracy remainedmap was not the sameoptimal. for the CART and EVTREE methods, but forAfter Random reducing Forest, the thesize accuracy of the training decreased sample substantially set from by5001 20–30% to 3304 (Table points6). Afterby reducing increasing the 2 squaredthe mean areas distance with betweensample points points (2 by km reducing), the accuracy the number remained of points the same from for 5001 the to CART 1785, and the EVTREE accuracy methods,of all three but methods for Random decreased Forest, proportionally. the accuracy Whendecreased increasing substantially the mean by distance20–30% (Table between 6). points After increasingby reducing the the mean size distance of the sample between set points to 1000 by points, reducing the the Random number Forest of points algorithm from failed5001 to to 1785, run, theand accuracy EVTREE andof all CART three yielded methods similar decreased accuracy propor to thattionally. used forWhen 1785 increasing points. For the the mean small distance sample betweenset, EVTREE points and by CART reducing have the an size advantage of the sample over Random set to 1000 Forest points, because the theyRandom can dealForest with algorithm a small failednumber to ofrun, cases. and EVTREE and CART yielded similar accuracy to that used for 1785 points. For the small sample set, EVTREE and CART have an advantage over Random Forest because they can deal with a small number of cases.

Table 6. Accuracy of the soil maps after reducing the size of the training sample set.

Soil Model Overall Accuracy, % Producer’s Accuracy, % User’s Accuracy, % Kappa EVTREE 59 49 61 0.45 CART 61 53 55 0.49 Random Forest 62 51 71 0.50

ISPRS Int. J. Geo-Inf. 2020, 9, 664 15 of 22

Table 6. Accuracy of the soil maps after reducing the size of the training sample set.

Soil Model Overall Accuracy, % Producer’s Accuracy, % User’s Accuracy, % Kappa EVTREE 59 49 61 0.45 CART 61 53 55 0.49 Random Forest 62 51 71 0.50 ISPRS Int. J. Geo-Inf. 2020, 9, x FOR PEER REVIEW 16 of 23 The decision trees obtained with the quantitative cartographic rules of soil unit delineation were analyzedThe decision to extract trees the realobtained geographical with the qualitative quantitative relationships cartographic between rules of the soil soil unit and delineation soil formation were factors,analyzed like to thoseextract that the were real geographical formulated via qualitativ traditionale relationships soil mapping between (Figure the9). soil and soil formation factors, like those that were formulated via traditional soil mapping (Figure 9).

FigureFigure 9.9. TheThe qualitativequalitative decisiondecision treetree withwith thethe numbersnumbers ofof rulesrules fromfrom thethe quantitativequantitative decisiondecision treetree (in(in thethe square—thesquare—the partpart ofof thethe treetree that that was was expertly expertly corrected). corrected).

TheThe obtainedobtained qualitative qualitative decision decision tree tree was was represented represented in thein the form form of separate of separate rule chainsrule chains for each for soil.each Thesoil. rule The chainsrule chains were were also split also intosplit environmental into environmental niches niches (Figure (Figure 10). 10).

ISPRS Int. J. Geo-Inf. 2020, 9, 664 16 of 22 ISPRS Int. J. Geo-Inf. 2020, 9, x FOR PEER REVIEW 17 of 23

Figure 10. Packaging of the quantitative cartographic rules of soil unit delineation with qualitative rules. Figure 10. Packaging of the quantitative cartographic rules of soil unit delineation with qualitative rules. Analysis of the soil delineation showed that Fluvisols (21_AL) and Gully complexes (22_GC) were not correctlyAnalysis mapped. of the soil This delineation occurred showed because that of the Fluvisols overly (21_AL) simplified and river Gully map, complexes which (22_GC) led to a rough were notriver correctly distance mapped. map. The This quality occurred of thebecause soil predictionsof the overly can simplified be improved river map, by corrections which led to of a the rough soil riverformation distance factors map. maps. The Consequently,quality of the wesoil decided predictions to rebuild can be the improved river distance by corrections map. We of rebuilt the soil the formationriver and riverfactors distance maps. Consequent maps usingly, a parameterwe decided of to river rebuild length the river equal distance to 100 m map. instead We rebuilt of 1000 the m. riverThis parameterand river distancerepresented maps the using minimal a parameter length of of ri aver river length on the equal map. to 100 Then m instead we reran of the1000 EVTREE m. This parameteralgorithm withrepresented the new the input minimal river distancelength of map. a river In theon resultingthe map. soil Then map, we the reran visual the diEVTREEfference algorithmbetween Fluvisols with the(21_AL) new input and river the Gullydistance complex map.(22_GC) In the resulting soils was soil excellent. map, the In visual the rebuilt difference tree, betweeninformation Fluvisols about (21_AL)Chernozems and calcareous the Gullysoils complex (10_CHrc) (22_GC) was soils added. was excellent. On the large-scale In the rebuilt soil maps, tree, informationthese soils were about mapped Chernozems in the calcareous areas of outcrops. soils (10_CHrc) Such areas was were added. recognized On the usinglarge-scale Landsat soil satellite maps, theseimages soils and were the supervised mapped in classification the areas of algorithmoutcrops. Such with theareas maximum were recognized likelihood using method. Landsat For teaching, satellite imagesthe original and soilthe samplesupervised set for classification only outcrop algorithm areas was used.with the Then, maximum the rule forlikelihood accurately method. delineating For teaching,Chernozems the calcareous original (10_CHrc)soil sample soils set for in the only areas outcrop of the areas outcrops was wasused. added Then, to the the rule classification for accurately tree delineating(Figure 11). Chernozems calcareous (10_CHrc) soils in the areas of the outcrops was added to the classification tree (Figure 11). Therefore, there were at least four ways to correct the soil map:

ISPRS Int. J. Geo-Inf. 2020, 9, x FOR PEER REVIEW 18 of 23

1. The addition of more accurate covariate maps to the analysis; 2. manual tree correction; a. change in qualitative rules; ISPRSb. Int.change J. Geo-Inf. in2020 quantitative, 9, 664 rules; 17 of 22 3. manual soil map unit correction.

Figure 11. Expertly corrected classification tree using three methods: modeling with reworked factor Figure 11. Expertly corrected classiﬁcation tree using three methods: modeling with reworked factor maps, the manual addition of expert rules, and expert correction using direct recognition of remote maps, the manual addition of expert rules, and expert correction using direct recognition of remote sensing data. sensing data.

AnTherefore, accuracy there assessment were at was least provided four ways for to the correct corrected the soil soil map: map using another independent sample set of 1269 soil pits. The overall accuracy before manual correction was 65%, with 76% 1. The addition of more accurate covariate maps to the analysis; accuracy after correction. Therefore, based on soil geography knowledge, the covariates for the modelling2. manual and treequantitative correction; models were corrected by an expert to provide higher accuracy and notionala. consistencychange inof qualitativethe soil maps. rules; b. change in quantitative rules; 4. Discussion 3. manual soil map unit correction. We obtained accurate soil maps automatically and made them much more accurate manually. The accuracyAn accuracy of the assessmentmanually corrected was provided soil map for can the correctedreach one soilhundred map usingpercent, another but this independent requires laborioussample set work. of 1269 The soil developed pits. The framework overall accuracy allowed before us manualto implement correction the work was 65%, of a with traditional 76% accuracy soil- mapperafter correction. using the Therefore,statistical method based on of soil evolutionary geography trees knowledge, and thethe understandable covariates for representation the modelling andof soilquantitative mapping rules. models This were statistical corrected method by an was expert first to used provide for digital higher soil accuracy mapping. and notionalThe advantages consistency of thisof themethod soil maps.over the CART method are the compactness and reasonability of the model. The easy- to-use representation of soil–landscape relationships—as a decision tree and chains of rules—is an 4. Discussion advantage of the proposed approach compared to structural equation modelling [12,21]. However, for mappingWe obtained continuous accurate soil soilproperties, maps automatically these two approaches and made can them be much used moretogether. accurate This manually.process involvesThe accuracy mappingof the soil manually units with corrected classification soil map trees can and reach then one splitting hundred them percent, into continuous but this requires soil propertylaborious maps work. (Figure The 12). developed framework allowed us to implement the work of a traditional soil-mapper using the statistical method of evolutionary trees and the understandable representation of soil mapping rules. This statistical method was ﬁrst used for digital soil mapping. The advantages of this

method over the CART method are the compactness and reasonability of the model. The easy-to-use representation of soil–landscape relationships—as a decision tree and chains of rules—is an advantage of the proposed approach compared to structural equation modelling [12,21]. However, for mapping ISPRS Int. J. Geo-Inf. 2020, 9, 664 18 of 22 continuous soil properties, these two approaches can be used together. This process involves mapping ISPRSsoil units Int. J.with Geo-Inf. classiﬁcation 2020, 9, x FOR trees PEER and REVIEW then splitting them into continuous soil property maps (Figure 19 of 12 23).

Figure 12. ExampleExample of of the the compliance compliance of of the the classification classiﬁcation trees and SEM.

This framework didiffersffers from from the the SoLIM SoLIM because because it aimsit aims at theat the development development of approaches of approaches that fullythat fullyimitate imitate the work the work of a traditionalof a traditional soil mapper.soil mapper. Thus, Th theus, representation the representation of the of classificationthe classification trees trees and andrule chainsrule chains was closelywas closely investigated. investigated. In our methodology,In our methodology, we used we mostly used traditional mostly traditional soil formation soil formationfactor maps, factor such maps, as geological such as parent geological materials parent and materials simple derivatives and simple of thederivatives digital elevation of the digital model, elevationrather than model, difficult rather indexes. than difficult We also indexes. used a traditional We also used representative a traditional form representative of soil units form instead of soil of unitsa fuzzy instead logic of approach. a fuzzy logic Our frameworkapproach. Our and framework the package and we the developed package providewe developed a traditional provide soil a traditionalmapping alternative soil mapping to SoLIM alternative [22,23 ].to Realization SoLIM [22,23]. of this Realization framework of was this strongly framework based was on machinestrongly basedlearning on techniquesmachine learning and the techniques advantages and of the the R adva statisticalntages language, of the R whichstatistical make language, it congruent which with make the itSoilGRIDs congruent framework with the SoilGRIDs [7]. framework [7]. The proposed forms of the relationships between the soil and soil formation factors, classification trees, and rule chains appear to be very convenient for interpretation. Qualitatively describing the quantitative rules was also simple, despite some subjective factors due to the expert nature of the analysis. Using mostly traditional soil formation factor maps helped us interpret the quantitative model and easily obtain the qualitative model. The obtained qualitative classification

ISPRS Int. J. Geo-Inf. 2020, 9, 664 19 of 22

The proposed forms of the relationships between the soil and soil formation factors, classification trees, and rule chains appear to be very convenient for interpretation. Qualitatively describing the quantitative rules was also simple, despite some subjective factors due to the expert nature of the analysis. Using mostly traditional soil formation factor maps helped us interpret the quantitative model and easily obtain the qualitative model. The obtained qualitative classification tree allowed us to visualize the complex environmental relationships for the study area in a form that is understandable for traditional natural scientists, soil scientists, geologists, ecologists, environmentalists, and others. This form is useful for environmental analysis, such as determining the reasons to change soils over time and defining the environmental state of the study area. This is a traditional type of analysis that is neglected in most other DSM methodologies. The soil mapping chains of the rule form were convenient for making the soil map and were manually corrected. This simple form can be programmed to make soil maps in most GIS programs. The novelty (associated with the chains of the rules) of our methodology compared to SoLIM is that our method uses groups based on environmental niches that are important for investigating the underlying reasons for soil geography. A considerable problem with our framework is the laborious work associated with the process of translating the relationships from the form of classification trees to rule chains. This process, however, can be automated and allowed us to collect important expert soil geography information. Below we summarize the problems that we aimed to solve with the proposed framework.

1. The laborious manual work associated with producing two forms of soil rules as a general classification tree and soil-specific rule chains. A possible solution is using a package with an interface that allows one to make adjustments as easily as possible. Moreover, this method should permit automatic transformations between classification trees and chains of rules. 2. The significant changes in the classification trees when using updated covariate maps. This can be overcome by using more sophisticated statistical methods of finding classification rules and incorporating prior soil knowledge into the search for rules. 3. The mapping of soil classes when we need to map continuous soil properties or soil functions. Compliance with other methodologies provides a solution for this problem. In that case the proposed framework offers an interface with legacy soil maps. 4. Statistical improvements of the classification tree algorithm’s accuracy.

5. Conclusions 1. The overall accuracy of the soil maps in the study area was 59%, 67%, and 87% for EVTREE, CART, and Random Forest, respectively. The prediction accuracy of EVTREE for the study area was slightly lower than that for CART and substantially lower than that for Random Forest. After reducing the size of the sample set from 5001 to 1785 points, the accuracy remained the same for EVTREE (59%), decreased for CART (61%), and decreased substantially for Random Forest (62%). With a sample set of 1000 points, the Random Forest algorithm failed to run, and EVTREE and CART realized similar accuracy to 1785 points. The reduced sample sets covered all soils and most soil formation factor features of the original entire sample set. Therefore, the quality of soil classiﬁcation under EVTREE was much less dependent on the size of the sample set. 2. The mean size of the trees in the study area constructed by EVTREE included 24 conditions, compared to the 154 conditions for the trees constructed by CART. The compact decision trees that were constructed by EVTREE were convenient for both quantitative quality assessment and expert-based notional tracing. 3. The soil maps that were created using EVTREE had much higher notional consistency than the maps created using CART. The rules formulated by EVTREE for the delineation of Fluvisols and Phaeozems were much more direct than those formulated by CART. The direct models constructed by EVTREE for soil mapping are suitable for full or partial extrapolation to distant territories in the world with similar environmental conditions. ISPRS Int. J. Geo-Inf. 2020, 9, 664 20 of 22

4. The R package is required for the manual “constparty” decision tree model corrections, which can correct conditions and labels. The features for selecting tree branches and linking the tree with the associated soil map for the online monitoring of changes will be useful. 5. We developed a digital soil mapping framework that imitates the process of traditional soil mapping and incorporates qualitative knowledge about soil geography into the quantitative rules for soil delineation. This model in the form of a decision tree can be expertly analyzed, and improvements can be made to the covariate list and notional consistency of the model. 6. Challenges include the development of an easy-to-use and fast visual tree editor, automatic transformation of the classiﬁcation trees to soil-speciﬁc rule chains, automatic statistical incorporation of prior soil knowledge in the process of tree building, and compliance with other methods for mapping continuous soil properties and soil functions.

Author Contributions: Conceptualization, Igor Savin; methodology, Arseniy Zhogolev and Igor Savin; software, Arseniy Zhogolev; validation, Arseniy Zhogolev; formal analysis, Arseniy Zhogolev; investigation, Arseniy Zhogolev; resources, Arseniy Zhogolev and Igor Savin; data curation, Arseniy Zhogolev and Igor Savin; writing—original draft preparation, Arseniy Zhogolev and Igor Savin; writing—review and editing, Igor Savin; visualization, Arseniy Zhogolev; funding acquisition, Igor Savin. All authors have read and agreed to the published version of the manuscript. Funding: This research was funded bythe support of the “RUDN University Program 5-100” and the RFBR grant N 19-05-50063, as well as by the support of the Ministry of Science and Higher Education of Russian Federation (Agreement No. 075-15-2020-805). Acknowledgments: We thank Praskovya Grubina for help with preparing the sample set for modelling. Conﬂicts of Interest: The authors declare no conﬂict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

References

1. Arrouays, D.; McBratney,A.B.; Minasny,B.; Hempel, J.W.; Heuvelink, G.B.M.; MacMillan, R.A.; McKenzie, N.J. The GlobalSoilMap Project Speciﬁcations. In GlobalSoilMap: Basis of the Global Spatial Soil Information System; CRC Press: Boca Raton, FL, USA, 2014. 2. Arrouays, D.; Richer-de-Forges, A.C.; McBratney, A.B.; Hartemink, A.E.; Minasny, B.; Savin, I.; Lagacherie, P. The GlobalSoilMap project: Past, present, future, and national examples from France. Dokuchaev Soil Bull. 2018, 95, 3–23. [CrossRef] 3. Arrouays, D.; Savin, I.; Leenaars, J.; McBratney, A.B. (Eds.) GlobalSoilMap—Digital Soil Mapping from Country to Globe; CRC Press: London, UK, 2017. 4. Lagacherie, P.; McBratney, A.B. Chapter 1. Spatial soil information systems and spatial soil inference systems: Perspectives for digital soil mapping: Review article. Digital Soil Mapping. An introductory perspective. Dev. Soil Sci. 2007, 31, 137–150. 5. Arrouays, D.; Leenaars, J.G.; Richer-de-Forges, A.C.; Adhikari, K.; Ballabio, C.; Greve, M.; Heuvelink, G. Soil legacy data rescue via GlobalSoilMap and other international and national initiatives. GeoResJ 2017, 14, 1–19. [CrossRef] 6. Grunwald, S.; Thompson, J.A.; Boettinger, J.L. Digital soil mapping and modeling at continental scales: Finding solutions for global issues. Soil Sci. Soc. Am. J. 2011, 75, 1201–1213. [CrossRef] 7. Hengl, T.; de Jesus, J.M.; Heuvelink, G.B.; Gonzalez, M.R.; Kilibarda, M.; Blagoti´c,A.; Guevara, M.A. SoilGrids250m: Global gridded soil information based on machine learning. PLoS ONE 2017, 12, e0169748. [CrossRef] 8. Lagacherie, P. Digital soil mapping: To state-of-the-art. In Digital Soil Mapping with Limited Data; Springer: Dordrecht, The Netherlands, 2008; pp. 3–14. 9. McBratney, A.B.; de Gruijter, J.; Bryce, A. Pedometrics Timeline. Geoderma 2019, 338, 568–575. [CrossRef] 10. Mulder, V.L.; De Bruin, S.; Schaepman, M.E.; Mayr, T.R. The use of remote sensing in soil and terrain mapping—A review. Geoderma 2011, 162, 1–19. [CrossRef] ISPRS Int. J. Geo-Inf. 2020, 9, 664 21 of 22

11. Qi, F.; Zhu, A.X. Knowledge discovery from soil maps using inductive learning. Int. J. Geogr. Inf. Sci. 2003, 17, 771–795. [CrossRef] 12. Angelini, M.E.; Heuvelink, G.B.M.; Kempen, B.; Morrás, H.J.M. Mapping the soils of an Argentine Pampas region using structural equation modeling. Geoderma 2016, 281, 102–118. [CrossRef] 13. Bui, E.N.; Henderson, B.L.; Viergever, K. Knowledge discovery from models of soil properties developed through data mining. Ecol. Modeling 2006, 191, 431–446. [CrossRef] 14. McBratney, A.B.; Santos, M.M.; Minasny, B. On digital soil mapping. Geoderma 2003, 117, 3–52. [CrossRef] 15. Zhogolev, A.V. A comparison of SoilGRIDs with the Disaggregated State Soil Map of Russia. In GlobalSoilMap—Digital Soil Mapping from Country to Globe; CRC press: London, UK, 2017; pp. 49–52. 16. Odgers, N.P.; Sun, W.; McBratney, A.B.; Minasny, B.; Clifford, D. Disaggregating and harmonising soil map units through resampled classification trees. Geoderma 2014, 214, 91–100. [CrossRef] 17. Pásztor, L.; Laborczi, A.; Takács, K.; Szatmári, G.; Bakacsi, Z.; Szabó, J. DOSoReMI as the National Implementation of GlobalSoilMap for the Territory of Hungary. In GlobalSoilMap—Digital Soil Mapping from Country to Globe; CRC Press: London, UK, 2017; pp. 17–22. 18. Caubet, M.; Dobarco, M.R.; Arrouays, D.; Minasny, B.; Saby, N.P.A. Merging country, continental and global predictions of soil texture: Lessons from ensemble modeling in France. Geoderma 2019, 337, 99–110. [CrossRef] 19. Chinilin, A.V.; Savin, I.Y. The large-scale digital mapping of soil organic carbon using machine learning algorithms. Dokuchaev Soil Bull. 2018, 91, 46–62. [CrossRef] 20. Hengl, T.; Nussbaum, M.; Wright, M.N.; Heuvelink, G.B.M.; Gräler, B. Random Forest as a generic framework for predictive modelling of spatial and spatio-temporal variables. PeerJ 2018, 6, e5518. [CrossRef][PubMed] 21. Angelini, M.E.; Heuvelink, G.B.M.; Kempen, B. Multivariate mapping of soil with structural equation modeling. Eur. J. Soil Sci. 2017, 68, 575–591. [CrossRef] 22. Zhu, A.; Cheng-Zhi, Q.; Peng, L.; Fei, D. Digital soil mapping for smart agriculture: The SoLIM method and software platforms. Rudn J. Agron. Anim. Ind. 2018, 13, 317–335. [CrossRef] 23. Zhu, A.X.; Band, L.E.; Dutton, B.; Nimlos, T.J. Automated soil inference under fuzzy logic. Ecol. Modeling 1996, 90, 123–145. [CrossRef] 24. Qi, F.; Zhu, A.X. Comparing three methods for modeling the uncertainty in knowledge discovery from area-class soil maps. Comput. Geosci. 2011, 37, 1425–1436. [CrossRef] 25. Zhogolev, A.V.; Savin, I.Y. Automated updating of medium-scale soil maps. Eurasian Soil Sci. 2016, 49, 1241–1249. [CrossRef] 26. Bui, E.N. Data-driven Critical Zone science: A new paradigm. Sci. Total Environ. 2016, 568, 587–593. [CrossRef][PubMed] 27. Ma, Y.X.; Minasny, B.; Malone, B.P.; McBratney, A.B. Pedology and Digital Soil Mapping (DSM). Eur. J. Soil Sci. 2019, 70, 216–235. [CrossRef] 28. Padarian, J.; Morris, J.; Minasny, B.; McBratney, A.B. Pedotransfer Functions and Soil Inference Systems; Springer: Berlin/Heidelberg, Germany, 2018; pp. 195–220. 29. Lagacherie, P. Formalisation des Lois de Distribution des Sols Pour Automatiser la Cartographie Pédologique à Partir d’un Secteur Pris Comme Reference. Master’s Thesis, Université de Montpellier, Institut National de la Recherche Agronomique, Montpellier, France, 1992. 30. Therneau, T.M.; Atkinson, E.J. An Introduction to Recursive Partitioning Using the RPART Routines. 2018. Available online: https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf (accessed on 28 February 2019). 31. Grubinger, T.; Zeileis, A.; Pfeiffer, K.P. Evtree: Evolutionary Learning of Globally Optimal Classification and Regression Trees in R. J. Stat. Softw. 2014, 61, 1–29. [CrossRef] 32. Balestriero, R.; Baraniuk, R. A spline theory of deep learning. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 374–383. 33. Chimatapu, R.; Hagras, H.; Starkey, A.; Owusu, G. Explainable AI and Fuzzy Logic Systems. In International Conference on Theory and Practice of Natural Computing; Springer: Berlin/Heidelberg, Germany, 2018; pp. 3–20. 34. Savin, I.Y. Computer-Based Imitation of Soil Mapping. Digital Soil Mapping: Theoretical and Experimental Investigations; V.V. Dokuchaev Soil Science Institute: Moscow, Russia, 2012; pp. 26–34. (In Russian) 35. Egorov, V.V.; Fridland, V.M.; Ivanova, E.N.; Rozov, N.N. Classification and Diagnostics of Soils of the Soviet Union; Kolos: Moscow, Russia, 1977. (In Russian) ISPRS Int. J. Geo-Inf. 2020, 9, 664 22 of 22

36. Shishov, L.L.; Tonkonogov, V.D.; Lebedeva, I.I.; Gerasimova, V.I. (Eds.) Classification and Diagnostic System of Russian Soils; Oikumena: Smolensk, Russia, 2004. (In Russian) 37. IUSS Working Group WRB. World Reference Base for Soil Resources 2006; World Soil Resources Reports 103; FAO: Rome, Italy, 2006. 38. Ilyina, L.P.; Mihailova, R.P.; Simakova, M.S.; Shubina, I.G. Compilation of the Regional Medium-Scale Soil Maps with Representation of Patterns of Soil Cover; V.V. Dokuchaev Soil Science Institute: Moscow, Russia, 1990. 39. Soil Survey Staff. Soil Taxonomy: A Basic System of Soil Classification for Making and Interpreting Soil Surveys, 2nd ed.; Agriculture Handbook, 436; Natural Resources Conservation Service: Washington, DC, USA; U.S. Department of Agriculture: Washington, DC, USA, 1999. 40. Zinck, J.A. Physiography and Soils. In Soil Survey Courses. International Institute for Aerospace and Earth Sciences; ITC: Enschede, The Netherlands, 1988. 41. Hengl, T. Finding the right pixel size. Comput. Geosci. 2006, 32, 1283–1298. [CrossRef] 42. Xiong, L.Y.; Zhu, A.X.; Zhang, L.; Tang, G.A. Drainage basin object-based method for regional-scale landform classification: A case study of loess area in China. Phys. Geogr. 2018, 39, 523–541. [CrossRef] 43. Jarvis, A.; Reuter, H.I.; Nelson, A.; Guevara, E. Hole-filled SRTM for the Globe Version 4. Available online: http://srtm.csi.cgiar.org (accessed on 28 February 2019). 44. Qin, C.Z.; Zhu, A.X.; Pei, T.; Li, B.L.; Scholten, T.; Behrens, T.; Zhou, C.H. An approach to computing topographic wetness index based on maximum downslope gradient. Precis. Agric. 2011, 12, 32–43. [CrossRef] 45. Hengl, T.; Maathuis, B.H.P.; Wang, L. Geomorphometry in ILWIS. Dev. Soil Sci. 2009, 33, 309–331. 46. Hothorn, T.; Zeileis, A. Partykit: A modular toolkit for recursive partytioning in R. J. Mach. Learn. Res. 2015, 16, 3905–3909. 47. Breiman, L.; Friedman, J.; Olshen, R.; Stone, C. Classification and regression trees. Wadsworth Int. Group 1984, 37, 237–251. 48. Rossiter, D.G.; Zeng, R.; Zhang, G.L. Accounting for taxonomic distance in accuracy assessment of soil class predictions. Geoderma 2017, 292, 118–127. [CrossRef]

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional aﬃliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).