There has been considerable expansion in the application of digital soil mapping (DSM) techniques because it could help to save much time and costs for collecting and analyzing soil data points compared to conventional methods. This research aims to assess the potential of mapping soil types in a Northern region of Vietnam based on the comparison between two DSM methods: Multinomial Logistic Regression (MLR) and Artificial Neural Networks (ANNs).

Eight predictive variables were derived from the ancillary data including land use, altitude, slope, NDVI, PVI, RVI, Topographic Wetness Index and SAGA Wetness Index. MLR and ANNs models were constructed to predict soil classes at 2 levels: WRB-Reference Soil Group and intermediate level of Soil Group between Reference Soil Group and the full WRB soil name. The map quality was indicated by the soil map purity estimated with an independent validation dataset. The diversity indices were calculated to assess the information content of the resultant maps. Selection of the best model is based on the soil map purity, the Shannon’s entropy and a combined index.

At both taxonomic levels, MLR yields higher map purity than ANNs. When the taxonomic level changed from Reference Soil Group level to intermediate level the map purity decreases while the value of the diversity indices increases. Therefore, soil mapping using MLR in predicting Reference Soil Group will be more efficient. However, at intermediate level, the model predicts higher diversity of soil map and thus the informative value estimated by the combined index is higher.



De toepassing van digitalebodemkarteringsmethoden (DSM) is sterktoegenomen ten opzichte van meerconventionelemethodenomdathiermeetijd en kostenkunnenwordenbespaardbij de verzameling en analyse van bodemdata op de puntschaal. Ditonderzoekrichtzich op eeninschatting van het potentieel van twee DSM-methodenbij de kartering van bodemklassen in eenregio in het noorden van Vietnam: MultinomialeLogistischeRegressie (MLR) en ArtificiëleNeuralenetwerken (ANNs).

Acht predictive variabelen, afgeleid van hulpgegevenswerdengebruiktombodemklassentekarteren op het niveau van de WRB-Reference Soil Group en eenniveautussen de Reference Soil Group en de volledige WRB-naam. Dezepredictievevariabelenomvatten het landgebruik, de terreinhoogte, de helling, de NDVI, PVI, RVI, de ‘Topographic Wetness Index’ en de ‘SAGA Wetness Index’. De kaartkwaliteitwerdaangeduid met de kaartzuiverheid, welkewerdgeschat met eenonafhankelijkegegevensset. Diversiteitsindiceswerdenberekendom het informatiegehalte van de resultaatkaarten in teschatten. De selectie van het beste model is gebaseerd op de kaartzuiverheid, Shannon’s entropy en eengecombineerde index.

Op beidetaxonomischeniveausgeeft MLR eenhogerekaartzuiverheiddan ANNs. Bijverandering van het taxonomischniveau van Reference Soil Group naar het tussenliggendeniveauneemt de kaartzuiverheidafterwijl de waarde van de diversiteitsindicestoeneemt. Daaromzalbodemkarteringgebruikmakend van MLR om Reference Soil Groups tekarterenefficienterzijn. Echter, op het tussenliggendeniveauvoorspelt het model eenhogerediversiteit in de bodemkaart en is dushetinformatiegehaltezoalsingeschat door de gecombineerde index hoger



Soil remains one of the most important, yet most abused natural resources on the planet; indeed a responsible management of soil resources plays a critical role in the survival and prosperity of many nations around the world (White, 2005). Soil is limited in quantity and degradable in quality. Soil is an irreplaceable capital good in all the productive activities of human and plays central role in the natural environment.

The understanding of soil properties and behavior strongly support sustainable . During the last decade, increasing attention has been paid to the soil resource in order to understand the internal mechanisms that define its nature as well as its relationship with other environmental factors. One of the most helpful and functional tools to study is soil mapping. Many countries have been involved in making maps of their soils to determine the range of soil types in their territory, where they occur and how they can be used efficiently. Soil mapping is the combination of locating and identifying the different soil types by collecting information about their location, properties and potential use, and recording this information on maps and all supporting documents.

Modern users of soil geo-information require maps at detailed scales. The technological and theoretical advances in the last 20 years have led to a number of new methodological improvements in the field of soil mapping. Most of these belong to the domain of a new emerging discipline – pedometrics – for the quantitative, (geo)statistical production of soil geoinformation. Pedometrics is strongly focused on predictive or digital soil mapping (DSM). DSM embraces a set of quantitative mapping methods that have developed from more traditional soil mapping techniques. There were various case studies that demonstrated the application of DSM methods in mapping soil properties and classes, updating soil attribute maps or mapping soil features (CarréGirard, 2002; Jafari et al., 2012; Kempen et al., 2009; Yang et al., 2011). Because traditional methodologies are costly and time- consuming, the use of DSM methods has increased and has resulted in improvements in and classification steps, also allowing the application of the results in other similar landscapes (Resende, 2000).

Vietnam has a total land area of 32.924.064 ha (9.345.346 ha of agriculture land, 11.575.429 ha of forestry land, 1.532.843 ha of special used land, 443.147 ha of residential land and 10.027.265 ha of un-used land). Vietnam is a developing country; agriculture has played a key role in the economy coupled with the dramatic development of industry and service. These create a huge pressure in using land resources. Its government attached great importance to the appropriate management and use of land to serve the needs of production and people’s lives, on the basis of sustainable development.

In Vietnam, the national map of soil types at a scale of 1:1,000,000 was published by the Vietnamese Soil Science Association in 1996 using the classification of FAO (FAO/ISRIC/CSIC, 1988). Although soil mapping has made certain progress, conventional soil maps produced in the past decades are the major data sources for information on the spatial variation of soil. They are limited in terms of both the level of spatial detail and the accuracy of soil attributes as well as high requirements of costs and time.



2.1. Objectives

The overall objective of this project is to propose a digital soil mapping method which is capable of mapping soil types of Vietnam in more detail and requires lower costs for soil survey. To do this, the following tasks were identified:

1. Apply two chosen methods: Multinomial Logistic Regression and Artificial Neural Networks on mapping soil types at 2 levels: WRB-Reference Soil Group and intermediate level of reference soil group with respect to the in the test area, respectively. 2. Validate the resultant maps using independent validation data. 3. Select the Digital Soil Mapping method based on the comparison of the two methods in the test area at the 2 classification levels by evaluation of the quality (taxonomy purity) and the information content of the resulting maps.

2.2. Hypothesis

It is possible to sample a reference area including most of the soil types of a region. Based on this area, the prediction of soil distribution in other areas may be facilitated if there are enough data observations and other ancillary data.



3.1. Map of soil types

3.1.1. Soil classification

Classification of soils is finding some common properties or behavior between individual soil profiles in order to make meaningful classes to help us organize our knowledge and simplify our decision-making in soil management. Soils profiles were classified by grouping them into classes, for example soil series. These classes form other objects and then can be classified into still more general classes, e.g. Reference Soil Group. This is a hierarchic classification, and is common in soil science.

Why do we classify soil?

Classification is an essential part of data reduction process, whereby complex sets of observations are made understandable. Another obvious reason for classifying is to save time and simplify our description. Simply, if many of the soil types have three or four properties in common, it is sensible to use one short name for them all in order to be easier to remember and define the relationships among them. Classification studies the number and composition of the groups in a set of data, which allows the human mind to recall information and relate entities and attributes to each other. One important function of soil classification is to accommodate the apparent individuals in the most satisfactory manner so as to permit the compilation of legible and meaningful soil maps. It also facilitates the prediction of unknown soil types which is based on the observed property ranges and the factors which govern soil formation. Finally, soil classification enables a concise description of the spatial variation of soil as a three dimensional multivariate system.

Soil classification and soil maps are important basic documents for soil survey, evaluating soil, land management, land-use planning and agricultural planning. Depending on the characteristics and nature of each soil types, managers can allocate appropriate land-use economically and sustainably.

The development of many soil classification systems all over the world reflect different views based on the concepts of soil formation and mirror differences of opinion about the criteria to be used for classification. The two most important international scientific soil classification systems that are still being developed and maintained are both diagnostic systems, and hence are based on the absence or presence of diagnostic properties:

* USDA (United States Department of Agriculture): Soil Taxonomy.

* FAO (Food and Agriculture Organization) of the United Nations and UNESCO: World Reference Base for Soil Resources.

In these systems, diagnostic properties of the soil are derived from the subdivision of the soil profile into horizons and the soil properties of each of these horizons. The (hierarchical) classification is done like in other determination systems (flora, fauna) by means of determination keys (Finke, 2011).


In this project, World Reference Base (WRB) was selected as a base for correlation of soil units in Vietnam. WRB is designed for such a purpose to serve “as an easy means of communication amongst scientists to identify, characterize and name major types of soils, to be a tool for better correlation between national systems, to act as common dominator through which national system can be compared” (WRB, 2006).

3.1.2. World Reference Base for Soil Classification.

WRB is the international standard soil classification system sanctioned by the International Union of Soil Science. It was developed by an international collaboration coordinated by the International Soil Reference and Information Center (ISRIC) and sponsored by the International Union of Soil Science (IUSS) and the FAO. It replaces the FAO Legend for the Soil Map of the World.

Classification principles:

Stepwise, classification of a soil in WRB proceeds as follows:

1. and horizonation is described according to the FAO-guidelines for soil descriptions (FAO, 2006). Colors are recorded using the Munsell Colors Charts (KIC, 1990). Chemical and physical characteristics are determined conform to the Procedures for Soil analysis (Van ReeuwijkHouba, 1998).

2. Diagnostic horizons, properties and materials are inferred.

3. The classification itself is a 2-tier approach:

a. The first level is the classification into the Reference Soil Group (RSG), using the classification key. In a specified, mandatory order, each RSG is tested against the identified diagnostic horizons, properties and materials. The first RSG that gives a fully positive test results is the classified RGS.

b. In the second level, the RSG is further specified using prefix and suffix qualifiers. Prefix qualifiers, RSG name and suffix qualifiers together, in a prescribed order, make the full taxonomic name. Each RSG has a unique set of these prefix and suffix qualifiers that are eligible. The election of a qualifier is again based on diagnostic horizons, properties and materials (Finke, 2011).

Diagnostic horizons, properties and materials

Tab.1 gives an overview about the diagnostic criteria of the horizons, properties and materials for classification (WRB, 2006). The inference of these diagnostics from the soil description and laboratory measurements is done by application of logical (AND, EITHER, OR) operators to diagnostic criteria using profile data and measurements.


Table 1. Diagnostic horizons, properties and materials for classification into World Reference Base (WRB, 2006)

Diagnostic horizons Diagnostic properties

Albic Salic Abrupt textural change

Anthraquic Sombric Albeluvic tonguing

Anthric Spodic Andic properties

Argic Takyric Aridic properties

Calcic Terric Continuous rock

Cambic Thionic Ferralic properties

Cryic Umbric Geric properties

Duric Vertic Gleyiccolour pattern

Ferralic Voronic Lithological discontinuity

Ferric Yermic Reducing conditions

Folic Secondary carbonates

Fragic Stagniccolour pattern

Fulvic Vertic properties

Gypsic Vitric properties



Hydragric Diagnostic materials

Irragric Artefacts

Melanic Calcaric material

Mollic Colluvic material

Natric Fluvic material

Nitic Gypsiric material

Petrocalcic Limnic material

Petroduric Mineral material

Petrogypsic Organic material

Petroplinthic Ornithogenic material

Pisoplinthic Sulphidic material

Plaggic Technic hard rock

Plinthic Tephric material


Elements for lower level units

Soil subunits can be identified in WRB in the second level of classification. In this level, so-called qualifiers are added to the RSG name. Each RSG has a unique list of qualifiers that can (or not) be selected based on the presence of diagnostic horizons, properties or materials. There are two groups of qualifiers:

1. Prefix qualifiers: these qualifiers describe properties of the RSG that are either: a. Typically associated to the RSG; b. Intergrades to other RSG;


c. The haplic prefix qualifier is only used when no typically associated or intergraded qualifiers apply.

2. Suffix qualifiers: these qualifiers give additional information on the RSG and are related to either:

a. Diagnostic horizons, properties or materials, b. Chemical properties, c. Physical characteristics, d. Mineralogical characteristics, e. Surface characteristics, f. Texture, g. Colour, h. Other characteristics.

3.1.3. Major soil types in Vietnam

Vietnam has a total land area of 32.924.064 ha, population of 90.549.390 in 2011 with a growth rate of 1.02%. Viet Nam can be divided into four physiographic regions: the Annamese extending from north to south through west-central Viet Nam, the Red River delta in the north, the Mekong River delta in the south, and the coastal plain in the east. The extremely rugged and densely forested Cordillera, a southward extension of the Yunnan Plateau, covers about two-thirds of the country. Parallel northwest-southeast ranges with several peaks rising to more than 1,800 meters dominate the northern half, and a series of heavily eroded longitudinal plateaus average elevation 750 to 1,500 meters extend into the southern half.

According to the reports of national project named “Mapping soil types of Vietnam using the classification of World Reference Base for Soil Resources of FAO”, Vietnam has twenty one soil groups with 61 soil units (Table 2). However, for easier evaluation these soils can be grouped into 2 big combinations:

- Mountainous and hilly soils: Most are , Ferralsols or . Under annual cropping, without reasonable improving measure, the soil is rapidly degraded. The mountainous and hilly soils should be reserved for forestation, cultivation of perennial crops, and fruit crops with appropriate protection measures. - Delta soils: The centers of food production are mainly the deltas of the Red River, the Mekong River and other rivers. These are regions with high levels of intensive cultivation and crop intensity. With irrigation, moisture is sufficient, the rate of soil degradation is low; alluvial deposits bring fertility annually; this is often augmented by organic and mineral fertilizers (Mui, 2006).


Table 2. Classification of Vietnam’s soil types in FAO - UNESCO

Name Name No Symbol No Symbol FAO-UNESCO FAO-UNESCO I AR XI VR 1 ARl LuvicArenosols 34 VRe Eutric Vertisols 2 ARr RhodicArenosols 35 VRd Dystric Vertisols 3 ARh HaplicArenosols XII LX 4 ARb CambicArenosols 36 LXh HaplicLixisols 5 ARa AlbicArenosols 37 LXx ChromicLixisols 6 ARg GleyicArenosols 38 LXh HaplicLuvisols 7 ARo FerralicArenosols XIII CL II SC 39 CLh HaplicCalcisols 8 SCg GleyicSolonchaks 40 CLl LuvicCalcisols 9 SCh HaplicSolonchaks XIV PT 10 SCm MollicSolonchaks 41 PTd DystricPlinthosols III FLt ThionicFluvisols 42 PTa AlbicPlinthosols 11 GLtp Proto-ThionicGleysols 43 PTu Humic Plinthosols 12 FLto Orthi-ThionicFluvisols XV PD Podzoluvisols IV FL 44 PDd DystricPodzoluvisols 13 FLe EutricFluvisols 45 PDg GleyicPodzoluvisols 14 FLd DystricFluvisols XVI AC Acrisols 15 FLg GleyFluvisols 46 ACh HaplicAcrisols 16 FLu UmbricFluvisols 47 ACp PlinthicAcrisols 17 FLb CambicFluvisols 48 ACg GleyicAcrisols V GL 49 ACf FerricAcrisols 18 GLe EutricGleysol 50 ACu Humic Acrisols 19 GLd DystricGleysol XVII NT 20 GLu UmbricGleysol 51 NTh HaplicNitisols VI HS 52 NTr RhodicNitisols 21 HSf FibricHistosol XVIII FR Ferralsols 22 HSt ThionicHistosol 53 FRr RhodicFerralsols VII SN 54 FRx XanthicFerralsols 23 SNh HaplicSolonetz 55 FRp PlinthicFerralsols 24 SNg GleyicSolonetz 56 FRu Humic Ferralsols VIII CM XIX AL Alisols 25 CMe EutricCambisols 57 ALh Humic Alisols 26 CMd DystricCambisols 58 ALg GleyicAlisols IX AN 59 ALu HisticAlisols 27 ANh HaplicAndosols XX LP 28 ANm MollicAndosols 60 LPq LithicLeptosols X LV Luvisols XXI AT 29 LVf FerricLuvisols 61 AT Anthrosols 30 LVg GleyicLuvisols 31 LVk CalcicLuvisols 32 LVx ChromicLuvisols 33 LVq LithicLuvisols

In recent years, the significant socio-economic development of Vietnam along with the high growth rate of the population has caused intense pressure on soil resources. In order to manage that important resource sustainably and reasonably, it was needed to have a national soil map. This document should provide basic data about soil characteristics and properties, which are very necessary information for using land. In 1996, a national soil map of Vietnam at a scale of 1/1,000,000 was published by Vietnamese Soil Science Association using the classification of FAO (FAO, 1988,


FAO, 1994). This soil map was built by conventional soil mapping methods which are generally created using free survey. The soil surveyor employs a conceptual soil-landscape model to select observation locations at which the most useful information is likely to be obtained. The average area for each observation was 1920 ha. The soil samples then were analyzed in the laboratory. Landscape features as seen in the field and expert experiments are also taken into account to describe the soil profile. The soil map has a set of soil profile descriptions. Each map unit is characterized by one or more representative soil profiles of the soil types that comprise the map unit. These profiles are used for the interpretation of the soil map.

Subsequently, some regions and provinces also have soil maps at larger scale, for example: Soil map of Tay Nguyen region at scale of 1/100.000, Soil map of Nam Dinh, Ninh Binh at scale of 1/50.000… Nevertheless, because of the lack of government funds, there are still many regions and areas in Vietnam that do not have soil map which is a very important material to manage and use land efficiently. Therefore, a new method which can map soil in more detail but cost less than conventional soil mapping method is needed in order to deal with those problems in Vietnam.

3.2. Digital soil mapping

3.2.1. Soil mapping

Soil mapping or soil survey is a process of determining the spatial distribution of physical, chemical and descriptive soil properties and presenting it in understandable and interpretable form to various users (Beckett, 1976; DentYoung, 1981). Traditional soil mapping consists of the following steps:

- Project planning; - Preparation for fieldwork; - Photo-interpretation and pre-processing of auxiliary data; - Collecting field data and laboratory analysis; - Data input and organization - Presentation and application of soil mapping products.

Project planning is especially important step for a success of soil survey project because it includes definition of a sampling plan, inspection density, classification system and data organization system. Preparation for fieldwork typically includes literature study and reconnaissance surveys. The end product of a soil mapping project is a soil resource inventory, i.e. a map showing distribution of soils and its properties accompanied by a soil survey report (Avery, 1987)

Due to the significant development of informatics, the soil resource inventory data is organized into a thematic type of geographic information system called a Soil Information System (SIS), of which the major part is a Soil Geographical Database (SGDB) (Burrough, 1991). This is a combination of spatial data (map of polygon and point) closely linked with attribute data for profile observations, soil mapping units, soil classes and all relevant data. SIS was not only applied to soil science but also on a wide range of civil applications such as planning, urban administration, environment… It offers not only the


information on soils but also on their potential (and actual) use, environmental risks involved (e.g. erosion risk) and gives prediction of soil behavior on intended management (Hengl, 2003).

Soil mapping projects vary in the inspection intensity levels, purpose and type of conceptual models used. In view of the intensity levels, soil mapping projects range from small scale (1:100 K to 1: 1 M) surveys to medium (1:50 K) and large scale surveys (1:25 K to 1:5 K or larger). Considering the intentional purposes, a soil mapping project can be classified as the special purpose (commonly referred to as thematic) and general purpose. The former is completely demand-driven and focuses on a limited set of soil variables or a single soil variable, the latter is more holistic, but also more complex, thus more costly and often not affordable at large scale. The conceptual models of soils reflect the purpose of the mapping project: (i) special-purpose mapping projects commonly follow the continuous model of spatial variation, thus geostatistical techniques are used to make prediction; (ii) general- purpose mapping projects commonly rely on photo-interpretation and profile descriptions, following the discrete model of spatial variation (Hengl, 2003).

It is not easy to cope with soil variation from the beginning of the soil mapping. Soil variables vary not only horizontally but also with depth, not only continuously but also abruptly. Soil mapping requires much denser field inspections in comparison with vegetation or land use mapping. Furthermore, soil horizons and soil types are often hard to be distinguished or measured. Especially the polygenetic nature of soils has always been a main problem in description and classification of soils (Jenny, 1941). Many pioneer soil geographers have wondered whether they will be able to fully describe the patterns of soil cover (Jenny, 1941). The quality and usefulness of the polygon-type soil maps has for decades been an object of argue (WebsterBeckett, 1968). However, it is obvious that the technological and theoretical progress in the last 30 years have led to a dramatic improvement in mapping soil methodology. Most of these belong to the new emerging discipline: Digital Soil Mapping (DSM)

3.2.2. Overview of Digital Soil Mapping.

The great expansion in informatics has yielded huge amounts of data and tools in all fields of application. Soil science is no exception, with the ongoing development of regional, national, continental and worldwide database. The challenge of understanding these large stores of data has led to the development of new tools in the field of statistics and spawned new areas such as data mining and machine learning (Hastie et al., 2001). In soil science, the development of GIS, GPS, Remote Sensing and data sources such as digital elevation models (DEMs) is leading to new ways forward. These techniques provide wide range of soil data and information for environmental monitoring and modeling.

Worldwide, there are more and more researchers that investigate the potential of applying the new techniques of information technology and science to soil survey and soil mapping. The main principle is soil assessment using GIS, for example the digital soil property and class maps with the constraint of limited fieldwork and laboratory analysis which are very expensive. DSM is the next great advancement in delivering soil survey information.


DSM is a spatial soil information system created by numerical models that account for the spatial and temporal variations of soil properties based on soil information and related environmental variables (LagacherieMcBratney, 2007). Pedologists working with DSM technology are dealing with various topics: the production and processing of covariates (soil forming factors derived from remote sensing, digital terrain models, existing soil maps, et cetera), the collection of soil data, the development of soil predictions based on numerical models, the evaluation of the quality and the representation of digital soil maps. The recent advances and open questions within each of these topics are already examined with a certain success.

The world’s overpopulation of the human race and associated pressures on resources, necessitate the immediate need for valuable soil information to make informed decisions about the soil resource as well as make people aware of the problems and potential problems. We do not have enough time or resources to canvass the earth to make soil surveys by our traditional methods. DSM would be able to deliver the needed information and may provide better and more accurate information. DSM is a credible alternative to fulfill the increasing worldwide demand in spatial soil data due to its ability to (i) increase spatial resolutions and enlarge extents and (ii) convey relevant information. The first challenge requires developing a specific spatial data infrastructure for DSM, to implement DSM in existing soil survey programs and to build up soil spatial inference systems. The second challenge has the need of mapping soil function and threats to develop a framework for the accuracy assessment of DSM products and to introduce the time dimension (Lagacherie, 2006)

Soil observations Auxiliary data

Soil spatial inference system

Spatially predicted soil Spatially predicted soil properties and features classes

Application domain

Figure 1: Generalized flowcharts for Digital Soil Mapping


DSM is a response to the demand of quantitative soil information for environmental monitoring and modeling. The environmental or so-called scorpan factors (scorpan is a mnemonic for factors for prediction of soil attributes: soil, climate, organisms, parent materials, age and spatial position proposed by McBratney et al., (2003) derived from digital elevation models (DEM), remote sensing images, existing soil maps… are used to generate soil information in the form of a database where most of the information consists of predictions that are statistically optimal. Figure1 summarizes the process of digital soil mapping, where geo-referenced soil observations coupled with environmental variables form the input data. In the spatial soil inference system, soil properties over the whole area can be predicted and mapped using spatial soil prediction functions (such as regression, …). This prediction is based on correlations between the environmental variables and soil attributes, as well as the spatial autocorrelation of the attributes themselves. These spatially inferred soil properties can be used to predict more difficult – to – measure functional soil properties, for examples: field capacity, available water capacity using pedotransfer functions under soil inference system. All of the predicted soil properties can be used to evaluate soil functions.

There were many case studies that demonstrated the application of DSM methods in mapping soil properties and classes, updating soil attribute maps or mapping soil feature, examining the spatial- temporal changes in land cover… (CarréGirard, 2002; Kempen et al., 2009; Turetta et al., 2006; Yang et al., 2011). However, in this project, we concern about the application of DSM in mapping soil types by two methods: Multinomial Logistic Regression and Artificial Neural Networks because of their ability to predict soil classes such as WRB – classes. These methods will be discussed in more detail in next part.

3.2.3. Digital soil mapping methods for mapping soil types. Multinomial Logistic Regression.

DSM involves quantitative prediction of soils and their properties using observed data and auxiliary data on soil forming factors. The major part of the prediction is to quantitatively model the relationship between the predictors and the dependent variable. Because it is complicated to build a non-linear model, a model that linearizes the relationship is preferred. One of the most suitable models is the logit model which is built using logistic regression. The logit model relates the natural logarithm of the odds (ratio of the probability of the existence to that of non-existence) of a categorical variable to its predictor variables (Menard, 2002). Logit model is widely used in many other areas of research for analyzing categorical variables and it is less demanding in terms of data characteristics such as normality and constant moments (Menard, 2002; Raimundo et al., 2006). In cases where the dependent categorical variable has more than two categories, the multinomial logistic regression (MLR) is used; otherwise, the binomial version is used.

The logit (ℓ) is the logarithmic function of the ratio between the probability (P) that a pixel (i) is a member of a class (j) and the probability that it is not (1−P). Its value can be directly predicted from the


predictor values through regression function as adapted from Fagerland et al. (2008); Goeman and Cessie(2006); Menard (2002):

P ij ij = ln = a j + b1 j × X ij + b2 j × X 2i +...+ bnj × X ni (1) 1− Pij

Equation (1) shows how to calculate the logit (ℓ) of a category, e.g. soilgroup j, predicted from the values of a number of quantitative factors X1 … Xn e.g. soil properties, of pixel i. The ‘a’ indicates the intercept of the regression curve for the soil class j, the ‘b1j…nj’ are the coefficients of each predictor

‘X1…n’ for the respective soil class j. The n stands for the total number of the soil properties that significantly correlate with the given soil group j. From equation (1), another equation (2) estimates the probability that a given soil group j is present at pixel i (Pij) can be derived as:

el ij P ij = m−1 l (2) 1()+ e ij ∑ j=1

where m stands for the total number of the dependent categories, whereas the Σ indicates the summation of the logits of all the soil groups (except the reference group) for the particular pixel i. One of the categories, often the last in the list, is considered as reference (r) and its probability of presence is given as:

1 P r = m−1 l (3) 1()+ e ij ∑ j=1

The value of ‘a’ and ‘b’ will have to be determined for each soil group based on the empirical data. The logit models are then related to the probability models as in Equation (2) and (3) is used to predict the probability of the reference category. The probability of the soil groups can then beused as inputs in – for instance - the raster calculator of ArcGIS to produce a map showing the likelihood of presence of each soil group at each pixel (Debella-GiloEtzelmuller, 2009).

There is a variety of studies that applied linear models for the predictions of soil classes: Gessler et al. (1995) used generalized linear models to predict the presence or absence of a bleached A2 horizon from digital terrain information; MLR was applied to predict soil drainage classes using terrain attributes and vegetation indices by Campling et al. (2002); Debella-Gio and Etzelmuller (2009) predicted the soil classes in Vestfold County, Norway using digital terrain analysis and MLR modeling integrated in GIS. Artificial Neural networks.

Artificial neural networks (ANNs) attempt to build a mathematical model that supposedly works in an analogous way to the human brain. The design and the basic concept have been adopted from data processing in biological nervous systems, since there are different groups of cells for reception,


forwarding, storage and outward release of information. Neural networks have a system of many elements or “neurons” interconnected by communication channels or “connectors” which usually carry numeric data, encoded by a variety of means and organized into layers (A.B. McBratney et al., 2003). The application of an ANN consists of two stages. In the first stage, the network is trained to learn the conditions on which a certain feature (e.g. a soil class) occurs. Each input unit (cell or neuron) of the

ANN represents a predictor variable (Figure 2): a terrain attribute (R1…Rn), a land-use unit (L1…Ln), and/or a geological unit (G1…Gn). The output represents the target variable as the desired output (the soil class). Exemplified topology of a feed-forward multilayer ANN. Each cell or unit of the input layer represents

one terrain attribute (R1…Rn), one land–use unit

(L1…Ln), or one geological unit (G1…Gn), respectively. The input cells are connected to the cells of the output layer (S), representing one soil

unit, via hidden cells (H1…Hn). The knowledge of the relation between input and output is saved through the weight (w)which are adjusted during the learning process. I = input unit (I = 1, …, n; n = input units x hidden units), h = hidden unit (h = 1,..., n; n = hidden units x output units) (Behrens et al., 2005)

Figure 2: Exemplified topology of a feed-forward multilayer ANNs.

The connections exemplified by the arrows are expressed by the weights wi (wi1…win). The adjustment of these weights which are randomly chosen at the beginning is the intrinsic learning process. As each attribute combination (in terms of pixels of a grid map) is put into the network in succession, the weights are adjusted iteratively if the output (S) does not match the output of a training data set.

The mean square error of the network (MSE) is used to test the performance of the ANNs and is continuously calculated during the learning process as equation (4):

1 MSE=() o− p 2 (4) n ∑

Where o represents the observed output value for each one of n pixels and p is the predicted output. The training has to be disrupted when the average-error function and/or the gradient of the average- error function for the training set becomes small (Sarle, 2002 ), otherwise more iterations may cause an over fitting effect, associated with decreasing generalization ability due to learned noise (Sarle, 2002 ).

During the second stage, the learned knowledge in terms of the calibrated weights can be applied to prediction areas, for which the same input parameters (e.g. terrain attributes, land use, and geological


units) are available, but no soil map has been surveyed. The network then predicts the soil units based on the learned weights (Behrens et al., 2005).

Neural networks are widely applied in the soil science literature, mainly for predicting soil attributes. It also can be used to predict the probability of soil classes using multi-logit transformation of the output. Zhu (2000) used neural networks to predict soil classes form soil environmental factors. Fidèncio et al. (2001) applied artificial neural networks to classify soils from Sao Paulo state by means of their near- infrared spectroscopy. Behrens et al. (2005) used artificial neural networks to spatially predict soil units based on terrain data.



4.1. Study area

The DSM methods are applied in Bac Ninh province in the Northern part of Vietnam. Bac Ninh is located at 21o 05’ N latitude and 106o10’ E longitude and covers an area of about 82,300 ha. Bac Ninh is located in a tropical monsoon region, and the average annual precipitation and temperature are 1500 mm and 230 C, respectively. It has a rather level and flat terrain; mainly sloping from North to South and West to East. The terrain is not much dissected, field areas are 3-7m high and hill and mountain areas are 300-400m high above sea level. The area was selected based on the availability of most of the necessary data as well as the representativeness for the deltaic region of Vietnam.

Figure 3: Location of study area in Vietnam

4.2. Data collection

4.2.1. Soil point data

The point dataset was collected during a soil survey project in 2010 and contains 537 observations. The observations locations are chosen based on the topography, , and land use over the 47,000 ha of agricultural area. At the selected locations, soil profiles were made to describe and classify according to the WRB classification system. The soil was classified in 2 levels: the Reference Soil Groups (RSG) and the qualifiers which describe in detail the properties of the RSG by adding a set of uniquely defined qualifiers (WRB, 2006). There were five WRB Reference Soil Groups found in the surveyed area: - Fluvisols (402 samples): Genetically young, azonal soils in alluvial deposits.


- Acrisols (58 samples): soils having higher content in the than in the as a result of pedogenetic processes (especially clay migration) leading to an argic subsoil horizon. Acrisols have at certain depths a low base saturation and low-activity clays. - Arenosols (7 samples): sandy soils, including both soils developed in residual after in situ weathering of usually quartz-rich sediments or rocks, and soils developed on recently deposited sands such as dunes in desert and beach lands. - (15 samples): wetland soils which, unless drained, are saturated with ground water for long enough periods to develop a characteristic gleyic color pattern. - Plinthosols (55 samples): soils with plinthite, petroplinthite or pisoliths. Plinthite is an Fe-rich (Mn-rich), humus-poor mixture of kaolinitic clay (and other products of strong weathering such as gibbsite) with quartz and other constituents that changes irreversibly to a layer with hard nodules, a or irregular aggregates on exposure to repeated wetting and drying. Petroplinthite is a continuous, fractured or broken sheet of connected, strongly cemented to indurated nodules or mottles. Pisoliths are discrete strongly cemented to indurated nodules. Both petroplinthite and pisoliths develop from plinthite by hardening. (WRB, 2006) The 537 soil profiles in the surveyed area were also classified using qualifiers in addition to the WRB Reference Soil Group. This leads to 30 different soil categories, which was considered a too high number for digital soil mapping because of the low presence of samples in many of the categories. This is illustrated by table 3.

Table 3. Presence of soil profiles at the most detailed categorical level

Number Number Soil category of soil Soil category of soil profiles profiles Abrupti-DystricFluvisol 1 Areni- PlinthicAcrisol 10 Areni- EutricFluvisol 5 Areni - HyperdystricAcrisol 3 Dystric- GleyicFluvisol 46 Endoferri - HyperdystricAcrisol 2 Dystric-CambicFluvisol 54 Hyperdystri - ArenicAcrisol 4 Endogleyi-CambicFluvisol 8 Hyperdystri - PlinthicAcrisol 6 Gleyi-DystricFluvisol 52 Plinthi - HyperdystricAcrisol 14 Plinthi-DystricFluvisol 33 Skeleti - HaplicAcrisol 2 Silti- EutricFluvisol 39 Veti - HyperdystricAcrisol 2 Silti-DystricFluvisol 34 Dystri - HaplicArenosol 5 Endoplinthi-DystricFluvisol 50 Fluvi - DystricArenosol 1 Epigleyi-CambicFluvisol 8 Veti - DystricPlinthosol 22 Epiplinthi-DystricFluvisol 31 Areni- DystricPlinthosol 22 Eutri-CambicFluvisol 2 Dystri - AlbicPlinthosol 7 Albi - HyperdystricAcrisol 8 Endocamni- DystricGleysol 4 Anthraqui - ArenicAcrisol 1 Fluvi- DystricGleysol 8


For this reason, soil data points were classified into an intermediate level. This intermediate classification was based on some properties relevant for soil management: base saturation status as indicator of , texture and appearance of hard layer in the soil profile. These properties were assigned based on the qualifiers in the profile soil classifications: eutric=high base saturation; dystric=low base saturation; plinthic=plinthite present that may be or become a hardpan; epigleyic=reducing conditions in upper 50 cm, endogleyic=reducing conditions between 50 and 100 cm. As a result, 15 intermediate level of soil units were classified as summarized in table 4.

Table 4. Presence of soil profiles at the intermediate categorical level

Intermediate level Number of No Properties classification soilprofiles 1 Acrisol00000 4 Acrisolshaving no special property Acrisols having a hard subsurface horizon (plinthic 2 Acrisol00001 11 horizon) which make it more difficult to work on this soil Acrisols having a low base saturation(dystric 3 Acrisol10000 21 qualifier) ,thus with higher fertilizers need Acrisols having both hard subsurface horizon 4 Acrisol10001 22 (plinthic horizon) and a low base saturation 5 Arenosol10000 7 Arenosolshavinga low base saturation very wet Fluvisols having reducing condition within 6 Fluvisol0001000 9 50cm of the soil surface a wet Fluvisols that have reducing condition between 7 Fluvisol0010000 9 50cm and 100cm from the soil surface Fluvisols have high base saturation and texture of 8 Fluvisol0100010 42 , silt , silty clay loam or silty clay 9 Fluvisol1000000 171 Fluvisols have low base saturation Fluvisols have low base saturation and texture of silt, 10 Fluvisol1000010 38 silt loam, silty clay loam or silty clay Fluvisols have low base saturation and a hard 11 Fluvisol1000100 126 subsurface horizon 12 Fluvisol0100000 2 Fluvisols have high base saturation Fluvisols have high base saturation and texture of 13 Fluvisol0100001 5 loamy fine or coarser 14 Gleysol10000 15 Gleysols have low base saturation 15 Plinthosol10000 55 Plinthosols have low base saturation


4.2.2. Digital elevation model (DEM)

Topography is one of the most important factors which affects the soil formation, thus it may determine the soil types in an area. Landscape position cause localized changes in moisture and temperature. Therefore, a DEM of the area at the grid resolution of 25m was created by digitizing the topographic map of the region. The DEM was used to derive four terrain attributes using the Saga GIS: Altitude, Slope, Topographic wetness index and SAGA wetness index (Olaya, 2004). Those attributes may reflect the soil forming condition in the study area.

4.2.3. Remote Sensing indices

The SPOT image from Vietnam Space Technology Institute has a resolution of 20m, and was used to compute remote sensing indices such as Normalized Difference Vegetation Index (NDVI), Ratio Vegetation Index (RVI) and Perpendicular Vegetation Index (PVI) by using ArcGIS. As a result, three raster maps at a resolution of 20m were derived: NDVI map, RVI map and PVI map. Subsequently, these maps were rescaled into a resolution of 25m in order to obtain the same map extent and grid size as the DEM – derived attributes maps. This was done in ArcGIS.

The vegetation indices are numerical indicators that uses the visible and near-infrared bands of the electromagnetic spectrum to assess whether the target being observed contain live green vegetation or not. These indices are widely applied in vegetative studies and are often directly related to ground parameters such as percent of ground cover, photosynthetic activity of the plant, surface water,… The NDVI algorithm subtracts the red reflectance values from the near-infrared (NIR) and divides it by the sum of near-infrared and red bands.(Rouse et al., 1973)


The RVI formed by dividing the NIR radiance by the red radiance (PearsonMiller, 1972)


4.2.4. Land use map

A land use map of Bac Ninh province at a scale of 1:25,000 in 2010 was produced to be a source of ancillary data. The study area locates in the biggest deltaic region of Vietnam and paddy rice is the dominant crop. Because the observations were obtained only in the agricultural area, the following three main land use types were encountered in the study area: two crops per year of rice cultivation(LUC), one crop per year of rice cultivation (LUK) and annual crops (BHK). Annual crops include maize, potatoes, sweet potatoes, vegetables and cassava.


Table 5: Available ancillary data

No Data set Predictor name Resolution / Scale 1 Digital Elevation model ALTITUDE 25 m 2 Map of slope SLOPE 25 m 3 Map of Saga Wetness Index SAGAWET 25 m 4 Map of Topographicwetness index WETNESS 25 m 5 Map of NDVI NDVI 25 m 6 Map of PVI PVI 25 m 7 Map of RVI RVI 25 m 8 Land use map LU 1 : 25,0000

4.3. Multinomial logistic regression

4.3.1. The multinomial logistic regression model

Multinomial logistic regression was used to model the relationships between the Reference Soil Group or the intermediate level soil groups (categorical dependent variables) and the terrain attributes, remote sensing indices and land use types in the research area (quantitative predictors) using the “nnet’ package of R. This model belongs to the family of generalized linear models and is used when with categorical response variable. Suppose that we want to model the probability πij that observation i in each jth class of the m soil groups j = 1 … m. In the model for predicting soil groups, the Fluvisols (j=1) is taken as the reference class due to its dominance in the soil point data (402 of 537 samples). In the MLR model for more detail level, the Fluvisol1000000 is the reference class for the same reason

(171/537 samples). Consequently, the base probability πi1 is computed as the residual probability after the other classes πi2 … πim have been modeled.

Thus the model has k +1 coefficients for each of the j = m – 1 classes (leaving out the reference class): one intercept αj and one “slope” for each predictor βlj, where l = 1 … k is a column in the model matrix.

The fitted probabilities are then:

e(...)αβj+++11 jxx i β kj ik ,jm 2,..., π ij ==m (...)xx 1+ e αβj+++11 j i β kj ik ∑l=2 m 1 ππi1 =−∑ ij j=2

where xi is a vector of explanatory variables. This set of equations is fitted by maximizing the likelihood.

The fitted α and β can then be used to assess the log-odds of an observation being classified in each soil class, relative to the base class. That is, what is the chance that, instead of Fluvisols in Soil Group


level (or Fluvisol1000000 in intermediate level), the observation is in another soil group. The log-odds are computed as:

π ij ln=+αβjjikjik11xxjm ++ ... β , = 2,..., πi1

So, once we fit the model, we can predict the odds of some soil groups (or intermediate level of soil groups), relative to the reference one. To recover the actual odds, the inverse logistic transformation is used. In R-project, we use the predict function to provide the probability of all the classes (which of course sum to 1).

4.3.2. Assessing model significance and contribution of predictors

In order to find the best model, one that provide the maximum fit for the fewest predictors, it is important to select the predictor variables in the logistic regression model that contributes most to the pattern in the categorical response variable. The criteria for assessing different models include the deviance statistics and the Akaike Information Criteria (AIC). (Akaike, 1973). AIC is a measurement of relative quality of a statistical model for a given data set. AIC deals with the trade-off between the complexity of the model and the goodness of fit of the model, thus it provides a mean for model selection. AIC adjusts the residual deviance for the number of predictor variables:

AIC = 2K – 2ln(L) where K is the number of the estimated parameters included in the model, L is maximized value of the likelihood function for the estimated model which is readily available in the statistical output, and reflects the overall fit of the model. In itself, the AIC value for a given data set has no meaning. It becomes interesting when it is compared to the AIC of a series of models, one with the lowest AIC being the best model. If many models have similarly low AICs, the one with the fewest predictor variables should be chosen.

In this research, the stepwise-forward method was used for model selection. Firstly, we begin with no variables in the model. For each of the independent variables, the model was fitted, and then the AIC for each model was computed and models were compared. The most influential predictor variables which have the lowest AIC will be included in the final model firstly; other variables are added one by one to the model in order of increasing AIC. The variables selection will stop if the AIC of the fitted model increases. Finally, the selected model is the one have the fewest independent variables and the lowest AIC.

4.4. Artificial neural network

ANNs are a standard technique in the range of artificial intelligence and data mining in general. They are thus designed to learn rules from examples. In R-project, ANNs was run using the “neuralnet” package (FritschGuenther, 2012). The package contains a very flexible function to train feed-forward


neural networks. It was built to train neural networks in the context of regression analysis and focuses on multiple layer perceptrons, which are well applicable when modeling functional relationships. In “neuralnet”, the predictors are selected using the stepwise-forward methods as described in 4.3.2.

In this study, a model of back-propagation ANN as developed to predict soil types at both Soil Group level and the intermediate level. Back-propagation networks were trained with a back-propagation technique which adjusted the weight and bias values along a negative gradient descent directed in an attempt to minimize the mean squared error (MSE) between the input and output vectors of training data set (SigillitoHutton, 1990).

The application of an ANNs consists of two stages. During the first stage, the network is trained, meaning that it learns the conditions on which a certain soil group occurs using the calibration data set. Each input unit (cell or neuron) of the ANNs represents a prediction variable: terrain attributes, remote sensing indices and land use units. The output unit represents the Soil Groups or the intermediate level of Soil Groups. The connection between neurons are described by the weight wi (wi1

… win). The adjustment of these weights depends on the learning process. As each attribute combination (in terms of pixels of a grid map) is put into the network in succession, the weights are adjusted iteratively if the predicted output does not match the output of a training data set. The other network parameters including the optimum iteration learning rates, the number of hidden layer and transfer function were adjusted after the stage of learning to train the network. During the second stages, the learned knowledge in terms of the calibrated weights can be applied to the whole study area, for which the same input parameters (terrain attributes, remote sensing indices and land use maps) are available but no soil map has been surveyed. The network then predicts the soil units based on the learned weights. (Behrens. et al., 2005)

4.5. Validation

The quality of a soil map can be determined by comparing the prediction at the calibration sites with the observed values. However, the accuracy thus obtained, referred to as the internal accuracy, often over-estimates the actual accuracy (Chatfield, 1995). Therefore, in this project, an independent validation data set of 53 observations was selected randomly from the data set. The predictions based on the dataset excluding the validation dataset are then compared with independent validation data which were not used in the modeling.

For assessing the quality of the predicted soil maps, the map purity was used based on the confusion matrix (Brus et al., 2011). Table 6 shows an error matrix: the row margins (the area covered by the map units) of the matrix are known, whereas the column margins (the areas covered by the true classes) are unknown, and must be estimated from the samples.


Table 6: Confusion matrix

Field 1 2 ... U ∑

1 A11 A12 ... A1U A1+

2 A21 A22 ... A2U A2+ Map ......

U AU1 AU2 ... AUU AU+ ∑ A+1 A+2 ... A+U A

Aij = number of observations mapped as class Ci with observed soil class Cj

The overall purity is defined as the proportion of the mapped samples in which the predicted soil class, which is the soil class as depicted on the map, equals the true soil class as determined on validation points. In other words, it is the proportion correctly classified:

U A p = ∑ UU u=1 A

Where U denotes the number of classes, AUU denotes the number of correctly classified observations of map unit u and A denotes the total number of observations in the study area. A good map has a value for map purity close to 1 (Finke, 2011).

4.6. Soil diversity indices

The diversity indices were calculated to access the variation of the predicted soil maps. In this research, three indices including Shannon’s entropy H’, richness S and evenness E were calculated for each predicted map.

• Richness (S): is the number of soil classes that exists in an area. • Shannon’s entropy: is the most commonly used measurement of pedodiversity (Guo et al., 2003; Ibáñez et al., 1998)

S Hppln =−∑ ii × i=1

Where piis the proportion of area found in i-th unit over the total area of the map. When one class dominates over the area, we have p = 1, thus Hmin= 0. The closer values of p to 1/S, the more homogeneous the distribution of p, the more diverse the class composition is. The maximum value of


H is calculated as Hmax = lnS, a value close to Hmax indicate an equal proportional contribution of all classes (MartínRey, 2000).

• Evenness (E) refers to the relative abundance of each soil class in the area. It canbedefined as:

HH'' E == HSmax ln

If each soil class is equally abundant, the evenness has high value and inversely, an area in which the abundance of soil classes differ greatly has low evenness (A. B. McBratneyMinasny, 2007).

The diversity of a map indicates the amount of information depicted on the map: a high diversity correspond to high information content.

4.7. Combined Index practical management

The map purity is the indicator of map quality whereas soil diversity gives you an idea about the information content of the map. Thus, both aspects can be used to express how useful the map is. In terms of management practices, the goal of soil mapping is to construct a map with high purity that adequately represents soil diversity. Therefore, the combination of map purity and Shannon’s entropy is an important index to assess the soil mapping’s performance. The combined index for accuracy and depicted diversity was defined by multiplying H’ and map purity.



5.1. The soil maps modeled by multinomial logistic regression

In model of MLR for predicting Reference Soil Group in Bac Ninh, the stepwise-forward method results in the selection of fives prediction variables including altitude, NDVI, slope, topographic wetness index and saga wetness index (Table 7). The Wetness indices are frequently used to simulate the conditions in a watershed quantitatively. Altitude and slope are very important terrain attributes. Therefore, the combination of relief and distribution of water over the area significantly affects the formation of soil at higher level of classification. The effects of terrain attributes on distribution of soil groups were shown by Debella-Gilo and Etzelmuller (2009) using Multinomial Logistic Regression. In addition, Jafari, et al (2012) also found that the degree of wetness plays a role in the identification of soil types in a semi-arid area via the same method.

The MLR model for predicting intermediate level of Soil Group consists of the same variables with the model above (altitude, NDVI, slope, topographic wetness index, SAGA wetness index) and land use (Table 7). It is reasonable to expect that to predict soil class in more detail, the model need more predictive variables because the relationship between the soil class and the covariates is more complex at lower categorical levels. In addition, the more detail level was classified base on the soil management properties, land use also have considerable influence on the soil definition.

Table 7: The variable used to predict soil group and intermediate level of soil group in multinomial logistic regression.

Soil class Variable in modeling Reference Soil Group ALTITUDE+NDVI+SLOPE+WETNESSIN+SAGAWETNET MLR Intermediate level of Soil LU+ALTITUDE+NDVI+SLOPE+WETNESSIN+SAGAWETNET Group

Multinomial logistic regression predicts the soil classes directly from the predictors. Figure 4 shows the occurrence of Reference Soil Group predicted by MLR. As can be seen from the map, Fluvisols is the dominant class over the area. This can be explained by the fact that Bac Ninh is located in Red River delta that is the biggest delta in the North of Vietnam. Fluvisols are genetically young soil in alluvial deposits, thus over the study area, this soil group accounts for the largest area. The good natural fertility of this soil group make Bac Ninh become one of the highest paddy rice production region in Vietnam.

Beside Fluvisols, Acrisols, Arenosols and Plinthosols are predicted with a very limited proportion by MLR method. However, the model did not predict any Gleysols even though we have samples belong to this group, too. Looking back to the input observations, it is clear that Fluvisols account for more than 70% and four other soil groups only account for about 25% of the total number of samples. This explains for the excessive appearance of Fluvisols compared to the others and the exclusion of Gleysols as the output of the model (Gleysols only have 15 samples over the total of 537).


Figure 4: Map of Reference Soil Group predicted by Multinomial Logistic Regression

Acrisols occurs in high landscape position in the study area as compared to the topographic map (Figure 6), which is a good prediction of the model because this soil group is often associated with hilly or undulating topography in wet tropical climates (FAO, 2001).

Figure 5 illustrates the distribution of the intermediate level of soil group predicted by MLR. At this level, the Reference Soil Group was reclassified based on the soil management properties to avoid the predominance of one soil class in the input sampling. As expected, the model predicted more detailed soil classes: 11 soil classes appear in the resultant map. Nevertheless, there is still no occurrence of Gleysols which lead to the missing information of the model similar to the soil group prediction.


Figure 5: Map of intermediate level of Soil Group predicted by Multinomial Logistic Regression

Fluvisol1000000 – Fluvisols have low base saturation - occurs in most area of Bac Ninh. Generally this is the fertile alluvial soil, distributed over different types of terrain, but due to the long exploitation for cultivation without appropriate land treatment reduces the soil fertility. The second dominant soil class over the study area is Fluvisol1000100 – Fluvisols have low base saturation and a hard subsurface horizon.

Fluvisols have high base saturation and fine texture (Fluvisol0100010) appears in both sides following the Red river. This soil class has high fertility because the river annually deposits a certain amount of sediment to the area around it.

The model also results in the distribution of Acrisol00000 over the hilly region but in a more extensive area as compared to the Reference Soil Group level. The prediction of the MLR model for other soil classes concerns very small area.


Figure 6: Digital Elevation Model of Bac Ninh

5.2. The soil maps modeled by artificial neural network

Artificial Neural Networks were used to estimate the probabilities of occurrence of each soil class at the nodes of 25m raster covering Bac Ninh. Subsequently, the with the largest probability at each pixel was used to construct a prediction map. Therefore, at Reference Soil Group level, 5 models were constructed to predict 5 Reference Soil Groups appearing in the study area. Similarly, there are 15 ANN models corresponding to 15 intermediate level of Soil groups.

The parsimonious model for prediction was selected in a similar way to Multinomial Logistic Regression based on the smallest AIC and residual deviance. However, as shown in Table 8, the entire chosen model for each soil class by ANNs have only one predictive variable. Surprisingly, the increasing number of covariates led to the increasing in AIC for all models despite the fact that more variables included in the model could describe the relationship between the target variable and the covariates better.


Table 8: The variable used to predict soil group and intermediate level of soil group in artificial neural networks.

Level Soil class Variable in modeling Acrisols SAGAWET Fluvisols SAGAWET Reference Soil Group Arenosol WETNESS Gleysol WETNESS NDVI Acrisol00000 ALTITUDE Acrisol00001 NDVI Acrisol10000 SAGAWET Acrisol10001 ALTITUDE Arenosol10000 NDVI Fluvisol0001000 WETNESS Fluvisol0010000 LU Intermediate level of Soil Fluvisol0100000 ALTITUDE Group Fluvisol0100001 LU Fluvisol0100010 LU Fluvisol1000000 PVI Fluvisol1000010 ALTITUDE Fluvisol1000100 PVI Gleysol10000 LU Plinthosol100000 NDVI

Figure 7 shows the map of Reference Soil Group constructed by ANNs model. Three out of the five Reference Soil Groups were predicted by the model: Fluvisols, Acrisols and Plinthosols. ANNs predicted Fluvisols in about 98% of the total area (Table 9). This was also attributed to the unequal presence of the soil types in the observation data: more than 400 samples were Fluvisols in a 537 points dataset. Acrisols and Plinthosols having 58 and 55 samples respectively occur in the resultant map in a very limited proportion. Arenosols and Gleysols which have the lowest number of observations were not present in the predictive map.


Figure 7: Map of Reference Soil Group predicted by Artificial Neural Networks

In terms of ANNs for predicting intermediate level of Soil groups, the model predicted six soil classes belong to the same Soil groups with higher level: Acrisols, Fluvisols and Plinthosols. Similarly, the soil classes belonging to both Gleysols and Arenosols were not classified by the model. This map shows similar pattern with the map produced by MLR: Fluvisols have low base saturation cover most of the area (78.8%), Fluvisols have high base saturation and fine texture located in both sides following the Red river, and Acrisols distribute in hilly regions.


Figure 8: Map of intermediate level of Soil Group predicted by Artificial Neural Networks


Table 9: Distribution of soil classes predicted by Multinomial Logistic Regression and Artificial Neural Networks

Soil class Area (m2) Proportion 533214375 0.961 MLR forReference 7025000 0.013 Soil Group Arenosol 11722500 0.021 Plinthosol 3138125 0.006 Acrisol00000 12662500 0.023 Acrisol10000 6856250 0.012 Arenosol10000 23313125 0.042 Fluvisol0010000 4143750 0.007 Fluvisol0100000 3651250 0.007 MLR forintermediate Fluvisol0100001 938125 0.002 level of Soil group Fluvisol0100010 25088125 0.045 Fluvisol1000000 318666250 0.574 Fluvisol1000010 36875 0.000 Fluvisol1000100 154650625 0.279 Plinthosol10000 5093125 0.009 Plinthosol 832500 0.001 ANNsfor Reference Acrisol 6525625 0.012 Soil Group Fluvisol 547741875 0.987 Acrisol00000 13628125 0.025 Acrisol10000 3309375 0.006 ANNsforintermediate Fluvisol0100010 25390000 0.046 level of Soil group Fluvisol1000000 437148750 0.788 Fluvisol1000100 67043125 0.121 Plinthosol10000 8580625 0.015

5.3. Comparison of predictive methods

5.3.1. Soil map purity

The predictive soil maps were validated with independent data of 53 points collected by simple random sampling from the dataset. The overall purity of the maps was calculated from the confusion matrix. It has been used for many soil maps as a criterion to assess map quality. Many surveys reports state that the intention of the soil survey was to obtain a map purity of ca. 70%, which means that the soils should be classified correctly on about 70% of the map (Finke, 2011)

Table 9 presents the estimated purity of the soil maps predicted by Multinomial Logistic Regression and Artificial Neural Networks at both levels. Both of the two methods get the same map purity value


(0.73) at the high level of soil class. This indicates a good performance of both methods in predicting Reference Soil Group.

As expected, in terms of lower level of soil class, the map purity drops dramatically to 0.39 and 0.37 for MLR and ANNs, respectively. MLR have slightly higher purity in predictive map than that of ANNs. Descending in the classified level introduces more properties that might be related to local conditions and natural selection, thus can lead to the complexity of the system (Toomanian et al., 2006). Therefore, some properties might not be included in the applied covariates and disconnection occurs between soil classes and covariates at lower level. Digital soil mapping relies on the relationships between soil samples and environmental factors of the target area. Weak relationships will result in weak prediction as seen in the performance of both methods at intermediate level of Soil groups. Jafari et al (2013) also found that soil map purity decreased toward the lower taxonomy category. Another reason is that the number of different soil units at Reference Soil Group level is much less than at the intermediate level (5 Reference Soil Groups compare to 15 Intermediate levels). The soil map purity decreases due to low contrasting soil units at lower level. Olaniyan and Ogunkunle (2007) reported that soil mapping units with high purity included very contrasting soil types.

Table 10: Map purity, diversity indices and combined index of maps predicted by MLR and ANNs

Map Shannon Evenness Level Richness Purity * Shannon purity H’ E Reference Soil 0.73 5 0.20 0.12 0.15 Group MLR Intermediate level of Soil 0.39 15 1.21 0.44 0.41 Group Reference Soil 0.73 5 0.07 0.04 0.05 Group ANNs Intermediate level of Soil 0.37 15 0.77 0.28 0.29 Group

5.3.2. Soil diversity

Table 9 shows the Richness, Shannon index and the Evenness of the resultant maps from two methods at both taxonomic levels of soil units. It is clear to see that with increasing number of soil units from the Reference Soil Group to the intermediate level, the diversity and the evenness rise sharply. The greater number of soil units correspond to the higher the diversity at the lower taxonomic level.

At the same taxonomic level, MLR always yields a higher value of the Shannon’s index than ANNs. With the same Richness, the higher values of H’ from MLR compared to that of ANNs indicate that


higher soil diversity was MLR. This was confirmed above in table 8: Fluvisols predicted by MLR are less abundant than that by ANNs model even if both methods have a very low diversity index at Reference Soil Group level (0.2 for MLR and 0.07 for ANNs). The lower level of classification acquires ahigher value of Shannon’s index: 1.21 for MLR and 0.77 for ANNs. This could be attributed to the increasing number of soil map units at this level, thus induce the diversity of the predicted map. Similar with Reference Soil Group, the diversity is higher in maps made with MLR than with ANNs.

In addition, Figure 9 and Figure 10 illustrate the relationship between the map purity, the Shannon index and the combined index for MLR and ANNs model respectively. The diversity index always shows the opposite trend as the soil map purity. When the soil map purity decreases, the diversity index increases. The number of different soil units (richness) in each classification level may explain for this. H’ is closely related to the number of soil units: if the number of different soil classes increases, a greater number of fractions are summed in H’.




0.8 Purity

Shannon 0.6 Purity * Shannon 0.4


0 RSG - MLR InterSG - MLR

Figure 9: Variation of the purity, Shannon Index and the combined index for the map predicted by MLR at two level of soil class

The diversity indices including richness, Shannon’s index and evenness represent the deterministic soil complexity(Jafari et al., 2013). For that reason, the increase of entropy in the study area from Reference Soil Group to lower level indicates higher complexity of the soil system. Besides, an increase in entropy associated with the larger number of different soil classes influences the prediction ability of the model. When the system complexity increases, there are more different soil classes in the area, thus the model should be trained for larger number of soil classes. It means that there are fewer observations per class for training of the model. This raises the uncertainty of the prediction for each soil classes and soil map purity decreases for the intermediate level of Soil groups. The soil diversity is a reflection of the intricacy of soil maps and may therefore influence the soil map purity (Minasny et al., 2010).






0.5 Purity Shannon 0.4 Purity * Shannon 0.3



0 RSG - ANN InterSG - ANN

Figure 10: Variation of the purity, Shannon Index and the combined index for the map predicted by ANNs at two level of soil class

The combined index defined by multiplying Shannon’s entropy and map purity increases from the Reference Soil Group level to intermediate level in both MLR and ANNs approaches. However, MLR show higher value at both levels in comparison with ANNs as illustrated in table 9.

In terms of management practices, we need a soil map with high purity that adequately represents soil diversity. The pedodiversity measurements are related to the density of soil map or presence of various soil units (Jafari et al., 2013). Soil mapping methods should acquire high map purity and also, it should represent the real soil diversity. In this research, although there are small differences in map purity between those two predictive methods, MLR shows higher pedodiversity at both mapping levels than ANNs does. Therefore, it seems that soil mapping will be more efficient by using Multinomial Logistic Regression than Artificial Neural Network. In MLR methods, the map purity at Reference Soil Group level is much higher than that value at intermediate level of Soil groups. Therefore, the model performs much better in predicting Soil groups. However, at lower level, the model predicts better diversity of the soil map and thus the informative value estimated by the combined index of the intermediate level maps is higher.



Some main conclusions can be drawn from the results of this study:

1. The Multinomial Logistic Regression could successfully be used to directly predict soil types map.

2. The soil map purity shows an opposite trend to that of the mapped soil diversity: as the purity decreases from Soil Groups to intermediate level of Soil groups, the soil diversity increases.

3. Based on the map purity and the combined index, Multinomial Logistic Regression performed better for predicting soil types than Artificial Neural Networks. Soil mapping at the level of Reference Soil Group acquires a high map purity and a low diversity.

4. To improve the model performance, more observations are needed for Acrisols, Plinthosols, Arenosol and especially Gleysols to avoid the abundance of Fluvisol over the dataset.



