International Journal of Environmental Research and Public Health

Article Rainfall-Induced Landslide Prediction Using Machine Learning Models: The Case of District,

Martin Kuradusenge 1,* , Santhi Kumaran 2 and Marco Zennaro 3

1 African Centre of Excellence in Internet of Things, University of Rwanda, 3900, Rwanda 2 School of ICT, The Copperbelt University, Kitwe 21692, Zambia; [email protected] 3 International Centre of Theoretical Physics, Strada Costiera, 11, I-34151 Trieste, Italy; [email protected] * Correspondence: [email protected]

 Received: 20 February 2020; Accepted: 27 April 2020; Published: 10 June 2020 

Abstract: Landslides fall under natural, unpredictable and most distractive disasters. Hence, early warning systems of such disasters can alert people and save lives. Some of the recent early warning models make use of Internet of Things to monitor the environmental parameters to predict the disasters. Some other models use machine learning techniques (MLT) to analyse rainfall data along with some internal parameters to predict these hazards. The prediction capability of the existing models and systems are limited in terms of their accuracy. In this research paper, two prediction modelling approaches, namely random forest (RF) and logistic regression (LR), are proposed. These approaches use rainfall datasets as well as various other internal and external parameters for landslide prediction and hence improve the accuracy. Moreover, the prediction performance of these approaches is further improved using antecedent cumulative rainfall data. These models are evaluated using the receiver operating characteristics, area under the curve (ROC-AUC) and false negative rate (FNR) to measure the landslide cases that were not reported. When antecedent rainfall data is included in the prediction, both models (RF and LR) performed better with an AUC of 0.995 and 0.997, respectively. The results proved that there is a good correlation between antecedent precipitation and landslide occurrence rather than between one-day rainfall and landslide occurrence. In terms of incorrect predictions, RF and LR improved FNR to 10.58% and 5.77% respectively. It is also noted that among the various internal factors used for prediction, slope angle has the highest impact than other factors. Comparing both the models, LR model’s performance is better in terms of FNR and it could be preferred for landslide prediction and early warning. LR model’s incorrect prediction rate FNR = 9.61% without including antecedent precipitation data and 3.84% including antecedent precipitation data.

Keywords: prediction; rainfall; antecedent rainfall; landslide; random forest; logistic regression

1. Introduction Landslides and floods are the common natural disasters that strike the northwestern due to its topographical, geological features and climatic profile [1,2]. According to The National Risk Atlas of Rwanda report published by the Ministry in Charge of Emergency Management (MINEMA), 42% of areas are classified as moderate to very high susceptible areas to landslides [3]. Statistics indicate that many people lose their lives in different disaster incidences, and the damages of their properties/infrastructure are worth millions of dollars, and much more money is spent on disaster recovery too [4]. Every year, during the rainfall period, landslides affect many people in mountainous regions. These disasters led to the loss of lives and left many homeless and without a livelihood. Since the establishment of an institution in charge of disaster management (MINEMA) in 2010, systematic

Int. J. Environ. Res. Public Health 2020, 17, 4147; doi:10.3390/ijerph17114147 www.mdpi.com/journal/ijerph Int. J. J. Environ. Environ. Res. Res. Public Public Health Health 20202020,, 1717,, x 4147 2 of 19 20 livelihood. Since the establishment of an institution in charge of disaster management (MINEMA) in records indicate that there were 227 deaths and 160 injuries from 2011–2018 (June). Also many houses 2010, systematic records indicate that there were 227 deaths and 160 injuries from 2011–2018 (June). collapsed and many hectares of crops have been washed away by landslides and floods [5]. Table1 Also many houses collapsed and many hectares of crops have been washed away by landslides and and Figure1 summarize the statistics of deaths caused by landslides during the period of 2011–2018 floods [5]. Table 1 and Figure 1 summarize the statistics of deaths caused by landslides during the in Rwanda. period of 2011–2018 in Rwanda. Table 1. Number of deaths caused by landslides in Rwanda (2011–2018). Table 1. Number of deaths caused by landslides in Rwanda (2011–2018). Year 2011 2012 2013 2014 2015 2016 2017 2018 Total Year 2011 2012 2013 2014 2015 2016 2017 2018 Total NumberNumber of death of death19 19 14 14 3131 00 10 10 64 64 7 7 77 77222 222

Figure 1. Number of deaths caused by landslides per district.

In most cases, landslides and floods floods in Rwanda occur in a cascading manner as debris is dumped dumped into rivers which in turn causes riverine floodsfloods [[66]. One example is the case of floodsfloods and landslides hazards that occurred in 2016 and 2018 where many people died after heavy rainfalls that caused widespread floodingflooding and landslides across parts of Rwanda. At least 222 died in different different landslide events since 2011–2018, where victims drowned in floodwaterfloodwater or others died after houses collapsed due to debris movement or landslides caused by heavyheavy rainfall. The type of floods floods most threatening Rwanda are riverine floodsfloods due to its densedense river network andand largelarge wetlandswetlands [[7].7]. The triggering factor of a landslide is rainfall infiltrationinfiltration into the soil, making the groundwater level increase, resulting in a reduct reductionion of the shear strength which is also closely related to antecedent rainfall, cumulativecumulative rainfall, rainfall, and and rainfall rainfall duration duration [8]. The[8]. slopeThe instabilityslope instability increases increases with high with intensity high intensityor long duration or long duration rainfall but rainfall does but not does relate not to relate rainfall to alone. rainfall It alone. is affected It is affected by other by factors other suchfactors as suchlithological as lithological material, material, type of soiltype and of soil depth, and thedep surroundingth, the surrounding vegetation, vegetation, slope inclination slope inclination or aspect, or aspect,curvature, curvature, altitude, altitude, land use land patterns, use patterns, and drainage and drainage networks networks [5,9,10]. [5,9,10]. Although the rainfall intensity may be the same in different different regions, landslides may or may not occur accordingaccording to to the the geological geological and and topographical topographical characteristics. characteristics. Therefore, Therefore, rainfall rainfall intensity intensity cannot cannotbe the onlybe the cause only of cause landslides of land [10slides]. According [10]. According to the study to the carried study out carried by MINEMA, out by MINEMA, the landslide the landslidecontributing contributing factors have factors been have assigned been weights assigned based weights on the based past on landslides the past characteristicslandslides characteristics in Rwanda: inrainfall Rwanda: (20%), rainfall slope (20%),(20%), altitudeslope (20%), (15%), altitude soil type (15%), (14%), soil lithology type (14%), (10%), lithology land cover (10%), (9%), land soil depthcover (7%),(9%), distancesoil depth to (7%), the main distance roads to (5%) the [7main]. Various roads machine(5%) [7]. learning Variousmodels machine and learning assessment models studies and assessmenthave been carriedstudies out have on been both carried landslide out susceptibility on both landslide assessments susceptibility and flood assessments predictions and [6, 10flood,11]. Thepredictions models [6,10,11]. are built The based models on the are assumption built based that on the landslides assumption are more that landslides likely to occur are more in situations likely to occursimilar in to situations those of pastsimilar landslides to those [ 4of]. past landslides [4]. Although therethere have have been been numerous numerous researches research undertakenes undertaken on disaster on susceptibilitydisaster susceptibility assessment assessmentand prediction and models,prediction the models, literature the about literature landslides about datalandslides for Rwanda data for is scarce.Rwanda Rwanda is scarce lacks [7]. Rwanda lacks efficient early warnings about disasters and hence, the risks of vulnerability by Int. J. Environ. Res. Public Health 2020, 17, 4147 3 of 20 efficient early warnings about disasters and hence, the risks of vulnerability by incidences are high [12]. Therefore, there is a need for continuous research on prediction of disasters due to landslides and floods to reduce the risks. Recently, a few studies were conducted on landslide susceptibility in Rwanda, [5] but none of them were about disaster prediction for early warning and risk reduction. Elsewhere, numerous machine learning techniques have been used to predict landslide occurrence but most of them have been focusing on disaster susceptibility mapping using internal (geological, topographical, environmental) factors without taking into consideration the triggering internal factors such as rainfall [12,13].Those that include rainfall in dataset do not consider antecedent rainfall or vice versa [5,14,15], others use both daily and cumulated previous precipitations without internal conditioning parameters [16]. False negative rate was not considered to measure the performance of the models, yet it is a crucial evaluation metric in landslide predictions. The main purpose of this research is to improve the performance of the prediction models by including the antecedent rainfall data among other parameters used in the previous studies such as daily rainfall, hill slope angle, soil type, soil depth, and land cover. Another objective is to minimize the incorrect predictions (FNR) as this is a very important metric to be considered since this one counts landslide cases that have not been seen by the prediction model and has a significance in early warning systems. The preferred MLTs are random forest (RF) and logistic regression (LR). These two MLTs were selected among others because of being among the most broadly used in landslide prediction [4] and due to their ability to deal with discrete and continuous data for classification problems. Besides, LR calculates regression coefficients while RF can show up the importance of different parameters, its training speed is high and the computational cost is low [4]. This study is aimed at: (1) to analyze the correlation between rainfall historical data and other topographical and geological factors impacting landslide occurrence in Rwanda and (2) propose a machine learning model for the prediction of this disaster which can be used for early warning. This study is geographically limited to the Ngororero district in Rwanda and the period of study was from 2011–2018.

2. Materials and Methods

2.1. Area of Study The district of Ngororero is one of the seven districts of the western province. The district is situated in the northwestern region of Rwanda. The district has a surface area of around 679 km2, and is composed of 13 administrative sectors, 73 administrative cells, and 419 villages. The district shares borders with five other districts. The district has a relief characterized by high mountains with very steep slopes that flow into valleys. The altitude varies between 1460 m and 2883 m above sea level, the highest point being on Bweru Mountain situated in Muhanda sector with 2883.4 m of altitude. The average annual temperature is 18 ◦C which varies with the altitude. The average altitude is 1500 m. The climate of the region is more of the tropical type with four distinct seasons (short rainy season of October–December corresponding to the agriculture season A; short dry season of January–February; long rainy season of March–June corresponding to the agriculture season B and long dry season of July–September corresponding to the swamp agriculture season C. Rainfall is regular, with a rainfall of 1527.7 mm per year, although irregularities are recorded sometimes with shortages or excess rainfall. Due to much rainfall and topographic structure which is characterized by a steep slope, the district is often affected by landslides (Figure2) and floods almost every year during the period of heavy rainfall. In the past, some of the victims drowned in floodwater, others died after houses collapsed under the heavy rainfall or landslide [17]. Int. J. Environ. Res. Public Health 2020, 17, 4147 4 of 20 Int. J. Environ. Res. Public Health 2020, 17, x 4 of 19

Figure 2. Photographs of some landslides in the study area.

2.2. Data Acquisition and Landslide Inventory 2.2. Data Acquisition and Landslide Inventory 2.2.1. Data Collection 2.2.1. Data Collection Site visits have been conducted in the study area for primary data collection where different Site visits have been conducted in the study area for primary data collection where different landslide-prone areas were visited during the period of 9 months (October 2018–June 2019) to collect landslide-prone areas were visited during the period of 9 months (October 2018–June 2019) to collect GPS coordinates of landslide locations and investigate contributory factors. In addition to the primary GPS coordinates of landslide locations and investigate contributory factors. In addition to the primary data collection, historical data about landslide incidence records have been collected from the ministry data collection, historical data about landslide incidence records have been collected from the in charge of emergency management (formerly called the ministry of disasters management and ministry in charge of emergency management (formerly called the ministry of disasters management refugees: MIDIMAR). Since this ministry was established in 2010, landslide data before 2011 could not and refugees: MIDIMAR). Since this ministry was established in 2010, landslide data before 2011 be found. Therefore, the chosen period for the study is 2011–2018. Daily rainfall data were collected could not be found. Therefore, the chosen period for the study is 2011–2018. Daily rainfall data were from Rwanda office of Meteorology. Soil type, land cover and slope inclination were collected from collected from Rwanda office of Meteorology. Soil type, land cover and slope inclination were different government institutions (Rwanda Land Management and Use Authority, Rwanda Agriculture collected from different government institutions (Rwanda Land Management and Use Authority, Board, and Centre for Geographic Information Systems and Remote Sensing (CGIS) at the University Rwanda Agriculture Board, and Centre for Geographic Information Systems and Remote Sensing of Rwanda. It has been realized that landslides occur under the same or similar conditions as those (CGIS) at the University of Rwanda. It has been realized that landslides occur under the same or that caused them in the past [9]. The common internal triggering factor of landslides in Rwanda is the similar conditions as those that caused them in the past [9]. The common internal triggering factor of precipitation, but depending on the selected disaster-prone area, the conditioning factors may vary. landslides in Rwanda is the precipitation, but depending on the selected disaster-prone area, the 2.2.2.conditioning Dataset factors may vary.

2.2.2.The Dataset internal factors most contributing to the landslide occurrence in Rwanda are slope angle, soil type/texture, soil depth, and land cover while rainfall is the external triggering factor [7]. The daily rainfallThe data internal and factors other parameters most contributing collected to fromthe landslide stage 1 occurrence have been arrangedin Rwanda in are one slope datasheet. angle, Geo-topographicalsoil type/texture, soil data depth, were and extracted land fromcover Geographic while rainfall Information is the external Systems triggering (GIS) data. factor Features [7]. The of thedaily dataset rainfall are data described and other as follows:parameters collected from stage 1 have been arranged in one datasheet. Geo-topographical data were extracted from Geographic Information Systems (GIS) data. Features of 2.2.3.the dataset Rainfall are described as follows: Daily historical rainfall data (2011–2018) for the study area were collected from the Rwanda Office 2.2.3. Rainfall of Meteorology. The dataset was made up of daily precipitation from Sovu and Muramba weather stationsDaily located historical in the rainfall Ngororero data (2011–2018) district. The for sizes the ofstudy rainfall area data were records collected from from each the station Rwanda are 1893Office and of 2077Meteorology. entries respectively. The dataset Sovu was stationmade up had of 978 daily rainfall precipitation instances from while Sovu Muramba and Muramba had 1305 (Figureweather3). stations Because located no landslides in the Ngororero occurred duringdistrict. the Th drye sizes seasons of rainfall in the data study records area, some from non-rainfalleach station daysare 1893 were and not 2077 included entries in respectively. this dataset. Sovu Therefore station 3970had 978 instances rainfall (rainfall instances and while non-rainfall) Muramba have had been1305 (Figure used. 3). Because no landslides occurred during the dry seasons in the study area, some non- rainfall days were not included in this dataset. Therefore 3970 instances (rainfall and non-rainfall) have been used. Int. J. Environ. Res. Public Health 2020, 17, 4147 5 of 20 Int. J. Environ. Res. Public Health 2020, 17, x 5 of 19

Figure 3. RainfallRainfall in in the the study study area. area.

The two rainfall gauge gauge stations stations were were not sufficient sufficient for rainfall data at every location in the the study study area. To To estimate estimate the the rainfall rainfall in in the the areas areas distan distantt from from the rainfall gauge stations, the data from neighboring weatherweather stations stations (Figure (Figure4) were4) were collected collected and dataand interpolated data interpolated using the using inverse the distanceinverse distanceweighting weighting (IDW) method. (IDW) method.

Figure 4. LandslideLandslide inventory inventory in in the the study study area area and and rain gauge stations.

The IDW was chosen because it is a commonly used technique for the estimation of missing data in hydrology and geographical sciences [18]. It is based on the functions of the inverse distances in Int. J. Environ. Res. Public Health 2020, 17, 4147 6 of 20

The IDW was chosen because it is a commonly used technique for the estimation of missing data in hydrology and geographical sciences [18]. It is based on the functions of the inverse distances in which the weights are defined by the opposite of the distance and normalized so that their sum equals one [19]: XN Z(S0) = λiZ(Si) (1) i=1 where Z(S0) represents the interpolated value at point S0, Z(Si) represents the observed value at point Si, n is the number of observations, and λi is the weight. The weights decrease as the distance increases. The weights λi can be calculated as follows:

p d− XN = i0 = λi PN p λi 1 (2) − ⇒ j=1 di0 i=1 where p is a power and di0 is the distance between a target and observations. Rainfall data from the two weather stations in the study area and those in three neighboring districts (Figure4) were used as input into the IDW model and GIS software tool was used for implementation.

2.2.4. Antecedent Rainfall Analysis indicates that landslides can be triggered by one-day prolonged rainfall or by many consecutive days. From a physical viewpoint, the antecedent rainfall determines the initial water content and matric suction of the soil, but effects of the antecedent rainfall lasts for a certain period [10]. From the rainfall dataset, 5-days antecedent rainfall data have been computed and used as an additional parameter. This parameter has the same size (number of records) as that of daily rainfall.

2.2.5. Slope One of the most important landslides’ causal factors is slope [20]. Increase in the slope decreases its stability if the soil depth is sufficient [21]. According to the “National risk atlas of Rwanda” by the ministry in charge of emergency management, the slope is classified into ranges from 0 to 10 according to the steepness angle [7].

2.2.6. Soil Type Based on the grain-size distribution analysis and prevailing range of particle sizes, soils in the study area are classified into three categories: sand, silt, and clay [7].

2.2.7. Soil Depth This was considered as one of the landslide causal factors. Three classes of soil depth were used in the dataset according to MINEMA soil depth classification [7].

2.2.8. Land Cover Various past studies have pointed out that permanently covered lands are more protected against landslides than non-covered ones. According to their potential influences, there are six main types of vegetation in Rwanda [7], but three are dominant in the area of study as shown in Table2. All the data in Table2 have been collected as secondary data. Slope and soil depth are continuous while on the other hand, the soil type and land cover are categorical. Both data were obtained from the source (government institutions) with the assigned scores (claimed to be FAO scores) for the purpose of getting qualitative data as defined in some official documents [7]. The standard scores were used as categorical data during model training process. Figure5 shows the geographical characteristics (as defined in the area of study). Int. J. Environ. Res. Public Health 2020, 17, 4147 7 of 20 Int. J. Environ. Res. Public Health 2020, 17, x 7 of 19

Table 2. >20–25Summary of classification6 of key parameters in the study area and their standardized scores >25–45 10 used in the model.

All theSlope data in Table 2 have been collected as Soil secondary data. Slope and soil depth are Land continuous whileSlope on Angle the other hand, the soil type and land coveSoilr are Depth categorical. Both data were obtained from Score Soil Type Score Score Land Cover Score the(Degree) source (government institutions) with the assigned(cm) scores (claimed to be FAO scores) for the purpose0–10 of getting qualitative 0 Claydata as defined 0 in some<50 official documents 1 [7]. Forest The plantation standard scores 3 were>10–15 used as categorical 1 data Sand during model 4 training>50–100 process. Figure 4 5 shows Agriculture the geographical 7 characteristics>15–20 (as defined 4 in the Silt area of study). 6 >100 10 Open land 10 >20–25 6 >25–45The complete dataset 10 was made of 3970 daily rainfall × 5 slope categories × 3 soil type categories × 3 soil depth categories × 3 land cover categories = 535,950 dataset records.

(a) (b)

(c) (d)

Figure 5. Geographical characteristics in the study area: (a) slope, (b) land cover, (c) soil type, (d) soil depth. Int.Int. J. Environ. J. Environ. Res. Res. Public Public Health Health 20202020, 17, 17, x, 41478 of 8 of19 20

Figure 5. Geographical characteristics in the study area: (a) slope, (b) land cover, (c) soil type, (d) soil The complete dataset was made of 3970 daily rainfall 5 slope categories 3 soil type categories depth. × × 3 soil depth categories 3 land cover categories = 535,950 dataset records. × × 2.2.9. Landslide Incidences 2.2.9. Landslide Incidences Data about landslide events with their respective date of occurrences were collected from the Data about landslide events with their respective date of occurrences were collected from the ministry in charge of emergency management (MINEMA) and were recorded as either zero (0) for ministry in charge of emergency management (MINEMA) and were recorded as either zero (0) for “no landside event” or as one (1) meaning “landslide event”. The full dataset used in this study is “no landside event” or as one (1) meaning “landslide event”. The full dataset used in this study is available online [22]. The datasheet was converted to the comma-separated value (CSV) format for available online [22]. The datasheet was converted to the comma-separated value (CSV) format for the reason of being used in the machine learning algorithm. At this stage, the missing values were the reason of being used in the machine learning algorithm. At this stage, the missing values were handled using the fill in missing values with the median technique and labels removed from features. handled using the fill in missing values with the median technique and labels removed from features.

2.2.10.2.2.10. Splitting Splitting Dataset Dataset into into a aTraining Training and and a aTest Test Dataset Dataset TheThe dataset dataset was was divided divided into into training training set set and and test test set set at at different different proportions. proportions.

2.2.11.2.2.11. Training Training and and Testing Testing the the Models Models TheThe training training dataset dataset was was used used to to train train two two mach machineine learning learning models models and and predictions predictions were were done done usingusing the the test test dataset. dataset. Test Test labels labels were were used used for for the the accuracy accuracy calculation. calculation.

2.2.12.2.2.12. Results Results Analysis Analysis BasedBased on on the the performances, performances, the the best best model model wi willll be be recommended recommended to to be be implemented implemented for for early early warningwarning system. system.

2.3.2.3. Machine Machine Learning Learning Models Models ThisThis study study aims aims to to apply apply two two machine machine learning learning techniques techniques and and choose choose the the best best performing performing one one forfor predicting predicting landslide landslide incidences incidences in in Rwanda, Rwanda, which which can can be be used used for for early early warning warning to to reduce reduce risks. risks. TwoTwo classification classification models models that that have have been been selected selected to to accomplish accomplish this this research research area area are are random random forest forest (RF)(RF) and and logistic logistic regression (LR).(LR). The The simplicity simplicity or or complexity complexity of eachof each MLT MLT differs differs from onefrom to one another. to another.Hyper-parameter Hyper-parameter tuning has tuning been done has tobeen optimize done the to performance optimize the of each.performance These classification of each. These models classificationwere chosen models because were this researchchosen because is dealing this with rese thearch binary is dealing classification with the problem. binary classification The complete problem.research The methodology complete research is indicated methodology in Figure6 .is indicated in Figure 6.

FigureFigure 6. 6.ProposedProposed landslide landslide prediction prediction flow flow chart. chart. Int. J. Environ. Res. Public Health 2020, 17, 4147 9 of 20 Int. J. Environ. Res. Public Health 2020, 17, x 9 of 19

2.3.1. Random Forest Random forest (RF)(RF) waswas firstfirst introducedintroduced by by Breiman Breiman [23 [23]] as as an an ensemble ensemble learning learning algorithm. algorithm. It isIt ais commonlya commonly used used machine machine learning learning technique technique for data for prediction data prediction for both classificationfor both classification and regression and problemsregression[ 23problems–25]. It is[23–25]. called It random is called forest random because forest of randomnessbecause of randomness in the selection in the of selection features onof eachfeatures decision on each tree decision and also tree training and also samples training from samp theles dataset from the [26]. dataset It operates [26]. It by operates building by up building several decisionup several trees decision from randomtrees from subsets random of datasetsubsets during of dataset the trainingduring the time training to make time up ato relationshipmake up a betweenrelationship features between and features labels [ 27and]. Decisionlabels [27]. or Decisi classificationon or classification tree (Figure tree7) is(F aigure tree-shaped 7) is a tree-shaped diagram whichdiagram is awhich basic structureis a basic block structure of RF block used toof decideRF used a route to decide of action. a route RF consistsof action. of theRF constructionconsists of the of classificationconstruction of rules classification (tree) from rules bootstrap (tree) samples from bootstrap of the dataset. samples The of decision the dataset. of the The final decision classification of the isfinal computed classification from theis computed majority votefrom of the multiple majority classification vote of multiple trees. classification trees.

Figure 7. Random Forest decision tree [[28].28].

The bigger number of trees in the model plays a role in improving the accuracy and predictive capability [29] [29] and requires definingdefining the number of trees (T) andand thethe numbernumber of predictivepredictive variablesvariables (m). RF can workwork withwith bothboth categorical categorical and and numerical numerical data data [20 [20].]. RF RF was was chosen chosen because because of of its its power power to dealto deal with with mixed mixed variables variables (categorical (categorical and numerical), and numerical), over-fitting over-fitting risks reduced, risks reduced, efficiently efficiently working onworking large databaseson large databases with high with accuracy, high accuracy, robustness robu withstness less trainingwith less time, training estimates time, estimates missing data, missing and featuredata, and importance feature importance indication indication [30]. [30].

2.3.2. Logistic Regression Logistic regression (LR) is a machine learning algorithm that was initially used in biological sciences (during the early 20th century) and later in otherother areas.areas. It is used in solvingsolving classificationclassification problems by measuring the relationshiprelationship between independentindependent variables and categoricalcategorical dependent variables. The technique has become one of the the most most used used in in landslide landslide susceptibility susceptibility assessment assessment [20]. [20]. It isis in in three three categories: categories: Binary Binary (dichotomous (dichotomous or has two or possiblehas two classes possible coded classes as 0/1: absencecoded/ present),as 0/1: Multinomialabsence/present), (more Multinomial than two categories (more than without two cate ordering),gories without and ordinal ordering), LR (more and ordinal than 2 categoriesLR (more withthan ordering).2 categories In with its simplest ordering). form In LR its model simplest can form be expressed LR model using can a be sigmoid expressed function using to a makeover sigmoid regression value in a range of to + producing a probability between 0 and 1 for non-landslide function to makeover regression−∞ value∞ in a range of −∞ to +∞ producing a probability between 0 and and1 for landslidenon-landslide respectively and landslide with 0.5 respec beingtively threshold with [0.531]: being threshold [31]: 1 𝑃(𝑌( = ) == 1) = (3) P Y 1 1+𝑒y (3) 1 + e− where P is the probability of event occurrence and y is the dependent variable (landslide or no- landslide) and is defined as: 𝑦 = 𝛽0 + 𝛽1𝑋1 + 𝛽𝑋2 +⋯𝛽𝑛𝑋𝑛 (4)

where y is dependent variable and X1, X2, … and Xn are explanatory variables. Then: Int. J. Environ. Res. Public Health 2020, 17, 4147 10 of 20 where P is the probability of event occurrence and y is the dependent variable (landslide or no-landslide) and is defined as: y = β0 + β1X1 + βX2 + βnXn (4) ··· where y is dependent variable and X1, X2, ... and Xn are explanatory variables. Then:

1 eβ0+β1X1+β2X2+...βnXn P(Y = 1) = = (5) (β0+β1X1+β2X2+...βnXn) β0+β1X1+β2X2+...βnXn 1 + e− 1 + e where β0 is an intercept of the model and β1, β2, ... .βn are regression coefficients that measure the contribution of x1,x2, ... xn to the prediction model. With LR mathematical equation can be generated from regression coefficients and intercept, hence the probability of landslide incidence can be computed.

2.4. Re-Sampling One of the challenges of various machine learning models is that their performance is affected by imbalanced data in the training dataset. To deal with this problem, three re-sampling techniques have been used: oversampling, under-sampling and synthetic minority oversampling technique (SMOTE) and the later found to be the best. SMOTE was initially proposed in 2001 by Nitesh [31]. It uses the nearest neighbor’s algorithm to create new instances of minority categories by forming a convex combination of neighboring instances.

2.5. Preliminary Analysis Using Exploratory Data Analysis (EDA) Exploratory data analysis (EDA) was used to get patterns and the relationship between landslide incidences and various parameters [31]. The scatter plot matrix tool was used as methods of EDA and helped to identify the relationship among variables.

2.6. Models Evaluation Proper evaluation of the performance of the model is a crucial aspect of predictive modelling. Wide ranges of performance metrics are available for classification and regression models. A common metric for binary classification is the area under the receiver operating curve (ROC). The most widely used evaluation metric is confusion metric, which distinguishes measures between errors and measures overall accuracy or percent correct classification. In this study, the statistical index-based method and ROC have been used to evaluate performance of the models. In this regard, the table of summary of classification errors (confusion matrix) was used to calculate the following evaluation indexes:

TP + TN Accuracy = (6) TP + FN + TN + FP

FP FPR = (7) FP + TN TP Recall = (8) TP + FN where Accuracy is the overall performance of the classifier, false positive rate (FPR) is the rate of incorrect predictions. TP or Recall denotes the true positive (correctly predicted incidences), TN represents true negative (correctly predicted as negative) [32]. FP is the false positive (predicted occurrences but did not occur), FN is the false negative (predicted negative but in reality it was positive). These performance metrics were selected based on their high impact on predicted landslide. Moreover, the metrics are most relevant for the proposed approach as the model will be combined with wireless sensor network (WSN) that will be used as early warning system to alert people about landslide occurrence and reduce risks. Int. J. Environ. Res. Public Health 2020, 17, 4147 11 of 20

The receiver operating curve is one of the most widely used models’ performance tools for classification problems. It summarizes all the confusion matrices that each threshold produced by plotting FPR (1 - sensitivity) on X-axis vs. Sensitivity or Recall (TPR) on Y-axis [33]. ROC curve makes Int.it easy J. Environ. to identify Res. Public the Health best 2020 threshold, 17, x to make a decision. The area under the curve (AUC) is also11 used of 19 to compare two or more ROC curves. AUC is used to decide which classification method is better. The following equation is used to calculate AUC(∑ [34𝑇𝑃]: + ∑ 𝑇𝑁) 𝐴𝑈𝐶 = (9) 𝑃+𝑁 Int. J. Environ. Res. Public Health 2020, 17, x P P 11 of 19 ( TP + TN) where P is the total number of positivesAUC = (landslides), N is the total number of negatives (No(9) P + N landslides). (∑ 𝑇𝑃 + ∑ 𝑇𝑁) 𝐴𝑈𝐶 = (9) where P is the total number of positives (landslides), 𝑃+𝑁N is the total number of negatives (No landslides). 3. Resultswhere P is the total number of positives (landslides), N is the total number of negatives (No 3. Resultslandslides). 3.1. Preliminary Analysis: Correlation Among Features Used in the Models 3.1. Preliminary3. Results Analysis: Correlation Among Features Used in the Models As shown in Figure 8, there is no specific threshold for 1-day rainfall inducing landslides. As shown in Figure8, there is no specific threshold for 1-day rainfall inducing landslides. Rainfall Rainfall3.1. Preliminaryfrom 0 mm Analysis: could Correlationcause landslide Among Features occurre Usednce independing the Models on how much it rained in the previousfrom 0 mm days. could As causeindicated landslide by graphs, occurrence very dependingfew landslide on hazards how much occurred it rained for in antecedent the previous rainfall days. As shown in Figure 8, there is no specific threshold for 1-day rainfall inducing landslides. lessAsindicated than 50 mm. by graphs, In most very cases, few landslides landslidehazards occurred occurred when previous for antecedent cumulative rainfall precipitation less than 50 mm.was In mostRainfall cases, from landslides 0 mm could occurred cause when landslide previous occurre cumulativence depending precipitation on how much was close it rained to 100 in mm.the close previousto 100 mm. days. As indicated by graphs, very few landslide hazards occurred for antecedent rainfall less than 50 mm. In most cases, landslides occurred when previous cumulative precipitation was close to 100 mm.

FigureFigure 8. CorrelationCorrelation between between landslide landslide and and rainfalls. rainfalls. Figure 8. Correlation between landslide and rainfalls. Also,Also, as as indicated indicated in in Figure Figure 99,, many landslides occurred inin hilly areas where thethe slope inclination Also, as indicated in Figure 9, many landslides occurred in hilly areas where the slope inclination isis 25–45° 25–45◦ (category(category 6 6and and 10) 10) and and very very few few landslides landslides occurred occurred in the in area the area where where the slope the slope is less is than less is 25–45° (category 6 and 10) and very few landslides occurred in the area where the slope is less than 15°than (category 15 (category 1 and 1 and4). The 4). Theland land covered covered by bythe the forest forest resists resists landslide landslide hazards hazards but, but, also also the the low low 15°◦ (category 1 and 4). The land covered by the forest resists landslide hazards but, also the low numbernumber of of landslide landslide incidences incidences in in that that area area can can be be justified justified by by the the low low dominance dominance of of that that category. category. number of landslide incidences in that area can be justified by the low dominance of that category. AgriculturalAgricultural land land is the is themost most affected aff affectedected by by slope slope failurefailure while the thethe forest forestforest is isisless lessless affected. aaffected.ffected.

FigureFigure 9. Correlation9. Correlation analysis analysis betweenbetween landslides, landslides, slope slope and and land land cover. cover. Figure 9. Correlation analysis between landslides, slope and land cover. The soil under the category of silt was much affected by landslides while areas of sand have few incidences just because of physical properties of this soil particles but also because not much of this The soil under the category of silt was much affected by landslides while areas of sand have few type of soil is found in the district. Figure 10 points out that few landslides incidences occurred in the incidences just because of physical properties of this soil particles but also because not much of this areas where the soil depth is thin (less than 50 cm) while more occurred in the regions with thick soil type of(depth soil isbigger found than in 50).the district. Figure 10 points out that few landslides incidences occurred in the areas where the soil depth is thin (less than 50 cm) while more occurred in the regions with thick soil (depth bigger than 50). Int. J. Environ. Res. Public Health 2020, 17, 4147 12 of 20

The soil under the category of silt was much affected by landslides while areas of sand have few incidences just because of physical properties of this soil particles but also because not much of this type of soil is found in the district. Figure 10 points out that few landslides incidences occurred in the areas where the soil depth is thin (less than 50 cm) while more occurred in the regions with thick soil (depthInt. J. Environ. bigger Res. than Public 50). Health 2020, 17, x 12 of 19

Figure 10. Correlation analysis between landslide and soil. Figure 10. Correlation analysis between landslide and soil. 3.2. Models Results 3.2. Models Results The performance evaluation of the classifiers was done by using five different metrics including RecallThe (or performance percent correlation evaluation classification), of the classifiers accuracy, was done FPR, by FNR, using and five the different ROC-AUC. metrics Four including internal parametersRecall (or percent and one correlation external (triggering classification), factor) accu haveracy, been FPR, used FNR, by and the models.the ROC-AUC. Initially, Four the internal models wereparameters trained and by usingone external 1-day rainfall (triggering followed factor) by ha trainingve been models used by with the the models. inclusion Initially, of the the cumulated models 5-dayswere trained antecedent by using precipitation 1-day rainfall as a newfollowed parameter by training into the models dataset. with the inclusion of the cumulated 5-daysInitially, antecedent the models precipitation have been as traineda new parameter using 5-fold into cross the validationdataset. and later tested using different train/Initially,test ratios the (0.8 models and 0.2, have 0.75 been and 0.25,trained 0.7 using and 0.3, 5-fold 0.65 cross and 0.35, validation 0.6 and and 0.4, later 0.55 andtested 0.45) using for determiningdifferent train/test which provideratios (0.8 better and results.0.2, 0.75 By and applying 0.25, 0.7 5-fold and cross0.3, 0.65 validation, and 0.35, the 0.6 variance and 0.4, is very0.55 low,and as0.45) indicated for determining by the insignificant which provide standard better deviation results. forBy bothapplying models 5-fold which cross is 0.0003validation, and 0.0004 the variance for RF andis very LR, low, respectively as indicated (Table by3 the). The insignificant overall prediction standard accuracy deviation was for 95.30%both models (RF) and which 94.63% is 0.0003 (LR) and by using0.0004 cross for RF validation and LR, while respectively 0.6 and 0.4(Table ratio 3). provided The overall better prediction results with accuracy an accuracy was 95.30% of 95.35% (RF) for and RF and94.63% 0.55 (LR) and by 0.45 using was cross the best validation ratio for while LR. On 0.6 the and other 0.4 ratio hand, provided if the antecedent better results rainfall with is an included accuracy as theof 95.35% new feature for RF in and the models,0.55 and the 0. overall45 was accuracythe best wasratio 98.74 for LR. and On 98.79% the other (RF, LR hand, respectively), if the antecedent the best ratiorainfall for is RF included cross validation as the new was feature 0.65 andin the 0.35 model withs, an the accuracy overall ofaccuracy 97.67%, was while 98.74 0.60 and and 98.79% 0.40 ratio (RF, looksLR respectively), to be better inthe LR best model ratio withfor RF 98.40% cross accuracy.validation was 0.65 and 0.35 with an accuracy of 97.67%, while 0.60 and 0.40 ratio looks to be better in LR model with 98.40% accuracy. Table 3. Models’ overall accuracy. Table 3. Models’ overall accuracy. 1-Day Rainfall (Without Antecedent Rainfall) 5-Days Antecedent Rainfall Included 1-Day RainfallAccuracy (Without (using Antecedent Rainfall) Accuracy 5-Days (using Antecedent Rainfall Included Fold Accuracy (using train/test ratios) Accuracy (using train/test ratios) Accuracycross-validation) (using Accuracy (using train/test cross-validation)Accuracy (using Accuracy (using train/test Fold cross-validation)RF LR Ratioratios) RF LRcross-validation) RF LR Ratio RFratios) LR 1 0.9530 0.9466 0.80 & 0.20 0.9414 0.9322 0.9875 0.9877 0.80 & 0.20 0.9742 0.9836 RF LR Ratio RF LR RF LR Ratio RF LR 2 0.9536 0.9466 0.75 & 0.25 0.9398 0.9330 0.9871 0.9882 0.75 & 0.25 0.9759 0.9835 0.80 & 0.80 & 1 0.95303 0.95310.9466 0.9459 0.70 & 0.300.9414 0.9399 0.9322 0.9352 0.9875 0.9876 0.98780.9877 0.70 & 0.30 0.97440.9742 0.98380.9836 4 0.952 0.94570.20 0.65 & 0.35 0.9394 0.9392 0.9870 0.9879 0.65 &0.20 0.35 0.9767 0.9838 5 0.9530 0.94670.75 0.60 & & 0.40 0.9469 0.9413 0.9876 0.9879 0.60 &0.75 0.40 & 0.9740 0.9840 2 Av.0.9536 0.95300.9466 0.9463 0.55 & 0.450.9398 0.9409 0.9330 0.9415 0.9871 0.9874 0.98790.9882 0.55 & 0.45 0.97440.9759 0.98370.9835 Std 0.0003 0.00040.25 0.0028 0.0041 0.0002 0.00010.25 0.001 0.0001 0.70 & 0.70 & 3 0.9531 0.9459 0.9399 0.9352 0.9876 0.9878 0.9744 0.9838 0.30 0.30 It has been observed0.65 that & there is no significant difference in applying di0.65fferent & train/test ratios as 4 0.952 0.9457 0.9394 0.9392 0.9870 0.9879 0.9767 0.9838 indicated by the standard deviation0.35 (0.0001). However, the ratio of 65% of the0.35 dataset was adopted as training, while 35% was used0.60 & as a test sample. This ratio was preferred for0.60 the & reason that LR 0.65 5 0.9530 0.9467 0.9469 0.9413 0.9876 0.9879 0.9740 0.9840 and 0.35 is the best ratio for0.40 incorrect predictions (FNR) and for maintaining0.40 the same ratios to both 0.55 & 0.55 & Av.models. 0.9530 The overall 0.9463 accuracy could0.9409 not be considered0.9415 as0.9874 only one0.9879 metric to judge the0.9744 performance 0.9837 0.45 0.45 Std 0.0003 0.0004 0.0028 0.0041 0.0002 0.0001 0.001 0.0001

It has been observed that there is no significant difference in applying different train/test ratios as indicated by the standard deviation (0.0001). However, the ratio of 65% of the dataset was adopted as training, while 35% was used as a test sample. This ratio was preferred for the reason that LR 0.65 and 0.35 is the best ratio for incorrect predictions (FNR) and for maintaining the same ratios to both models. The overall accuracy could not be considered as only one metric to judge the performance of Int. J. Environ. Res. Public Health 2020, 17, x 13 of 19

the models because of the imbalance of dependent values in the dataset. Therefore, to assess the performanceInt. J. Environ. Res. of Publicthe models, Health 2020 other, 17, 4147metric indexes have been used. FN and FP were used to evaluate 13 of 20 the models in terms of incorrect predictions such as landslide cases which have been considered as no landslides, yet landslides occurred. In addition to the prediction results presented in Table 4 RF generatedof the models features because of importance of the imbalance indicating of dependent how much values each inconditioning the dataset. factor Therefore, contributes to assess to the the landslideperformance incidence. of the models,As the triggering other metric factor, indexes rainfa havell has been the used. highest FN contribution and FP were rate used on to landslide evaluate occurrence.the models inIf terms1-day ofrainfall incorrect data predictions are used (without such as landslideantecedent cases prec whichipitation), have rainfall been considered contributes as 40.49%,no landslides, slope 32.74%, yet landslides soil type occurred. 16.48%, land In addition cover (land to the use) prediction 8.46%, and results soil depth presented 1.80% in(Figure Table 411a).RF generated features of importance indicating how much each conditioning factor contributes to the landslide incidence. As the triggeringTable factor, 4. Performance rainfall has results. the highest contribution rate on landslide occurrence. If 1-day rainfall data are used (without antecedent precipitation), rainfall contributes 1-Day Rainfall, Antecedent Rainfall 5-Days Antecedent Rainfall 40.49%,Performance slope 32.74%,Metric soil type 16.48%, land cover (land use) 8.46%, and soil depth 1.80% (Figure 11a). Excluded (%) Included (%) RF LR RF LR Table 4. Performance results. Recall (TPR) 84.61 90.38 95.19 96.15 Specificity (TNR) 93.911-Day Rainfall, Antecedent 93.92 5-Days 97.64 Antecedent Rainfall 98.38 Performance Metric False Positive Rate (FPR) 6.08 Rainfall Excluded 6.07 (%) 2.35Included (%) 1.61 False Negative Rate (FNR) 15.38 RF 9.61 LR RF 4.80 LR 3.84 Recall (TPR) 84.61 90.38 95.19 96.15 Upon Specificityincluding (TNR)antecedent cumulated 93.91 rainfall, the 93.92 most contributing 97.64 factor was an 98.38 antecedent rainfall Falseat the Positive rate of Rate 40.64%, (FPR) followed by 6.08 the rainfall on 6.07the day of landslide 2.35 occurrence 1.61 contributed 22.91%,False slope Negative with 22.50%, Rate (FNR) soil type 7.87%, 15.38 land cover 5.21%, 9.61 and soil depth 4.80 0.84% (Figure 3.84 11b).

FigureFigure 11. 11. RFRF features features importance importance ( (aa)) without without antecedent antecedent rainfall rainfall ( (bb)) with with antecedent antecedent rainfall. rainfall. Upon including antecedent cumulated rainfall, the most contributing factor was an antecedent Also, LR model provided the intercept and coefficients of variables form which the mathematical rainfall at the rate of 40.64%, followed by the rainfall on the day of landslide occurrence contributed equation establishing the relationship between labels (dependent variables) and features 22.91%, slope with 22.50%, soil type 7.87%, land cover 5.21%, and soil depth 0.84% (Figure 11b). (independent variables) was derived. Variables’ coefficients reveal which parameter has much or low Also, LR model provided the intercept and coefficients of variables form which the mathematical impact on landslide occurrence as indicated by positive or negative values in Table 5. equation establishing the relationship between labels (dependent variables) and features (independent variables) was derived. Variables’Table 5. LR coe interceptfficients and reveal coefficients which of parameter parameters. has much or low impact on landslide occurrence as indicated by positive or negative values in Table5. Daily Antecedent Rainfall Intercept Slope Soil Type Soil Depth Land Cover Rainfall Table(5-Days) 5. LR intercept and coefficients of parameters. −7.06 1.32 2.47 −9.34 −2.90 −4.10 −4.37 Daily Antecedent Intercept Slope−9.29 Soil Type−5.39 Soil Depth−1.85 Land Cover −1.12 Rainfall Rainfall (5-Days) 1.20 1.23 −1.10 −1.56 7.06 1.32 2.47 9.34 2.90 4.10 4.37 − − 3.87 − − − 9.29 5.39 1.85 1.12 − 6.49 − − − 1.20 1.23 1.10 1.56 − − 3.87 By using equations (3) and (4) and values of inte6.49rcept and coefficients in Table 5, the probability of landslide occurrence can be expressed by the following equation:

P(Y=1) = = (10) Int. J. Environ. Res. Public Health 2020, 17, 4147 14 of 20

By using Equations (3) and (4) and values of intercept and coefficients in Table5, the probability of landslide occurrence can be expressed by the following equation:

1 ey P(Y = 1) = y = y (10) 1 + e− 1 + e where P is the probability of landslide occurrence and:

Y = 7.06+ 1.32rf + 2.47Ar 9.34Sl0 10 9.29Sl11 15 + 1.2Sl16 20 + 3.87Sl21 25 − − − − − − − +6.49Sl26 45 2.9Stclay 5.39Stsand + 1.23Stsilt 4.10Sd<0.5 − − − − (11) 1.85Sd0.6 1.0 1.10Sd>1.0 4.37Lcforest 1.12Lcagri − − − − − 1.56Lcopen − where 7.09 is the intercept, Rf is one day rainfall, Ar is antecedent rainfall, Sl is a slope in degrees − (0–10, 11–15, 16–20, 21–25, 26–45) respectively, St is the type of soil (clay, sand and silt), Sd is soil with (>0.5 meter, 0.5–1.0 meter, more than 1.0 meter) depth and Lc is land cover (forest, agriculture and open land).

4. Discussion Prediction of landslide occurrence and early warning systems are the primary keys for the awareness and preparedness for risk reduction from this type of disaster. Different prediction models and warning systems (both local and regional) are available for this purpose [35]. The most popular machine learning models that have recently been used for landslide prediction are namely random forest [20,25,36], artificial neural network [37,38], support vector machine [4,25], logistic regression [36,38,39], etc. The results obtained from these different studies were good depending on the variables used and metrics used to evaluate their performance. In some studies, only internal factors were used to determine landslide susceptibility, some others have used rainfall, while a few of them included antecedent precipitation to determine its impact on the disaster occurrence. In this study, we preferred to use RF and LR for the reason explained earlier and more particularly to be applied in Rwanda as, to the best of our knowledge, there is no such study carried out on this territory. The use of both internal (geological and morphological) factors together with external (triggering) factor (rainfall: antecedent & current) led us to the better performance of the models compared to when either one of these factors is not considered. The results from a comparative analysis of one-day rainfall and 5-days antecedent cumulative rainfall before landslide occurrence revealed that antecedent cumulative rainfall has more impact on landslide occurrence than one-day rainfall (Table6). This is because the shear strength of the soil is degraded by rainfall and continuous if the interval between two or more rainfall events is short (few hours) but also depending on the hydraulic conductivity of the soil [40]. All performance metrics improve when the antecedent rainfall (Ar) is used for both models. For correct predictions, RF improves 10.58 and 3.73% for recall and specificity respectively while LR improves 5.77 and 4.46% on the same metrics. Incorrect predictions improve in the same manner as the correct predictions as ones are the opposite of the other ones. FN is a very important parameter to consider in the case of landslide prediction as its low value means few incidence cases were not predicted. In this case, LR performs better than RF as Table6 shows that only 3.84% cases were not predicted. The receiver operating curve was used to identify the performance of the models at different classification threshold and the AUC was used to compare two models by including or not including antecedent cumulated rainfall. Considering ROC-AUC, RF performs with AUC = 0.973 (Figure 12a) if only daily rainfall is considered as landslide triggering factor and AUC = 0.995 by including antecedent rainfall as additional landslide triggering factor (Figure 12b). This means that antecedent cumulated rainfall implicates better performance of the model to the rate of 2.2%. Likewise, Ar improved the LR model as before the inclusion of previous cumulative precipitations AUC was 0.979 (Figure 12a). Int. J. Environ. Res. Public Health 2020, 17, 4147 15 of 20

However, after including the Ar, AUC was equal to 0.997 (Figure 12b), hence this parameter contributed 1.8% to the model prediction.

Int. J. Environ. Res. Public HealthTable 2020 6., 17Models’, x performance summary (various metrics). 15 of 19

Random Forest Logistic Regression The receiver operating curve was used to identify the performance of the models at different Correct Predictions (%) classification threshold and the AUC was used to compare two models by including or not including Without With Without With Performance Improvement Improvement antecedent cumulatedAntecedent rainfall. ConsideringAntecedent ROC-AUC, RFAntecedent performs withAntecedent AUC = 0.973 (Figure 12a) Metric (%) (%) if only daily rainfallRainfall is consideredRainfall as landslide triggeringRainfall factor and RainfallAUC = 0.995 by including antecedentRecall (TPR)rainfall as additional 84.61 landslide 95.19 triggeri 10.58ng factor (Figure 90.38 12b). This 96.15 means that 5.77 antecedent Specificity cumulated rainfall implicates93.91 better 97.64 performance 3.73 of the model 93.92 to the rate 98.38 of 2.2%. Likewise, 4.46 Ar (TNR) improved the LR model as before the inclusion of previous cumulative precipitations AUC was 0.979 Incorrect Predictions (%) (Figure 12a). However, after including the Ar, AUC was equal to 0.997 (Figure 12b), hence this False Positives 6.08 2.35 3.73 6.07 1.61 4.46 parameterFalse Negatives contributed 15.38 1.8% to the 4.80model prediction. 10.58 9.61 3.84 5.77

FigureFigure 12.12. ROC-AUCROC-AUC (a) ( awith) with no no antecedent antecedent rainfall, rainfall, (b) (5-daysb) 5-days antecedent antecedent rainfall rainfall taken taken into intoconsideration. consideration. Both models indicated that 5-days antecedent precipitation has a high impact on the occurrence Both models indicated that 5-days antecedent precipitation has a high impact on the occurrence of landslides in the study area. Table5 shows that the features most triggering landslide incidences are of landslides in the study area. Table 5 shows that the features most triggering landslide incidences rainfall (one-day, 5-days antecedent) as indicated by their coefficients of 2.47 and 1.32, respectively. are rainfall (one-day, 5-days antecedent) as indicated by their coefficients of 2.47 and 1.32, Among the internal factors, the slope is the most affecting the occurrence of the disaster. The magnitude respectively. Among the internal factors, the slope is the most affecting the occurrence of the disaster. of the impact increases as slope increases or vice versa. For low slopes (less than 15 degrees coefficients The magnitude of the impact increases as slope increases or vice versa. For low slopes (less than 15 are negative while the probability of landslide incidence is very high on slopes of 25–45 degrees with a degrees coefficients are negative while the probability of landslide incidence is very high on slopes coefficient of 6.49. This means that high slope zones are prone to landslides and people should not of 25–45 degrees with a coefficient of 6.49. This means that high slope zones are prone to landslides reside in such areas to reduce risks. The silt soil, the soil of more than 1m deep, and agriculture land and people should not reside in such areas to reduce risks. The silt soil, the soil of more than 1m deep, are the most vulnerable to the landslide (Table5). and agriculture land are the most vulnerable to the landslide (Table 5). Table6 shows that both models performed well in terms of di fferent performance indicators such Table 6 shows that both models performed well in terms of different performance indicators such as as accuracy, errors, and AUC. The use of antecedent precipitation made the models predict better. accuracy, errors, and AUC. The use of antecedent precipitation made the models predict better. We We also compared the performance of these prediction models with other recently established models also compared the performance of these prediction models with other recently established models using the same machine learning algorithms, [26,41–44] and the ones used different models such as using the same machine learning algorithms, [26,41–44] and the ones used different models such as artificial neural networks (ANNs) [26], support vector machine (SVM) [26], event-class predictor [45], artificial neural networks (ANNs) [26], support vector machine (SVM) [26], event-class predictor [45], and rotation forest with alternating decision tree (RFADT) [41]. The comparison of the models was and rotation forest with alternating decision tree (RFADT) [41]. The comparison of the models was done using AUC which seems to be widely used by researchers and FNR which has been used by a done using AUC which seems to be widely used by researchers and FNR which has been used by a few. The ANN proposed by Wang et al. [26] performed better compared to the RF and LR models few. The ANN proposed by Wang et al. [26] performed better compared to the RF and LR models proposed in this study before the use of antecedent rainfall data, but after including the antecedent proposed in this study before the use of antecedent rainfall data, but after including the antecedent rainfall data in the dataset both RF and LR models predict better with good AUC results compared rainfall data in the dataset both RF and LR models predict better with good AUC results compared to ANN and other models. Thus, adding antecedent rainfall data has made the RF and LR models proposed in the current study achieve good results (Figure 13). A comparative analysis has also been done with other models on incorrect predictions. Figure 14 indicates that the model proposed by Utomo et al. [36] is the best to minimize false negative errors up to 1.60%, while the LR in our study can minimize FNR up to 3.84%. On the other hand, our proposed LR can minimize FP errors up to 1.61% against 2.01% of Utomo et al. [36]. Int. J. Environ. Res. Public Health 2020, 17, x 16 of 19

Comparing the two proposed models RF and LR of the current study among themselves, there is a slight difference in the prediction parameters, like for instance if ROC-AUC is considered both models are good, as indicated in Figure 12, but in terms of prediction errors, LR is considered the best Int.because J. Environ. its Res.FNR Public is 3.84%Health 2020 against, 17, x 4.80% of RF. This evaluation parameter (FNR) should be highly16 of 19 considered because it indicates how many landslide cases were predicted as NO landslides, which can beComparing a dangerous the outcome. two proposed Minimi modelszing the RF ratio and ofLR false of the negatives current isstudy crucial among for better themselves, performance there ofis athe slight system difference and disaster in the riskprediction reduction. parameters, In this regard, like for logistic instance regression if ROC-AUC is the is bestconsidered model toboth be usedmodels due are to good, its performance as indicated in in terms Figure of 12,error but reduct in termions of(false prediction negatives). errors, Besides LR is errorconsidered reduction, the best LR Int.performsbecause J. Environ. its faster Res.FNR Publicthan is 3.84% Health RF which 2020against, 17 ,is 4147 4.80%another of importantRF. This evaluation aspect that parameter LR to be (FNR)considered should in bethe highly 16 early of 20 warningconsidered system because as the it indicates delay is reduced.how many Though landslid 100%e cases perfect were prediction predicted is as not NO achievable landslides, through which thesecan be modelling a dangerous approaches, outcome. Minimithe priorityzing isthe to ratio mini ofmize false the negatives false negatives is crucial (warning for better that performance there will tobeof ANNtheno landslide system and other and yet disasterlandslides models. risk Thus, occur) redu adding becausection. antecedentIn it this has regard, a negative rainfall logistic impact data regression has on madelives ratheris the the RF bestthan and modelfalse LR models alarms to be proposed(warningused due tothat in its the there performance current will studybe a inlandslide achieve terms of goodwhile error results isreduct false). (Figureion (false 13 ).negatives). Besides error reduction, LR performs faster than RF which is another important aspect that LR to be considered in the early warning system as the delay is reduced. Though 100% perfect prediction is not achievable through these modelling approaches, the priority is to minimize the false negatives (warning that there will be no landslide yet landslides occur) because it has a negative impact on lives rather than false alarms (warning that there will be a landslide while is false).

Figure 13. Comparison of the models results (using AUC) vs. other studies.

A comparative analysis has also been done with other models on incorrect predictions. Figure 14 indicates that the model proposed by Utomo et al. [36] is the best to minimize false negative errors up to 1.60%, while the LR in our study can minimize FNR up to 3.84%. On the other hand, our proposed LR can minimizeFigure FP errors 13. Comparison up to 1.61% of againstthe models 2.01% results of Utomo(using AUC) et al. vs. [36 other]. studies.

Figure 14. Comparison of the models results (false predictions) vs. other studies.

Figure 14. Comparison of the models results (false(false predictions) vs. other studies.

Comparing the two proposed models RF and LR of the current study among themselves, there is a slight difference in the prediction parameters, like for instance if ROC-AUC is considered both models are good, as indicated in Figure 12, but in terms of prediction errors, LR is considered the Int. J. Environ. Res. Public Health 2020, 17, 4147 17 of 20 best because its FNR is 3.84% against 4.80% of RF. This evaluation parameter (FNR) should be highly considered because it indicates how many landslide cases were predicted as NO landslides, which can be a dangerous outcome. Minimizing the ratio of false negatives is crucial for better performance of the system and disaster risk reduction. In this regard, logistic regression is the best model to be used due to its performance in terms of error reduction (false negatives). Besides error reduction, LR performs faster than RF which is another important aspect that LR to be considered in the early warning system as the delay is reduced. Though 100% perfect prediction is not achievable through these modelling approaches, the priority is to minimize the false negatives (warning that there will be no landslide yet landslides occur) because it has a negative impact on lives rather than false alarms (warning that there will be a landslide while is false).

5. Conclusions In this research two approaches, random forest (RF) and logistic regression (LR), were applied to analyze the rainfall data along with other external and internal factors to develop a prediction model for landslide incidences for an early warning system. Performance parameters such as ROC-AUC, error rate (TP and FN) have been used to evaluate the best prediction models. The results prove that the prediction performance of these two models are better than those established in other research studies. Results from this study revealed that landslides are triggered due to the too much daily (or low intensity prolonged) rainfall, but in most cases they occurred after a few consecutive (like 2–5) days of precipitation. This was proved by the impact of antecedent rainfall on disaster occurrence as marked by prediction models used in this study. Both models indicated that 5-days antecedent precipitation has a high impact on the occurrence of landslides in the study area. In addition to the rainfall data, other parameters have been utilized to assess their impact on the disaster. The slope of hills is the most internal parameter affecting disaster occurrence after rainfalls (external factor), meaning that the areas of high slope angle are more susceptible to landslides than the regions where the terrain is almost a plateau. Agricultural land or non-protected land is the most susceptible to landslide occurrence while land covered by forest had few incidences. As previously discussed, landslides are triggered by current rainfall but very correlated with consecutive precipitation before the day of incidence. This is because the soil strength reduces with water content depending on the type of soil material and gets strong again during the evapotranspiration process or the sunny season. Based on the results, the logistic regression proves to be the best approach to be used for landslide prediction and early warning. LR model’s incorrect prediction rate FNR is 9.61% without including antecedent precipitation data and is 3.84% after including the antecedent precipitation data. Therefore, LR model can be used for the early warning system. Instead of antecedent cumulated rainfall, in our future work, we will study the correlation between landslide occurrences, rainfall and soil moisture level on different types of the soil which will be gathered by sensors and through WSN, landslides will be predicted and citizens warned earlier. The prediction capability combined with the Internet of Things (IoT) where rainfall gauges can be used to capture rainfall intensity in real-time and soil moisture sensors used to measure the water content in the soil will be used for rescuing citizens from landslide disasters.

Author Contributions: Data collection, methodology, software and writing original draft preparation, M.K.; Supervision and funding acquisition, S.K.; co-supervision, M.Z.; review and editing, S.K. and M.Z. All authors have read and agreed to the published version of the manuscript. Funding: This research was under the support of the African Centre of Excellence in the Internet of Things (ACEIoT) running under the University of Rwanda, College of Science and Technology (UR-CST) and the APC was funded by the same centre. Acknowledgments: The authors acknowledge the financial assistance of the ACEIoT, and the University of Rwanda. The contribution of Rwanda Meteorology, Ministry in Charge of Emergency Management, Ngororero District, the Centre for Geographic Information System (CGIS) under University of Rwanda in providing the needed data for this work was highly recognized and appreciated. The authors would like also to thank the Int. J. Environ. Res. Public Health 2020, 17, 4147 18 of 20 editorial office and anonymous reviewers for their constructive comments and contribution to improve the quality of this scientific study. Conflicts of Interest: The authors declare no conflict of interest.

References

1. Douglas, I.; Alam, K.; Maghenda, M.; Mcdonnell, Y.; McLean, L.; Campbell, J. Unjust waters: Climate change, flooding and the urban poor in Africa. Environ. Urban. 2008, 20, 187–205. [CrossRef] 2. Nahayo, L.; Li, L.; Mupenzi, C. Spatial Distribution of Landslides Vulnerability for the Risk Management in Ngororero District of Rwanda. EAJST Special 2018, 8, 1–14. 3. MIDIMAR. Republic of Rwanda Refugee Affairs National Contingency Plan for Floods; MIDIMAR: Kigali, Rwanda, 2014. 4. Huang, Y.; Zhao, L. Review on landslide susceptibility mapping using support vector machines. Catena 2018, 165, 520–529. [CrossRef] 5. Nsengiyumva, J.B.; Luo, G.; Nahayo, L.; Huang, X.; Cai, P. Landslide susceptibility assessment using spatial multi-criteria evaluation model in Rwanda. Int. J. Environ. Res. Public Health 2018, 15, 243. [CrossRef] [PubMed] 6. Zhang, K.; Xue, X.; Hong, Y.; Gourley, J.J.; Lu, N.; Wan, Z.; Hong, Z.; Wooten, R. iCRESTRIGRS: A coupled modeling system for cascading flood-landslide disaster forecasting. Hydrol. Earth Syst. Sci. 2016, 20, 1–23. [CrossRef] 7. Ministry in charge of emergency management. The National Risk Atlas of Rwanda. Available online: http: //minema.gov.rw/uploads/tx_download/National_Risk_Atlas_of_Rwanda_electronic_version.pdf (accessed on 15 March 2018). 8. Jeong, S.; Lee, K.; Kim, J.; Kim, Y. Analysis of rainfall-induced landslide on unsaturated soil slopes. Sustainability 2017, 9, 1280. [CrossRef] 9. Youssef, A.M.; Pourghasemi, H.R.; Pourtaghi, Z.S.; Al-Katheeri, M.M. Landslide susceptibility mapping using random forest, boosted regression tree, classification and regression tree, and general linear models and comparison of their performance at Wadi Tayyah Basin, Asir Region, Saudi Arabia. Landslides 2016, 13, 839–856. [CrossRef] 10. Hong, M.; Kim, J.; Jeong, S. Rainfall intensity-duration thresholds for landslide prediction in South Korea by considering the effects of antecedent rainfall. Landslides 2018, 15, 523–534. [CrossRef] 11. Diakakis, M. Rainfall thresholds for flood triggering. The case of Marathonas in Greece. Nat. Hazards 2012, 60, 789–800. [CrossRef] 12. Nahayo, L.; Mupenzi, C.; Kayiranga, A.; Karamage, F.; Ndayisaba, F.; Nyesheja, E.M.; Li, L. Early alert and community involvement: Approach for disaster risk reduction in Rwanda. Nat. Hazards 2017, 86, 505–517. [CrossRef] 13. Lee, S.; Ryu, J.H.; Won, J.S.; Park, H.J. Determination and application of the weights for landslide susceptibility mapping using an artificial neural network. Eng. Geol. 2004, 71, 289–302. [CrossRef] 14. Staley, D.M.; Kean, J.W.; Cannon, S.H.; Schmidt, K.M.; Laber, J.L. Objective definition of rainfall intensity-duration thresholds for the initiation of post-fire debris flows in southern California. Landslides 2013, 10, 547–562. [CrossRef] 15. Marjanovi´c,M.; Kovaˇcevi´c,M.; Bajat, B.; Voženílek, V. Landslide susceptibility assessment using SVM machine learning algorithm. Eng. Geol. 2011, 123, 225–234. [CrossRef] 16. Lee, M.-J. Rainfall and Landslide Correlation Analysis and Prediction of Future Rainfall Base on Climate Change. In Geohazards Caused by Human Activity; InTech: London, UK, 2016. 17. Republic of Rwanda Western Province Ngororero District. Available online: www.ngororero.gov.rw (accessed on 17 September 2019). 18. Modeling, H. Comparison of Spatial Interpolation Schemes for Rainfall Data and Application in hydrological modeling. Water 2017, 9, 342. [CrossRef] 19. Ly, S.; Charles, C.; Degré, A. Different methods for spatial interpolation of rainfall data for operational hydrology and hydrological modeling at watershed scale. A Review. 2013, 17, 392–406. Int. J. Environ. Res. Public Health 2020, 17, 4147 19 of 20

20. Trigila, A.; Iadanza, C.; Esposito, C.; Scarascia-Mugnozza, G. Comparison of Logistic Regression and Random Forests techniques for shallow landslide susceptibility assessment in Giampilieri (NE Sicily, Italy). Geomorphology 2015, 249, 119–136. [CrossRef] 21. Vorpahl, P.; Elsenbeer, H.; Märker, M.; Schröder, B. How can statistical models help to determine driving factors of landslides? Ecol. Model. 2012, 239, 27–39. [CrossRef] 22. Data Ngororero - African Center of Excellence in Internet of Things. Available online: https://aceiot.ur.ac.rw/ spip.php?article103 (accessed on 26 March 2020). 23. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [CrossRef] 24. Pourghasemi, H.R.; Kerle, N. Random forests and evidential belief function-based landslide susceptibility assessment in Western Mazandaran Province, Iran. Environ. Earth Sci. 2016, 75, 1–17. [CrossRef] 25. Pourghasemi, H.R.; Rahmati, O. Prediction of the landslide susceptibility: Which algorithm, which precision? Catena 2018, 162, 177–192. [CrossRef] 26. Wang, Y.; Wu, X.; Chen, Z.; Ren, F.; Feng, L.; Du, Q. Optimizing the predictive ability of machine learning methods for landslide susceptibility mapping using smote for Lishui city in Zhejiang province, China. Int. J. Environ. Res. Public Health 2019, 16, 368. [CrossRef][PubMed] 27. Kim, J.C.; Lee, S.; Jung, H.S.; Lee, S. Landslide susceptibility mapping using random forest and boosted tree models in Pyeong-Chang, Korea. Geocarto Int. 2018, 33, 1000–1015. [CrossRef] 28. Random Forest Classifier - Machine Learning Global Software Support. Available online: https://www. globalsoftwaresupport.com/random-forest-classifier/ (accessed on 26 March 2020). 29. Lin, W.; Wu, Z.; Lin, L.; Wen, A.; Li, J. An ensemble random forest algorithm for insurance big data analysis. IEEE Access 2017, 5, 16568–16575. [CrossRef] 30. Louppe, G. Understanding random forests: From theory to practice. Ph.D. Thesis, University of Liege, Liege, Belgium, July 2014. 31. Gorsevski, P.V.; Gessler, P.E.; Foltz, R.B.; Elliot, W.J. Spatial prediction of landslide hazard using logistic regression and ROC analysis. Trans. GIS 2006, 10, 395–415. [CrossRef] 32. Mathew, J.; Jha, V.K.; Rawat, G.S. Landslide susceptibility zonation mapping and its validation in part of Garhwal Lesser Himalaya, India, using binary logistic regression analysis and receiver operating characteristic curve method. Landslides 2009, 6, 17–26. [CrossRef] 33. Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [CrossRef] 34. Frattini, P.; Crosta, G.; Carrara, A. Techniques for evaluating the performance of landslide susceptibility models. Eng. Geol. 2010, 111, 62–72. [CrossRef] 35. Piciullo, L.; Calvello, M.; Cepeda, J.M. Territorial early warning systems for rainfall-induced landslides. Earth-Sci. Rev. 2018, 179, 228–247. [CrossRef] 36. Al-Najjar, H.A.H.; Kalantar, B.; Pradhan, B.; Saeidi, V. Conditioning factor determination for mapping and prediction of landslide susceptibility using machine learning algorithms. Proc. SPIE 2019, 11156, 19. [CrossRef] 37. Pham, B.T.; Son, L.H.; Hoang, T.A.; Nguyen, D.M.; Bui, D.T. Prediction of shear strength of soft soil using machine learning methods. Catena 2018, 166, 181–191. [CrossRef] 38. Reichenbach, P.; Rossi, M.; Malamud, B.D.; Mihir, M.; Guzzetti, F. A review of statistically-based landslide susceptibility models. Earth-Sci. Rev. 2018, 180, 60–91. [CrossRef] 39. Lombardo, L.; Mai, P.M. Presenting logistic regression-based landslide susceptibility results. Eng. Geol. 2018, 244, 14–24. [CrossRef] 40. Capparelli, G.; Versace, P. FLaIR and SUSHI: Two mathematical models for early warning of landslides induced by rainfall. Landslides 2011, 8, 67–79. [CrossRef] 41. Thai Pham, B.; Shirzadi, A.; Shahabi, H.; Omidvar, E.; Singh, S.K.; Sahana, M.; Asl, D.T.; Ahmad, B.B.; Quoc, N.K.; Lee, S. Landslide susceptibility assessment by novel hybrid machine learning algorithms. Sustainability 2019, 11, 4386. [CrossRef] 42. Pourghasemi, H.R.; Gayen, A.; Park, S.; Lee, C.W.; Lee, S. Assessment of landslide-prone areas and their zonation using logistic regression, LogitBoost, and naïvebayes machine-learning algorithms. Sustainability 2018, 10, 3697. [CrossRef] 43. Park, S.; Kim, J. Landslide susceptibility mapping based on random forest and boosted regression tree models, and a comparison of their performance. Appl. Sci. 2019, 9, 942. [CrossRef] Int. J. Environ. Res. Public Health 2020, 17, 4147 20 of 20

44. Sun, X.; Chen, J.; Bao, Y.; Han, X.; Zhan, J.; Peng, W. Landslide susceptibility mapping using logistic regression analysis along the Jinsha river and its tributaries close to Derong and Deqin County, southwestern China. ISPRS Int. J. Geo-Information 2018, 7, 438. [CrossRef] 45. Utomo, D.; Chen, S.F.; Hsiung, P.A. Landslide prediction with model switching. Appl. Sci. 2019, 9, 1839. [CrossRef]

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).