ISSN 2090-3359 (Print) ISSN 2090-3367 (Online)

ΑΔΣ Advances in Decision Sciences

Volume 23 Issue 3 September 2019

Michael McAleer Editor-in-Chief University Chair Professor Asia University, Taiwan

Published by Asia University, Taiwan ADS@ASIAUNIVERSITY

Predictive Models for Classifying the Outcomes of Violence: Case Study for ’s Deep South*

Bunjira Makond** Faculty of Commerce and Management Prince of Songkla University Trang, Thailand and Centre of Excellence in Mathematics Commission on Higher Education (CHE) Ministry of Education, Bangkok, Thailand

Mayuening Eso Faculty of Science and Technology Prince of Songkla University Pattani, Thailand and Centre of Excellence in Mathematics Commission on Higher Education (CHE) Ministry of Education, Bangkok, Thailand

Revised: August 2019

* The authors gratefully appreciate the assistance of Metta Kuning, former Director of DSCC, Prince of Songkla University, Pattani, Thailand, and a reviewer for helpful comments and suggestions. This research received much appreciated financial support from the Centre of Excellence in Mathematics, Commission on Higher Education, Thailand. ** Corresponding author: [email protected]

1

Abstract

Violence is now widely recognized as a public health problem because of its significant consequences on the health and wellness of people and it remains a growing problem in many countries including Thailand. Elucidating the factors related to violence can provide information that can help to prevent violence and decrease the number of injuries. This study explored predictive data mining models which have high interpretability and prediction accuracy in classifying the outcomes of violence. After data preprocessing, a set of 21,424 incidents occurring from 2004 to 2016 were obtained from the Deep South Coordination Centre database. A correlation-based feature subset selection and decision tree technique with embedded feature selection were used for variable selection and four data mining techniques were applied to classify the violent outcomes into physical injury and no physical injury. The findings revealed that regardless of the variable selection method, gun was selected as a risk factor of physical injury. Moreover, a decision tree model with three variables, gun, zone, and solid/sharp weapon outperformed a naive Bayes model in terms of accurate performance and interpretability. Decision tree and artificial neural network models have similar levels of performance in classifying the outcome of violence but in practical terms, a decision tree model is more interpretable than an artificial neural network model.

Keywords: Decision tree, naive Bayes, artificial neural network, logistic regression, violence in Thailand. JEL: C53, C55, C88, N35

2

1. Introduction

Violence is now widely recognized as a public health problem because of its significant effect on the health and wellness of people, and remains a growing problem in many countries including Thailand. In the deep south of Thailand comprising Pattani, Yala, and Narathiwat provinces and parts of (Nathawi, Sabayoi, Chana, and Thepa ), violence causes serious and extensive impacts on public health. The statistics showed that from January 2004 to March 2013, nearly 13,000 violent events were recorded, which resulted in 15,574 casualties (5,614 deaths and 9,960 injuries) (Burke et al., 2013).

Despite the fact that the highest cost in terms of physical injury is by way of loss of life, non- fatal injuries result in functional losses and limitations, medical expenditures, lost work performance and disability compensation. Further, after injury, individuals experience psychological distress that occurs simultaneously with injury-related changes in function and the quality of their lives (Duckworth and Iezzi, 2010). Moreover, violence not only affects the individuals directly involved, but has an effect on the healthcare system, the delivery of healthcare, and surrounding people in the areas affected (The World Medical Association, Inc., 2012).

From the perspective of public health, violence is preventable. Understanding the factors related to violence can generate data enabling the likelihood of specific events resulting from specific causes and environments to be predicted and such data can be used as a means of preventing violence and decreasing the number of injuries. For several decades, traditional statistical approaches have been manually implemented to data relating to violence in order to detect characteristics of or risk factors associated with violence (Höhle et al., 2009). Therefore, it is important that relevant information is available and is provided to decision makers in order that they can devise suitable prevention and intervention measures.

Due to advances in the development of computer technology, huge amounts of data relating to violence can now be easily and efficiently stored in databases at a reasonable cost. However, the use of only traditional statistical approaches is insufficient to discover the knowledge hidden within huge datasets (Karrar et al., 2016). The application of the technique of data mining can however be employed to examine the factors affecting the outcome of

3

violence and make possible better information for violence prevention and controlling to reduce the number of injuries. In practice, prediction is a goal of data mining involving learning model from independent variables or attributes to predict unknown variables.

In data mining activities, classification is a task to discover a predictive learning model that classifies a data item into one of several pre-classified classes (Rokach and Maimon, 2014). Recently, classification has been widely applied in mining data relating to violence in order to identify relationships and to generate data useful for the prevention of violence (Babcock and Cooper, 2018; Kumar et al., 2019; Liu et al, 2011; Ö zyirmidokuz and Kaya, 2014; Wijenayake et al., 2018).

This study explored various predictive models with high interpretability and performance in classifying the outcomes of violence, based on a comparative study of decision tree, logistic regression, naïve Bayes, and artificial neural network techniques. The feature selection method was used to select relevant variables, and the pruning method was applied to construct simple and accurate models. Analysis of variance (ANOVA) was employed to identify significant differences in the predictions made by these models and Tukey’s HSD test was used to identify which particular models produced significantly different predictions.

The rest of this paper is organized as follows. The literature related to the risk factors associated with and the characteristics of violence in Thailand’s deep south, studies relating to the application of data mining techniques in the domain of violence, and factors which influence the occurrence and outcome of violence are reviewed in section 2. Section 3 describes the variables, the data collected, the data preprocessing step, and the research methods. The experimental framework and the results are presented in section 4 and finally, the results are discussed and conclusions offered in section 5.

2. Literature Review

Several researches related to explore the characteristics of the violence data in the southernmost have been active as follows. Grid maps and statistical models were used to investigate the terrorist event rate distributed by location and time the finding showed that the violence mostly frequent occurred at between 8 and 9 pm and the

4

most likely days were Wednesdays and Thursdays. Moreover, the violence had steadily increased trend in the rate during 2004 and stabilizing in 2005 and the effects revealed that violence had enlarged to the neighboring districts in Songkla (Marohabout et al., 2009). Lim et al. (2009) investigated the living conditions of the families of victims of the unrest in , and found that the majority of victims were male, the head of the family, of working age (45.9 ± 12.4 years), married, Muslim and had children. Further, the majority of victims had only primary school education and were farmers and most were shot and died.

In addition, some of the victims suffered property damage and there were around three people per family who relied on each victim. Khongmark and Kuning (2013) constructed and compared Poisson and negative binomial generalized linear models with zero-corrected log- transformed linear models for the incidence of adverse events over location, time and the demographic characteristics of the victims including their gender and age group. The results showed that the incidence of injuries resulting from terrorism showed different patterns in different districts.

Komolmalai et al. (2012) employed negative binomial and log-normal models to analyze the incidence of injuries to civilian victims of violence from terrorism in Pattani, Yala and Narathiwat provinces and four eastern districts of Songkhla province. Their study concluded that while specific regions were at higher risk at different times the pattern of incidence could not be easily predicted and overall the risks among different demographic groups remained relatively constant. Chirtkiatsakul et al. (2014) studied the factors associated with casualties due to the unrest between 2004 and 2011 in the three southern border provinces and the surrounding districts of Songkhla using logistic regression. The results showed that gender, age, religion, occupation, type of weapon, and province were all factors in fatalities and injuries arising due to the violence in the south of Thailand.

It is notable that all those researchers created statistical models based on data collected in the Deep South Coordination Center (DSCC) database. However, the usefulness of those statistical models is restricted. and thus the use of, data mining techniques has grown rapidly and is more successful in the classification task, and the data can thus be used for a wide range of purposes associated with violent events as reviewed in the following paragraphs.

5

Kumar et al. (2019) studied trends in terrorist attacks around the world applying lazy tree, multilayer perceptron, multiclass and naïve Bayes classifiers to a database created from various public and open access sources for the years 1970-2015 covering 156,772 reported attacks causing substantial loss of life and damage to property. The results showed that all the techniques were able to draw patterns by extracting information about the terrorist attacks with a fair level of performance. Wijenayake et al. (2018) employed a decision tree to predict whether an offender would re-commit a domestic-violence-related offence within a period of 24 months after the end of their first court appearance. T he dataset consisted of 14,776 records and analyzed the data based on eleven independent variables obtained from the re-offending database. The results showed that a decision tree using 3 of the 11 variables outperformed logistic regression in terms of both its usability an accuracy. Babcock and Cooper (2018) experimentally compared neural network models with traditional linear models in predicting the arrest history of men who had committed intimate- partner violence. They concluded that the neural network models were superior to the linear models in terms of their predictive power.

Ö zyirmidokuz and Kaya (2014) applied a decision tree to data relating to women in the 15-59 age group in Turkey to investigate and support the prevention of violence against women and the researchers concluded that the proposed method produced very good predictive performance. Liu et al. (2011) aimed to compare the validity and predictive accuracy of logistic regression, classification and a regression tree, and neural network models in the prediction of reconviction for violent offences, finding that the three models demonstrated similar levels of accuracy. As can be seen, decision tree, artificial neural network, logistic regression, and naïve Bayes models have all been effectively applied in the classification of data relating to violence. This study therefore also applied these models to data retrieved from the DSCC database to classify the outcome of violent incidents.

A number of recent studies have mentioned various factors which influence the occurrence and outcome of violence. Bilukha et al. (2013) studied fatal and non-fatal injuries due to intentionally caused explosions in Nepal occurring between 1 January 2008 and 31 December 2011 and showed that the number of casualties per incident was different based on gender, place, age and type of bomb. Further, A report from the World Health Organization Center for Health Development in Kobe, Japan similarly noted that the number of injuries in violent

6

incidents was influenced by gender, age, occupation, time and place of event and method of assault (Nieves and Cruz, 2011) and Chirtkiatsakul et al. (2014) found that gender, age, religion, occupation, type of weapon, and province were factors related to fatalities and injuries due to the violence in the south of Thailand.

In addition, Tabish et al. (2013) found that patterns of injuries encountered by health authorities were different depending on the type of weapon and Khongmark and Kuning, (2013) concluded that the incidence of injuries due to terrorism showed quite different patterns in different districts. Burke et al., (2013) who investigated violent events in the deep south of Thailand from January 2004 to March 2013 also found that within each province, there were significant differences between the numbers of incidents in high-incident and low- incident districts and that incidents mostly occurred in schools, government offices, and state- related bodies with hit-and-run attacks, bombs, and arson being the methods most commonly used in violent events.

Moreover, Sornsrivichai (2007) showed that the number of people injured was different in different months. Finally, Kuning et al. (2014) found that the most common weapons used in violent events were guns, knives and bombs, with arson also being frequently used. Further, events usually occurred during the weekend rather than on weekdays and the number of events occurring was different at different periods of time during the day, with attacks most frequently happening on roads or highways, in shops, houses and schools, respectively.

3. Methods

3.1 Variables

Based on this review of previous literature, the determinant variables can be categorized into two groups: demographic factors of the victims and characteristics of the violent events. In the present study, the characteristics of violent events only included were day, time, month, zone, province, district, arson, gun, bomb, and solid/sharp weapon. The target (dependent) variable chosen in this study was the outcome of a violent event with yes indicating that the violence resulted in physical injury otherwise the outcome was designated as no.

7

3.2 Data

This was a retrospective study and relevant data was obtained from the DSCC database maintained at the Faculty of Science and Technology, Prince of Songkla University, Pattani Campus. The data stored on the database were retrieved from various sources, including the army, the police and newspaper reports (Kuning et al., 2014).

The current study focused on data relating to violent events recorded between 2004 and the beginning of January 2016. Though, the dataset covered various types of violent events, they were all considered equally as violence. The original dataset contained 23,670 violent events and classified the information on them under 51 headings. However, before the implementation of data mining techniques on a dataset, it is essential to preprocess the data with the purpose of reducing the amount of ‘noise’ and the number of missing values, and discrepancies.

3.3 Data preprocessing

Data preprocessing is essential for data mining and incorporates data cleaning, data integration, data transformation, data reduction, and discretization. Although, the variables adopted in this study were selected based on the literature, data preprocessing is essential to increase the quality of the variables and the accuracy of the classification and prediction process. Data cleaning was used to deal with missing values for certain variables, namely, day, time, zone, province and those incidents lacking this information were removed from the data set. Data transformation was adopted to convert data from one form into another required form; for instance, the variable, day was obtained from the date of the incident.

In addition, the values of two variables, time and zone were grouped based on the given definition for the variable concerned. Finally, the dataset was pre-classified according to the target variable into two classes as shown above, yes or no depending on whether or not the incident resulted in physical injury. All the steps of data pre-processing were performed using SQL and Microsoft Excel. After data preprocessing, the total number of incidents in the study was reduced to 21,424. The variables with descriptions and their values are presented in Table 1.

8

Table 1 Variables with descriptions and the values Number Predictor Descriptions Values of Variables events t1 represents time period from 00:01 a.m. to 1,437 03.00 a.m. t2 represents time period from 03:01 a.m. to 1,912 06:00 a.m. t3 represents time period from 06:01 a.m. to 3,824 09:00 a.m. t4 represents time period from 09:01 a.m. to 2,538 time of the 12:00 a.m. time day t5 represents time period from 12.01 p.m. to 2,116 15.00 p.m. t6 represents time period from 15:01 p.m. to 2,564 18.00 p.m. t7 represents time period from 18:01 p.m. to 4,841 21.00 p.m. t8 represents time period from 21:01 p.m. to 2,192 24.00 a.m. Sun represents Sunday 2,773 Mon represents Monday 3,316 Tue represents Tuesday 3,130 day of the day Wed represents Wednesday 3,358 week Thur represents Thursday 3,323 Fri represents Friday 3,030 Sat represents Saturday 2,494 Jan represents January 1,598 month of the Feb represents February 1,637 month year Mar represents March 1,846 Apr represents April 1,968

9

Number Predictor Descriptions Values of Variables events May represents May 1,789 Jun represents June 2,067 Jul represents July 1,747 Aug represents August 2,383 Sep represents September 1,469 Oct represents October 1,827 Nov represents November 1,620 Dec represents December 1,473 zone1 represents resident area/personal area 3,540 zone 2 represents road / highway 11,792 zone 3 represents military barracks 549 zone 4 represents checkpoints / booth of police 224 or soldiers, guardhouse zone 5 represents school 749 place of the zone 6 represents religious place 309 zone incident zone 7 represents business place, factory 626 zone 8 represents shop 1,063 zone 9 represents farm / plantation 811 zone 10 represents forest 171 zone 11 represents government office 523 zone 12 represents public place / market 482 zone 13 represents other / unspecified 585 dist1represents Yala district 1,752 dist2 represents Ra-ngae district 1,200 dist3 represents 1,174 district district dist4 represents district 1,136 dist5 represents 1,141 dist6 represents 1,086 dist7 represents 969

10

Number Predictor Descriptions Values of Variables events dist8 represents 933 dist9 represents 880 dist10 represents 765 dist11 represents 731 dist12 represents Mueang Narathiwat district 638 dist13 represents Su-ngai Padi district 648 dist14 represents Cho-airong district 562 dist15 represents Su-ngai Kolok district 540 dist16 represents 556 dist17 represents district 516 dist18 represents 515 dist19 represents 465 dist20 represents 474 dist21 represents 478 dist22 represents 444 dist23 represents 433 dist24 represents 414 dist25 represents 357 dist26 represents Yi-ngo district 343 dist27 represents Krong Pinang district 327 dist28 represents Thung Yang Daeng district 330 dist29 represents Kapho district 311 dist30 represents 246 dist31 represents 207 dist32 represents 215 dist33 represents Mai Kaen district 169 dist34 represents 136 dist35 represents 114 dist36 represents 106

11

Number Predictor Descriptions Values of Variables events dist37 represents 113 7,614 Pattani province 7,040 province province Songkhla province 1,092 5,678

means arson yes 2,538 arson was used in the incident no 18,886 weapon used yes 10,342 in the incident gun was one or no 11,082 more guns weapon used yes 3,987 in the incident bomb was one or no 17,437 more bombs weapon used in the incident yes 496 solid/sharp was one or weapon more solid/sharp no 20,928 weapons

The outcome yes represents physical injury 10,740 outcome of violence no represents otherwise 10,684

12

3.4 Classifiers

3.4.1 Decision tree (DT)

The C4.5 algorithm from the Weka software application for building decision trees was implemented as a classifier denoted as J48. The algorithm builds a tree from a set of training data using a top-down and recursive splitting technique based on the concept of information entropy. A tree structure consists of a root node, internal nodes, and leaf nodes. The root node contains all the input data. An internal node can have two or more branches and is associated with a decision function (Banu and Gomathy, 2014). Each leaf node is assigned to one class representing the most appropriate target value. On the other hand, the leaf node may indicate the probability of the target attribute having a certain value. Instances are classified by directing them from the root of the tree down to a leaf, consistent with the outcome of the tests along the path (Rokach and Maimon, 2005).

DT is an embedded technique which performs variable selection as part of the learning (Saeys et al. 2007). Generally, there are two steps in obtaining the optimal tree, growing and pruning. In growing a tree, the training data is used to build the tree using certain criteria to find a sequence of splits that separate the training data into smaller subsets that have pure class labels. By the end of the growing tree phase, a complex tree has usually been built. The pruning step is commonly used to reduce the size of the tree but the performance is still high (Beck et al, 2007).

3.4.2 Naive Bayes (NB)

NB is a classifier based on Bayes' theorem which employs a probabilistic learning method where prior knowledge and observed data can be merged ( Swetapadma and Yadav, 2016) . Given a set of attributes A: ( A1, A2,..., Aj,) where j are integers, and class C: ( C1, C2) , Bayes' theorem is stated as follows:

P ( A C ) P ( C ) P ( C A ) = (1) P ( A )

13

where P ( C ) denotes prior probability, P ( A ) denotes evidence, P ( A C ) denotes conditional probability, and P ( C A ) denotes posterior probability. The NB classifier assumes that all the attributes are completely independent given the class. Moreover, the class has no parents and each attribute has the class as its individual parent. The NB classifier learns from training data. The classification is carried out by applying Bayes' theorem to compute the posterior probability and then predicting the class with the highest posterior probability (bin Othman and Yau, 2007).

3.4.3 Artificial Neural Networks (ANN)

ANN is a mathematical model that tries to simulate the structure, processing method and learning ability of biological neurons. An ANN model is composed of three layers, input, output, and hidden. Each layer has a number of nodes. Nodes in the input layer are connected to nodes in the hidden layer and nodes in the hidden layer are connected to nodes in the output layer, in sequence. Those connections represent weights between nodes. ANN is an adaptive system that can change its structure based on information that affects the process during computation. In this study, multilayer perceptron ( MLP) which is the most popular ANN architecture was employed. The representation of and knowledge about the data is acquired and stored in an ANN by adjusting the weight of the connections. There are several algorithms for ANN training, w h i c h usually try to adjust the ANN weights to approximate the outputs of the ANN to the desired outputs based on the training data, and the back propagation algorithm is based on this error correcting concept (Lorena et al., 2011).

3.4.4 Logistic regression

Logistic regression (Hosmer and Lemeshow, 2000) is a well-known statistical method that is used to explain the relationship between a set of predictor variables denoted by

X  = ( x 1 , x 2 ,...,x p ) and a binary response variable that can take the values 1 and 0 (here, 1 represents physical injury and 0 represents no physical injury). The conditional probability of the response Y given the predictor variables is P (Y X ) = π(x) which is expressed as the following formula:

14

( β +β x +β x +...+βpx p ) e 0 1 1 2 2 π(x ) = , when 0   ( x )  1 (2) ( β +β x +β x +...+βpx p ) 1 + e 0 1 1 2 2

Transformation of the probability (between 0 and 1) to a real number and transformation of the function into a linear form can be achieved by transforming the probability π(x) to the odds, which can be expressed as:

π(x) odds = (3) 1 − π(x) then taking the natural logarithm of the odds which is called the logit transformation, defined as:

 π(x)  ln  = β0 + β1 x 1 + β2 x 2 + ... + βp x p (4)  1 − π(x) 

 , , ,...,  where 0 1 2 p represents a set of parameters obtained via the maximum likelihood method.

The model can be interpreted by an odds ratio (OR). The OR associated with the effect of a one unit change in x j when the other predictor variables in the model held constant is

β represented as e j (Rodríguez, 2007).

3.5 Model evaluation criteria

In order to evaluate the performance of the classifiers used in this paper, the dataset was divided by the 10–fold cross-validation method. Thus, the full data set was divided into ten independent subsets with each subset consisting of approximately one tenth of the full data set. Each of the ten subsets was used once as the testing set to evaluate the performance of the models built from the combination of the remaining subsets. This process was repeated ten

15

times, using a different subset in each cycle as the testing set. Using this method, the bias caused by random sampling for training and testing sets can be reduced (Witten and Frank, 5 . The performance of each classifier was given by the average of the performances observed in the test subsets.

The performance of the classifiers was measured relating to different performance metrics. These metrics were as follow:

TPT + TNT Accuracy = 100 % (5) TPT + TNT + FP + FN

TPT Sensitivity = 100 % (6) TPT + FN

TNT Specificity = 100 % (7) TNT + FP where TPT represents true positives, TNT represents true negatives, FP represents false positives, and FN represents false negatives. These values are generally revealed in a confusion matrix as presented in Table 2. In this study, the class of interest is yes, which is therefore denoted as positive with others as negative.

3.6 Feature Selection

Feature selection is the process of removing irrelevant variables. By using this process, a set of relevant variables is obtained, which can be used to yield comparable classification results on each occasion when all the variables are employed. In this study, a correlation-based feature subset selection (CFS) method was applied to select the variables from the ten variables identified from the literature. The CFS method ranks variables as specified by a correlation based heuristic evaluation function (Hall, 1999). The function evaluates subsets of variables that are highly correlated with the target variable but are independent. On the other hand, the CFS method ignores irrelevant variables which are assumed that have a low correlation with the target variable. Moreover, unnecessary variables which can lead to high

16

Table 2 Confusion matrix Predicted class yes no Actual class yes TP FN no FP TN

17

correlations with the remaining variables should be excluded. The evaluation function is expressed as follows:

3.6 Feature Selection

Feature selection is the process of removing irrelevant variables. By using this process, a set of relevant variables is obtained, which can be used to yield comparable classification results on each occasion when all the variables are employed. In this study, a correlation-based feature subset selection (CFS) method was applied to select the variables from the ten variables identified from the literature. The CFS method ranks variables as specified by a correlation based heuristic evaluation function (Hall, 1999).

The function evaluates subsets of variables that are highly correlated with the target variable but are independent. On the other hand, the CFS method ignores irrelevant variables which are assumed that have a low correlation with the target variable. Moreover, unnecessary variables which can lead to high correlations with the remaining variables should be excluded. The evaluation function is expressed as follows:

krcf M s = (8) k + k(k −1 )rff

where M s is the heuristic valuation of a subset, s containing k variables, rcf is the mean

correlation value between the predictor variables and the target variables, and rff is the mean correlation value between any two predictor variables. In this study, CFS was implemented by the Weka application based on a best-first search. The predictor variables selected are presented in Table 3.

18

Table 3 Selected predictor variables by feature selection Predictor Descriptions Values Number of events variables

means used in yes 2,538 arson the incident was no 18,886 arson

weapon used in yes 10,342 gun the incident was no 11,082 one or more guns weapon used in yes 496 the incident was solid/sharp one or more weapon solid/sharp no 20,928 weapons

19

4. Experimental Framework and Results

After data preprocessing, the data consisting of 21,414 records classified by ten predictors and a target variable were converted into an Attribute-Relation File Format (ARFF) file by WEKA in order to construct a model. In the classification task, using a large number of variables to construct a model is not a guarantee of high performance if those variables are irrelevant to the target variable. In order to construct a predictive model which has interpretability as well as high prediction accuracy in classifying the outcomes of violence into physical injury and no physical injury, the experiment conducted in this study was composed of two steps: 1) forming decision tree models and obtaining a set of selected variables, 2) in order to form models based on the two sets of variables, the first set was selected by the decision tree technique and the second was selected by the CFS method. These models denoted by acronyms along with their descriptions are presented in Table 4. Each model was used to conduct 100 trials. ANOVA was used to detect differences in the predictive performance among the models. Multiple comparison tests were also conducted to identify differences among the distinctive models using Tukey’s HSD test. The differences were considered to be significant at p < 0.05.

4.1 Decision tree model construction

Although the decision tree models created by the J48 had high performance, they generally resulted in huge trees leading to a problem of overfitting. To solve this problem, pruning was necessary in order to obtain small accurate models, thus avoiding unnecessary complexity while obtaining models which optimized the classification accuracy. This study employed post-pruning as described in the study of (Drazin and Montag 2013). In the post pruning process, the confidence factor was set as 0.3 0.25 or 0.2, while the minimum number of instances per node (minNumObj) was held at 2. The comparative results for the accuracy, sensitivity, and specificity of the models is presented in Table 5.

The results of the ANOVA shown in Table 6 detected significant differences (p < 0.05) among all these models for the sensitivity and specificity metrics. Tukey’s HSD test identified the specific differences and Tables 7 and 8 list the models with differences in different columns while those which are not different are listed in the same column. Two

20

interesting findings emerged: first the DT_J48_0.25 and DT_J48_0.2 models needed less information (three variables) than the DT_J48_0.3 model (eight variables) in the construction of the models. However, the accuracy of the three models in predicting the outcomes of violent events was not different. Second, DT_J48_0.2 gave inferior performance to DT_J48_0.25 in terms of sensitivity; while the DT_J48_0.2 model outperformed DT_J48_0.25 in terms of their specificity. However, a low value of the confidence factor corresponds to heavy pruning, whereas, a large one corresponds to tiny pruning, so the optimum decision tree model was DT_J48_0.25 which was constructed using three variables, namely, gun, solid/sharp weapon, and zone and the tree is presented in Figure 1.

4.2 Model construction based on two sets of variables

In order to derive the best fitting model, the gun, solid/sharp weapon, and zone variables used in DT_J48_0.25 were also employed to form NB_J48_0.25, LR_ J48_0.25, BPNN_J48_0.25. Moreover, NB_cfs, LR_cfs, BPNN_cfs, and DT_cfs were constructed based on the selected variables by the CFS method, as presented in Table 3. The comparative results of all the models are presented in Table 9.

The ANOVA results shown in Table 10 identified the significant differences among these models (p < 0.05) in terms of the accuracy, sensitivity, and specificity metrics. The differences among the models were identified by Tukey’s HSD test, and are presented in Tables 11-13 which list the models which are different in different columns while the models which are not different are listed in the same column.

In terms of accuracy, the interesting point as illustrated in Table 11 is that there were no differences in the performance of models DT_J48_0.25, BPNN_J48_0.25, or LR_J48_0.25 while, those three models outperformed the other models. In regard to the results for the sensitivity of the models shown in Table12, this study found that the BPNN_cfs and DT_cfs models were the best. Similarly, the results for specificity presented in Table 13 reveal that DT_J48_0.25 and BPNN_J48_0.25 had equally good performances and were superior to the other models.

21

Table 4 All the models studied in the form of acronyms and their descriptions

Acronym Description

DT_J48_0.30 Decision tree model created by the J48 with set confidence factor = 0.30

DT_J48_0.25 Decision tree model created by the J48 with set confidence factor = 0.25

DT_J48_0.20 Decision tree model created by the J48 with set confidence factor = 0.20

NB_J48_0.25 Naïve Bayes model constructed based on 3 selected variables used in DT_J48_0.25 LR_ J48_0.25 Logistic regression model constructed based on 3 selected variables used in DT_J48_0.25 BPNN_ J48_0.25 Artificial neural network model constructed based on 3 selected variables used in DT_J48_0.25 Decision tree model constructed based on 3 variables selected by CFS DT_cfs method Naïve Bayes model constructed based on 3 variables selected by CFS NB_cfs method Logistic regression model constructed based on 3 variables selected by LR_cfs CFS method Artificial neural network model constructed based on 3 variables selected BPNN_cfs by CFS method

22

Table 5 Accuracy, sensitivity, and specificity of decision tree models Models Accuracy Sensitivity Specificity DT_J48_0.25 85.294050 83.7728 86.8233 DT_J48_0.20 85.359863 83.1313 87.6001 DT_J48_0.30 85.483091 87.4078 83.5482

23

Table 6 F statistic and p value of ANOVA for all metrics Metrics F statistic p value Accuracy 2.032 0.133 Sensitivity 228.685 0.000 Specificity 163.075 0.000

24

Table 7 Tukey’s HSD test for the sensitivity of decision tree models

Different subset of models at  = 0.05 Models 1 2 3

DT_J48_0.20 83.1313 DT_J48_0.25 83.7728 DT_J48_0.30 87.4078

25

Table 8 Tukey’s HSD test for the specificity of decision tree models

Different subset of models at  = 0.05 Model 1 2 3

DT_J48_0.30 83.5482 DT_J48_0.25 86.8233 DT_J48_0.20 87.6001

26

Figure 1 Proposed decision tree model

27

Table 9 Accuracy, sensitivity, and specificity of all models Models Accuracy Sensitivity Specificity DT_J48_0.25 85.4009 83.1155 87.6983 NB_J48_0.25 84.1397 83.1155 84.4955 BPNN_J48_0.25 85.4056 83.0251 87.7985 LR_J48_0.25 85.4648 83.6965 87.2425 NB_cfs 84.0039 83.6406 84.3691 LR_cfs 84.0692 83.7709 84.3691 BPNN_cfs 84.2727 84.5475 83.9966 DT_cfs 84.2746 84.5531 83.9947

28

Table 10 F statistic and p value of ANOVA test for all indices Metrics F statistic p value Accuracy 102.153 0.000 Sensitivity 31.130 0.000 Specificity 297.123 0.000

29

Table 11 Tukey’s HSD test for accuracy of all models

Different subset of models at  = 0.05 Model 1 2

NB_cfs 84.003913 LR_cfs 84.069261 NB_J48_0.25 84.139735 BPNN_cfs 84.272767 DT_cfs 84.274634 DT_J48_0.25 85.390666 BPNN_J48_0.25 85.405607 LR_J48_0.25 85.464883

30

Table 12 Tukey’s HSD test for sensitivity of all models

Different subset of models at  = 0.05 Models 1 2 3

BPNN_J48_0.25 83.0251 DT_J48_0.25 83.0875 NB_cfs 83.6406 LR_J48_0.25 83.6965 LR_cfs 83.7709 NB_J48_0.25 83.7858 BPNN_cfs 84.5475 DT_cfs 84.5531

31

Table 13 Tukey’s HSD test for specificity of all models

Different subset of models at  = 0.05 Model 1 2 3

DT_cfs 83.9947 BPNN_cfs 83.9966 NB_cfs 84.3691 84.3691 LR_cfs 84.3691 84.3691 NB_J48_0.25 84.4955 LR_J48_0.25 87.2425 DT_J48_0.25 87.7058 BPNN_J48_0.25 87.7985

32

5. Discussion and Conclusions

This paper explored predictive models with high interpretability and performance in classifying the outcome of violent events (i.e., physical injury and no physical injury) in the deep south of Thailand, based on data obtained from the DSCC database. The DT technique with embedded feature selection and the CFS method were applied to the original data set based on ten variables. Using the DT technique, three variables, namely, gun, solid/sharp weapon and zone were selected. The CFS method was also employed and selected three variables, namely, arson, gun, and solid/sharp weapon. It is worth noting that, regardless of the method used to select the variables, the variable, gun was selected, which was in agreement with the study of Chirtkiatsakul et al. (2014) which concluded that the use of a gun in a violent incident entailed a high risk of injury or death.

In terms of performance, the study reached the following conclusions: 1) creating models using the variables selected by DT (gun, solid/sharp weapon , zone) resulted in similar high performances for the DT, BPNN, and LR models overall for the prediction of the outcomes of violent events, 2) The BPNN_cfs and DT_cfs models, which were created using the arson, gun, and solid/sharp weapon variables had high performance in predicting physical injury because of the high sensitivity metric, 3) The DT_J48_0.25 and BPNN_J48_0.25 models, which were built using the gun, solid/sharp weapon and zone variables were superior to the other models for the prediction of the outcomes of violent events not leading to physical injury owing to the high specificity metric.

The results showed that the performances of DT and BPNN were equal; however, the results obtained by using the DT technique presented in graphs were easy to interpret whereas BPNN is a so-called ‘black-bo ’ technique where the route taken by the model is not apparent from its structure. DT can thus provide knowledge useful for preventing violence and data surveillance in a graph that is easily understandable by users.

33

References

Babcock, J. C., & Cooper, J. (2018). Testing the utility of the neural network model to predict history of arrest among intimate partner violent men. Safety, 5(2), 1-11. Banu, M. A. N. & Gomathy, B. (2014). Disease Forecasting System Using Data Mining Methods. In Proceeding of the 2014 International Conference on Intelligent Computing Applications, Coimbatore, 130-133. doi: 10.1109/ICICA.2014.36 Beck, J.R., Garcia, M.E, Zhong, M., Georgiopoulos, M., Anagnostopoulos, G. (2007). A backward adjusting strategy for the C4.5 decision tree classifier, AMALTHEA REU SITE, 1-15. Bilukha, O.O., Becknell, K, Laurenge, H, Danee, L, & Subedi, K.P. (2013). Fatal and non- fatal injuries due to intentional explosions in Nepal, 2008-2011: analysis of surveillance data. Confl Health. Conflict and Health, 7(1), 1-9. bin Othman M.F., Yau T.M.S. (2007). Comparison of Different Classification Techniques Using WEKA for Breast Cancer. In Proceedings of the 3rd Kuala Lumpur International Conference on Biomedical Engineering 2006. IFMBE, vol 15. Springer, Berlin, Heidelberg. Burke, A., Tweedie, P., & Poocharoen, O. (2013). Understanding the Subnational Conflict Area. The Contested Corners of Asia: Subnational Conflict and International Development Assistance: The Case of , (p11-24). The Asia Foundation, San Fransisco, CA, U.S.A. Chirtkiatsakul, B., Kuning, M., McNeil, N., & Eso, M. (2014). Risk Factors for Mortality among Victims of Provincial Unrest in Southern Thailand. Kasetsart Journal of Social Sciences, 35, 84-91. Duckworth, M.P., & Iezzi, T. (2010). Physical Injuries, Pain, and Psychological Trauma: Pathways to Disability. Psychol. Inj. and Law, 3, 241 – 253. Drazin, S.& Montag, M. (2013). Decision tree analysis using weka. Machine learning-project II. University of Miami Hall, M.A. Correlation-based feature selection for machine learning. PhD Thesis. Department of Computer Science, Waikato University, New Zealand; 1999. Höhle , aul , Held L 9 . “Statistical pproaches to the onitoring and Surveillance of Infectious Diseases for Veterinary ublic Health.” reventive Veterinary edicine, 91(1), 2–10. doi:10.1016/j.prevetmed.2009.05.017.

34

Karrar, A. E., Abdalrahman, M. A., & Ali, M. M. (2016). Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Dataset Using WEKA Tool. The International Journal Of Engineering And Science, 5(10), 35-39. Khongmark, S. & Kuning, M. (2013). Modeling Incidence Rates of Terrorism Injuries in Southern Thailand. Chiang Mai Journal of Science, 40(4), 743 – 749. Komolmalai ,W., Kuning, M., & McNeil, D. (2012). Muslim Victims of Terrorism Violence in Southern Thailand. International Journal of Business and Social Science, 3(12), 114 – 119. Kumar, V., Mazzara, M., Gen, M., Messina, A., & Lee, J.Y. (2019). A conjoint application of data mining techniques for analysis of global terrorist attacks prevention and prediction for combating terrorism. arXiv:1901.06483v3 [cs.LG], 1-13. Kuning, M., Eso, M., Sornsrivichai, V., & Chongsuvivatwong, V. (2014). Epidemiology of the Violence in the Deep South. In Chongsuvivatwong,V., Boegli, L. C. & Hasuwannakit, S (Eds). Healing under fire the case of southern Thailand (pp. 41-49). The Deep South Relief and Reconciliation Foundation and the Rugiagli Initiative, Thailand, Bangkok. Lim, A., Choonpradub, C., Tongkumchum, P. and Chesoh, S. (2009). Living conditions and the path to healing victim’s families after violence in southern thailand: case study in Pattani province, Asian Social Science, 5(9), 84-92. Liu, Y. Y., Yang, M., Ramsay, M., Li, X. S., & Coid, J.W. (2011). A comparison of logistic regression, classification and regression tree, and neural networks models in predicting violent re-offending. Journal of Quantitative Criminology, 1-27. Lorena et al. (2011). Comparing machine learning classifiers in potential distribution modelling. Expert Systems with Applications, 38, 5268 -5275. Marohabout, P. , Choonpradub, C. & Kuning M. (2009). Terrorism Risk Modeling in Southern Border Provinces of Thailand during 2004 to 2005. Songklanakarin J. of Social Sciences & Humanities, 15(6), 883-895. Nieves, S., Cruz, A. (2011). Finding Patterns of Terrorist Groups in Iraq: A Knowledge Discovery Analysis. Proceedings of Ninth LACCEI Latin American and Caribbean Conference, Engineering for a Smart Planet, Innovation, Information Technology and Computational Tools for Sustainable Development, 1-10. Ö zyirmidokuz, E. K., & Kaya, Y. (2014). Decision tree induction of emotional violence against women. In Proceedings of INTCESS14- International Conference on Education and Social Sciences, 3-5 February 2014- Istanbul, Turkey, 847-856.

35

Rodríguez, G. (2007). Lecture Notes on Generalized Linear Models. Retrieved from http://data.princeton.edu/wws509/notes/ Rokach, L. & Maimon, O. (2005). Decision trees. In Oded Mairnon and Lior Rokach, editors, The Data Mining and Knowledge Discovery Handbook, 165-192. Springer. Rokach, L., & Maimon, O. (2014). Data mining with decision trees: theory and applications. 2nd edition. World Scientific Publishing Co. Pte. Ltd. Singapore. Saeys, Y., Inza, I., & Larranaga, P. (2007). A review of feature selection techniques in bioinformatics. Gene expression, 23(19), 2507–2517. Sornsrivichai, V. (2007). Violence--related Injury Surveillance related Injury Surveillance (VIS) Deep South, Thailand, 2007 Deep South, Thailand, 2007, Epidemiology Unit, Prince of Songkla University, Hat Yai, Thailand Swetapadma, A. & Yadav, A. (2016). Protection of parallel transmission lines including inter-circuit faults using aıve Bayes classifier. Alexandria Engineering Journal, 55, 1411 – 1419. Tabish, S. A., Wani, R. A., Ahmad, M., Thakur, N., GH, Y. & Wani, S. N. (2013). Profile and Outcome of Violence Related Injuries of Patients during Civilian Unrest in a Conflict zone. Emergency Medicine, 3(3), 1-5. The World Medical Association, Inc. (2012). WMA statement on violence in the health sector by patients and those close to them. Adopted by the 63rd WMA General Assembly, Bangkok, Thailand, October 2012. Available from URL: https://www.med.or.jp/jma/jma_infoactivity/jma_activity/2012wma/2012_06e.pdf [Accessed 2018 Mar.] Wijenayake, S., Graham, T., & Christen, P. (2018). A decision tree approach to predicting recidivism in domestic violence. arXiv:1803.09862v1 [cs.LG], 1-12. Witten, I.H. & Frank, E. (2005). Data mining: practical machine learning tools and techniques. San Francisco, CA: Morgan Kaufmann.

36