Risk Map of Dengue Case in the Southernmost Provinces of Thailand Using a Machine Learning
Total Page:16
File Type:pdf, Size:1020Kb
Turkish Journal of Physiotherapy and Rehabilitation; 32(3) ISSN 2651-4451 | e-ISSN 2651-446X RISK MAP OF DENGUE CASE IN THE SOUTHERNMOST PROVINCES OF THAILAND USING A MACHINE LEARNING Teerawad Sriklin 1, Supattra Puttinaovarat 2, Siriwan Kajornkasirat 3 1,2,3 Faculty of Science and Industrial Technology, Prince of Songkla University, Surat Thani Campus,Surat Thani 84000, Thailand Email:1 16240320501@psu.ac.th, 2supattra.p@psu.ac.th, 3siriwan.wo@psu.ac.th ABSTRACT This study aimed to compare a machine learning model suitable for spatial data on dengue cases in the three southernmost provinces of Thailand including Pattani, Yala, and Narathiwat provinces. Data on the dengue cases and weather data including rainfall, rainy day, temperature, relative humidity, and air pressure for the period of 2015 to 2019 were obtained from the Bureau of Epidemiology, and the Meteorological Department of Southern Thailand, respectively. The machine learning models used for testing to compare the accuracy of predictions were ANN, Random Forest, SVM, and J48. The results showed that the ANN model at the number of iterations 2500 rounds gives the highest accuracy (98.89%) with lowest root mean square error and mean absolute error. Narathiwat (Russo, Si Sakhon, and Chanae District) and Pattani (Yarang and Mayo District) provinces are defined to the high-risk areas. Yala province was the low-risk area corresponding to the information obtained from the Public Health Office and the risk map created from the patient information. Keywords:dengue, ANN, Random Forest, SVM, J48. I. INTRODUCTION The World Health Organization [1] revealed the rapid increase in the number of Dengue Hemorrhagic Fever (DHF) cases are caused by the dengue virus, which contagious disease caused by mosquitoes that reached 30 times during the last 50 year. The disease also threatens the health of more than 40 percent of the world's population (2,500 million people), especially in tropical and temperate countries [2], [3]. The southern region of Thailand is different from other regions. Since the area is adjacent to the sea, high mountains and closer to the equator than the rest of the country. Therefore, the area is classified as a tropical area, resulting in high temperatures and rainfall throughout the year. The average number of rainy day is 148.7 days per year with the amount of rainfall 1,781.7 millimeters per year [4]. The reported of dengue cases and deaths in Thailand in 2019 was 42,185 cases and 29 deaths [5]. In the southern region, 6,859 cases and 3 deaths were reported [5]. There was a higher dengue incidence rate in the three southernmost provinces including Pattani, Yala, and Narathiwat provinces and higher than in the neighboring areas in the lower South region of Thailand [5]. From the previous studies, it was found that rainfall, temperature, humidity, relative humidity [6]-[8] and the application of Geographic Information Systems (GIS) tools were used to predict or determine risk areas by linking, analyzing, processing, and correlating data [9]-[11]. In this study investigated the appropriate models for dengue case prediction was done using the model of ANN, Random Forest, SVM, and J48 in the southernmost province in Thailand. The prediction is performed using WEKA software. The objectives of this study are: 1) to explore the major factors of dengue fever, 2) to study models suitable for predicting dengue severity in the three southernmost provinces, and 3) to map outbreaks of dengue fever. This will be used in planning and prevent the spread of dengue fever in the area. II. THEORY AND RELATED MATERIALS The learning and testing process is performed using WEKA software. We choose the Decision Tree algorithm as this algorithm is widely known and suitable for solving complex problems. It is an easy-to-understand model www.turkjphysiotherrehabil.org 10884 Turkish Journal of Physiotherapy and Rehabilitation; 32(3) ISSN 2651-4451 | e-ISSN 2651-446X consisting of Random Forest, J48, and two other models, ANN and SVM. The reason for choosing four algorithms is that ANN has excellent performance and high accuracy in characterizing data which like a human brain [12]. Random Forest is a popular method with its excellent performance and accuracy in classifying tasks. It also outperforms its counterparts such as neural network, discriminant analysis and Support Vector Machines (SVM) [13]. SVM is a powerful supervised learning algorithm that has many applications in biophotonics, pattern recognition, and classification [14]-[16]. For J48, it can apply can to other fields unrelated to health. It is used in credit analysis where research is compared with other algorithms. A. Artificial Neural Networks (ANN) The field of artificial neural networks tries to simulate and to fabricate networks and devices in the spirit of neurobiology, to solve the useful computational problems of the kind that biology does effortlessly [17] and can be trained to perform a particular task based on available empirical data. When the relationships between data are unknown [12]. There exist many ANN variants, this study focuses particularly on Multi-Layer Perceptron (MLP). In our implementation, the MLP has defined as per equations (1) and (2). ( ) = ( ) ( ) + ( ) ( ) (1) =1 �1 ( ) = 1+ ( ) (2) − When this equationⅇ = The total number of input nodes ( ) = A sampled data present at the node ℎ ( ) = The weight assigned to the link between the and the nodes ℎ ℎ ( ), ( ) = The bias and weight linking between bias and the node ℎ ( ) = The response at the node the node. ℎ B. Random Forest The Random Forest is an aggregation method used for classification and regression. This supervised learning method allows the construction of multiple decision trees with observations and random variables [18]. For example, one such rule might differ some dengue case locations from others by a rainfall threshold, while another rule might further split the data based on the population density within a specific range. The input data is categorized repeatedly according to different classification structures, and the final forecast/classification is done by taking the average of each tree [19]. C. SVM SVM is define the boundary between data groups and maximize the boundary distance (or extract a hyperplane in case of multiple dimensions) from the nearest data point. These closest data points, located on both sides of a line or hyperplane are called support vectors. This has led to a good generalization ability of the classifier that might produce better results with invisible samples. In the case of linear inseparable data, mathematical functions (also known as kernel functions) are used to transform data to a higher-dimensional space, which can be linearly separated in the new region [20]. D. J48 J48 is classification is the process of modeling a class from set of records labeled classes Decision Tree algorithm looks for how an attribute-vector works for number of instances. A class was also found for the newly created instance based on the training instance. This algorithm generates rules for the prediction of target variables with the help of tree classification algorithms, the key distributions can be easily understanding [21]. www.turkjphysiotherrehabil.org 10885 Turkish Journal of Physiotherapy and Rehabilitation; 32(3) ISSN 2651-4451 | e-ISSN 2651-446X III. METHODOLOGY A. Study areas In this study, we selected the three southernmost provinces of Thailand for the study areas including Pattani, Yala, and Narathiwat provinces. Southern region Thailand is on the Malay Peninsula with an area of around 70,714 km2. The study areas are located near the seaside and mountainous areas, closer to the equator than in other parts of Thailand. Therefore, it is tropical, causing high temperatures and rainfall throughout the year. It is influenced by the northeast monsoon that occurs from October to February. Topographic characteristics of the central and southern study areas, the region has the Sankalakiri Mountain Range. Yala and Narathiwat provinces resting in the east-west direction and a border between Thailand and Malaysia. On the east is a river basin stretching to the coast of the Gulf of Thailand in Pattani and Narathiwat provinces. B. Data collection The data set used in this study is monthly dengue fever cases from January 2015 to December 2019 in the three southernmost provinces of Thailand. The data were obtained from the Bureau of Epidemiology, Ministry of Public Health. Weather data in the same period were collected from the Meteorological Department of Southern Thailand consisting of mean temperature, minimum temperature, and maximum temperature, air pressure, relative humidity rainfall, and rainy days. The location of data in this study are randomly from Google Earth with a total of 180 points separated by province. C. Data Preprocessing The number of dengue cases was divided into three intervals of severity according to the principle of the class interval (Fig. 1). Severity rates are established to determine the outcome of the prediction. Severity rate data is used in conjunction with weather data (i.e. air pressure, relative humidity, temperature, rainfall, number of rainy days) D. Training and Dengue Fever Predicting Processes All data reformatted into ARFF to be able to set MLP configured ANN, and training Random Forest, SVM, and J48. The learning process is performed using WEKA software [22]. During the training and predicting, the ANN learning was set to the number of iterations given by 500, 1500, 2000, 2500, and 3000 rounds. Another parameter is learning rate is 0.3 and the hidden layer is automatic. The reason for these to find the balance between accuracy and time. Fig. 1 Dangue rate points www.turkjphysiotherrehabil.org 10886 Turkish Journal of Physiotherapy and Rehabilitation; 32(3) ISSN 2651-4451 | e-ISSN 2651-446X IV. RESULTS The results of comparisons between different accuracy were shown in Table 1. The results indicated that ANN and Random Forest gives the highest accuracy with 98.89 percent with the lowest of RMSE and MAE are 0.085 and 0.0141, respectively.