Turkish Journal of Physiotherapy and Rehabilitation; 32(3) ISSN 2651-4451 | e-ISSN 2651-446X

RISK MAP OF DENGUE CASE IN THE SOUTHERNMOST PROVINCES OF USING A MACHINE LEARNING

Teerawad Sriklin 1, Supattra Puttinaovarat 2, Siriwan Kajornkasirat 3 1,2,3 Faculty of Science and Industrial Technology, Prince of Songkla University, Surat Thani Campus,Surat Thani 84000, Thailand Email:1 [email protected], [email protected], [email protected]

ABSTRACT

This study aimed to compare a machine learning model suitable for spatial data on dengue cases in the three southernmost including Pattani, Yala, and Narathiwat provinces. Data on the dengue cases and weather data including rainfall, rainy day, temperature, relative humidity, and air pressure for the period of 2015 to 2019 were obtained from the Bureau of Epidemiology, and the Meteorological Department of Southern Thailand, respectively. The machine learning models used for testing to compare the accuracy of predictions were ANN, Random Forest, SVM, and J48. The results showed that the ANN model at the number of iterations 2500 rounds gives the highest accuracy (98.89%) with lowest root mean square error and mean absolute error. Narathiwat (Russo, Si Sakhon, and Chanae ) and Pattani (Yarang and Mayo District) provinces are defined to the high-risk areas. Yala province was the low-risk area corresponding to the information obtained from the Public Health Office and the risk map created from the patient information.

Keywords:dengue, ANN, Random Forest, SVM, J48.

I. INTRODUCTION The World Health Organization [1] revealed the rapid increase in the number of Dengue Hemorrhagic Fever (DHF) cases are caused by the dengue virus, which contagious disease caused by mosquitoes that reached 30 times during the last 50 year. The disease also threatens the health of more than 40 percent of the world's population (2,500 million people), especially in tropical and temperate countries [2], [3]. The southern region of Thailand is different from other regions. Since the area is adjacent to the sea, high mountains and closer to the equator than the rest of the country. Therefore, the area is classified as a tropical area, resulting in high temperatures and rainfall throughout the year. The average number of rainy day is 148.7 days per year with the amount of rainfall 1,781.7 millimeters per year [4].

The reported of dengue cases and deaths in Thailand in 2019 was 42,185 cases and 29 deaths [5]. In the southern region, 6,859 cases and 3 deaths were reported [5]. There was a higher dengue incidence rate in the three southernmost provinces including Pattani, Yala, and Narathiwat provinces and higher than in the neighboring areas in the lower South region of Thailand [5].

From the previous studies, it was found that rainfall, temperature, humidity, relative humidity [6]-[8] and the application of Geographic Information Systems (GIS) tools were used to predict or determine risk areas by linking, analyzing, processing, and correlating data [9]-[11]. In this study investigated the appropriate models for dengue case prediction was done using the model of ANN, Random Forest, SVM, and J48 in the southernmost province in Thailand. The prediction is performed using WEKA software. The objectives of this study are: 1) to explore the major factors of dengue fever, 2) to study models suitable for predicting dengue severity in the three southernmost provinces, and 3) to map outbreaks of dengue fever. This will be used in planning and prevent the spread of dengue fever in the area.

II. THEORY AND RELATED MATERIALS The learning and testing process is performed using WEKA software. We choose the Decision Tree algorithm as this algorithm is widely known and suitable for solving complex problems. It is an easy-to-understand model www.turkjphysiotherrehabil.org 10884

Turkish Journal of Physiotherapy and Rehabilitation; 32(3) ISSN 2651-4451 | e-ISSN 2651-446X consisting of Random Forest, J48, and two other models, ANN and SVM. The reason for choosing four algorithms is that ANN has excellent performance and high accuracy in characterizing data which like a human brain [12]. Random Forest is a popular method with its excellent performance and accuracy in classifying tasks. It also outperforms its counterparts such as neural network, discriminant analysis and Support Vector Machines (SVM) [13]. SVM is a powerful supervised learning algorithm that has many applications in biophotonics, pattern recognition, and classification [14]-[16]. For J48, it can apply can to other fields unrelated to health. It is used in credit analysis where research is compared with other algorithms.

A. Artificial Neural Networks (ANN) The field of artificial neural networks tries to simulate and to fabricate networks and devices in the spirit of neurobiology, to solve the useful computational problems of the kind that biology does effortlessly [17] and can be trained to perform a particular task based on available empirical data. When the relationships between data are unknown [12]. There exist many ANN variants, this study focuses particularly on Multi-Layer Perceptron (MLP).

In our implementation, the MLP has defined as per equations (1) and (2).

( ) = ( ) ( ) + ( ) ( ) (1) 𝑛𝑛=1 𝑗𝑗 𝑖𝑖 𝑖𝑖𝑖𝑖 𝑗𝑗 𝑗𝑗 𝑥𝑥 𝑝𝑝 �1 𝑥𝑥 𝑝𝑝 𝑤𝑤 𝑝𝑝 𝑏𝑏 𝑝𝑝 𝑤𝑤 𝑝𝑝 ( ) = 𝑖𝑖 1+ ( ) (2) 𝑗𝑗 −𝑥𝑥 𝑝𝑝 When𝑦𝑦 𝑝𝑝 this equationⅇ

= The total number of input nodes

𝑛𝑛 ( ) = A sampled data present at the node 𝑡𝑡ℎ 𝑥𝑥𝑖𝑖 𝑝𝑝( ) = The weight assigned to the link𝑖𝑖 between the and the nodes 𝑡𝑡ℎ 𝑡𝑡ℎ 𝑤𝑤𝑖𝑖(𝑖𝑖 𝑝𝑝), ( ) = The bias and weight linking between bias𝑖𝑖 and the node𝑗𝑗 𝑡𝑡ℎ 𝑏𝑏 𝑗𝑗 (𝑝𝑝 ) 𝑤𝑤=𝑗𝑗 The𝑝𝑝 response at the node the node. 𝑗𝑗 𝑡𝑡ℎ B.𝑦𝑦𝑗𝑗 Random𝑝𝑝 Forest 𝑗𝑗 The Random Forest is an aggregation method used for classification and regression. This supervised learning method allows the construction of multiple decision trees with observations and random variables [18]. For example, one such rule might differ some dengue case locations from others by a rainfall threshold, while another rule might further split the data based on the population density within a specific range. The input data is categorized repeatedly according to different classification structures, and the final forecast/classification is done by taking the average of each tree [19].

C. SVM SVM is define the boundary between data groups and maximize the boundary distance (or extract a hyperplane in case of multiple dimensions) from the nearest data point. These closest data points, located on both sides of a line or hyperplane are called support vectors. This has led to a good generalization ability of the classifier that might produce better results with invisible samples. In the case of linear inseparable data, mathematical functions (also known as kernel functions) are used to transform data to a higher-dimensional space, which can be linearly separated in the new region [20].

D. J48

J48 is classification is the process of modeling a class from set of records labeled classes Decision Tree algorithm looks for how an attribute-vector works for number of instances. A class was also found for the newly created instance based on the training instance. This algorithm generates rules for the prediction of target variables with the help of tree classification algorithms, the key distributions can be easily understanding [21].

www.turkjphysiotherrehabil.org 10885

Turkish Journal of Physiotherapy and Rehabilitation; 32(3) ISSN 2651-4451 | e-ISSN 2651-446X

III. METHODOLOGY A. Study areas In this study, we selected the three southernmost provinces of Thailand for the study areas including Pattani, Yala, and Narathiwat provinces. Southern region Thailand is on the Malay Peninsula with an area of around 70,714 km2. The study areas are located near the seaside and mountainous areas, closer to the equator than in other parts of Thailand. Therefore, it is tropical, causing high temperatures and rainfall throughout the year. It is influenced by the northeast monsoon that occurs from October to February. Topographic characteristics of the central and southern study areas, the region has the Sankalakiri Mountain Range. Yala and Narathiwat provinces resting in the east-west direction and a border between Thailand and Malaysia. On the east is a river basin stretching to the coast of the Gulf of Thailand in Pattani and Narathiwat provinces.

B. Data collection The data set used in this study is monthly dengue fever cases from January 2015 to December 2019 in the three southernmost provinces of Thailand. The data were obtained from the Bureau of Epidemiology, Ministry of Public Health. Weather data in the same period were collected from the Meteorological Department of Southern Thailand consisting of mean temperature, minimum temperature, and maximum temperature, air pressure, relative humidity rainfall, and rainy days. The location of data in this study are randomly from Google Earth with a total of 180 points separated by province.

C. Data Preprocessing The number of dengue cases was divided into three intervals of severity according to the principle of the class interval (Fig. 1). Severity rates are established to determine the outcome of the prediction. Severity rate data is used in conjunction with weather data (i.e. air pressure, relative humidity, temperature, rainfall, number of rainy days)

D. Training and Dengue Fever Predicting Processes All data reformatted into ARFF to be able to set MLP configured ANN, and training Random Forest, SVM, and J48. The learning process is performed using WEKA software [22]. During the training and predicting, the ANN learning was set to the number of iterations given by 500, 1500, 2000, 2500, and 3000 rounds. Another parameter is learning rate is 0.3 and the hidden layer is automatic. The reason for these to find the balance between accuracy and time.

Fig. 1 Dangue rate points

www.turkjphysiotherrehabil.org 10886

Turkish Journal of Physiotherapy and Rehabilitation; 32(3) ISSN 2651-4451 | e-ISSN 2651-446X

IV. RESULTS The results of comparisons between different accuracy were shown in Table 1. The results indicated that ANN and Random Forest gives the highest accuracy with 98.89 percent with the lowest of RMSE and MAE are 0.085 and 0.0141, respectively. The accuracy prediction of ANN, Random Forest, SVM and J48 were shown in Fig. 2.

Table 1 Comparisons between different accuracy Classifier Accurac RMSE MAE model y ANN 98.8889 0.0805 0.0141 Random 98.8889 0.0811 0.0277 Forest SVM 88.8889 0.2722 0.0279 J48 98.3333 0.1054 0.0111

Fig. 2 Accuracy prediction percent of learning process Since the ANN can add more iterations to improve the efficiency of computational results, the researcher has added iterations to test for the most accurate iterations. The number of iterations for the learning process by comparing results for ANN in the same period with the predictions given by 500, 1500, 2000, 2500 and 3000 rounds of trained neural. The results as reported in Table II indicated that ANN at the number of iterations 2500 rounds is well balanced between accuracy and time complexity.

Table 2 Comparisons between different number of iterations for ANN Number of Accuracy RMSE MAE Iterations 500 96.6667 0.1050 0.0235 1000 97.2222 0.0937 0.0141 2000 98.8889 0.0805 0.0141 2500 98.8889 0.0802 0.0134 3000 98.8889 0.0802 0.0131

The risk points on the map of three southernmost provinces of Thailand are shown in Fig. 3-6 which divides the risk areas for dengue fever into three severity interval rates including low risk (green), moderate risk (yellow), and high risk (red). Since the ANN, Random Forest, and J48 versions have characteristics, points on the map are not much different, so the risk is similar. This can be explained as follows: Narathiwat (Russo, Si Sakhon, and

www.turkjphysiotherrehabil.org 10887

Turkish Journal of Physiotherapy and Rehabilitation; 32(3) ISSN 2651-4451 | e-ISSN 2651-446X

Chanae District) and Pattani (Yarang and Mayo District) provinces are defined to the high-risk areas. Yala province was the low-risk area

Prediction results are compared with actual data in the form of maps. The red point is the result of a prediction false and the green point is the result that gives the correct prediction (Fig. 7-10).

Fig. 3 Rate point by ANN Fig. 4 Rate point by Random Forest

Fig. 5 Rate point by SVM Fig. 6 Rate point by J48

Fig. 7 Comparison results of Fig. 8 Comparison results of ANN Random Forest

www.turkjphysiotherrehabil.org 10888

Turkish Journal of Physiotherapy and Rehabilitation; 32(3) ISSN 2651-4451 | e-ISSN 2651-446X

Fig. 9 Comparison results of Fig. 10 Comparison results SVM of J48

V. CONCLUSION From the analysis and the importance of factors affecting the dengue fever cases in the three southernmost provinces of Thailand using GIS techniques and machine learning with spatial data, it was able to identify the risk of dengue transmission using the weather factors including mean temperature, minimum temperature, air pressure, relative humidity rainfall, and rainy day. The models with the highest accuracy are ANN (98.89%), Random Forest (98.89%), and J48 (98.33%). These three models are high accuracy and reliable because of the accuracy higher than 95 percent. We can use these three models or choose the most accurate model (i.e. ANN) to identify the trends of dengue fever cases that could be used to support local public health offices. Pre-planning with emergency mosquito control measures can help reduce the risk of dengue outbreaks.

The suggestion from this research is that it may increase the resolution of factors to improve the efficiency of machine learning. For example, it could be weekly data to look at climate changes to predict dengue fever transmission.

Acknowledgment The authors would like to thank an anonymous reviewer for constructive comments on the manuscript. The work was financially supported by Prince of Songkla University, Surat Thani Campus (contract no. 003/2562), Graduate School, Prince of Songkla University, Graduate School, Prince of Songkla University Surat Campus, and Prince of Songkla University, Surat Thani Campus. We also thank the Bureau of Epidemiology, Ministry of Public Health for dengue case data and the Meteorological Department of Southern Thailand for weather data in this study.

REFERENCES 1. World Health Organization, Dengue and severe dengue. [Online] Available: https://www.who.int/news-room/fact-sheets/detail/dengue-and-severe- dengue (2020, November), 2020. 2. Wongkoon S, Jaroensutasinee M, Jaroensutasinee K. Spatio-temporal climate-based model of dengue infection in Southern, Thailand. Trop Biomed. 2016 Mar 1;33(1):55-70. 3. Bureau of Epidemiology of Ministry of Public Health, Reported cases and deaths by Province and by Month Thailand 2020. [Online] Available:http://www.boe.moph.go.th/boedb/surdata/disease.php?dcontent=old&ds=26 (2020, November), 2020. 4. Thai Meteorological Department, Weather Summary in 2015. [Online] Available:http://www.tmd.go.th/climate/climate.php?FileID=5 (2020, November), 2020. 5. Bureau of Epidemiology, Department of Disease Control, Thailand. Communicable Disease Epidemiology Group (DHF). [Online] Available:http://www.boe.moph.go.th/fact/Dengue_Haemorrhagic_Fever.htm (2020, November), 2020. 6. Sammatat S, Boonsith N, Lekdee K. Spatial mathematical analysis: an application to mapping of dengue hemorrhagic fever in Thailand, 2014. 7. Zambrano LI, Sierra M, Lara B, Rodríguez-Núñez I, Medina MT, Lozada-Riascos CO, Rodríguez-Morales AJ. Estimating and mapping the incidence of dengue and chikungunya in Honduras during 2015 using Geographic Information Systems (GIS). Journal of infection and public health. 2017 Jul 1;10(4):446-56. 8. Minale AS, Alemu K. Mapping malaria risk using geographic information systems and remote sensing: The case of Bahir Dar City, Ethiopia. Geospatial health. 2018 May 7;13(1). 9. The Office of Disease Prevention and Control 3 ChonBuri, Prediction risky area of dengue hemorrhagic fever among eastern-seaboard part of Thailand (Disease Prevention and Control 3). [Online] Available: www.interfetpthailand.net/forecast/files/report2012/report_2012_11_no12.pdf (2020, November), 2020. 10. Preechapanich O, Thernmontri S. A Geographic Information System for Supporting the Surveillance of Dengue Infection in Songkhla Province. Thaksin University Journal. 2015;18(3):161-9. 11. Somard, J., Suwanlee, R. S., Turnbull, N., & Phommat, T., (2017). Analyzing dengue fever risk areas using geographic information systems in Dome Pradis Sub-district, Nam Yuen District, Ubon Ratchathani Province. Journal of Medicine and Health Sciences. 2017 Dec 29;24(3):65-76. www.turkjphysiotherrehabil.org 10889

Turkish Journal of Physiotherapy and Rehabilitation; 32(3) ISSN 2651-4451 | e-ISSN 2651-446X

12. Lopez-Garcia TB, Coronado-Mendoza A, Domínguez-Navarro JA. Artificial neural networks in microgrids: a review. Engineering Applications of Artificial Intelligence. 2020 Oct 1;95:103894. 13. Ong J, Liu X, Rajarethinam J, Kok SY, Liang S, Tang CS, Cook AR, Ng LC, Yap G. Mapping dengue risk in Singapore using Random Forest. PLoS neglected tropical diseases. 2018 Jun 18;12(6):e0006587. 14. Vapnik V. The nature of statistical learning theory springer new york google scholar. New York. 1995. 15. Widjaja E, Zheng W, Huang Z. Classification of colonic tissues using near-infrared Raman spectroscopy and support vector machines. International journal of oncology. 2008 Mar 1;32(3):653-62. 16. Tahir M, Khan A, Majid A. Protein subcellular localization of fluorescence imagery using spatial and transform domain features. Bioinformatics. 2012 Jan 1;28(1):91-7. 17. Hopfield JJ. Artificial neural networks. IEEE Circuits and Devices Magazine. 1988 Sep;4(5):3-10. 18. Fawagreh K, Gaber MM, Elyan E. Random forests: from early developments to recent advancements. Systems Science & Control Engineering: An Open Access Journal. 2014 Dec 1;2(1):602-9. 19. Bostrom H. Estimating class probabilities in random forests. InSixth International Conference on Machine Learning and Applications (ICMLA 2007) 2007 Dec 13 (pp. 211-216). IEEE. 20. Tahir M. Protein Subcellular Classification using machine learning approaches (Doctoral dissertation, Pakistan Institute of Engineering and Applied Sciences Nilore Islamabad, Pakistan), 2014. 21. Kaur G, Chhabra A. Improved J48 classification algorithm for the prediction of diabetes. International journal of computer applications. 2014 Jan 1;98(22). 22. Bouckaert RR, Frank E, Hall MA, Holmes G, Pfahringer B, Reutemann P, Witten IH. WEKA---Experiences with a Java Open-Source Project. The Journal of Machine Learning Research. 2010 Dec 1;11:2533-41.

www.turkjphysiotherrehabil.org 10890