Type 2 Diabetes Prediction Using Machine Learning Algorithms
Total Page:16
File Type:pdf, Size:1020Kb
Type 2 Diabetes Prediction Using Machine Learning Algorithms Parisa Karimi Darabi*, Mohammad Jafar Tarokh Department of Industrial Engineering, K. N. Toosi University of Technology, Tehran, Iran Abstract Article Type: Background and Objective: Currently, diabetes is one of the leading causes of Original article death in the world. According to several factors diagnosis of this disease is Article History: complex and prone to human error. This study aimed to analyze the risk of Received: 15 Jul 2020 having diabetes based on laboratory information, life style and, family history Revised: 1 Aug 2020 with the help of machine learning algorithms. When the model is trained Accepted: 5 Aug 2020 properly, people can examine their risk of having diabetes. *Correspondence: Material and Methods: To classify patients, by using Python, eight different machine learning algorithms (Logistic Regression, Nearest Neighbor, Decision Parisa Karimi Darabi, Tree, Random Forest, Support Vector Machine, Naive Bayesian, Neural Department of Industrial Network and Gradient Boosting) were analyzed. Were evaluated by accuracy, Engineering, K. N. Toosi University sensitivity, specificity and ROC curve parameters. of Technology, Tehran, Iran Results: The model based on the gradient boosting algorithm showed the best [email protected] performance with a prediction accuracy of %95.50. Conclusion: In the future, this model can be used for diagnosis diabete. The basis of this study is to do more research and develop models such as other learning machine algorithms. Keywords: Prediction, diabetes, machine learning, gradient boosting, ROC curve DOI : 10.29252/jorjanibiomedj.8.3.4 people die every year from diabetes disease. Introduction By 2045, it will reach 629 million (1). According to the world health organization, This epidemic disease tends to grow. It was diabetes is a chronic disease. This disease is estimated that approximately 642 million due to either the pancreas not producing people will have diabetes by 2040. However, enough insulin or the body not responding the number of studies estimated that the properly to the insulin produced. annual cost of the disease from US$673 Diabetes has a serious global economic billion in 2015 reaches US$827 billion in impact. In 2012, this disease with 1.5 million 2016, which currently exceed the 2030 deaths, made the 8th leading cause of death. prediction (2). Based on statistics of 2017, nearly 425 Moreover, the economic burden of diabetes in million people had diabetes, about 2-5 million China has risen over time. In 2015, the total Copyright© 2018, Jorjani Biomedicine Journal has published this work as an open access article under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/4.0/) which permits non- commercial uses of the work while it is properly cited. Type 2 Diabetes Prediction Using Machine Learning Algorithms Karimi Darabi P. et al. annual expenditure was between US$51.1 and Despite human error probability, different US$88.4 billion (3). types of machine learning algorithms, Type 1 diabetes is also called “insulin- analytical and statistical methods, that use the dependent diabetes”. Type 1 diabetes results previous and current data for finding from the body’s inability to produce insulin. information and predicting future events, can Therefore, the injection of insulin is essential be used for solving this problem. for the patient. Type 2 diabetes occurs when The artificial intelligence field is growing the body’s cells do not react to insulin rapidly. Its functions in the field of diabetes properly. This type of diabetes appears most are diagnosis and management of this chronic often in middle-aged, and sometimes older disease. The techniques of machine learning people. Gestational diabetes is high blood are used for making models of diabetes sugar that develops during pregnancy in a diagnosis and its side effects (6). woman who has not been diagnosed with Results of empirical research articles, reflect diabetes before. the ability of analytical techniques to Today, the disease is also occurring in diagnose diabetes. Recent research shows that children. There are several causes of diabetes, machine learning techniques can describe the including allergy, genetics, body weight, diet, characteristics of patients and identify patients physical activity, and, lifestyle (4). at risk for type 2 diabetes (7, 8). Machine Since diabetes is a multifactorial disease, it is Learning is one of the most important features diagnosis is complicated. Therefore, for of artificial intelligence that supports the accurate diagnosis, the physician should development of computer systems with the examine the test results of a patient, compare ability to learn and improve from experiences, and the need to plan for every case (9). the test results with those of patients under the same condition, and analyze previous The purpose of this study is to use learning decisions. To care for a person, doctors are machine algorithms for making models of well prepared to identify people at risk for type 2 diabetes diagnosis. type 2 diabetes. Related work However, as we try to screen thousands of More than half of all adults worldwide are patients at high risk, the challenges that suffering from diabetes disease. Hence, early physicians face become apparent (5). and detection of the disease greatly improves Analysis of the factors influencing the patient quality of life. For the diagnosis of diagnosis may be influenced by human error. diabetes, a lot of researchers used learning Because it is subject to the physician’s machine methods and data analysis (10). interpretation. Many articles by using statistical analysis A Blood test is not discriminatory enough, methods explored connections and samples and it is not enough for an accurate diagnosis between Chinese patients’ data (11-14). of diabetes. Because among different characteristics of people it may be interpreted Gao et al. in (11) used statistical analysis and differently (4). Therefore, apart from multivariate Cox Regression algorithm to laboratory information, lifestyle, physical study the relationship between ALT and condition, and family history of patients diabetes. They concluded that people between should be examined. the age group 30-40 with elevated ALT are at 5| Jorjani Biomedicine Journal. 2020; 8(3): P 4-18. Type 2 Diabetes Prediction Using Machine Learning Algorithms Karimi Darabi P. et al. a greater risk compared to those with low Many researchers also by analyzing different ALT. types of machine learning algorithms, assess Chen et al. in (12) by using statistical analysis and compare their performance on a particular and multivariate Cox Regression algorithm data set, and, ultimately with particular demonstrated that ratio TG /HDL-C was evaluation methodologies, introduce the most positively associated with the incidence of appropriate predictive model for diabetes (20, diabetes in the Chinese population. 21). Qin et al. in (3) by exploring ratio TG/HDL-C Materials and Methods as an independent predictor of diabetes, found Brief description of Machine Learning that TG/HDL-C is a powerful predictor in Classification Techniques male patients compared to female patients. Logistic Regression Chen et al. in (13) investigated the relation of obesity and age with diabetes. Eventually, Logistic Regression (LR) is a sort of across all age groups, there was a linear supervised learning which estimates the association between BMI and diabetes. They connection between a binary dependent verified that BMI and age are a strong variable and at least one independent variable predictor of diabetes. by evaluating probabilities with the help of sigmoid function. In contrary to its name, LR Lin et al. in (14) designed a Nomogram based is not used for regression problems rather is a on the seven factors of diabetes in order to type of machine learning classification predict type 2 diabetes risk among people. problem where the dependent variable is Diabetes prediction should be under control. dichotomous (0/1, -1/1, true/false) and Therefore, supervised learning algorithms independent variable can binominal, ordinal, have attracted a great deal of attention from interval or ratio-level. The sigmoid/Logistic researchers. By using machine learning, many function is given as: diabetes research articles have been done on the Pima data set (15-18). (1) Wu et al. in (15) applied the Nearest Neighbor where, y is the output which is the result of and Regression Logistic algorithms to make a weighted sum of input variables x. If the diabetes prediction model with an accuracy of output is more than 0.5, the output is 1 else 95.42%. the output is 0 (21). Viloria et al. in (6) developed a diabetes K-Nearest Neighbor predictive model by applying the Support K-Nearest Neighbor (KNN) method can be Vector Machine algorithm. The algorithm used for both classification and regression was obtained with an accuracy of 99.2% with problems (19). However, it is more widely Colombian patients. used in classification problems in the Devi et al. in (19) by employing a Support industry. KNN is a simple algorithm that Vector Machine algorithm, managed to make stores all available cases and classifies new a diabetes prediction model with an accuracy cases by a majority vote of its k neighbors. of 99.4%. The case is assigned to the class is most common amongst its K nearest neighbors 6| Jorjani Biomedicine Journal. 2020; 8(3): P 4-18. Type 2 Diabetes Prediction Using Machine Learning Algorithms Karimi Darabi P. et al. measured by a distance function. These multidimensional space. A hyperplane is a distance functions can be Euclidean, decision boundary to classify data points. The Manhattan, Makowski, and Hamming hyperplane classifies the data points with the distance. maximum margin between the classes and the The first three functions are used for hyperplane (21). continuous function and the fourth one In this algorithm, we plot each data item as a (Hamming) for categorical variables. If K = 1, point in n-dimensional space (where n is the then the case is simply assigned to the class of number of features you have) with the value its nearest neighbor.