An Effective Machine Learning Approach for Disease Predictive Modelling in Medical Application
Total Page:16
File Type:pdf, Size:1020Kb
INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 9, ISSUE 04, APRIL 2020 ISSN 2277-8616 An Effective Machine Learning Approach For Disease Predictive Modelling In Medical Application Priya G, Radhika A Abstract— Most illnesses are lethal if left untreated and most of the people don't know whether they have a certain disorder or not. Therefore, it is necessary to diagnose the disease at an earlier stage to improve the life expectancy of the affected individuals. The cancer disease is the most deadly disease which needs to be identified and diagnosed earlier. In particular, breast cancer is the most common cancer among women in the world with a high death rate. Hence breast cancer disease needed to be predicted earlier to lead a life healthier. In this paper, a novel predictive modelling technique is implemented with the combination of the logistic regression, random forest and deep neural network to predict breast cancer. The performance of proposed model has been evaluated in terms of measuring accuracy, precision, recall and F1-score. For our proposed ensemble technique, we gain accuracy of 97.6%, precision of 97%, recall of 97% and an F1-score of 97%. The result that we obtained through the ensemble technique is exceptional when compared to the traditional method like logistic regression and conventional method like the random forest. Index Terms— Breast Cancer, Deep Neural Network, Disease predictive modelling, Machine Learning algorithms, Principal component analysis —————————— —————————— 1. INTRODUCTION cancer data. LLR model is derived from LASSO (Least IN recent days, most non-communicable diseases such as Absolute Shrinkage and Selection Operator) Logistic cancer, heart disease, etc. are mainly based on unhealthy Regression. Machine learning technique, i.e. fuzzy-based lifestyles and morbidity [1]. Where two-thirds of global deaths logistic regression, has been implemented to select the cancer are caused by cancer and breast cancer is one of these. gene data via feature selection. The average classification Further, that breast cancer is most common cancer among accuracy on the breast cancer dataset with 116 cases was women. The prevalence of breast cancer is growing across the found to be 94.05 % and also the model achieve productivity globe. According to the 2002 World Health Report, the cause with the lower error rate. The LLR system still has some of breast cancer depends on five factors such as sedentary drawbacks like the model that has only been tested with 116 lifestyle, smoking, insufficient breastfeeding, unhealthy diet, instances that are tiny databases. With a large data set, it and excess alcohol usage. Breast cancer is developed from should be evaluated in future. The study by Singh [4] glandular milk ducts epithelial cells of the breast. It's one of the evaluated the ability of anthropometric and medical tumour types; the name of the tumour is a malignant tumour. measurements for breast cancer screening using a list of 116 Further, the non-cancer tumour is called benign. Several cases. Different elements of the machine learning model were diagnostic procedures are required for the doctor or surgeon incorporated. Further, the performance is analysed and to decide cancer is a benign tumour or a malignant tumour [2]. calculated using feature selection, cross-validation, and The doctor uses several measurements, such as cell type classification method. Consequently, the resistin, insulin, age, uniformity, clump density, cell size uniformity, etc., to diagnose HOMA and glucose were used to get biomarkers accuracy for breast cancer, but still, the outcome is not accurate. This has the detection of breast cancer. Furthermore, more research is led particularly to an increase in the use of machine learning needed to verify such results with more factors on a greater and computing as diagnostic tools. Therefore, new diagnostic and multi-centred anthropometric measurement and medical tools or new techniques for predicting breast cancer needs to and database. In the future, the accuracy could be compared be developed earlier in order to extend human life. with the advanced technologies like a deep neural network with a larger database. Study by Tapak et al [5] compared and 2. RELATED WORK predicted breast cancer (BC) metastasis and survival using six In the related work, we have analysed few research related to machine learning methods and two traditional methods such breast cancer with different predictive modelling using as Support Vector Machine (SVM), Naive Bayes (NB), Linear machine learning techniques like fuzzy-based logistic Discriminant Analysis, AdaBoost, Random Forest (RF), regression, Support Vector Machine (SVM), Naive Bayes (NB), Logistic Regression (LR) and Least-square SVM (LSSVM) and Linear Discriminant Analysis, AdaBoost, Random Forest (RF), Adabag. In this study, the researcher used a database Logistic Regression (LR) and Least-square SVM (LSSVM) and composed of 550 patients with breast cancer. The Adabag. The brief summary of these methods with issues for performances such as accuracy specificity, sensitivity and the same are discussed as follows: Research by Nandagopal likelihood ratio were evaluated using these techniques. As a et al. [3] employed a novel regression method called LLR. It is result, the highest specificity is, therefore, RF, which is 98%, used to detect the victim gene using the classification of the accuracy of SVM is 93% and accuracy of LDA is 93% and the highest sensitivity is 36%, that is NB. Research by Liu, [6] _______________________________ used the Sklearn machine learning library, the logistic regression technique to predict breast cancer by classifying Priya G is currently pursing PhD (part-time) in the Department of the dataset. The Wisconsin diagnostic breast cancer (WDBC) dataset is used in this research. In this experiment, two Statistics, Periyar University, Salem characteristics, that is mean texture and mean radius is used Radhika A is currently working as Assistant Professor, for classification. In classification, the accuracy is 90.48 % Department of Statistics, Periyar University, Salem when choosing the maximum texture and 96.5 % when 2819 IJSTR©2020 www.ijstr.org INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 9, ISSUE 04, APRIL 2020 ISSN 2277-8616 choosing the maximum radius. Maximum result for the conventional method of machine learning is 96.5%. Therefore, in the future, a better combination of features must be chosen to improve the accuracy of the classification. Research by Jain and Bhaumik [7] proposed Application Specific Integrated Circuit (ASIC) model depend on forward search to diagnosis cardiovascular disease on smart mobile. It is the diagnostic algorithm for the processing of the ECG signal. The ASIC is the low computational detecting algorithm and to evaluate cardiovascular disease diagnostic dataset Physionet PTB ECG is used. The algorithms accurately predicted the T wave parameters, P wave, ST-segment and QRS complex. As a result, the specificity of the P(T) wave is 91.07% and the sensitivity is 98.91% respectively. In the future, different datasets need to be used by the proposed model to predict cardiovascular disease on smart mobile. From the review of literature, different machine techniques will be applied to study disease like breast cancer [6], [8] and cardiovascular disease [9] Jain and Bhaumik,[7]. Overall, earlier research has produced decent results to predict breast cancer disease. Nevertheless, the earlier method of detection needs improvisation. We also believe that the ensemble methodology using machine learning and data mining can help to predict breast cancer accurately and also help to minimize previous diagnostic errors. As a result, the ensemble predictive method ultimately delivers the patient with good quality services. 3. RESEARCH METHODOLOGY The aim of the research is to predict where the patient has Benign tumour (non-cancerous) or Malignant tumour Fig. 1. Proposed Method (LR, RF and DNN) (cancerous). We have outlined an ensemble predictive model to come with the most accurate predictions. The system 3.1 System Configuration consists of the following major phases like system In this research work, an effective disease predictive modeling configuration, dataset selection, data scaling, feature in medical application is implemented in python using three selection, feature extraction, and classification. Finally, the machine learning techniques - Logistic regression, random result is obtained using the confusion matrix and performance forest, and deep neural network. The study's experimental metric. The workflow below outlines a basic review of the findings were all performed on a computer with a high visual entire research methodology: interface configuration and operating system setup. The experiment is conducted on the test machine configured with 1. System Configuration Intel (R) Core (TM) i7 processor with 16GB of RAM running 2. Selection of dataset 64-bit Windows 10 Operating system. The data transformation 3. Data Pre-processing and model training were executed using python 3.7 software. The detail of the system configuration is represented in table Checking for missing value 1. Data scaling (i.e. Standardization) TABLE 1 Feature selection (i.e. Random Forest) System Configuration Feature extraction (i.e PCA) 4. Training and Testing by the classifier (i.e. Logistic Regression and Deep Neural Network) 5. The performance metric is evaluated using the confusion matrix;