INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 9, ISSUE 04, APRIL 2020 ISSN 2277-8616

An Effective Approach For Disease Predictive Modelling In Medical Application

Priya G, Radhika A

Abstract— Most illnesses are lethal if left untreated and most of the people don't know whether they have a certain disorder or not. Therefore, it is necessary to diagnose the disease at an earlier stage to improve the life expectancy of the affected individuals. The cancer disease is the most deadly disease which needs to be identified and diagnosed earlier. In particular, breast cancer is the most common cancer among women in the world with a high death rate. Hence breast cancer disease needed to be predicted earlier to lead a life healthier. In this paper, a novel predictive modelling technique is implemented with the combination of the logistic regression, random forest and deep neural network to predict breast cancer. The performance of proposed model has been evaluated in terms of measuring accuracy, precision, recall and F1-score. For our proposed ensemble technique, we gain accuracy of 97.6%, precision of 97%, recall of 97% and an F1-score of 97%. The result that we obtained through the ensemble technique is exceptional when compared to the traditional method like logistic regression and conventional method like the random forest.

Index Terms— Breast Cancer, Deep Neural Network, Disease predictive modelling, Machine Learning algorithms, Principal component analysis

——————————  —————————— 1. INTRODUCTION cancer data. LLR model is derived from LASSO (Least IN recent days, most non-communicable diseases such as Absolute Shrinkage and Selection Operator) Logistic cancer, heart disease, etc. are mainly based on unhealthy Regression. Machine learning technique, i.e. fuzzy-based lifestyles and morbidity [1]. Where two-thirds of global deaths logistic regression, has been implemented to select the cancer are caused by cancer and breast cancer is one of these. gene data via feature selection. The average classification Further, that breast cancer is most common cancer among accuracy on the breast cancer dataset with 116 cases was women. The prevalence of breast cancer is growing across the found to be 94.05 % and also the model achieve productivity globe. According to the 2002 World Health Report, the cause with the lower error rate. The LLR system still has some of breast cancer depends on five factors such as sedentary drawbacks like the model that has only been tested with 116 lifestyle, smoking, insufficient breastfeeding, unhealthy diet, instances that are tiny databases. With a large data set, it and excess alcohol usage. Breast cancer is developed from should be evaluated in future. The study by Singh [4] glandular milk ducts epithelial cells of the breast. It's one of the evaluated the ability of anthropometric and medical tumour types; the name of the tumour is a malignant tumour. measurements for breast cancer screening using a list of 116 Further, the non-cancer tumour is called benign. Several cases. Different elements of the machine learning model were diagnostic procedures are required for the doctor or surgeon incorporated. Further, the performance is analysed and to decide cancer is a benign tumour or a malignant tumour [2]. calculated using feature selection, cross-validation, and The doctor uses several measurements, such as cell type classification method. Consequently, the resistin, insulin, age, uniformity, clump density, cell size uniformity, etc., to diagnose HOMA and glucose were used to get biomarkers accuracy for breast cancer, but still, the outcome is not accurate. This has the detection of breast cancer. Furthermore, more research is led particularly to an increase in the use of machine learning needed to verify such results with more factors on a greater and computing as diagnostic tools. Therefore, new diagnostic and multi-centred anthropometric measurement and medical tools or new techniques for predicting breast cancer needs to and database. In the future, the accuracy could be compared be developed earlier in order to extend human life. with the advanced technologies like a deep neural network with a larger database. Study by Tapak et al [5] compared and 2. RELATED WORK predicted breast cancer (BC) metastasis and survival using six In the related work, we have analysed few research related to machine learning methods and two traditional methods such breast cancer with different predictive modelling using as Support Vector Machine (SVM), Naive Bayes (NB), Linear machine learning techniques like fuzzy-based logistic Discriminant Analysis, AdaBoost, Random Forest (RF), regression, Support Vector Machine (SVM), Naive Bayes (NB), Logistic Regression (LR) and Least-square SVM (LSSVM) and Linear Discriminant Analysis, AdaBoost, Random Forest (RF), Adabag. In this study, the researcher used a database Logistic Regression (LR) and Least-square SVM (LSSVM) and composed of 550 patients with breast cancer. The Adabag. The brief summary of these methods with issues for performances such as accuracy specificity, sensitivity and the same are discussed as follows: Research by Nandagopal likelihood ratio were evaluated using these techniques. As a et al. [3] employed a novel regression method called LLR. It is result, the highest specificity is, therefore, RF, which is 98%, used to detect the victim gene using the classification of the accuracy of SVM is 93% and accuracy of LDA is 93% and the highest sensitivity is 36%, that is NB. Research by Liu, [6] ______used the Sklearn machine learning library, the logistic regression technique to predict breast cancer by classifying  Priya G is currently pursing PhD (part-time) in the Department of the dataset. The Wisconsin diagnostic breast cancer (WDBC) dataset is used in this research. In this experiment, two Statistics, Periyar University, Salem characteristics, that is mean texture and mean radius is used  Radhika A is currently working as Assistant Professor, for classification. In classification, the accuracy is 90.48 % Department of Statistics, Periyar University, Salem when choosing the maximum texture and 96.5 % when 2819 IJSTR©2020 www.ijstr.org INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 9, ISSUE 04, APRIL 2020 ISSN 2277-8616

choosing the maximum radius. Maximum result for the conventional method of machine learning is 96.5%. Therefore, in the future, a better combination of features must be chosen to improve the accuracy of the classification. Research by Jain and Bhaumik [7] proposed Application Specific Integrated Circuit (ASIC) model depend on forward search to diagnosis cardiovascular disease on smart mobile. It is the diagnostic algorithm for the processing of the ECG signal. The ASIC is the low computational detecting algorithm and to evaluate cardiovascular disease diagnostic dataset Physionet PTB ECG is used. The algorithms accurately predicted the T wave parameters, P wave, ST-segment and QRS complex. As a result, the specificity of the P(T) wave is 91.07% and the sensitivity is 98.91% respectively. In the future, different datasets need to be used by the proposed model to predict cardiovascular disease on smart mobile. From the review of literature, different machine techniques will be applied to study disease like breast cancer [6], [8] and cardiovascular disease [9] Jain and Bhaumik,[7]. Overall, earlier research has produced decent results to predict breast cancer disease. Nevertheless, the earlier method of detection needs improvisation. We also believe that the ensemble methodology using machine learning and can help to predict breast cancer accurately and also help to minimize previous diagnostic errors. As a result, the ensemble predictive method ultimately delivers the patient with good quality services.

3. RESEARCH METHODOLOGY The aim of the research is to predict where the patient has Benign tumour (non-cancerous) or Malignant tumour Fig. 1. Proposed Method (LR, RF and DNN) (cancerous). We have outlined an ensemble predictive model to come with the most accurate predictions. The system 3.1 System Configuration consists of the following major phases like system In this research work, an effective disease predictive modeling configuration, dataset selection, data scaling, feature in medical application is implemented in python using three selection, feature extraction, and classification. Finally, the machine learning techniques - Logistic regression, random result is obtained using the confusion matrix and performance forest, and deep neural network. The study's experimental metric. The workflow below outlines a basic review of the findings were all performed on a computer with a high visual entire research methodology: interface configuration and operating system setup. The experiment is conducted on the test machine configured with 1. System Configuration Intel (R) Core (TM) i7 processor with 16GB of RAM running 2. Selection of dataset 64-bit Windows 10 Operating system. The data transformation 3. Data Pre-processing and model training were executed using python 3.7 software. The detail of the system configuration is represented in table  Checking for missing value 1.  Data scaling (i.e. Standardization) TABLE 1  Feature selection (i.e. Random Forest) System Configuration  Feature extraction (i.e PCA)

4. Training and Testing by the classifier (i.e. Logistic Regression and Deep Neural Network)

5. The performance metric is evaluated using the confusion matrix; 3.2 Breast Cancer – Evaluation

Below is a flowchart (Figure 1) and a detailed description of the technique and the actions carried out against each 3.2.1 Dataset technique. The breast cancer data utilized in this research were taken from the Wisconsin breast cancer data set (WBCD) [10]. The data is collected by the University of California at Irvine’s machine learning data collection warehouse. The data set contains 569 Instances (data columns) with 32 different Attributes (rows). The breast lump features by fine-needle aspiration (FNA) is 10 forms of a digital image, the nucleus image of the variance, average value and maximum value [11].These 33 types of features

2820 IJSTR©2020 www.ijstr.org INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 9, ISSUE 04, APRIL 2020 ISSN 2277-8616

include area, radius, perimeter, compactness, texture, Where, symmetry, smoothness, fractal dimension, concave points, etc. Zi – Standardization The basic features of samples are presented in Table 2 Xi – Data Point (x1, x2, x3…xn).

3.2.2 Data Pre-processing – Mean S – Standard deviation 3.2.2.1 Checking for missing value The first step of pre-processing is to check for the missing 3.2.2.3 Feature Selection value in the dataset. We have analysed the Wisconsin breast The third step of pre-processing is the feature selection. The cancer data set. As a result, found that the dataset does not feature selection is a filter method, used to select the most contain null elements. Every feature is numerical in the relevant feature from the dataset. The feature selection is used dataset. to improve accuracy, reduce over fitting and to reduce training time [12]. In this study, the random forest algorithm is used for 3.2.2.2 Data scaling (Standardization) feature selection and identifies the important features Next important step in preprocessing is data scaling. Where automatically from the dataset. The identified feature will help to data need to be scaled before modelling. In our research, the contribute most to the parameter of prediction. The technique data scaling is done using standardization method. In the takes only a small subset of features rather than all features. In standardization approach, the data is represented as data the mathematical theory of communication, the concept of information theory is used by the random forest method to pick points. All the data points in the WBCD dataset is standardised using the mean and standard deviation. The standardisation is the most significant feature by looking into a prediction variable. also known as z-score. Using the following formula, the data is In this study, the most significant features has been extracted scaled using standardization, (nearly 15 features) using random forest classifier approach. from the WBCD dataset.

(1)

TABLE 2 Sample Wisconsin breast cancer data set

ID Diagnosis Radius Texture Perimeter Area Smooth- Compact- Concavity Concave mean mean mean mean ness ness mean mean points mean mean 842302 M 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 842517 M 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 84300903 M 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 84348301 M 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 84358402 M 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430

3.2.2.4 Feature Extraction The feature extraction technique is the fourth process of pre- processing technique. The feature extraction is also known as dimension reduction. The feature extraction is a method to reduce dimensionality by reducing the original set of raw information to more workable processing groups. It reduces the dimensionality of data by selecting important features in data. While the initial information set is still accurate and original. In our research, feature extraction technique is used to remove the unwanted variable from the WBCD dataset. Here for the dimension reduction, the Principal component analysis (PCA) is used. PCA is a statistical method which uses an orthogonal transformation to transform a set of observed possible associated variables. The 32 Attributes from the WBCD dataset is reduced to 2 principal features using the PCA technique. Fig. 2. Feature Extraction(PCA) Since the PCA components are orthogonal to each other, they are not correlated; we can see malignant and benign classes as 3.2.3 Classification (Training and Testing the model) distinct. The feature extraction PCA technique is represented in Once the data is pre-processed, it is necessary to train the data Fig 2. to accurately predict the results. For training the data, we need machine learning classification algorithm. In our research, we have used three classification algorithms such as Logistic 2821 IJSTR©2020 www.ijstr.org INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 9, ISSUE 04, APRIL 2020 ISSN 2277-8616

Regression, Random Forest and Deep Neural Network algorithms to train the data. Once the data is trained, learnt data (5) is sent to the testing phase; the same three machine learning The representations of TP, TN, FP, and FN have the meaning as classification algorithms are used for testing as well. The listed in Table 3. splitting ratio of Wisconsin Breast Cancer dataset (WBCD) for TABLE 3 training is 70%, and testing is 30%. The classifier predicts, Performance Evaluation Metrics where the person has breast cancer or not. If the person has Term Meaning breast cancer, it predicted as a malignant tumour, else it predicted as a benign tumour. For the better accuracy tried TP The positive class is predicted correctly by the model ensemble technique with three different machine learning TN The negative class is predicted correctly by the model algorithms such as Logistic Regression, Random Forest and FP The positive class is predicted wrongly by the model Neural Deep Network algorithms. All these three classification FN The negative class is predicted wrongly by the model algorithms will predict, whether, the patient has a malignant tumour or benign tumour. 4.1 Confusion matrix 4. RESULT Confusion Matrix provides a matrix as output and defines the Once breast cancer data is trained and tested, then it is sent to model's full performance. The performance of the confusion the performance evaluation. Using the confusion matrix values matrix for three machine learning algorithms such as logistic such as true positive, true negative, false positive and false regression, random forest and deep neural network is given in negative, the performance metric like accuracy, precision, detail. Using confusion matrix, the True Positive (TP), True recall and F1 score is calculated. The proposed model negative(TN), False positive(FP) and False negative(FN) were performance is evaluated using actual and predicted calculated. Below table 4 provides the detailed description of the classification. The system accuracy is determined using the confusion matrix and the actions carried out against each confusion matrix obtained by the classifier. The precision, technique. recall, accuracy, and F1-score is calculated using the formula TABLE 4 below, Confusion Matrix Malignant Benign Precision 116 5 Malignant Precision highlights only on the positive samples. It signifies the LR success of the probability of classification having the true 0 50 Benign 115 7 Malignant positive class. It estimated as the ratio of true positive to the RF sum of a true positive and false positive. 1 48 Benign 116 3 Malignant Ensemble (2) 0 52 Benign

Recall 4.1.1 Logistic Regression Recall clarifies that the model sensitivity in the way of finding the Using the logistic regression technique, the confusion matrix is positive class. It evaluated as the ratio of true Positive to the calculated and obtained true positive (TP) of 116 counts, True total amount of true positives and false negatives. negative (TN) of 50 counts, False positive (FP) of 5 counts and False negative (FN) of the null count. Hence, the algorithm is (3) appropriately trained and the result is effective and efficient. The detail of the confusion matrix for logistic regression is given F-measure in Table 4. The F-measure, which is the harmonic mean of recall and precision, is also known as the F1-score. The range for the 4.1.2 Random Forest value of F-Measure is from 0 to 1. The high score is reflected by We also calculated the confusion matrix using the random forest F Measure's high value. This measure, called the Fβ-measure, technique. The obtained True Positive (TP) is 115 count, True does have a different form. This variant represents the weighted negative (TN) is 48 count, False positive (FP) is 7 count and harmonic mean between recall and precision. The metric is False negative (FN) is 1 count. Hence, we can predict the sensitive in order to change the distribution of data. Presume algorithm is appropriately trained and the result is effective and the negative group outcomes have been increased by α times; efficient. The detail of the confusion matrix for the random The F-measure will be appropriately measured, forest is given in Table 4.

(4) 4.1.3 Ensemble (LR, RF and DNN) Further, we calculated the confusion matrix using the ensemble Accuracy technique for the WBCD dataset. The obtained True Positive Accuracy identifies the positive classes and negative classes’ of (TP) is 116 count, True negative (TN) is 52 count, False positive the model. It calculated as the ratio of the total of real positive (FP) is 3 count and False negative (FN) is null count. Therefore, and real negative to entire samples (true and false positives and we can predict the Ensemble model performance is outstanding negatives). The accuracy rate is signified as closed to their real when compared to other technique. The detail of the ensemble output. technique confusion matrix is given in Table 4.

4.2 Performance metric 2822 IJSTR©2020 www.ijstr.org INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 9, ISSUE 04, APRIL 2020 ISSN 2277-8616

The performance of the proposed system is examined by The performance metric accuracy result of breast cancer is contemplating the actual and predicted classification. The compared with different machine learning technique like Logistic framework suggested consists of three approaches such as Regression, Support Vector Machine, Fuzzy based logistic logistic regression, random forest and ensemble technique. All regression and with our ensemble method (RF, LR and DNN). three approaches are performed independently to obtain the Where, The Liu, [6] predicted the breast cancer using the performance metric. The Accuracy, Precision, Recall and F1 logistic regression technique and Liu obtained the accuracy of score are calculated in the performance metric. Below is a 90.48%. Similarly, Tapak et al [5] used Support Vector Machine detailed description of the performance metric and the actions technique and obtained accuracy of 93.00%. Further, carried out against each confusion matrix. Table 5 represents Nandagopal et al [3] used Fuzzy-based Logistic Regression the performance evaluation of three different techniques. technique to predict breast cancer and obtain an accuracy of 94.05%. Hence accuracy range for prediction of breast cancer TABLE 5 is from 90% to 94%. To enhance and improve prediction the Comparison of LR, RF and DNN in Performance metric accuracy, we have implemented ensemble method. Our Algorithm Accuracy Precision Recall F1- Score ensemble method is combination of Logistic regression, random RF 95% 97% 96% 96% forest and deep neural network. Our proposed ensemble LR 97% 96% 96% 96% method obtains an accuracy of 97.6%. We found that our Ensemble (RF, 97.6% 97% 97% 97% proposed model provides better result when compared to other LR and DNN) studies accuracy results. The publications relevant to Machine

Learning methods used for breast cancer survival prediction is While comparing the performance metric for the logistic represented in Table 6. regression, random forest, and ensemble technique, we achieved 95 % accuracy for the RF model. Similarly, the Figure 4 represents the breast cancer prediction accuracy precision is 96%, the recall is 96%, and the F1 score is 96% comparison chart. The comparison is done with the proposed respectively for RF model. The performance metrics of the model and with other machine learning technique like Logistic traditional RF approach is minimum when compared to other Regression, Support Vector Machine, methods like LR and ensemble. The comparison chart is represented in Table 5. Similarly, Performance metric like accuracy, precision, recall and f1 score is evaluated for the logistic regression technique. The model obtains the Accuracy of 97%, precision of 96%, recall of 96% and F1- score of 96% respectively. The performance of the LR is moderate when compared to ensemble method. In addition, the LR model accuracy is higher than RF models result, but lower than ensemble method result.

Fig. 4. Comparison of accuracy with the proposed model and other studies

Fuzzy based logistic regression. Our proposed method exceptional in performance and obtain an accuracy of 97.6%. Hence, we can conclude that the performance of our proposed method in this study shows the best result as compared to the approach used by other authors on the same dataset

TABLE 6 Publications relevant to Machine Learning methods used for breast cancer survival prediction. Fig. 3. Pictorial representation of Comparison chart for Performance Author Methods Methods Accuracy metric – LR, RF and ensemble method

Liu, [6] Logistic Regression LR 90.48% Further, the performance metric is measured for the ensemble method. Obtained accuracy for the trained model is 97.6%. Tapak et al. [5] Support Vector SVM 93.00% Similarly, precision is 97%, recall is 97%, and F1-score is 97% Machine respectively. When comparing ensemble method performance Nandagopal et al. [3] Fuzzy-based Logistic Fuzzy - 94.05% Regression LR with two other techniques such as LR and RF. ensemble Proposed Ensemble Random Forest RF 97.6% method performance is exceptional. Further, the ensemble Technique Logistic Regression LR method shows the highest predictive accuracy, precision, recall Deep Neural Network DNN and f1 score when compared to other two technique. Hence, we can conclude that the ensemble method classifier performs 5. CONCLUSION AND FUTURE WORK better than the other two classifiers used in this study. The Breast cancer diseases are the most deadly diseases that need comparison of accuracy, precision, recall and f1 score is to be identified and diagnosed earlier. Hence, we developed represented in fig 3. predictive modelling to predict breast cancer and to assist a 4.3 Comparison of Results 2823 IJSTR©2020 www.ijstr.org INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 9, ISSUE 04, APRIL 2020 ISSN 2277-8616

physician to make optimum, timely and accurate decision. In random forest algorithm (ReliefF+RF),‖ in addition, we advanced the predictive model by validating the 2015 International Symposium on Innovations in Intelligent disease results using supervised machine learning methods and SysTems and Applications (INISTA), 2015, pp. 1–8. achieved good accuracy. For the advancement of the predictive model, we used three different classifiers, such as logistic regression, random forest and deep neural network, as an ensemble method. As a result, the proposed ensemble predictive modelling system help to enhance the quality of life by predicting breast cancer in an early stage. In order to improve the performance of the classification techniques, further research in this field will be carried out so that more variables can be predicted. This research will help to make disease prediction and diagnostic systems more effective and reliable, leading to the development of a better healthcare system by reducing fatalities. Furthermore, the proposed method can also be used to predict many other diseases like diabetes, trauma patients, heart disease, Alzheimer's disease, etc., which will be a future focus of our research.

REFERENCES [1] C. B. Johnson, M. K. Davis, A. Law, and J. Sulpher, ―Shared Risk Factors for Cardiovascular Disease and Cancer: Implications for Preventive Health and Clinical Care in Oncology Patients,‖ Can. J. Cardiol., vol. 32, no. 7, pp. 900–907, Jul. 2016. [2] A. K. Biz, ―Potential Novel Molecular Targets for Breast Cancer Diagnosis and Treatment,‖ Karolinska Institute, 2016. [3] V. Nandagopal, S. Geeitha, K. V. Kumar, and J. Anbarasi, ―Feasible analysis of gene expression –a computational based classification for breast cancer,‖ Measurement, vol. 140, pp. 120–125, Jul. 2019. [4] B. K. Singh, ―Determining relevant biomarkers for prediction of breast cancer using anthropometric and clinical features: A comparative investigation in machine learning paradigm,‖ Biocybern. Biomed. Eng., vol. 39, no. 2, pp. 393–409, Apr. 2019. [5] L. Tapak, N. Shirmohammadi-Khorram, P. Amini, B. Alafchi, O. Hamidi, and J. Poorolajal, ―Prediction of survival and metastasis in breast cancer patients using machine learning classifiers,‖ Clin. Epidemiol. Glob. Heal., Oct. 2018. [6] L. Liu, ―Research on Logistic Regression Algorithm of Breast Cancer Diagnose Data by Machine Learning,‖ in 2018 International Conference on Robots & Intelligent System (ICRIS), 2018, pp. 157–160. [7] S. K. Jain and B. Bhaumik, ―An Energy Efficient ECG Signal Processor Detecting Cardiovascular Diseases on Smartphone,‖ IEEE Trans. Biomed. Circuits Syst., vol. 11, no. 2, pp. 314–323, Apr. 2017. [8] L. R. Marchand and J. A. Stewart, ―Breast Cancer,‖ in Integrative Medicine, Elsevier, 2018, pp. 772-784.e7. [9] J. Müller-Nordhorn and S. N. Willich, ―Coronary Heart Disease,‖ in International Encyclopedia of Public Health, Elsevier, 2017, pp. 159–167. [10] D. W. H. Wolberg, ―Breast Cancer Wisconsin (Diagnostic) Data Set,‖ 2019. [11] L. Liu, ―Research on Logistic Regression Algorithm of Breast Cancer Diagnose Data by Machine Learning,‖ in 2018 International Conference on Robots & Intelligent System (ICRIS), 2018, pp. 157–160. [12] M. Peker, A. Arslan, B. Sen, F. V. Celebi, and A. But, ―A novel hybrid method for determining the depth of anesthesia level: Combining ReliefF feature selection and

2824 IJSTR©2020 www.ijstr.org