Correlation Analysis to Identify the Effective Data in Machine Learning: Prediction of Depressive Disorder and Emotion States

International Journal of Environmental Research and Public Health Article Correlation Analysis to Identify the Effective Data in Machine Learning: Prediction of Depressive Disorder and Emotion States Sunil Kumar and Ilyoung Chong * Department of Information and Communications Engineering, Hankuk University of Foreign Studies, Seoul 02450, Korea; [email protected] * Correspondence: [email protected]; Tel.: +82-10-3305-5904 Received: 31 October 2018; Accepted: 14 December 2018; Published: 19 December 2018 Abstract: Correlation analysis is an extensively used technique that identifies interesting relationships in data. These relationships help us realize the relevance of attributes with respect to the target class to be predicted. This study has exploited correlation analysis and machine learning-based approaches to identify relevant attributes in the dataset which have a significant impact on classifying a patient’s mental health status. For mental health situations, correlation analysis has been performed in Weka, which involves a dataset of depressive disorder symptoms and situations based on weather conditions, as well as emotion classification based on physiological sensor readings. Pearson’s product moment correlation and other different classification algorithms have been utilized for this analysis. The results show interesting correlations in weather attributes for bipolar patients, as well as in features extracted from physiological data for emotional states. Keywords: correlation analysis; health care; machine learning; data analytics 1. Introduction In order to analyze the causality in a dataset, various techniques can be used. One such technique in the domain of data analysis is frequent pattern mining. In this case, two types of analysis are widely used: identification of association rules and correlation analysis. In association rule mining, frequent pattern analysis is performed using the support count and confidence measures. An association rule will be considered strong if its support count and confidence are above some defined threshold. However, the identified rules are sometimes irrelevant and can often be misleading. This limitation can be handled by correlation analysis, which is used to determine the strength of a relationship between two item sets [1]. In this case, the strength can be identified based on three criteria: direction, form, and dispersion strength, as shown in Figure1. The limitation of this analysis is that in the case of two independent variables, it cannot determine the causality between the two. That is, we will not have any idea which variable is affecting and which one is getting affected. Another situation that can cause distort the correlation and association results is the confounding of another attribute [2]. If this attribute is not considered in the dataset, then its effect may not be highlighted in the experimentation and thus leading to spurious analyzed correlations. In this regard, determining predictor attributes plays a very critical role, as we try to include as many relevant attributes as possible, to avoid such circumstances. Correlation analysis is a widely used statistical measure through which different studies have efficiently identified interesting collinear relations among different attributes of datasets. This study has employed correlation analysis to identify such attributes which strongly affect depressive disorder severity and emotional states. Int. J. Environ. Res. Public Health 2018, 15, 2907; doi:10.3390/ijerph15122907 www.mdpi.com/journal/ijerph Int. J. Environ. Res. Public Health 2018, 15, 2907 2 of 24 Int. J. Environ. Res. Public Health 2018, 15, x 2 of 25 Figure 1. Correlation based on direction, form, and dispersion strength. There are are several several fac factorstors that that have have been been found found leading leading to a higher to a highernumber number of depressive of depressive disorder disordercases in casesSouth in Korea South Korea[3–5], [including3–5], including work work pressure pressure and and overtime, overtime, loneliness, loneliness, social social relations, relations, socioeconomicsocioeconomic status, etc. However, the effect of weather conditions, which may have a considerable impact on on depressive depressive disorder, disorder, has has often often been been ove overlookedrlooked in this in this region. region. With With respect respect to emotion to emotion data dataanalysis, analysis, the themost most commonly commonly-used-used methodologies methodologies in inpractic practicalal environments environments and and research approaches involveinvolve emotion emotion detection detection through through facial facial expressions, expressions, physiological physiological sensor sensor signals, signals, speech recognitionspeech recognition and analysis, and analysis, and text and analysis text analysis [6]. The [6] former. The former two techniques two techniques are more are commonlymore commonly used, becauseused, because facial expressionsfacial expressions and physiological and physiological responses responses are more are closely more related closely to related one’s emotional to one’s reactions,emotional whereas reactions limited, whereas information limited information can be extracted can be regarding extracted emotional regarding reactions emotional in the reactions latter two in techniquesthe latter two [7,8 techniques]. Besides, in[7, the8]. Besides, latter two in approaches, the latter two the approaches, responses of the emotional responses reactions of emotional can be manipulated,reactions can be hence manip resultingulated, inhe morence resulting noisy data in more and obstructingnoisy data and analysis obstructing [9,10]. Thisanalysis can [9,10] also be. This the casecan also with be facial the expressions, case with facial where expressions we have a lot, where of noisy we data, have as a compared lot of noisy to physiological data, as compared responses. to Inphysiological this work, we responses have considered. In this work, the first we twohave techniques considered for the identifying first two techniques a patient’s emotionalfor identifying state. a patient’sThis emotional work considers state. aspects of healthcare service provisioning using data analytics. Therefore, differentThis challenges work considers are addressed aspects andof healthcare overcome service in this study.provisioning One of theusing essential data analytics. problems Therefore, addressed bydifferent this work challenges is the processare addressed of identifying and overcome information in this that study. has a strongOne of impact the essential on depression problems for specifiedaddressed patients. by this Consideringwork is the differentprocess of studies identifying that have information identified that different has a types strong of depressiveimpact on disorderdepression situations, for specified each patients. patient Considering shows different different behavioral studies that trends have based identified on his different or her types specific of depressive disorder situations, type and itseach symptoms. patient sho Asws there different are differentbehavioral data trends sources based through on his whichor her wespecific can analyzedepressive behavioral disorder trends type (facialand its expressions, symptoms. physiologicalAs there are different sensor signals, data sources speech analysis,through textwhich analysis, we can self-reporting analyze behavioral questionnaires, trends (facial etc.), thereexpressions, is a strong physiological need for mechanisms sensor signals, that identifyspeech suchanalysis, parameters text analysis, with strongself-reporting relationships questionnaires, specific to etc.), certain there depressive is a strong disorder need for patients. mechanisms Another that challengeidentify such is identifying parameters prominent with strong and relationships persistent relationships specific to overcertain longer depressive periods disorder of time. patients. As there canAnother be different challenge levels is identifying of uncertainty prominent in the patternsand pers ofistent acquired relationships data (such over as longer high and periods low levels of time. of moodAs there in bipolarcan be different disorder), levels longer of periods uncertainty of data in arethe requiredpatterns toof analyzeacquired the data trends. (such A as good high prediction and low modellevels of requires mood in enough bipolar data disorder), readings longer to acquire periods sufficiently of data are steady required trends to analyze to have the reliable trends. prediction A good results.prediction Further, model a systemrequires is enough required data for on-demandreadings to acquire data acquisition sufficiently from steady different trends data to sources have reliable based onprediction the specific results. user Further, context. a Based system on is the required above discussion,for on-demand this data study acquisition aims to achieve from different the following data objectives:sources based on the specific user context. Based on the above discussion, this study aims to achieve the following objectives: • Analyzing the impact of weather on two types of depressive disorder; Bipolar and • MelancholiaAnalyzing the disorder. impact of weather on two types of depressive disorder; Bipolar and Melancholia • Analyzingdisorder. the impact of physiological data on emotions, as well as identifying patient’s health • status,Analyzing using the machine impact learningof physiological with strong data correlated on emotions,

Correlation Analysis to Identify the Effective Data in Machine Learning: Prediction of Depressive Disorder and Emotion States

2016 Gao, P. and Zhang, L., Determining Spurious Correlation

A COMPARISON of CAUSALITY TESTS APPLIED to the BILATERAL RELATIONSHIP BETWEEN CONSUMPTION and GDP in the USA and MEXICO GUISAN, M.Carmen*

Revisiting the Effect of Food Aid on Conflict

2010-06 Spurious Long-Horizon Regression in Econometrics* Antonio E

Spurious Regression and Econometric Trends1

Spurious Regressions

Cointegration Versus Spurious Regression in Heterogeneous Panels

Is the Phillips Curve Dead? a Vector Error Correction Model

An Error Correction∗

The Moderating Effect of Human Resource Management on the Relationship Between Ethical Leader and Employee Trust

1 Spurious Regressions with Time-Series Data

In This Chapter, We Cover Some More Advanced Topics in Time Series Econometrics. In