Deepening Well-Being Evaluation with Different Data Sources: a Bayesian Networks Approach
Total Page:16
File Type:pdf, Size:1020Kb
International Journal of Environmental Research and Public Health Article Deepening Well-Being Evaluation with Different Data Sources: A Bayesian Networks Approach Federica Cugnata 1 , Silvia Salini 2,* and Elena Siletti 3 1 University Centre of Statistics for Biomedical Sciences (CUSSB), Vita-Salute San Raffaele University, 20132 Milano, Italy; [email protected] 2 Department of Economics, Management and Quantitative Methods, Università degli Studi di Milano, 20122 Milano, Italy 3 Department of Economic and Political Sciences, Università della Valle d’Aosta, 11020 Saint-Christophe, Italy; [email protected] * Correspondence: [email protected] Abstract: In this paper, we focus on a Bayesian network s approach to combine traditional survey and social network data and official statistics to evaluate well-being. Bayesian networks permit the use of data with different geographical levels (provincial and regional) and time frequencies (daily, quarterly, and annual). The aim of this study was twofold: to describe the relationship between survey and social network data and to investigate the link between social network data and official statistics. Particularly, we focused on whether the big data anticipate the information provided by the official statistics. The applications, referring to Italy from 2012 to 2017, were performed using ISTAT’s survey data, some variables related to the considered time period or geographical levels, a composite index of well-being obtained by Twitter data, and official statistics that summarize the labor market. Keywords: Bayesian networks; big data, well-being; life satisfaction; sentiment analysis (list three to ten) Citation: Cugnata, F.; Salini, S.; Siletti, E. Deepening Well-Being Evaluation with Different Data 1. Introduction and Background Sources: A Bayesian Networks In recent decades, since the Stiglitz Commission’s suggestions advising to build a com- Approach. Int. J. Environ. Res. Public plementary system focused on social well-being that is suitable for measuring sustainability Health 2021, 18, 8110. https:// and that considers subjective assessment [1], different new measures of well-being have doi.org/10.3390/ijerph18158110 been proposed. In these new indices, the subjective dimensions are traditionally investi- gated using ad hoc surveys, but these data are not free of issues. Some issues are related to Received: 28 June 2021 the survey structure or plan, which, despite all efforts [2], still create some methodological Accepted: 28 July 2021 drawbacks [3,4]; other issues are related to the poor geographical disaggregation or low Published: 30 July 2021 time frequency of the data. An objective evaluation of the currently proposed indices demonstrates a limited Publisher’s Note: MDPI stays neutral and undersized presence of data in the the subjective and perceived dimension. Since with regard to jurisdictional claims in 2012, with the aim of finding complementary information to fill this information gap, published maps and institutional affil- a new subjective and perceived Italian well-being index from social networks data has iations. been proposed (Subjective Well-Being Index (SWBI) [5]). This is a composite index that uses the same framework considered by the New Economic Foundation for its Happy Planet Index [6], but it is the result of a human-supervised sentiment analysis (integrated sentiment analysis (iSA) [7]) of Twitter data. Recently, suggested using the SWBI to provide Copyright: © 2021 by the authors. information on subjective well-being at Italian sub-national levels and at different moments Licensee MDPI, Basel, Switzerland. in time [8]. This article is an open access article Despite social network data being defined as the largest available focus group in the distributed under the terms and world [9,10], as they cover several topics, are continuously updated, and are free or cheap, conditions of the Creative Commons these data also have disadvantages. Even if social media users are always increasing in Attribution (CC BY) license (https:// number (from http://wearesocial.com 27 January 2021: from January 2019 to January 2020 creativecommons.org/licenses/by/ the world growth was +7% in Internet access (59% of people in the world had Internet 4.0/). Int. J. Environ. Res. Public Health 2021, 18, 8110. https://doi.org/10.3390/ijerph18158110 https://www.mdpi.com/journal/ijerph Int. J. Environ. Res. Public Health 2021, 18, 8110 2 of 10 access) and +9.2% for active social media accounts (with a penetration of 49%)), not all are users; hence, one of the main issues with this kind of information is sampling bias. To enable the use of such a rich source of information, scholars are still working on the solution to these issues. Notably, suggested a new Italian well-being measure combining official statistics with Twitter data using a weighting procedure combined with a small area estimation (SAE) model to precisely consider sampling bias [11]. For a detailed description of all the well-being index in Italy, please refer to the work of [12]. This paper focuses on the Italian scenario due to the central role given to this topic by the Italian Parliament, which introduces equitable and sustainable well-being among the objectives of the government’s economic and social policy. The authors provided a detailed outline of the well-being indices useful for different scholars and practitioners, with the awareness that, for a good analysis, a complete and conscious description of disposable data is the starting point to further improve their usefulness, maximize their advantages, and reduce their limitations. In this study, using a a Bayesian network (BN) approach, we combined social net- works data, which are characterized by high frequency and geographical disaggregation, traditional survey data, and official statistics to evaluate well-being. Adopting this ap- proach, both categorical and continuous data can be used with different geographical area levels and time frequencies. The aims of this study were twofold: (1) to describe the relationship between survey and social network data; (2) to investigate social network data and official statistics. We focused on the forecasting power of the social media information. All analyses were performed using R version 4.0.4 (R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL: https://www.R-project.org/, downloaded 30 June 2021). Section 2.1 reports a brief presentation of BNs. Section 2.2 describes the data. Section3 provides a discussion of the results. 2. Materials and Methods 2.1. Bayesian Networks: A Short Refresher A Bayesian network (BN) is a probabilistic graph model that represents a set of stochas- tic variables with their conditional dependencies through the use of a direct acyclic graph (DAG). For example, a Bayesian network could represent the probabilistic relationship existing between symptoms and diseases. Given the symptoms, the network can be used to calculate the probability of the presence of different diseases. Formally, Bayesian networks are direct acyclic graphs whose nodes represent random variables in the Bayesian sense: they can be observable quantities, latent variables, un- known parameters, or hypotheses. The arcs represent conditions of dependence; the nodes that are not connected represent variables that are conditionally independent of each other. Each node is associated with a probability function which takes as input a particular set of values for the variables of the parent node and returns the probability of the variable represented by the node. For example, if the parents of the node are Boolean variables, then the probability function can be represented by a table in which each entry represents a possible combination of true or false values that its parents can assume. There are efficient algorithms that perform inference and learning starting from Bayesian networks. BNs are both mathematically rigorous and intuitively understandable data analytic tools. They implement a graphical model structure that is popular in statistics, machine learning, and artificial intelligence. They enable the effective representation and computa- tion of a joint probability distribution (JPD) over a set of random variables. The DAG’s structure is defined by a set of nodes, representing random variables and plotted by labeled circles, and a set of arcs, representing direct dependencies among the variables and plotted by arrows. Thus, an arrow from Xi to Xj indicates that a value taken by variable Xj depends on the value taken by variable Xi. Node Xi is referred to as a “parent” of Xj. Similarly, Xj is referred to as a “child” of Xi. An extension of these genealogical terms is often used to define sets of “descendants”, i.e., the set of nodes from Int. J. Environ. Res. Public Health 2021, 18, 8110 3 of 10 which the node can be reached on a direct path. The DAG guarantees that there is no node that can be its own ancestor (parent) or its own descendent. This condition is of vital importance to the factorization of the joint probability of a collection of nodes. Although the arrows represent direct causal connections between the variables, under the causal Markov condition, the reasoning process can operate on a BN by propagating information in any direction. A BN reflects a simple conditional independence statement, namely that each variable, given the state of its parents, is independent of its non-descendants in the graph. This property is used