Modeling and Management Big Data in Databases—A Systematic Literature Review
Total Page:16
File Type:pdf, Size:1020Kb
sustainability Review Modeling and Management Big Data in Databases—A Systematic Literature Review Diana Martinez-Mosquera 1,* , Rosa Navarrete 1 and Sergio Lujan-Mora 2 1 Department of Informatics and Computer Science, Escuela Politécnica Nacional, 170525 Quito, Ecuador; [email protected] 2 Department of Software and Computing Systems, University of Alicante, 03690 Alicante, Spain; [email protected] * Correspondence: [email protected]; Tel.: +593-02-2976300 Received: 25 November 2019; Accepted: 9 January 2020; Published: 15 January 2020 Abstract: The work presented in this paper is motivated by the acknowledgement that a complete and updated systematic literature review (SLR) that consolidates all the research efforts for Big Data modeling and management is missing. This study answers three research questions. The first question is how the number of published papers about Big Data modeling and management has evolved over time. The second question is whether the research is focused on semi-structured and/or unstructured data and what techniques are applied. Finally, the third question determines what trends and gaps exist according to three key concepts: the data source, the modeling and the database. As result, 36 studies, collected from the most important scientific digital libraries and covering the period between 2010 and 2019, were deemed relevant. Moreover, we present a complete bibliometric analysis in order to provide detailed information about the authors and the publication data in a single document. This SLR reveal very interesting facts. For instance, Entity Relationship and document-oriented are the most researched models at the conceptual and logical abstraction level respectively and MongoDB is the most frequent implementation at the physical. Furthermore, 2.78% studies have proposed approaches oriented to hybrid databases with a real case for structured, semi-structured and unstructured data. Keywords: big data; management; modeling; literature review 1. Introduction The Big Data modeling term became widespread in 2011, as is visible in Figure1. This figure shows searches in Google Trends related to Big Data modeling, which are intensified from 2011 onwards. Searches before 2004 are not presented, since Google Trends does not store earlier data. In recent years, researchers have consolidated their efforts to study new paradigms to deal with Big Data. Thus, novel Big Data modeling and management in databases approaches have emerged, in line with the new requirements. In consequence, new techniques in the database context have evolved towards Not Only SQL (NoSQL). The work presented in this paper is motivated by the acknowledgement that a complete systematic literature review (SLR) that consolidates all the research efforts for Big Data modeling and management in databases is missing. An SLR is the best way to collect, summarize and evaluate all scientific evidence about a topic [1]. It allows for the description of research areas shown the greatest and least interest by researchers. Considering the exposed issues, the SLR conducted in this work can contribute to solving this lack by collecting and analyzing details about the research published from 2010 to 2019. As a basis for our SLR, we adhered to the guidelines proposed by Kitchenham [1]. Moreover, this paper presents a complete bibliometric analysis and summarizes existing evidence about research Sustainability 2020, 12, 634; doi:10.3390/su12020634 www.mdpi.com/journal/sustainability Sustainability 2020, 12, 634 2 of 41 Sustainability 2019, 11, x FOR PEER REVIEW 2 of 44 on Big Data modeling and management in databases. With the information attained from the analysis performed,from the analysis we will performed, identify trends we wi andll identify gaps in thetrends published and gaps research in the topublished provide aresearch background to provide for new a research.background As result,for new 1376 research. papers As were result, obtained 1376 frompapers scientific were obtained libraries from and 36 scientific studies werelibraries selected and 36 as relevant.studies were All theselected research as relevant. efforts were All mappedthe research in order efforts to respondwere mapped the three in order research to respond questions the defined three inresearch this research. questions Our defined main goal in this is toresearch. consolidate Our main the main goal works is to consolidate to provide the an awarenessmain works of to the provide trends andan awareness the gaps relatedof the trends to Big Dataand the modeling. gaps related to Big Data modeling. The remainder ofof this this paper paper is is organized organized as as follows. follows. First First is theis the Introduction Introduction section, section, containing containing the meaningthe meaning of the of dithefferent different terms terms discussed discussed in this in study. this study. Second, Second, a Method a Method section section presents presents the process the usedprocess to performused to theperform planning, the planning, conducting conducting and reporting and of reporting the SLR. Theof the planning SLR. The phase planning describes phase the identificationdescribes the identification of the need for of a the SLR need study for and a SL theR study development and the ofdevelopment a review protocol, of a review objectives protocol, and justification,objectives and research justification, questions research and strategy. questions The conductingand strategy. phase The presents conducti theng inclusion phase andpresents selection the criteriainclusion for and the selection final corpus criteria of selected for the studies.final corpus of selected studies. Third, thethe Results Results section section answers answers the researchthe research questions questions in three in subsections. three subsections. The first subsection, The first thesubsection, Bibliometric the Bibliometric Analysis, comprises Analysis, comprises the reporting the reporting stage, including stage, including authors’ authors’ information, information, such as asuchffiliation as affiliation and country and andcountry relevant and data relevant about data their about works; their for instance,works; for publication instance, information,publication numberinformation, of citations, number funding of citations, source, funding year, digital source, library, year, impact digital factor, library, ranking. impact The factor, second ranking. subsection, The thesecond Systematic subsection, Literature the Systematic Review, presents Literature a mapping Review, ofpresents the selected a mapping studies of according the selected to three studies key conceptsaccording in to athree concept key matrixconcepts regarding in a concept to the matr datasetix regarding source, to modeling the dataset and source, database. modeling The third and subsection,database. The the third Discussion, subsection, highlights the Discussion, relevant findingshighlights in relevant the SLR findings study in in order the SLR to identify study in existing order trendsto identify and existing gaps. Finally, trends conclusions and gaps. Finally and future, conclusions works are and presented. future works are presented. Figure 1. TrendTrend of use of term Big Data [[2].2]. 1.1. Big Data Concepts In this part, we describe the main concepts related to Big Data, in order to provide a general overview for the reader and a backgroundbackground ofof thethe termsterms discusseddiscussed later.later. 1.1.1. A Brief History of Big Data 1.1.1. A Brief History of Big Data The production and processing of large volumes of data began to be of interest to researchers many The production and processing of large volumes of data began to be of interest to researchers years ago. By 1944, estimations for the size of libraries, which increased rapidly every year, were made many years ago. By 1944, estimations for the size of libraries, which increased rapidly every year, in American universities [3]. In 1997, at the Institute of Electrical and Electronics Engineers (IEEE) were made in American universities [3]. In 1997, at the Institute of Electrical and Electronics Engineers Conference on Visualization, the term “Big Data” was used for the first time during the presentation of (IEEE) Conference on Visualization, the term “Big Data” was used for the first time during the a study about large datasets’ visualization [4]. presentation of a study about large datasets’ visualization [4]. Big Data is the buzzword of recent years, that is, a fashionable expression in information systems. Big Data is the buzzword of recent years, that is, a fashionable expression in information The general population relates the term Big Data to its literal meaning of large volumes of data. systems. The general population relates the term Big Data to its literal meaning of large volumes of However, Big Data is a generic term used to refer to large and complex datasets that arise from the data. However, Big Data is a generic term used to refer to large and complex datasets that arise from combination of famous Big Data V’s that characterize it [5]. the combination of famous Big Data V’s that characterize it [5]. 1.1.2. Big Data Characterization 1.1.2. Big Data Characterization As mentioned before, Big Data does not refer only to high volumes of data to be processed. At the As mentioned before, Big Data does not refer only to high volumes of data to be processed. At beginning of the Big Data studies, their volume, velocity and variety were considered as fundamental the beginning of