future

Article A Scalable and Semantic Data Marketplace for Enhancing -Based Applications

Evangelos Psomakelis 1,2,* , Anastasios Nikolakopoulos 1, Achilleas Marinakis 1 , Alexandros Psychas 1, Vrettos Moulos 1 , Theodora Varvarigou 1 and Andreas Christou 1

1 School of Electrical and Computer Engineering, National Technical University of Athens, 15780 Athens, Greece; [email protected] (A.N.); [email protected] (A.M.); [email protected] (A.P.); [email protected] (V.M.); [email protected] (T.V.); [email protected] (A.C.) 2 Department of Informatics and Telematics, Harokopio University of Athens, 17778 Athens, Greece * Correspondence: [email protected]

 Received: 31 March 2020; Accepted: 23 April 2020; Published: 25 April 2020 

Abstract: Data handling and provisioning play a dominant role in the structure of modern cloud–fog-based architectures. Without a strict, fast, and deterministic method of exchanging data we cannot be sure about the performance and efficiency of transactions and applications. In the present work we propose an architecture for a Data as a Service (DaaS) Marketplace, hosted exclusively in a cloud environment. The architecture includes a storage management engine that ensures the Quality of Service (QoS) requirements, a monitoring component that enables real time decisions about the resources used, and a resolution engine that provides semantic data discovery and ranking based on user queries. We show that the proposed system outperforms the classic ElasticSearch queries in data discovery use cases, providing more accurate results. Furthermore, the semantic enhancement of the process adds extra results which extend the user query with a more abstract definition to each notion. Finally, we show that the real-time scaling, provided by the data storage manager component, limits QoS requirements by decreasing the latency of the read and write data requests.

Keywords: cloud; fog; mongodb; daas; data as a service; performance analysis; qos ensurance; content discovery; qos monitoring

1. Introduction

1.1. General Concepts One of the most-used expressions regarding data-driven IT is “data are the new oil”. Given the gravity of this statement, many concepts and technologies have arisen in order to address the needs of this new era. More specifically, this paper revolves around two main concepts; Data as a Service (DaaS) and the Data Marketplace. In order to fully exploit the data produced by any given sector, there is a need to streamline the process of collection, storage, access, utilization and, finally, delivery to the end user. DaaS aims at creating a service-based model that will cover the complete lifecycle of the data through distributed interconnected services that address the requirements of each aforementioned step of the streamlined process. To better understand the involvement of DaaS in the lifecycle of data, it is important to mention the main characteristics of a DaaS solution. It is important that a DaaS solution can provide data and deliver them at different levels of granularity depending on the end users’ needs. Given the heterogeneity of the data delivered, even from a single data producer, it is necessary to be able to handle different data sources with different types of data

Future Internet 2020, 12, 77; doi:10.3390/fi12050077 www.mdpi.com/journal/futureinternet Future Internet 2020, 12, 77 2 of 21 in terms of content, schema, and type of delivery (e.g., streaming, batch). Network heterogeneity, as well as the addition and reduction of data sources, are strongly considered in this type of solution. Last but not least it is necessary to have a monitoring and evaluation mechanism in order to maintain the expected Quality of Service (QoS) and Quality of Data (QoD). Despite the requirement coverage of the data lifecycle from the DaaS model, data still need to be discoverable and presented to the end user in a uniform way. The Data Marketplace concept aims at creating a marketplace in which users are able to discover data with the desired content and quality. In order to create a Data Marketplace, the most important step is to describe and deliver the data as a product. Converting data into a product is a procedure that is well defined by the DaaS model, as described in the previous paragraph. In order to sell data as product specific, metrics must derive from them. Potential buyers need to know the volume, cost, value, quality, and other important metrics that evaluate the data [1]. Although there is significant work on the conversion of data as a product in the context of Data Marketplaces (also described in Section2), in many sectors (critical infrastructure, edge, cloud, and fog ) the data are evolving in a direction where static data degenerate to hold no value. These types of sectors require a service-based model for data delivery in order to explore the huge amount of data that is created and processed in real time [2]. This new era of data has created the need for Data Marketplaces that not only deliver data as a product but most importantly create DaaS products [1]. DaaS product creation requires not only the definition of data as a product but also the definition of the service that delivers the data, as a product. The most important characteristic of the service that is needed to be identified and described in order to achieve this goal is the definition of the QoS of the service that delivers the data. As far as the DaaS Marketplace is concerned the QoD metric production, as well as delivery, is firmly established. One of the less evolved procedures is the QoS part of the product creation. This paper aims at describing and establishing specific software components in order to enable the delivery of DaaS products. The enhancements proposed in the context of the DaaS Marketplace are based on three main needs identified for these types of marketplace:

Content discovery: A semantically enhanced content discovery component that aids the data • buyer to find the most suitable data as far as the content is concerned. Also, aids the data provider to better describe their data in order to make them more easily discoverable. QoS evaluation: An assessment tool that produces analytics and QoS metrics for the services that • provide the data, in order to aid the data buyer not only to assess the QoD but also the QoS in which the data are delivered. DaaS repository scalability: A scalability component that enables the dynamic scaling of the DaaS • Marketplace repository in order to ensure business continuity and fault tolerance.

The remainder of this document is organized as follows: the remainder of the present section presents the current literature and related work on the domains we are exploring. Section2o ffers the general architecture of the proposed system, as well as descriptions of the components included in it. In Section3 we present the evaluation tests, the metrics used, and the results of our evaluation. Finally, in Section4 we present a short discussion on the methodologies and technologies used, the evaluation results, and possible future work.

1.2. Related Work Since the beginning of the development and establishment of digital marketplaces, significant research has been undertaken into incorporating recommendation systems in order to enhance them [3]. One of the most recent research works is related to how ElasticSearch [4] can be used to improve a recommendation system [5]. In fact, ElasticSearch is a very powerful cutting-edge search engine, Future Internet 2020, 12, 77 3 of 21 developed for text analysis. Given its enormous capabilities, it has been included in many research publications that highlight the selection of ElasticSearch to evaluate and correlate semi-structured data with very rich text content, such as LinkedIn profiles [6], job market data [7], and healthcare data [8]. As far as numerical data is concerned, a very promising project has been developed, combining the usage of ElasticSearch with the well-known vector-based algorithm (K-NN), in order to produce image correlation software [9]. In addition, ElasticSearch has been used for large-scale image retrieval [10]. For these reasons we chose to implement the data discovery service of the DaaS Marketplace on top of an ElasticSearch installation. The terms “Data Marketplace” and “Data as a Service (DaaS) Marketplace” have been around for quite some time. Only in recent years have companies and researchers focused on these terms and what could they possibly mean to today’s society. The idea of a marketplace that sells data to people who want to obtain them and use them to their preference is a concept that many scholars are studying today, and are even testing real-life scenarios. The same applies to Marketplaces that sell services (components that contain data) to potential buyers. However, how will potential buyers know which Service from the DaaS Marketplace best fits their needs and suits their requirements? For example, a buyer might have physical machine limitations, meaning that the computer system might have moderate specifications and therefore might not be able to properly run every service of the Marketplace. In a few words, the QoS in each and every one of those services available in a Marketplace should be analyzed. This concern raises a need for Marketplace assessment. In the case of Data Marketplaces, QoD should be the main concern. In the case of DaaS Marketplaces, however, QoS should come first, followed by data monitoring and evaluation. The majority of proposals and concept implementations for Data/DaaS Marketplaces found in the literature do not specify a monitoring system. For example, proposals for Decentralized (IoT) [11] and Scientific [12] Data Marketplaces analyze their implementation, but do not cover the assessment part at all. However, we have to acknowledge that other monitoring and evaluation attempts have been proposed by few other researchers/scholars. Some of those attempts are briefly presented below. In a course called Advanced Services Engineering, held during summer of 2018, the Technical University of Wien (TUW) addressed some ways with which a data quality evaluation tool could be implemented for Data (and DaaS) Marketplaces [13]. More specifically, TU Wien proposed that an evaluation tool could be developed using cloud services, or by using human computation capabilities (which actually means that Professionals and Crowds can act as data concern evaluators). However, the aforementioned tool is more focused on the QoD spectrum, rather than the QoS, which means that a tool with QoS operations has yet to be found. Another implementation is Trafiklab, an open data marketplace distributing open public transport data, linking together public transport authorities and open data users [14]. Trafiklab’s Technical Platform enables the usage of the API management system for monitoring and analyzing API performance and/or errors. However, unlike the ideal tool that we are searching for, Trafiklab cannot extract metrics and monitoring data from the physical machine itself. It still remains a suitable solution when it comes to API-only metrics, but not for a complete Data and DaaS Marketplace assessment scenario. Moreover, according to a recent study which analyses the implementation of a Data Marketplace for the Internet of Things [15], the researchers proceeded to evaluate the Marketplace, but the metrics they excluded were from the physical machine itself. This means that if the machine is also running other processes, the data would not be representative and therefore the whole evaluation part would be false. This choice is suitable, if we decide that the physical machine on which the Marketplace is running will have all its resources “dedicated” to it. In the field of service monitoring, there have been attempts to monitor individual services, for example, Docker Containers. After all, a Data/DaaS Marketplace would easily operate inside a Docker Container [16]. A group of researchers tried to measure the performance of Docker Containers by Future Internet 2020, 12, 77 4 of 21 combining (at least) three monitoring tools. Their results were not promising. They concluded that there are no “dedicated tools that cover a wide range of performance metrics”. The tools used by the aforementioned researchers are inadequate in the case of a potential Data/DaaS Marketplace to a Docker Container. An excellent monitoring system for docker containers is “CoMA” [17]. This monitoring agent aims to provide solid metrics from containers. It could be the perfect solution for most Docker clients who wish to retrieve the basic statistics from their containers. However, CoMA extracts data only for the monitoring of CPU, memory, and disk (block I/O). This means that it could not be used for a Data/DaaS Marketplace scenario, where more metrics (such as response time, jitter etc.) are needed. CoMA is indeed an excellent implementation of a Docker Container monitoring system, but is not a suitable solution to a containerized marketplace’s monitoring needs. There is another Docker Container monitoring implementation [18], which also extracts metrics only for the CPU and the memory. Note that this is one of the oldest implementations. This module’s main aim is to be helpful for multiple Docker providers (Kubernetes, Dockersh, Docker Swarm API etc.). However, it is not suitable for the Data/DaaS Marketplace scenario that we are describing. In our case we need more metrics than this module is able to provide, such as response time, lag, latencies, network requests, and other server-side metrics, for possible server-side implementations. Although ElasticSearch as a whole, or as a combination of its component tools, has not previously been used for DaaS monitoring, there has been an implementation in the fields of IaaS and scientific application monitoring [19]. Why is the “Elastic Stack” (as ElasticSearch and its extra tools are usually called) a good solution for monitoring? It is rather simple. ElasticSearch can run on remote physical machines and is easy to deploy, extremely lightweight, compatible with a variety of operating systems, and free, to name a few of its advantages. Therefore, using the Metricbeat module to constantly collect metrics and send them to a main ElasticSearch database (and perhaps filter them through Logstash or visualize them through Kibana) is one of the simplest and most efficient monitoring solutions. Nevertheless, we cannot just use the Elastic Stack because, as we mentioned, it does not cover all our needs. Thus, it is a good basis for our monitoring solution but it was necessary to build upon it. In conclusion, there have been a plethora of attempts to implement a monitoring system for Data and DaaS Marketplaces, but none offers a complete solution. A number of researchers and scholars have mentioned the need for a powerful monitoring system, but there is none that can extract sufficient valuable metrics from both the Marketplace (the container on which it is operating) and its physical machine for the developers to produce analytics and correlation data. As a result, one of this paper’s proposed solutions is the answer to a complete monitoring scenario that covers the needs of a containerized Data/DaaS Marketplace. Before the development of any software in the present era, the need to plan how the relevant data will be stored, retrieved, and protected arises. This need is fundamental in the design of the software itself because each software application is currently created in order to somehow use or produce data, which is also the core functionality of a DaaS Marketplace. As stated by ’s Eric Schmidt, “Every two days now, we create as much information as we did from the dawn of civilization up until 2003” [20]; thus, it is only natural that an increasing number of software applications are being created in order to discover, use, and/or sell this data. When talking about data storage systems we have two main families: the traditional, ACID databases like SQL, and the newer NoSQL datastores, which are more chaotic and mostly free of the strict ACID rules. Both of these families have their strengths and weaknesses that will not be presented here as this is outside of the scope of this work. Instead, we will mention that according to the literature, for applications using Big Data, SQL databases are usually inadequate because they cannot handle the volume of data being written and read at any given time. In these cases, NoSQL datastores are often the only viable solution [21,22]. NoSQL datastores in turn are separated into four categories based on the data representation they are using. These categories are described in the following table (Table1). Future Internet 2020, 12, 77 5 of 21

Table 1. NoSQL datastore families, examples and short descriptions [21,23].

Data Format Examples Description These databases use a key-value data format in order to create Riak, Redis, Key-Value primary keys, which improve their performance when the single Couchbase primary key architecture is possible. These databases provide the advantage of storing all information concerning an object in a single document, using nested documents and arrays. This improves the performance, MongoDB, CouchDB, Document minimizing the queries needed to retrieve all requested data Terrastore and removing the need to join tables. It also provides the ability to create indexes based on any datafield in a document, not only primary keys. These systems initially resemble an SQL datastore, having columns and rows, but they provide the ability of inserting HBase, Cassandra, Column family unstructured data. This is possible because instead of using Amazon DynamoDB regular columns, they use column families that are grouped together in order to form a traditional table. These datastores are based on the relations between each datapoint, which is called a node. These nodes and the relations between them form clusters of data in the form of graphs. Neo4J, FlockDB, The datastores perform well when the relation between the data Graph InfiniteGraph points is more important than or at least as important as that of the actual data contained in the nodes. Their intolerance to horizontal partitioning makes them hard to use in data intensive production environments.

There are some clear advantages to NoSQL datastores, such as their inherent elastic scaling capabilities and ease of handling big data, the economics of their scaling out feature, and their flexible data models that can evolve over time as the application itself evolves [21]. NoSQL datastores do not need a pre-defined data model or a serious demand study. Both the data structure and the cluster architecture can be changed in real time or in semi-real time. Thus, if the application suddenly needs a different JSON schema to cover the clients’ needs, new data can be inserted into the datastore and it completes the process, without concerns about downtime due to long-running alter table queries. In addition, if the volume of users suddenly increases, more, small, cheap machines can be added to the cluster to handle the extra volume. The traditional SQL datastores would require investment of a significant amount of money in order to buy stronger servers or scale up the existing data servers. Our use case is a platform that provides Data as a Service, using Virtual Data Containers (VDCs). These containers are described in detail using a JSON document, called the Blueprint. More information about the VDC and the Blueprint can be found in Section 2.2.1. The fact that JSON is the dominant data format in the project and that the VDCs can be saved in binary format as part of BSON led us to choose the document datastore family, and specifically the MongoDB system. MongoDB has some clear advantages in data intensive applications that are clearly described by Jose and Abraham [21]. These advantages include the sharding functionality, which enables scaling out the database during runtime while retaining data availability and integrity, the high performance due to the embedded document structure, the high availability provided by the replica sets architecture of the cluster, and the flexibility of not having to pre-define a data schema, which is instead created dynamically following the needs of the application. Docker is a tool providing virtualization technology that creates lightweight, programmable containers, based on pre-created images [24]. Combining the fast and efficient virtualization technology that Docker provides, we can exploit the sharding capabilities of MongoDB in order to automate the scale out and scale in functions of the MongoDB cluster during runtime, according to the real-time performance data of the cluster. Docker has been used in order to automate the setup of MongoDB and Cassandra clusters [25,26] and to provide a micro-service architecture [27], but to our knowledge Future Internet 2020, 12, 77 6 of 21 the real-time performance data of the cluster. Docker has been used in order to automate the setup of MongoDBFuture Internet and2020 Cassandra, 12, 77 clusters [25,26] and to provide a micro-service architecture [27], but to6 our of 21 knowledge it has not been used to combine these two together to provide real-time modifications to theit has MongoDB not been cluster used to (creation, combine destruction, these two together scaling to out, provide and scaling real-time in). modifications to the MongoDB clusterThe (creation, cluster management destruction, scalingschema out, that and MongoDB scaling in). provides is one of the reasons that it is so popular;The it cluster is used management by multinational schema companies, that MongoDB such as provides MTV Network, is one of theCraigslist, reasons SourceForge, that it is so popular; EBay, andit is usedSquare by multinationalEnix, amongst companies, many others such [22]. as MTV Mong Network,oDB provides Craigslist, scaling SourceForge, out in three EBay, layers: and Square the clusterEnix, amongstlayer, the many application others layer, [22]. and MongoDB the data provides layer. The scaling cluster out layer in three consists layers: of the the Configuration cluster layer, servers,the application which handle layer, the and indexing the data and layer. sharding The cluster capabilities layer of consists the cluster, of the and Configuration identify where servers, data arewhich located handle and the which indexing node and to contact sharding for capabilities each reques oft. theThe cluster, application and identify layer consists where dataof the are Mongos located servers.and which These node servers to contact provide for the each access request. points The of th applicatione cluster, and layer ensure consists that the of thecluster Mongos is accessible servers. fromThese the servers outside provide world, the receiving access pointsqueries of and the serving cluster, andresponses. ensure Finally, that the the cluster datais layer accessible consists from of shardthe outside servers, world, which receiving hold the queries raw data. and They serving are responses.usually grouped Finally, in thereplica data sets layer in consistsorder to of ensure shard dataservers, integrity which and hold availability, the raw data. as each They chunk are usually of data grouped is replicated in replica once in sets each inorder shard to server ensure in datathe sameintegrity replica and set. availability, This means as eachthat chunkif an applicatio of data isn replicatedneeds to serve once a in large each number shard server of clients, in the more same Mongosreplica set.servers This can means be added that if to an handle application the queries. needs Similarly, to serve a if large penta-bytes number of of data clients, need more to be Mongos stored, moreservers shard can servers be added can to be handle added the to enlarge queries. our Similarly, data storage if penta-bytes capacity. of data need to be stored, more shard servers can be added to enlarge our data storage capacity. 2. Materials and Methods 2. Materials and Methods 2.1. Architectural Conceptualization 2.1. Architectural Conceptualization The general concept under which the software components were developed was the advancementThe general of conceptDaaS Marketplaces under which in the key software processes components of the complete were developed lifecycle was of the the advancement services it provides.of DaaS Marketplaces As described in in key Section processes 1 and of depicted the complete in Figure lifecycle 1, there of the are services three main it provides. pillars in As which described the advancementsin Section1 and took depicted place. Firstly, in Figure in1 a, DaaS there Marketplace are three main it is pillars important in which that the services advancements that provide took theplace. appropriate Firstly, in data a DaaS in terms Marketplace of content it is that important the bu thatyer needs the services can be that found provide easily the and appropriate intuitively. data In orderin terms to achieve of content this that goal, the a buyersemantically needs canenhanc be founded content easily discovery and intuitively. component In order was to created. achieve This this componentgoal, a semantically is able, through enhanced Domain-Specific content discovery Ontologies component (DSO), was to created. map simple This component tags to complex is able, queriesthrough in Domain-Specific order to retrieve Ontologies the best (DSO),candidate to map services simple from tags an to complexElasticSearch queries repository. in order to Content retrieve discoverabilitythe best candidate is crucial services for a from Data an Marketplace, ElasticSearch but repository. in a DaaS Marketplace Content discoverability it is also important is crucial to be for ablea Data to Marketplace,find the most but suitable in a DaaS service Marketplace in terms it of is alsothe quality important it provides. to be able toFor find this the reason, most suitablea QoS evaluationservice in terms component of the quality was introduced it provides. in For order this reason,to monitor a QoS and evaluation evaluate component key metrics was of introduced running instancesin order toof monitorthe services. and evaluateThese metrics key metrics are then of runninganalyzed instances and a QoS of assessment the services. is These delivered metrics to the are potentialthen analyzed buyer. and In this a QoS stage, assessment it is important is delivered to mention to the potentialthat the computational buyer. In this stage,load created it is important due to theto mentioncomplexity that of the the computational queries for retrieving load created candidate due toservices, the complexity as well as of the the network queries and for retrievingdatabase loadcandidate created services, due to the as files well being as the sent network to the user and in database order to load deploy created the service due to on the a physical files being machine, sent to createthe user fluctuations in order to in deploy the resources the service needed on a physical to operate machine, the backend create of fluctuations the DaaS Marketplace. in the resources To needed better handleto operate the heterogeneity the backend of of the the DaaS workload, Marketplace. a componen To bettert responsible handle the for heterogeneity scaling the repository of the workload, in real time,a component during operation, responsible was for created. scaling the repository in real time, during operation, was created.

Figure 1. DaaS Marketplace general architecture.

Future Internet 2020, 12, 77 7 of 21

2.2. Components Description

2.2.1. Semantic Resolution Engine The use case DaaS platform adopts the Service-Oriented Computing principles [28], one of which is visibility. This requires the publication of the description of a service in order to make it available to all its potential buyers. In this case, this is fulfilled via the so-called VDC Blueprint that is created and published by the data owners. The Virtual Data Container (VDC) provides an abstraction layer that takes care of retrieving, processing, and delivering data with the proper quality level, while in parallel putting special emphasis on performance, security, privacy, and data protection issues. Acting as middleware, it lets the data consumers simply define their requirements of the needed data, expressed as data utility, and takes the responsibility for providing those data timely, securely, and accurately, while also hiding the complexity of the underlying infrastructure [29]. Following the general definition of the VDC, its corresponding Blueprint is a JSON file that describes, except for the implementation details and the API that the VDC exposes, all the attributes and properties of the VDC, such as availability, throughput, and price [30]. This part is critical for the VDC to be considered as a product with specific characteristics that could be used as information metadata by the data buyers. The Blueprint consists of five distinct sections, each one of which provides different kinds of information regarding the VDC. Although a thorough analysis of those sections is outside the scope of this paper, the separation is mentioned in order to highlight that the Semantic Resolution Engine parses only the specific section of the Blueprint that refers to the content of the data that the VDC provides, thus enhancing the scalability of the overall system. Inside this Blueprint section, the data owners use descriptive tags (keywords) in order to make their data more discoverable in terms of content. Given the Blueprint’s creation and storage in the repository, the engine is responsible for finding the most appropriate Blueprints based on the content of data that the VDC delivers. This resolution process takes as input some terms as free text from the data consumers and then provides a list of candidate Blueprints that match one or many of those terms. The returned Blueprints are ranked based on their scores that reflect the level of matching against the set of user terms. For the Blueprint retrieval, the engine relies on ElasticSearch [4], which is one of the leading solutions for content-based search. However, in order to enrich the resolution results, meaning more Blueprints that the data consumer might be interested in, the Semantic Resolution Engine exploits the class hierarchy that the ontologies provide. When the consumer provides the input terms, the engine forms the query for the ElasticSearch database, also taking into account semantic relations of those terms. Those relations are extracted from Domain-Specific Ontologies (DSO), which domain experts provide to the marketplace. For instance, in the case of the e-health domain, when a doctor is looking for data about patients’ tests, the engine also returns Blueprints that contain the Blood, Oral, Urine, Ocular, etc. tags, since those are subclasses of the superclass Test (Figure2). Future Internet 2020, 12, 77 8 of 21

Future Internet 2020, 12, 77 8 of 21 Future Internet 2020, 12, 77 8 of 21

Figure 2. Test class hierarchy. FigureFigure 2. 2. TestTest class class hierarchy. hierarchy. Although validated against medical ontologies, the engine can be used in any domain, since the Although validated validated against against medical medical ontologies, ontologies, th thee engine engine can can be beused used in any in any domain, domain, since since the corresponding DSO, which actually serves as a reference vocabulary, is the only dependence that it correspondingthe corresponding DSO, DSO, which which actually actually serves serves as asa refe a referencerence vocabulary, vocabulary, is is the the only only dependence dependence that that it it has. The high-level architecture of the Semantic Resolution Engine is presented in Figure 3: has. The The high-level high-level architecture architecture of the Semantic Resolution Engine is presented in Figure3 3::

Figure 3. Semantic Resolution Engine architecture. Figure 3. SemanticSemantic Resolution Resolution Engine Engine architecture. architecture. The implementation of the engine aims at enhancing the core functionalities offered by The implementationimplementation of of the the engine engine aims ataims enhancing at enhancing the core functionalitiesthe core functionalities offered by ElasticSearch offered by ElasticSearch towards two directions: both the discovery and the ranking. It is developed on one hand ElasticSearchtowards two directions:towards two both directions: the discovery both the and discov theery ranking. and the It ranking. is developed It is developed on one hand on toone extend hand to extend the search results, capturing semantic relations between different terms, and on the other tothe extend search the results, search capturing results, capturing semantic semantic relations betweenrelations dibetweenfferent terms,different and terms, on the and other on the to reviseother to revise the ElasticSearch default scoring algorithm, used to rank those results. Regarding the tothe revise ElasticSearch the ElasticSearch default scoring default algorithm, scoring algori usedthm, to rank used those to rank results. thos Regardinge results. Regarding the discovery the discovery aspect, ElasticSearch is capable of handling synonyms during the analysis process. discoveryaspect, ElasticSearch aspect, ElasticSearch is capable ofis handlingcapable of synonyms handling during synonyms the analysis during process.the analysis Furthermore, process. Furthermore, the engine takes full advantage of ontologies to also handle two more types of semantic Furthermore,the engine takes the engine full advantage takes full of advantage ontologies of toonto alsologies handle to also two handle more two types more of semantictypes of semantic relation. relation. More precisely, in case the requested term is a class, then the following relation types are relation.More precisely, More precisely, in case the in requested case the requested term is a class, term thenis a class, the following then therelation following types relation are considered: types are considered: Equivalent (for exact or synonym terms), SuperClass/SubClass, and Sibling. In case the considered:Equivalent (forEquivalent exact or (for synonym exact or terms), synonym SuperClass terms), /SuperClass/SubClass,SubClass, and Sibling. an Ind Sibling. case the In requested case the requested term is an individual, then the corresponding types are: same (for exact or synonym terms), requestedterm is an term individual, is an individual, then the then corresponding the correspon typesding types are: are: same same (for (for exact exact or or synonymsynonym terms), terms), property (for individuals with property relation where the requested term is the subject), and siblings property (for individuals with property relation wher wheree the requested term is the subject), and siblings siblings (for individuals that belong to the same classes). For example, if a doctor is searching for medical data (for individuals that belong to the same classes). For example, if a doctor is searching for medical data about patients from Milan, the engine returns Blueprints that provide such data not only from Milan, about patients from Milan, the engine returns Blueprints that provide such data not only from Milan, but also from Italy, since Milan isPartOf Italy (property relation between individuals Milan and Italy), but also from Italy, since Milan isPartOf Italy (property relation between individuals Milan and Italy),

Future Internet 2020, 12, 77 9 of 21

(for individuals that belong to the same classes). For example, if a doctor is searching for medical dataFuture aboutInternet patients2020, 12, 77 from Milan, the engine returns Blueprints that provide such data not only9 from of 21 Future Internet 2020, 12, 77 9 of 21 Milan, but also from Italy, since Milan isPartOf Italy (property relation between individuals Milan andas well Italy), as asfrom well Rome, as from since Rome, both since of the both individuals of the individuals Milan and Milan Rome and belong Rome belongto the tosame the sameclass classItalianCity ItalianCity (Figures (Figures 4 and4 5,and respectively).5, respectively).

Figure 4. Milan individual property relations. Figure 4. Milan individual property relations.

Figure 5. ItalianCityItalianCity class class hierarchy. hierarchy. Figure 5. ItalianCity class hierarchy. However, the score of each Blueprint reflects the relevance of the data that it offers, according to However, the score of each Blueprint reflects the relevance of the data that it offers, according to the didifferentfferent semantic relation type that eacheach oneone captures.captures. Consequently, the firstfirst aforementioned Blueprintthe different is rankedsemantic higher relation than type the that other each two on ande captures. the third Consequently, is ranked lower, the if first all theaforementioned other factors Blueprint is ranked higher than the other two and the third is ranked lower, if all the other factors that affect the overall score remain the same. In particular, regarding the ranking aspect, the Semantic that affect the overall score remain the same. In particular, regarding the ranking aspect, the Semantic Resolution Engine implements an algorithm that calculates an objective score for each returned Resolution Engine implements an algorithm that calculates an objective score for each returned Blueprint, instead of the relevance score that ElasticSearch uses by default. The goal is to enable Blueprint, instead of the relevance score that ElasticSearch uses by default. The goal is to enable the the development of a fair and unbiased search engine as a core part of a DaaS Marketplace. In detail, development of a fair and unbiased search engine as a core part of a DaaS Marketplace. In detail, the the algorithm ignores factors that are involved in the ElasticSearch relevance score calculation, such as algorithm ignores factors that are involved in the ElasticSearch relevance score calculation, such as the term frequency (TF) and the inverse document frequency (IDF), which make no sense in the context the term frequency (TF) and the inverse document frequency (IDF), which make no sense in the of the proposed marketplace. Regarding IDF, if many Blueprints in the repository contain a term, this context of the proposed marketplace. Regarding IDF, if many Blueprints in the repository contain a does not imply that the term is less important than another term which few Blueprints include. Thus, term, this does not imply that the term is less important than another term which few Blueprints ignoring IDF, the score of a matched Blueprint is independent of other Blueprints given a specific include. Thus, ignoring IDF, the score of a matched Blueprint is independent of other Blueprints query. To conclude, based on the introduced scoring algorithm, the factors that affect the score of given a specific query. To conclude, based on the introduced scoring algorithm, the factors that affect a returned Blueprint are: the number of terms (T) from the query that are captured by the Blueprint the score of a returned Blueprint are: the number of terms (T) from the query that are captured by the tags, a boost coefficient (C) that corresponds to the semantic relation type of each captured term Blueprint tags, a boost coefficient (C) that corresponds to the semantic relation type of each captured (16, 4, and 2 for the three different types), and the total number of tags (N) included in the Blueprint. term (16, 4, and 2 for the three different types), and the total number of tags (N) included in the Blueprint. The last is a constant factor, independent of the query. The formula for the score calculation is given in Equation (1): T T Ci Ci (1)  , (1) i=1 N i=1 N

Future Internet 2020, 12, 77 10 of 21

The last is a constant factor, independent of the query. The formula for the score calculation is given in Equation (1): T X Ci Future Internet 2020, 12, 77 ,10 of (1) 21 Future Internet 2020, 12, 77 √ 10 of 21 i = 1 N InIn thethethe following followingfollowing running runningrunning example, example,example, the thethe DaaS DaaSDaaS Marketplace MaMarketplacerketplace aims aimsaims at delivering atat deliveringdelivering medical medicalmedical data to variousdatadata toto varioususersvarious such usersusers as doctors, suchsuch asas researchers, doctors,doctors, researchers,researchers, or any developers oror ananyy developersdevelopers that intend thatthat to create intendintend cloud-based toto createcreate applicationscloud-basedcloud-based applicationswhichapplications would whichwhich consume wouldwould those consumeconsume data. The thosethose storyline data.data. begins TheThe storylinestoryline with a domain beginsbegins expert withwith (i.e.,aa domaindomain doctor, expertexpert biomedical (i.e.,(i.e., doctor,engineer,doctor, biomedicalbiomedical etc.) that engineer,engineer, provides etc.)etc.) the relevant thatthat providesprovides DSO the tothe the relevantrelevant marketplace. DSODSO toto thethe Thereafter, marketplace.marketplace. data Thereafter, ownersThereafter, that datadata are ownerswillingowners tothatthat sell areare medical willingwilling data toto sellsell via medicalmedical the Marketplace, datadata viavia ththe createe Marketplace,Marketplace, and store createcreate their andand Blueprints storestore theirtheir in the BlueprintsBlueprints repository. inin theAsthe mentioned repository.repository. before, AsAs mentionedmentioned a Blueprint before,before, contains, aa BlueprintBlueprint among othercocontains,ntains, tags, amongamong those that otherother describe tags,tags, thosethose the content thatthat describedescribe of the data thethe contentthatcontent the ofof VDC thethe provides. datadata thatthat the Anthe VDC exampleVDC provides.provides. of those AnAn tags exexampleample is depicted ofof thosethose in Figure tagstags isis6 depicted.depicted inin FigureFigure 6.6.

FigureFigure 6. 6. VirtualVirtual Data Data Container Container (VDC) (VDC) Blueprint Blueprint tags. tags.

Finally,Finally, aa useruser whowho isis searching searching for for medical medical test test data data about about patients patients in in Milan Milan performs performs the the queryquery shownshown in in Figure Figure7 77 to toto the thethe Marketplace: Marketplace:Marketplace:

FigureFigure 7. 7. DataData consumer consumer query query terms. terms.

Then,Then, thethe the SemanticSemantic Semantic ResolutionResolution Resolution EngineEngine Engine isis isresponrespon responsiblesiblesible forfor for discoveringdiscovering discovering thosethose those BlueprintsBlueprints Blueprints fromfrom from thethe repositorytherepository repository thatthat that matchmatch match thethe the user’suser’s user’s preferences.preferences. preferences. ItIt isis It worth isworth worth notingnoting noting thatthat that withoutwithout without thethe the usageusage usage ofof of thethe the engine,engine, engine, butbut instead instead simplysimply relying relyingrelying on onon ElasticSearch, ElasticSearch,ElasticSearch, the thethe Blueprint BlueprintBlueprint depicted depicteddepicted in inin Figure FigureFigure6 would 66 wouldwould not notnot be be includedbe includedincluded in inthein thethe result resultresult set, set,set, though thoughthough it might itit mightmight be of be somebe ofof interestsomesome inteinte torest therest user.toto thethe In user.user. fact, theInIn fact, enginefact, thethe returns engineengine that returnsreturns Blueprint, thatthat Blueprint,dueBlueprint, to the duedue semantic toto thethe relationsemanticsemantic between relationrelation betweenthebetween term thethe test termterm and testtest the and tagand blood thethe tagtag or bloodblood the relation oror thethe relationrelation between betweenbetween Milan MilanandMilan Rome, andand Rome, accordingRome, accordingaccording to the snippets toto thethe snippetssnippets of the DSO ofof thth depictedee DSODSO depicteddepicted in Figures inin2 andFiguresFigures5, respectively. 22 andand 5,5, respectively.respectively. Applying ApplyingEquationApplying (1) EquationEquation to this specific (1)(1) toto thisthis Blueprint, specificspecific its Blueprint,Blueprint, overall score itsits overall isoverall derived scorescore from isis thederivedderived equality fromfrom in the Equationthe equalityequality (2): inin EquationEquation (2):(2): 4 2 + = 2. (2) √94242+=√9 +=22,, (2)(2) 2.2.2. SEMS 9999 The Service’s Evaluation and Monitoring System (SEMS) aims to provide a complete solution 2.2.2.to2.2.2. the SEMSSEMS Data as a Service Marketplace’s “complete assessment issue” (that is, the lack of a complete TheThe Service’sService’s EvaluationEvaluation andand MoniMonitoringtoring SystemSystem (SEMS)(SEMS) aimsaims toto provideprovide aa completecomplete solutionsolution toto thethe DataData asas aa ServiceService Marketplace’sMarketplace’s “complete“complete asseassessmentssment issue”issue” (that(that is,is, thethe lacklack ofof aa completecomplete monitoringmonitoring andand evaluationevaluation system),system), andand eventuallyeventually furtherfurther improveimprove itit asas aa component.component. TheThe purposepurpose ofof SEMSSEMS isis toto monitormonitor bothboth thethe serviceservice andand thethe phphysicalysical machinemachine onon whichwhich itit operates,operates, inin orderorder toto provideprovide real-timereal-time metricsmetrics andand evalevaluationuation scores.scores. ThoseThose metricsmetrics shallshall bebe obtainableobtainable fromfrom allall thethe

Future Internet 2020, 12, 77 11 of 21 monitoring and evaluation system), and eventually further improve it as a component. The purpose of SEMS is to monitor both the service and the physical machine on which it operates, in order to provide real-time metrics and evaluation scores. Those metrics shall be obtainable from all the potential buyers, for them to decide whether the service is a good choice (according to their requirements) or not. In a few words, SEMS monitors both the physical infrastructure and the DaaS Marketplace that runs on it. Its functionality is quite simple, but manages to achieve the desirable result. Briefly, SEMS consists of the following: (i) An ElasticSearch database storing the raw monitoring data. (ii) ElasticSearch’s MetricBeat module. MetricBeat is a monitoring tool installed on the physical machine and monitors both the system and the DaaS Marketplace running on it. MetricBeat forwards all the monitoring data to ElasticSearch every 10 seconds. (iii) ElasticSearch’s Kibana module. Kibana offers a user interface (UI) to ElasticSearch. It is useful in testing scenarios and for easily checking monitoring data. (iv) A Java Spring Boot Application. This tool obtains the monitoring data from ElasticSearch, calculates the necessary metrics and creates an HTTP GET Request for end users to have access to them. At the outset, the existence of a database for storage of raw monitoring data (before the final metrics calculation process) is essential. The selection of ElasticSearch would appear to best fit the needs of SEMS. ElasticSearch is a distributed, RESTful search and analytics engine capable of addressing a growing number of use cases. The database is the “depository” for all the raw monitoring data being sent from the MetricBeat modules. MetricBeat is the next basic SEMS component. Each physical machine containing a DaaS Marketplace will also have a MetricBeat module installed and running. The module will draw monitoring data from both the machine and the Marketplace every 10 seconds. Each MetricBeat dataset includes specific information (such as IDs) that are unique to every physical machine and Marketplace. In this manner, the correct data selection from the ElasticSearch database will be achieved. The Kibana UI, one of ElasticSearch’s prime tools, will serve as an easy method of testing and checking the data inserted into ElasticSearch. For example, in the case that an extra metric needs to be added, or an error appears in the data, browsing all the insertions through Kibana will help the process of adding or resolving. Regarding the final component, in order for SEMS to be implemented, a need arises for a main basic module that will orchestrate the raw monitoring data collected and undertake the necessary computation needed for the final metrics to be created. Furthermore, it should create an HTTP GET Request for the end user to type and retrieve the aforementioned metrics. This main module is, in practice, a Java Spring Boot Application. More specifically, its functionality can be split into two parts, as already mentioned. In the first part, the tool “listens” to an HTTP GET Request. The user only specifies the Marketplaces’ container ID, which runs on a physical machine. As soon as the application detects the incoming request, it initiates the searching process through all the data stored in ElasticSearch. It uses the current time and the Marketplaces’ ID as search variables. Note that the Marketplaces’ data are already in the ElasticSearch database, since MetricBeat sends them every 10 seconds (as analyzed before). Then, we move to the second part where the tool proceeds to the necessary calculations for three main metrics to be generated: CPU percentage used, RAM memory percentage used, and response time. Each of these refer to both the Marketplace and its physical machine. Finally, the application returns to the user the aforementioned three metrics, in JSON format (as a response to the request). If any error occurs a message will be shown to the user. More metrics can be added to SEMS; we selected the three highlighted as initial examples. SEMS also acts as the Recommendation System via the real-time metrics it extracts. In a few words, raw monitoring data are drawn from Data Marketplace Products (DaaS components) and are collected into a central physical machine, where the metrics/results extraction takes place (using the aforementioned raw monitoring data). One can see the raw data collected by the Visualization Tool, i.e., a User Interface (such as Kibana). After the metrics extraction is complete, the results are ready for the end user to view in real time. Moreover, these results serve as recommenders, meaning data that automatically fill the Recommendation System. Future Internet 2020, 12, 77 12 of 21

The Recommendation System is the final step of the process. It takes as an input the results of SEMS, which are the calculated DaaS metrics, in order to produce the complex queries that will generateFuture Internet the 2020 user-based, 12, 77 score (recommendation) of the DaaS Products’ Blueprints. For every12 DaaS of 21 Product’s Blueprint that is in the list of candidates, the Recommendation System queries the purchase repository inin order order to findto find the purchasethe purchase history hist of everyory of Blueprint every Blueprint and also the and application also the requirements application ofrequirements the buyers thatof the acquired buyers it. that After acquired correlating it. Afte ther storedcorrelating application the stored requirements application with requirements the current requirements,with the current as wellrequirements, as with the as real-time well as with metrics the providedreal-time bymetrics SEMS, provided it produces by SEMS, a score it taking produces under a strongscore taking consideration under strong this correlation.consideration Thus, this itcorrelation. generates aThus, recommendation it generates a thatrecommendation depends on what that thedepends users on wanted what fromthe users the Blueprintwanted from of the the particular Blueprint product of the particular and how product well this and Blueprint how well served this theBlueprint application’s served needs.the application's By correlating needs. the By di ffcorrelerentating requirements, the different the systemrequirements, takes the the scores system of userstakes thatthe scores had similar of users application that had similar requirements application into strongrequirements consideration. into strong This cons helpsideration. the Recommendation This helps the SystemRecommendation to generate System more user-centric to generate recommendationsmore user-centric insteadrecommendations of the technical instead filtering of the and technical raking providedfiltering and by theraking other provided components. by the other components. SEMS can also be used by the repository scaling service described in Section 2.2.3 for decision support. It It is is structured structured in in a a way that can be easily understood and used by both the developers (behind the the tool) tool) and and the the users. users. SEMS’s SEMS’s running running sequence sequence will not will be not disturbed be disturbed if any changes if any changes should shouldoccur to occur the physical to the physical machine, machine, the Data the Marketplace, Data Marketplace, or the DaaS or the products. DaaS products. It can run It canon any run system on any system(both the (both Product the Product and the andphysical the physical machine), machine), independent independent from its from computational its computational capabilities. capabilities. Given Givena specific a specific Marketplace’s Marketplace’s DaaS product, DaaS product, SEMS draw SEMSs monitoring draws monitoring data from data two fromMetricBeat two MetricBeat modules, modules,one inside one it and inside one it in and its onephysical in its physicalmachine. machine.If any changes If any happen changes to happen the Marketplace to the Marketplace and the andMetricBeat the MetricBeat inside it insidestops working, it stops working, data from data the from physical the physical machine’s machine’s MetricBeat MetricBeat will still will be drawn. still be drawn.In both cases, In both SEMS cases, will SEMS continue will continue to function. to function. The technologies used are relatively simple. SEMS SEMS takes takes full full advantage advantage of of ElasticSearch’s ElasticSearch’s tools. tools. It uses an ES Database, the Kibana User Interface for raw data (data in the database) visualization, and the MetricBeat monitoring tool. Finally, Finally, it uses the Spring Boot functionalities and capabilities, through a Java Spring Boot Application.Application. A general architecture is shown inin FigureFigure8 8..

Figure 8. Service’sService’s Evaluation and Monitoring System ( (SEMS)SEMS) general architecture. 2.2.3. Repository Manager 2.2.3. Repository Manager The Marketplace Repository Scaling Framework (MaRScaF) was developed in order to aid the DaaS The Marketplace Repository Scaling Framework (MaRScaF) was developed in order to aid the Marketplace functionality, providing support to business continuity, datastore elasticity, and QoS DaaS Marketplace functionality, providing support to business continuity, datastore elasticity, and assurance domains. It is a toolkit that contains a plethora of readily available to serve the needs QoS assurance domains. It is a toolkit that contains a plethora of APIs readily available to serve the of a growing and evolving marketplace for DaaS platforms. It is developed in Python and it can needs of a growing and evolving marketplace for DaaS platforms. It is developed in Python and it support raw Docker and Kubernetes infrastructure, providing great flexibility in the choice of database can support raw Docker and Kubernetes infrastructure, providing great flexibility in the choice of database topologies and architectures. In essence, any machine, located either in the cloud, at the physical location of the marketplace, or even at the edge of the network, can be used to store data as part of the repository cluster. It also provides real time scaling capabilities, enabling the Marketplace to handle rapidly changing number of users.

Future Internet 2020, 12, 77 13 of 21 topologies and architectures. In essence, any machine, located either in the cloud, at the physical location of the marketplace, or even at the edge of the network, can be used to store data as part of theFuture repository Internet 2020 cluster., 12, 77 It also provides real time scaling capabilities, enabling the Marketplace to handle13 of 21 rapidly changing number of users. As wewe mentioned, mentioned, MaRScaF MaRScaF is developedis developed in Python in Python 3, providing 3, providing numerous numerous APIs that APIs are separated that are intoseparated four main into four categories: main categories: 1. The MongoDB APIs, which handle the configuration of the MongoDB cluster, as well as the 1. communicationThe MongoDB needed APIs, whichto ensure handle that the the configur configurationations are of correctly the MongoDB applied and cluster, the status as well of allas themachines communication is within neededacceptable to ensure thresholds. that the configurations are correctly applied and the status 2. Theof all Docker machines APIs, is which within handle acceptable the Docker thresholds. hosts, the images hosted in them, and the containers 2. thatThe are Docker or will APIs, be parts which of handle the MongoDB the Docker cluster. hosts, the images hosted in them, and the containers 3. Thethat Kubernetes are or will be APIs, parts which of the also MongoDB handle cluster. the Docker hosts, images, and containers, but on a 3. higherThe Kubernetes level using APIs, the whichKubernetes also handle orchestration the Docker APIs. hosts, images, and containers, but on a higher 4. Thelevel Scaling using theAPIs, Kubernetes which provide orchestration the coordination APIs. APIs that combine all others in order to 4. provideThe Scaling the APIs,services which promised provide by the MaRScaF. coordination APIs that combine all others in order to provide Wethe will services focus promised on the Docker by MaRScaF. and Scaling APIs as these are being used in the evaluation run in the presentWe will work. focus As on shown the Docker in Figure and Scaling9, the thre APIse APIs as these that areare beingbeing usedused induring the evaluation the deployment run in theface present of the MongoDB work. As cluster shown are in Figurethe Create9, the Cluster three APIs API, thatthe Build are being Image used API during and the the Run deployment Container faceAPI. ofThe the first MongoDB receives cluster a JSON are file the that Create contains Cluster the API, initial the configuration Build Image APIof the and cluster the Run to be Container created. API.This Theincludes first receivesthe machines a JSON fileused, that the contains ports, the the initial IP addresses, configuration their of theroles cluster (Mongos to be created.server, ThisConfiguration includes the server, machines or Shard used, server), the ports, and the other IP addresses,relevant information. their roles (Mongos Then, this server, API uses Configuration the Build server,Image API or Shard in order server), to ensure and other that all relevant necessary information. images are Then, present this in API the uses Docker the Buildhosts Imageand that API they in orderare ready to ensure to spawn that all new necessary containers. images If an are image present is in not the present Docker it hosts is automatically and that they created are ready using to spawn pre- newcreated containers. Dockerfiles If an that image are ispopulated not present by itthe is automaticallyconfigurations created passed using to the pre-created Create Cluster Dockerfiles API. Then, that arethe populatedRun Container by the API configurations is called, which passed starts to the necessary Create Cluster containers API. Then, and configures the Run Container the MongoDB API is called,cluster, which returning starts the the fina necessaryl access containerspoint and anda debug configures status themessage MongoDB to the cluster, user who, returning in this the case, final is accessthe DaaS point Marketplace. and a debug status message to the user who, in this case, is the DaaS Marketplace.

Figure 9. Data repository scaling deployment API architecture. Figure 10 shows the real-time APIs, which provide scaling and cluster destruction capabilities. Figure 10 shows the real-time APIs, which provide scaling and cluster destruction capabilities. It can be seen that the monitoring module provided by SEMS monitors the MongoDB cluster and its It can be seen that the monitoring module provided by SEMS monitors the MongoDB cluster and its Docker hosts in order to provide real-time information, identifying choke points and other critical Docker hosts in order to provide real-time information, identifying choke points and other critical situations. If such a situation arises, the Scale out API is called by providing a JSON file with situations. If such a situation arises, the Scale out API is called by providing a JSON file with configuration about the server to be added. This API then calls again the image creation API configuration about the server to be added. This API then calls again the image creation API and the and the Run Container API in order to create the new server and then reconfigures the MongoDB Run Container API in order to create the new server and then reconfigures the MongoDB cluster, cluster, adding the new server dynamically during operation. This process takes less than 10 seconds if adding the new server dynamically during operation. This process takes less than 10 seconds if the the image is already in the Docker host and about 2 minutes if the image needs to be built. image is already in the Docker host and about 2 minutes if the image needs to be built.

Future Internet 2020, 12, 77 14 of 21

Future Internet 2020,, 12,, 7777 14 of 21

Figure 10. Data repository scaling real-time API architecture. Figure 10. Data repository scaling real-time API architecture. Figure 10. Data repository scaling real-time API architecture. 3. Results 3. Results We evaluatedevaluated thethe performanceperformance ofof ourour datadata recommendationrecommendation servicesservices versusversus thethe usage of simple ElasticSearchWe evaluated queries.queries. the Thisperformance recommendation of our data service, recomme as discussedndation services earlier, isversus used inthe order usage to of improve simple ElasticSearchthethe datadata discoverydiscovery queries. processprocess This recommendation inin thethe DaaSDaaS environmentenvironmen service, tas andand discussed thethe overalloverall earlier, qualityquality is used ofof experienceexperiencein order to (QoE)improve(QoE) ofof thethe user.datauser. discovery Both the ElasticSearch process in the andand DaaS the recommendationenvironment and servicesservicesthe overall areare alsoqualityalso scalablescalable of experience inin order toto(QoE) ensureensure of thethatthat user. theythey Both willwill followthefollow ElasticSearch thethe increasingincreasing and the needsneeds recommendation ofof thethe DaaSDaaS Marketplace,Marketplace, services are performing performingalso scalable always in order at to thethe ensure samesame thatQoS they levels. will follow the increasing needs of the DaaS Marketplace, performing always at the same QoS levels.ElasticSearch providesprovides twotwo configurationconfiguration optionsoptions duringduring datadata retrievalretrieval queries:queries: (a) the algorithm usedElasticSearch andand (b) (b) the the learning provideslearning weight. twoweight. configuration For For the firstthe optionfirst options op wetion du testedring we data threetested retrieval di threefferent differentqueries: algorithms, (a) algorithms, the the algorithm Gaussian the (GA),usedGaussian and the (GA), linear(b) the the (LIN), learning linear and (LIN), weight. the exponentialand For the theexponential first (EXP). op Fortion (EXP). thewe For learningtested the learningthree weight different weight we again algorithms,we testedagain tested three the Gaussiandithreefferent different values—10, (GA), values—10,the linear 25, and (LIN), 25, 50—which and and 50—which the in exponential essence in configureessence (EXP). configure howFor the quickly learning how the quickly scoring weight the in we thescoring again algorithms testedin the threechangesalgorithms different between changes values—10, two between similar 25, two values and si milar50—which that arevalues examined in that essence are by examined theconfigure internal by how the scoring internalquickly function. scoringthe scoring function. in the algorithmsInIn Figure changes 11,11, we we between can can see see thattwo that thesi themilar proposed proposed values reco that recommendationmmendation are examined system by system the more internal more accurately accurately scoring fulfilled function. fulfilled the theuser’s user’sIn search Figure search in 11, most in we most cantests. tests. see In that detail, In detail, the we proposed we can can see see recoin inbluemmendation blue the the percentage percentage system of of morecases cases accuratelyin in which which the the fulfilled proposed the user’ssystemsystem search returned in most a more tests. fittingfitting In detail, resultresult we thanthan can aa see simplesimple in blue ElasticSearch,ElasticSearch, the percentage and,and, of in cases red, thein which percentage the proposed of cases systeminin which returned the the simple simple a more ElasticSearch ElasticSearch fitting result query query than returned a returned simple a Elmore aasticSearch, more fitting fitting result and, result thanin red, than the the proposed the percentage proposed system. of system. cases The Theindifference which difference the between simple between ElasticSearchthe thealgorithms algorithms query is small is returned small but but the a the more best, best, fittingbeing being closestresult closest than to to the thethe 50% 50%proposed threshold threshold system. of equal The performancedifferenceperformance between between the the algorithms two systems,systems, is small is thethe but GA10GA10 the best, algorithm,algorithm, being closest with 62.30% to the for 50% thethe threshold proposedproposed of systemsystem equal performanceand 37.70%37.70% forfor between ElasticSearch.ElasticSearch. the two Thus, Thus,systems, we we continued iscontinued the GA10 our our algorithm, experiment experiment with with with 62.30% the the GA10 GA10for the algorithm algorithm proposed in in ordersystem order to provideandto provide 37.70% ElasticSearch ElasticSearch for ElasticSearch. with with the Thus,the most most we favorable favorablecontinued conditions. conditions. our experiment with the GA10 algorithm in order to provide ElasticSearch with the most favorable conditions.

Figure 11. Test results of different ElasticSearch algorithms versus the proposed system in Figure 11. Test results of different ElasticSearch algorithms versus the proposed system in a pre-rated a pre-rated dataset. Figuredataset. 11. Test results of different ElasticSearch algorithms versus the proposed system in a pre-rated dataset.

Future Internet 2020, 12, 77 15 of 21 Future Internet 2020, 12, 77 15 of 21

We also performed the the same same experiment experiment using using th thee 10-fold 10-fold cross-validati cross-validationon methodology methodology on the on theGA10 GA10 algorithm algorithm of ElasticSearch of ElasticSearch versus versus the the propos proposeded system, system, to to evaluate evaluate the the behavior behavior on queries withoutwithout anan exactexact matchmatch inin thethe ElasticSearchElasticSearch datadata indexes.indexes. The results are now more in favor ofof ElasticSearch, lessening the the difference difference in in percentages. percentages. Nonetheless, Nonetheless, there there is a is clear a clear superiority superiority of the of theproposed proposed system system in all in cases all cases as shown as shown in Figure in Figure 12. 12.

Figure 12. TestTest results results of of 10-fold 10-fold cross cross validation validation betw betweeneen the theGA10 GA10 algorithm algorithm of ElasticSearch of ElasticSearch and andthe proposed the proposed system. system. The repository scaling component was developed in order to maintain high QoS standards The repository scaling component was developed in order to maintain high QoS standards by by automatically evolving the storage capabilities and architecture of the DaaS marketplace. automatically evolving the storage capabilities and architecture of the DaaS marketplace. For the For the evaluation and tests, we ran an artificial load experiment on a sharded MongoDB cluster evaluation and tests, we ran an artificial load experiment on a sharded MongoDB cluster that was that was created using the scaling component. The load was produced using a jMeter script that created using the scaling component. The load was produced using a jMeter script that created clients created clients performing read, update, and write operations on our cluster. We used the monitoring performing read, update, and write operations on our cluster. We used the monitoring component component described earlier in order to take snapshots of various QoS-related metrics every 10 seconds described earlier in order to take snapshots of various QoS-related metrics every 10 seconds of of operation. The cluster was running on a single machine using Docker containers as its nodes. operation. The cluster was running on a single machine using Docker containers as its nodes. We We used five different cluster architectures: used five different cluster architectures: 1.1. SingleSingle shard shard cluster: cluster: 1 1Mongos, Mongos, 1 1configuration configuration server server and and 1 1 shard shard server server 2. DoubleDouble shard shard cluster: cluster: 1 1Mongos, Mongos, 1 1configuration configuration server server and and 2 2 shard shard servers servers 3. TripleTriple shard shard cluster: cluster: 1 1 Mongos, Mongos, 1 1 configuration configuration server server and and 3 3 shard shard servers servers 4. SingleSingle to to Double: Double: A A single single shard shard cluster cluster th thatat was was scaled scaled to to a adouble double shard shard cluster cluster

5. SingleSingle to to Double Double to to Triple: Triple: A A single single shard shard cluster cluster that that was was scaled scaled to to a adouble double shard shard and and then then to to a triplea triple shard. shard. We measured the latency for three categories of operations: (a) command operations, which includeWe configurations, measured the latency balancing, for three and categories other internal of operations: operations, (a) as command well as administrative operations, which commands include configurations,on the cluster; (b) balancing, reads, which and other include internal the read operations, operations as wellon data as administrativehosted in the cluster; commands and on(c) thewrites, cluster; which (b) include reads, whichall write, include replication, the read and operations update operations. on data hosted The artificial in the cluster; load and and monitoring (c) writes, whichwere executed include allfor write,180 seconds replication, per experiment, and update providing operations. sufficient The artificial time to load build and up monitoring load and test were the executedscaling with for 180up secondsto two perextra experiment, servers. Here providing we discuss sufficient only time the to single build shard up load and and the test two the scaling withexperiments up to two in extraorder servers. to have Here a more we compact discuss only presentation the single of shard results. and the two scaling experiments in order to have a more compact presentation of results. 3.1. Single Shard 3.1. Single Shard The first experiment is used as a baseline, and shows how a single shard cluster can handle the artificialThe load first we experiment created. The is used results as show a baseline, that as and the showsoperations how gather, a single the shard latency cluster (in milliseconds) can handle thefor artificialall three loadtypes we of created. operations The increases results show (Figures that as 13–15). the operations gather, the latency (in milliseconds) for all three types of operations increases (Figures 13–15). The same is not true for the average latency per operation. We notice that in read operations, the average latency increases as more requests come in (Figure 16), while in command operations it appears to be reducing (Figure 17). For write operations, we notice a logarithmic graph (Figure 18), possibly due to caching of the semi-randomized values created by the artificial load scripts.

Future Internet 2020, 12, 77 16 of 21

Future Internet 2020, 12, 77 16 of 21

Future Internet 2020, 12, 77 16 of 21 FutureFuture Internet Internet 20202020,, 1212,, 77 77 1616 of of 21 21

Figure 13. Commands latency curve versus the command operations count.

Figure 13. Commands latency curve versus the command operations count.

Figure 13. Commands latency curve versus the command operations count. FigureFigure 13. 13. CommandsCommands latency latency curve curve versus versus th thee command command operations operations count. count.

Figure 14. Reads latency curve versus the command operations count.

Figure 14. Reads latency curve versus the command operations count. Figure 14. Reads latency curve versus the command operations count. Figure 14. Reads latency curve versus the command operations count. Figure 14. Reads latency curve versus the command operations count.

Figure 15. Writes latency curve versus the command operations count.

The same is not true for the average latency per operation. We notice that in read operations, the average latency increases as more requests come in (Figure 16), while in command operations it appears to be reducing (Figure 17). For write operations, we notice a logarithmic graph (Figure 18), possibly due to caching of the semi-randomized values created by the artificial load scripts. Figure 15. Writes latency curve versus the command operations count. Figure 15. Writes latency curve versus the command operations count.

The same is notFigure true 15.for Writes the aver latencyage latency curve versus per operation. the command We operations notice that count. in read operations, the average latency increasesFigure 15. as Writes more latency requests curve come versus in the (Figure command 16), operations while in count.command operations it The same is not true for the average latency per operation. We notice that in read operations, the appears to be reducing (Figure 17). For write operations, we notice a logarithmic graph (Figure 18), averageThe samelatency is notincreases true for as the more aver requestsage latency come per inoperation. (Figure We16), noticewhile thatin command in read operations, operations the it possibly due to caching of the semi-randomized values created by the artificial load scripts. averageappears latencyto be reducing increases (Figure as more 17). requestsFor write come oper ations,in (Figure we notice16), while a logari in commandthmic graph operations (Figure 18), it appearspossibly to due be toreducing caching (Figure of the semi-randomized 17). For write oper valuesations, created we notice by the a logari artificialthmic load graph scripts. (Figure 18), possibly due to caching of the semi-randomized values created by the artificial load scripts.

Figure 16. Average reads latency (in ms) curve versus the time in seconds. Figure 16. Average reads latency (in ms) curve versus the time in seconds.

Figure 16. Average reads latency (in ms) curve versus the time in seconds.

Figure 16. Average reads latency (in ms) curve versus the time in seconds.

Figure 16. Average reads latency (in ms) curve versus the time in seconds.

Future Internet 2020, 12, 77 17 of 21

Future Internet 2020, 12, 77 17 of 21 Future Internet 2020, 12, 77 17 of 21

Future Internet 20202020,, 1212,, 7777 17 of 21

Figure 17. Average commands latency (in ms) curve versus the time in seconds. Figure 17. Average commands latency (in ms) curve versus the time in seconds. Figure 17. Average commands latency (in ms) curve versus the time in seconds. Figure 17. Average commands latency (in ms) curve versus the time in seconds. Figure 17. Average commands latency (in ms) curve versus the time in seconds.

Figure 18. Average writes latency (in ms) curve versus the time in seconds. Figure 18. AverageAverage writes latency (in ms) curve versus the time in seconds. 3.2. Single to DoubleFigure Shard 18. Average writes latency (in ms) curve versus the time in seconds. 3.2. Single to Double Shard 3.2. SingleThis experiment to DoubleFigure Shard shows 18. Average the effects writes of latency scaling (in the ms) cl ustercurve duringversus the operation. time in seconds. We notice a small spike 3.2. Single to Double Shard in latencies during the addition and configuration of the new server, which is normal since many This experiment experiment shows shows the the effects effects of ofscaling scaling the the cluster cluster during during operation. operation. We notice We notice a small a smallspike internal3.2. SingleThis operations experiment to Double needShard shows to be the executed effects of in scaling order theto integrate cluster during the new operation. server to We the notice cluster. a small After spike the spikein latencies in latencies during during the addition the addition and configuratio and configurationn of the ofnew the server, new server, which whichis normal is normal since many since shortin latencies spike, theduring curve the returns addition to its and normal configuratio behaviorn ofbut the at anew lower server, level, which achieving is normal less peak since latency, many manyinternalThis internal operations experiment operations need shows to need be the executed to effects be executed of in scaling order in order theto integratecluster to integrate during the new theoperation. newserver server toWe the notice to cluster. the cluster.a small After spike After the asinternal shown operations in Figures need19 and to 20, be whichexecuted provide in order a comparison to integrate between the new the server baseline to the single cluster. shard After cluster the theshortin latencies short spike, spike, theduring thecurve curve the returns addition returns to its toand itsnormal normalconfiguratio behavior behaviorn ofbut butthe atat anew lower a lower server, level, level, which achieving achieving is normal less less peak peaksince latency, latency, many (grey)short spike, and the the scaled curve cluster returns (black). to its normal behavior but at a lower level, achieving less peak latency, asinternal shown operations in Figures need19 and to 20,20be, whichexecuted provide in order a comparison to integrate between the new the server baseline to the single cluster. shard After cluster the as shown in Figures 19 and 20, which provide a comparison between the baseline single shard cluster (grey)short spike, and thethe the scaledscaled curve clustercluster returns (black).(black). to its normal behavior but at a lower level, achieving less peak latency, (grey) and the scaled cluster (black). as shown in Figures 19 and 20, which provide a comparison between the baseline single shard cluster (grey) and the scaled cluster (black).

Figure 19. Average read latency of scaled cluster versus the baseline, single shard cluster. Figure 19. Average read latency of scaled cluster versus the baseline, single shard cluster. Figure 19. Average read latency of scaled cluster versus the baseline, single shard cluster. Figure 19. Average read latency of scaled cluster versus the baseline, single shard cluster.

Figure 19. Average read latency of scaled cluster versus the baseline, single shard cluster.

Figure 20. Average write latency of scaled cluster versus the baseline, single shard cluster.

Future Internet 2020, 12, 77 18 of 21

Future InternetFigure 2020, 1220., 77Average write latency of scaled cluster versus the baseline, single shard cluster. 18 of 21 Future Internet 2020, 12, 77 18 of 21 Future Internet 2020, 12, 77 18 of 21 For theFigure command 20. Average operations write latency we can of scaledsee the cluster same versus spike thein activitybaseline, but, single since shard all cluster. of the overhead Figure 20. Average write latency of scaled cluster versus the baseline, single shard cluster. of the scaling is translated in internal operations and the Mongos and configuration servers are not ForFor the the command command operations operations we we can can see see the the same same sp spikeike in in activity activity but, but, since since all all of of the the overhead overhead scaled,For we the can command also see thatoperations the latency we can in thesesee the op sameerations spike is increased in activity by but, a certain since all amount, of the overheadas shown ofof the scaling is translated in in internal operatio operationsns and and the the Mongos and and configuration configuration servers servers are are not not ofin theFigure scaling 21. is translated in internal operations and the Mongos and configuration servers are not scaled,scaled, we we can can also also see see that that the the latency latency in in these these op operationserations is is increased increased by by a a certain certain amount, amount, as as shown shown scaled, we can also see that the latency in these operations is increased by a certain amount, as shown inin Figure Figure 21.21. in Figure 21.

Figure 21. Average command latency of scaled cluster versus the baseline, single shard cluster.

3.3. SingleFigureFigure to Double 21. 21. AverageAverage to Triple command command Shard latency latency of of scaled scaled cluster cluster versus versus the the baseline, baseline, single single shard shard cluster. Figure 21. Average command latency of scaled cluster versus the baseline, single shard cluster. 3.3.3.3. Single SingleThis experimentto to Double Double to to showsTriple Triple Shardmore clearly the effects of scaling a MongoDB cluster in real time, during 3.3.operation. Single to We Double notice to aTriple spike Shard in activity during the scaling process, which is possibly the result of a ThisThis experiment experiment shows shows more more clearly clearly the the effects effects of scaling scaling a a MongoDB MongoDB cluster cluster in in real real time, time, during during number of internal operations in order to integrate the new servers in the cluster and put them to use operation.operation.This experiment We We notice notice showsa a spike spike more in in activity activity clearly during the during effects the the scalingof scaling scaling process, process, a MongoDB which which cluster is ispossibly possibly in real the time, the result result during of ofa as discussed in the previous experiment. After the short spike, the latency drops below the level prior numberoperation.a number of ofinternalWe internal notice operations operations a spike in in inactivity order order to toduring integrate integrate the the scaling the new new serversprocess, servers in inwhich the the cluster cluster is possibly and and put put the them them result to to ofuse use a to the scaling and then starts to follow its normal curve for all three categories of operations, as shown asnumberas discussed discussed of internal in in the the previous previousoperations experiment. experiment. in order to After Afterintegrate the the short shortthe new spike, spike, servers the the latency latency in the clusterdrops drops below belowand put the the them level level to prior prior use in Figures 22 and 23, which provide a comparison between the baseline single shard cluster (grey) toasto thediscussed the scaling scaling in and and the then then previous starts starts experiment.to to follow follow its its normalAfter normal the curve curve short for for spike, all all three three the latency categories categories drops of of operations, operations,below the level as as shown shown prior and the scaled cluster (black). intoin theFigures Figures scaling 2222 andandand then 23,23, which which starts toprovide provide follow a aits comparison comparison normal curve between between for all threethe the baseline baseline categories single single of operations, shard shard cluster cluster as shown(grey) (grey) andinand Figures the the scaled scaled 22 and cluster cluster 23, which(black). provide a comparison between the baseline single shard cluster (grey) and the scaled cluster (black).

Figure 22. Average read latency of twice scaled cluster versus the baseline, single shard cluster. Figure 22. Average read latency of twice scaled cluster versus the baseline, single shard cluster. Figure 22. Average read latency of twice scaled cluster versus the baseline, single shard cluster. Figure 22. Average read latency of twice scaled cluster versus the baseline, single shard cluster.

Figure 23. Average write latency of twice scaled cluster versus the baseline, single shard cluster. Figure 23. Average write latency of twice scaled cluster versus the baseline, single shard cluster.

This means that we are achieving lower peak latencies for the monitored duration and load. ThisFigure means 23. Average that we writeare achieving latency of lowertwice scaled peak latenciescluster versus for the the monitoredbaseline, single duration shard cluster.and load. The The exceptionFigure 23. here Average is the write command latency of operations, twice scaled wherecluster theversus latencies the baseline, are following single shard the cluster. same trend exception here is the command operations, where the latencies are following the same trend as the as theThis others, means but that with we increased are achieving average lower latency, peak as latencies shown in for Figure the monitored 24, possibly duration due to theand plethora load. The of internal operations needed for the scaling process; these operations have an overhead on the Mongos exception This meanshere is that the wecommand are achieving operations, lower where peak latencies the latencies for the are monitored following duration the same and trend load. as Thethe exceptionand configuration here is the servers, command which operations, were not scaled where like the the latencies shard servers.are following the same trend as the

Future Internet 2020, 12, 77 19 of 21 others, but with increased average latency, as shown in Figure 24, possibly due to the plethora of Futureinternal Internet operations2020, 12, 77needed for the scaling process; these operations have an overhead on the Mongos19 of 21 and configuration servers, which were not scaled like the shard servers.

Figure 24. Average command latency of twice scaled cluster versus the baseline, single shard cluster. Figure 24. Average command latency of twice scaled cluster versus the baseline, single shard 4. Discussion cluster. Our results show a clear superiority of the proposed recommendation system over the simple 4. Discussion ElasticSearch queries, enhancing the QoE of the DaaS Marketplace user. This, combined with the semanticOur results enhancements, show a clear is superiority a valuable tool of the in knowledgeproposed recommendation discovery both in system our use over case the and simple in any ElasticSearchcloud- or fog-based queries, environment. enhancing Moreover,the QoE of the the tests DaaS conducted Marketplace with the user. repository This, combined scaling framework with the provesemantic that enhancements, we can ensure is the a valuable QoS requirements tool in know ofledge a cloud-based discovery application both in our by use adding case and more in dataany cloud-servers or as thefog-based need arises. environment. This process Moreover, takes less th thane tests 10 seconds,conducted and with is thus the arepository real-time solution, scaling frameworktackling demand prove spikes that we that can are ensure common the inQoS real requir worldements scenarios. of a cloud-based application by adding moreA data further servers experiment as the need of arises. interest This would process be ta tokes examine less than the 10 way seconds, the scaling and is thus of the a real-time Mongos solution,and Configuration tackling demand servers ofspikes the MongoDB that are common cluster a inffects real the world increasing scenarios. load on the cluster in practice, in orderA further to explore experiment if this hasof interest any value would regarding be to examine the QoE the of way the DaaSthe scaling Marketplace. of the Mongos Moreover, and weConfiguration could also examine servers of other the dataMongoDB stores, suchcluster as affects HDFS orthe MySQL, increasing in order load toon examine the cluster their in scalability practice, inand order other to relevant explore featuresif this has in comparisonany value regardin to the MongoDB.g the QoE of the DaaS Marketplace. Moreover, we could also examine other data stores, such as HDFS or MySQL, in order to examine their scalability Authorand other Contributions: relevant featuresConceptualization, in comparison A.P. to and the V.M.; MongoDB. methodology, A.M. and A.P.; software, E.P., A.M, A.N. and A.C.; validation, E.P., A.P. and A.N.; formal analysis, A.M. and V.M.; investigation, E.P. and A.N.; resources, T.V.Author and Contributions: V.M.; data curation, Conceptualization, E.P., A.M. and A.N.; A.P. writing–original and V.M.; methodol draftogy, preparation, A.M. and E.P., A.P.; A.P., software, A.M., A.N E.P., and A.M, V.M.; writing–reviewA.N. and A.C.; andvalidation, editing, E.P., E.P., A.P. V.M. and T.V.;visualization,A.N.; formal analysis, E.P., A.N.,A.M. A.Pand andV.M.; A.C.; investigation, supervision, E.P. V.M. and and A.N.; T.V.; project administration, V.M. and T.V.; funding acquisition, V.M. and T.V. All authors have read and agreed to resources,the published T.V. version and V.M.; of thedata manuscript. curation, E.P., A.M. and A.N.; writing–original draft preparation, E.P., A.P., A.M., A.N and V.M.; writing–review and editing, E.P., V.M. and T.V.;visualization, E.P., A.N., A.P and A.C.; supervision,Funding: The V.M. research and T.V.; leading project to ad theseministration, results has V.M. received and T.V.; funding funding from acquisition, the European V.M. and Commission T.V. All authors under the H2020 Programme’s project DataPorts (grant agreement No. 871493). have read and agreed to the published version of the manuscript. Conflicts of Interest: The authors declare no conflict of interest. Funding: The research leading to these results has received funding from the European Commission under the H2020 Programme’s project DataPorts (grant agreement No. 871493). References Conflicts of Interest: The authors declare no conflict of interest. 1. The Rise of the Data Marketplace. March 2017. Available online: https://www.datawatch.com/wp-content/ uploads/2017/03/The-Rise-of-the-Data-Marketplace.pdf (accessed on 23 March 2020). References 2. Reinsel, D.; Gantz, J.; Rydning, J. Data age 2025: The evolution of data to life-critical. In Don’t Focus on Big 1. TheData ;Rise IDC Corporate:of the Data Framingham, Marketplace. MA, March USA, 2017;2017. pp.Available 2–24. online: https://www.datawatch.com/wp- 3. content/uploads/2017/03/The-Rise-of-the-Data-MaHäubl, G.; Murray, K.B. Preference Constructionrketplace.pdf and Persistence (accessed in Digital on 23 Marketplaces: March 2020). The Role of 2. Reinsel,Electronic D.; Recommendation Gantz, J.; Rydning, Agents. J. DataJ. age Consum. 2025: Psychol.The evolution2003, 13 of, data 75–91, to Sociallife-critical. Science In ResearchDon’t Focus Network, on Big DataRochester,; IDC Corporate: NY, SSRN ScholarlyFramingh Paperam, MA, ID 964192,USA, 2017; Aug. pp. 2001. 2–24. [CrossRef ] 3.4. Häubl,Elasticsearch, G.; Murray, B.V. Open K.B. SourcePreference Search: Construction The Creators and of Persistence Elasticsearch, in ELKDigital Stack Marketplaces: & Kibana | Elastic.The Role 2020. of ElectronicAvailable online:Recommendation https://www.elastic.co Agents. J. Consum./ (accessed Psychol. on 28 2003 March, 13 2020)., 75–91; Social Science Research Network, 5. Rochester,Dang, T.K.; NY, Vo, SSRN A.K.; Scholarly Küng, J. Paper A NoSQL ID 964192, Data-Based Aug. 2001, Personalized doi:10.2139/ssrn.964192. Recommendation System for C2C 4. Elasticsearch,e-Commerce. B.V. In Database Open Source and Expert Search: Systems The Creators Applications of Elasticsearch,; Springer: Cham, ELK Switzerland,Stack & Kibana 2017; | Elastic. pp. 313–324. 2020. Available[CrossRef ]online: https://www.elastic.co/ (accessed on 28 March 2020). 5.6. Dang,Lai, V.; T.K.; Shim, Vo, K.J.; A.K.; Oentaryo, Küng, J. R.J.; A NoSQL Prasetyo, Data-Based P.K.; Vu, C.;Personalized Lim, E.P.; Lo,Recommendation D. CareerMapper: System An automatedfor C2C e- Commerce.resume evaluation In Database tool. and In Proceedings Expert Systems of theApplications 2016 IEEE; Springer: International Cham, Conference Switzerland, on Big 2017; Data pp. (Big 313–324, Data), doi:10.1007/978-3-319-64471-4_25.Washington, DC, USA, 5–8 December 2016; pp. 4005–4007. [CrossRef]

Future Internet 2020, 12, 77 20 of 21

7. Mbah, R.B.; Rege, M.; Misra, B. Discovering Job Market Trends with Text Analytics. In Proceedings of the 2017 International Conference on Information Technology (ICIT), Bhubaneswar, India, 21–23 December 2017; pp. 137–142. [CrossRef] 8. Chen, D.; Chen, Y.; Brownlow, B.N.; Kanjamala, P.P.; Arredondo, C.A.G.; Radspinner, B.L.; Raveling, M.A. Real-Time or Near Real-Time Persisting Daily Healthcare Data Into HDFS and ElasticSearch Index Inside a Big Data Platform. IEEE Trans. Ind. Inform. 2017, 13, 595–606. [CrossRef] 9. Klibisz, A. Elastik-Nearest-Neighbors. 2019. Available online: https://github.com/alexklibisz/elastik-nearest- neighbors (accessed on 28 March 2020). 10. Amato, G.; Bolettieri, P.; Carrara, F.; Falchi, F.; Gennaro, C. Large-Scale Image Retrieval with Elasticsearch. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor, MI, USA, 8–12 June 2018; pp. 925–928. [CrossRef] 11. Gupta, P.; Kanhere, S.S.; Jurdak, R. A Decentralized IoT Data Marketplace. In Proceedings of the 3rd Symposium on Distributed Ledger Technology, Gold Coast, Australia, 12 November 2018. 12. Ghosh, H. Data marketplace as a platform for sharing scientific data. In Data Science Landscape; Springer: Berlin/Heidelberg, Germany, 2018; pp. 99–105. 13. Wien, T. Data as a Service, Data Marketplace and Data Lake–Models, Data Concerns and Engineering. Available online: https://linhsolar.github.io/ase/pdfs/truong-ase-2018-lecture4-daas_datalake_datamarket_ dataconcerns.pdf (accessed on 28 March 2020). 14. Smith, G.; Ofe, H.A.; Sandberg, J. Digital service innovation from open data: Exploring the value proposition of an open data marketplace. In Proceedings of the 2016 49th Hawaii International Conference on System Sciences (HICSS), Koloa, HI, USA, 5–8 January 2016; pp. 1277–1286. 15. Mišura, K.; Žagar, M. Data marketplace for Internet of Things. In Proceedings of the 2016 International Conference on Smart Systems and Technologies (SST), Osijek, Croatia, 12–14 October 2016; pp. 255–260. 16. Casalicchio, E.; Perciballi, V. Measuring docker performance: What a mess!!! In Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering Companion, L’Aquila, Italy, 22–26 April 2017; pp. 11–16. 17. Jiménez, L.L.; Simón, M.G.; Schelén, O.; Kristiansson, J.; Synnes, K.; Åhlund, C. CoMA: Resource Monitoring of Docker Containers. In Proceedings of the 5th International Conference on and Services Science, Lisbon, Portugal, 20–22 May 2015; pp. 145–154. [CrossRef] 18. Soam, A.K.; Jha, A.K.; Kumar, A.; Thakur, V.K.; Hore, P. Resource Monitoring of Docker Containers. IJAERD 2016, 3. [CrossRef] 19. Bagnasco, S.; Berzano, D.; Guarise, A.; Lusso, S.; Masera, M.; Vallero, S. Monitoring of IaaS and scientific applications on the Cloud using the Elasticsearch ecosystem. In Journal of physics: Conference Series; IOP Publishing: Bristol, UK, 2015; Volume 608. 20. Abraham, S.M. Comparative Analysis of MongoDB Deployments in Diverse Application Areas. Int. J. Eng. Manag. Res. (IJEMR) 2016, 6, 21–24. 21. Jose, B.; Abraham, S. Exploring the merits of nosql: A study based on mongodb. In Proceedings of the 2017 International Conference on Networks & Advances in Computational Technologies (NetACT), Trivandrum, Kerala, India, 20–22 July 2017; pp. 266–271. 22. Aboutorabi, S.H.; Rezapour, M.; Moradi, M.; Ghadiri, N. Performance evaluation of SQL and MongoDB databases for big e-commerce data. In Proceedings of the 2015 International Symposium on Computer Science and Software Engineering (CSSE), Tabriz, Iran, 18–19 August 2015; pp. 1–7. 23. Klein, J.; Gorton, I.; Ernst, N.; Donohoe, P.; Pham, K.; Matser, C. Performance evaluation of NoSQL databases: A case study. In Proceedings of the 1st Workshop on Performance Analysis of Big Data Systems, Austin, TX, USA, 1 February 2015; pp. 5–10. 24. Docker-Build, Ship, and Run Any App, Anywhere. Available online: https://www.docker.com/ (accessed on 17 November 2017). 25. Liu, Q.; Zheng, W.; Zhang, M.; Wang, Y.; Yu, K. Docker-based automatic deployment for nuclear fusion experimental data archive cluster. IEEE Trans. Plasma Sci. 2018, 46, 1281–1284. [CrossRef] 26. Jovanov, G. Mongo + Docker Swarm (Fully Automated Cluster). In Medium; 2019; Available online: https://medium.com/@gjovanov/mongo-docker-swarm-fully-automated-cluster-9d42cddcaaf5 (accessed on 9 October 2019). Future Internet 2020, 12, 77 21 of 21

27. Alam, M.; Rufino, J.; Ferreira, J.; Ahmed, S.H.; Shah, N.; Chen, Y. Orchestration of microservices for iot using docker and edge computing. IEEE Commun. Mag. 2018, 56, 118–123. [CrossRef] 28. MacKenzie, C.M.; Laskey, K.; McCabe, F.; Brown, P.F.; Metz, R.; Hamilton, B.A. Reference Model for Service Oriented Architecture v1.0. In Reference Model for Service Oriented Architecture 1.0; OASIS: Burlington, MA, USA, 2006; Available online: https://docs.oasis-open.org/soa-rm/v1.0/soa-rm.html (accessed on 28 March 2020). 29. Plebani, P.; Garcia-Perez, D.; Anderson, M.; Bermbach, D.; Cappiello, C.; Kat, R.I.; Marinakis, A.; Moulos, V.; Pallas, F.; Tai, S.; et al. Data and Computation Movement in Fog Environments: The DITAS Approach. In Fog Computing: Concepts, Frameworks and Technologies; Mahmood, Z., Ed.; Springer International Publishing: Cham, Switzerland, 2018; pp. 249–266. 30. Plebani, P.; Garcia-Perez, D.; Anderson, M.; Bermbach, D.; Cappiello, C.; Kat, R.I.; Marinakis, A.; Moulos, V.; Pallas, F.; Pernici, B.; et al. DITAS: Unleashing the Potential of Fog Computing to Improve Data-Intensive Applications. In Advances in Service-Oriented and Cloud Computing; Springer: Cham, Switzerland, 2018; pp. 154–158.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).