A Multi-criteria Approach for Large-object Storage

Uwe Hohenstein1, Michael C. Jaeger1 and Spyridon V. Gogouvitis2 1Corporate Technology, Siemens AG Munich, Germany 2Mobility Management, Siemens AG Munich, Germany

Keywords: , Federation, Multi-criteria.

Abstract: In the area of storage, various services and products are available from several providers. Each product pos- sesses particular advantages of its own. For example, some systems are offered as cloud services, while others can be installed on premises, some store redundantly to achieve high reliability while others are less reliable but cheaper. In order to benefit from the offerings at a broader scale, e.g., to use specific features in some cases while trying to reduce costs in others, a federation is beneficial to use several storage tools with their individual virtues in parallel in applications. The major task of a federation in this context is to handle the heterogeneity of involved systems. This work focuses on storing large objects, i.e., storage systems for videos, database archives, virtual machine images etc. A metadata-based approach is proposed that uses the metadata associated with objects and containers as a fundamental concept to set up and manage a federation and to con- trol storage locations. The overall goal is to relieve applications from the burden to find appropriate storage systems. Here a multi-criteria approach comes into play. We show how to extend the developed by the VISION Cloud project to support federation of various storage systems in the discussed sense.

1 INTRODUCTION servers. They heavily rely on distributing data across several computers and prefer a schema-less storage of The National Institute of Standards and Technology data with a relaxed consistency concept. Certainly, (NIST) states that ” is a model for the settled technology of relational databases - if de- enabling ubiquitous, convenient, on-demand network ployed in a cloud - is also a kind of cloud storage. access to a shared pool of configurable computing There are corresponding offerings from major Cloud resources (e.g., networks, servers, storage, applica- providers. And finally, Blob stores represent a further tions, and services) that can be rapidly provisioned type of storage that should be mentioned. and released with minimal management effort or ser- Hence, we notice an increasing heterogeneity vice provider interaction” (Mell and Grance, 2011). of storage technologies even within the same cate- Hence, cloud computing represents a provisioning gory, with further differences from vendor to ven- paradigm for resources in first place. dor, whether deployed on-premises or in the cloud, Cloud storage is certainly one important cloud re- whether used as Platform-as-a-Service (PaaS) or as a source that benefits from the major characteristics of special Virtual Machine on the IaaS level. Each indi- cloud computing (Fox et al., 2009) such as vidual cloud storage has virtues of its own. virtually unlimited storage space, To benefit at a broader scale, a combination of • no upfront commitment for investments into hard- storage solutions seems to be useful. Some impor- • ware and software licenses, and tant aspects in this context are in the sense of polyglot pay per use for the occupied storage. persistence (Sandalage and Fowler, 2012): • to use the most appropriate storage technology for The term cloud storage is mostly associated with • the recent technology of Not only SQL databases each specific problem; (NoSQL, 2017), which attained a lot of popularity. to reduce costs by using a public cloud, by choos- Implied by an adaptation to distributed systems and • ing an appropriate cloud provider, particularly un- cloud computing environments, NoSQL databases der consideration of the various and complex price follow a different approach than the traditional fix- schemes and underlying factors such as price/GB, schema based model provided by relational database data transfer or number of transactions;

75

Hohenstein, U., Jaeger, M. and Gogouvitis, S. A Multi-criteria Approach for Large-object Cloud Storage. DOI: 10.5220/0006432100750086 In Proceedings of the 6th International Conference on Data Science, Technology and Applications (DATA 2017), pages 75-86 ISBN: 978-989-758-255-4 Copyright © 2017 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved DATA 2017 - 6th International Conference on Data Science, Technology and Applications

to take into account access time and latency, e.g., 2 THE VISION CLOUD PROJECT • to use fast but expensive storage only when really needed but slow and cheap storage in other cases; VISION Cloud (Kolodner et al., 2011) was an EU to consider the differences between on-premises co-funded project for the development of metadata- • and cloud solutions with certain limitations, e.g., centric cloud storage solutions. The project developed the maximal database size, a reduced features set; a storage system and several domain applications to use a hybrid cloud for security and confiden- where the handling of rich metadata provides new in- • tiality issues, i.e., keeping confidential data in a novations. Domains targeted by VISION Cloud were private cloud while taking benefit from the public telecommunication, broadcasting and media, health cloud for non-confidential data; care, and IT application management. to shard data in general, e.g., according to geo- These domains share the need for an appropri- • graphic locations, costs etc. ate object storage system. The telecommunication, the broadcasting, and media domains envisage the These points are fully interleaved and demand storage of videos, the health care domain applica- for a multi-criteria approach. Therefore a model tion stores high resolutions diagnostics images, and is needed that captures the necessary information the IT application management stores virtual ma- and creates associations between all involved entities. chine images. They all share the need for grow- This information can be used to find an optimal data ing storage capacity and large storage consumption placement solution for an overall benefit. per object. The images of virtual machines in a In this paper, we base our work upon a metadata- grow bigger. The media domain is mov- based approach for a cloud storage scheme de- ing to Ultra High-Definition and 4K resolution con- veloped by the European funded VISION Cloud tent. And in the telecommunication domain, shar- project (Kolodner et al., 2011)(Kolodner (2) et al., ing of video messages turns into a trend as the mar- 2012). VISION Cloud aimed at developing next ket share of high-resolution camera equipped hand- generation technologies for storing large objects like sets grows (VISION-Cloud, 2011). All these domains videos and virtual machine images, accompanied by in VISION Cloud benefit from an object storage de- a content-centric access to storage. Following the veloped in the project and serve as a proof-of-concept: CDMI proposal (CDMI, 2010), the approach relies They require an increasing need of large capacity for upon objects and containers and offers first-class sup- the expectation to store a vast number of large objects port for metadata for these storage entities. and the ability to maintain rich metadata sets in order VISION Cloud has implemented a simple federa- to navigate and retrieve stored content. tion approach that provides some basic access to sev- Pursuing a metadata-centric approach, VISION eral storage solutions. We extend the original federa- Cloud stresses the ability to represent the type and for- tion approach to tackle the needs of a multi-criteria mat of the stored objects. With such an awareness of storage solution that attempts to combine different the storage, functionality can be triggered depending storage technologies. Indeed, having a federation, on the execution context and the currently processed it is possible to benefit from the advantages of vari- storage object. The automation based on the aware- ous storage solutions, private, on-premise and public ness also contributes to the ability to deal with a high clouds, access speed, best price offerings etc. while number of objects stored in an autonomous manner. the same way avoiding the disadvantages of a single storage. A federation approach can provide a unified 2.1 The VISION Concept of Metadata and location-independent access interface, i.e., trans- parency for data sources, while leaving the federation Classic approaches that handle large objects basically participants autonomous. organize files in a hierarchical structure in order to al- The remainder of this work is structured as fol- low navigating through the hierarchy and finally find- lows: Section 2 explains the VISION Cloud software ing a particular item. However, it can be quite dif- that is relevant and extended in this work: the concept, ficult to set up a hierarchy in an appropriate manner especially of using metadata, the storage interfaces, that provides flexible search options with acceptable and the storage architecture. The original cloud fed- access performance and intuitive categories for ever eration approach of VISION Cloud is also presented. increasing amounts of data. Moreover, such a hierar- We explain an extended federation approach, partic- chy has usually to be organized manually and is thus ularly the architectural setup, in Section 3, and con- prone to outdated or wrongly applied placements. In tinue with technical details in Section 4. Section 5 addition, the problem arises how to maintain a hierar- is concerned with related work. This work ends with chy changes in a distributed environment. Section 6 providing conclusions and future work.

76 A Multi-criteria Approach for Large-object Cloud Storage

One of the goals of VISION Cloud was to enhance metadata. The content-centric approach relieves a the object storage with rich metadata handling capa- user from establishing hierarchies in order to organize bilities. Looking at the cloud storage offerings that a high number of storage objects. VISION Cloud fol- existed from the popular vendors at project start, VI- lows the Cloud Data Management Interface (CDMI, SION Cloud decided to focus on the storage and re- 2010) specification. CDMI standardizes the interface trieval of objects based on metadata and the ability to object storage systems in general. However, the to perform autonomous actions on the storage node concept of the previously mentioned storlet, for exam- based on metadata (Kolodner et al., 2011). ple, is a CDMI extension not covered by the standard The content can be accessed based upon using at the time of developing the VISION Cloud. metadata. A lot of useful metadata is technical, such CDMI defines a standard for accessing and storing as the file format or the image compression algorithm. objects in a cloud specifying the typical CRUD (Cre- Looking at some video sharing plaforms, some obvi- ate, Retrieve, Update, Delete) operations in a REST ous metadata is also already available such as the title, style (Fielding and Taylor, 2002) using HTTP(S). The the author of the video, a description, the date of up- user can organize storage objects using containers. load, or a rating provided by users who have watched Containers can be compared to the concept of buckets the content. Of course, such (existing) metadata could in other storage solutions. be easily added during import, as one of the VISION In VISION Cloud, a container enables not only Cloud demonstrators has shown (Jaeger et al., 2012). the organization of storage objects, it allows also to Besides objects, metadata can be also attached to con- efficiently design queries and handle objects in gen- tainers, which hold several different objects. Using eral. The following REST examples give an impres- container metadata also enables storage handling in- sion about the CDMI-based interface of VISION: formation for the objects inside. PUT /CCS/MyTenant/MyContainer creates a In addition to such types of metadata that requires • new container for a specific tenant MyTenant. just the import of objects, the VISION Cloud ob- PUT /CCS/MyTenant/MyContainer/MyObject ject storage has the ability to derive metadata from • then stores an object MyOb ject in this container. processing storage objects. VISION Cloud uses so called storlets (Kolodner (2) et al., 2012): Storlets The payload of a HTTP PUT request contains the are software modules, similar to stored procedures metadata describing an object. To distinguish a con- or triggers in traditional databases, which contain ex- tainer from an object, the type of data for this request ecutable code to process uploaded storage objects. is indicated in the HTTP Content-Type header field They can analyze storage object, e.g., deriving meta- (cdmi-container). For example, a full request for data to be attached to objects. Hence, one applica- creating a new container looks as in Figure 1: tion of storlets is to run speech-to-text analyzers on Example: PUT /CCS/MyTenant/MyContainer video content in order to store the text resulting from X-CDMI-Specification-Version: 1.0 the audio track as metadata. Then indexing can pro- Content-Type: application/cdmi-container vide search terms as additional metadata attached to Authorization: Basic QWxhZGRpbjpvcGVuIHNlc2FtZQ== video content. Such an approach improves the abil- Accept: application/cdmi-container ity to navigate across a large number of video objects { metadata : { content : video, format : mpeg3 } } drastically. Figure 1: HTTP PUT request. From a technical perspective, VISION Cloud uses a key-value objects tree in a dedicated storage for keeping the metadata only, along with a further ba- 2.3 VISION Cloud Architecture sic object storage for the large objects themselves. It is the special ability of VISION Cloud to efficiently The general VISION Cloud architecture was pre- keep both the storage objects and their metadata in sented in previous work (Kolodner et al., 2011). Thus, synchronization when considering the node-based ar- we here focus on the storage and the metadata han- chitecture and envisaging horizontal scaling. dling. Figure 2 outlines the structure of layering: The foundation is the object storage, which includes the 2.2 The VISION Content Centric ability to store and manage the metadata. For ex- ample, replication of storage objects with their meta- Interface data across nodes is handled by this layer. On top of this layer, every node has the Content Centric Service Applications deal with metadata in VISION Cloud by (CCS) deployed, which offers higher services such using the so-called Content Centric Interface (CCI), as a relationship concept. The Content Centric Stor- an interface that maintains and allows for querying age implements the CCI. It extends CDMI by adding

77 DATA 2017 - 6th International Conference on Data Science, Technology and Applications

some additional operations in order to provide rich by implementing an adapter, other storage implemen- metadata handling for applications. The application, tations can be integrated with CCS as well, given that depicted on the top of the two other layers in Figure 2, the metadata handling requirements are provided in accesses the storage via the CCS. addition to a plain object storage. As part of the VI- Being a cloud service implementation, the VI- SION Cloud project, the open source CouchDB doc- SION cloud storage software was designed to run on ument database was also integrated with the CCS. common appliance dimensions allowing one to de- Using a specific storage adapter, the CCS connects ploy a massive number of parallel nodes in a data cen- to a storage server’s IP and port number, either refer- ter in order to enable horizontal scaling. An applica- ring to a single storage server or to a load balancer tion is principally able to access several nodes with within a cluster implementation. Technically, the ob- CCS and underlying object storage stacks deployed. ject storage and the CCS could reside on different In fact, the distribution over nodes is transparent to machines or nodes. Also, a client could access the the user and facilitated by load balancing in a VISION object storage directly, not using the CCS metadata Cloud storage system. handling capabilities. In the basic setup of VISION Cloud nodes, the CCS is deployed on every storage node, which is a (potentially virtual) server running a VISION Cloud node software stack. This avoids increased request response times resulting from the connections between different network nodes. More- over, the CCS is capable of avoiding node manage- ment functionality and keeping track of the current status of object nodes. This enables a horizontal scal- ing of VISION Cloud storage nodes in general. The request handling of CCS does not (need to) support sessions. As such, it can be easily deployed Figure 2: VISION architecture. on the storage node as a module. The CCS can also work with storage system on different machines. As an example, assume the application has a ref- erence of an object and would like to find similar 2.4 Basic VISION Cloud Support for objects in the storage. This is a popular use case when it comes to different media encoding types for Federations different end devices. The provider might store me- dia content optimized for hand held devices or smart Cloud federation aims at providing an access inter- phones along with media content optimized for high- face so that a transparency of data sources in differ- definition displays. The application uses the CCS to ent storage clouds of different provisioning modes is query for objects similar to the object already known. provided. At the same time it leaves the federation The similarity is a metadata feature that was provided participants autonomous. Clients are able to leverage especially as part of the ability to maintain relations a federation with a unified view of storage and data between objects as metadata. The CCS defines how to services across several systems and providers. encode relations by using key-value metadata on the In general, such a federation has to tackle hetero- object storage and decodes the relation-based query geneity of the units to be combined. In the context from the application into an internal metadata for- of storage federation, there are several types of het- mat which is not aware of relations. A corresponding erogeneity. At first and most obvious, each cloud query is sent to the underlying object storage. provider has management concepts and Application The underlying object storage was developed as Programming Interfaces (APIs) of its own, which may part of the VISION Cloud project. It provides a num- be proprietary or may implement industry specifica- ber of innovations in addition to the handling of meta- tions, e.g., (CDMI, 2010). And then at the next lower data, e.g., execution of storlets or resiliency of storage level, the federation has to take into account the het- items. If an application uses only the metadata han- erogeneity of data models of the cloud providers. In dling features and not the storlets, i.e, only the basic fact, the implementation of the content-centric storage features for storing and managing objects, the CCS service of VISION Cloud helps to handle heterogene- can work also with other storage systems that are ca- ity by means of adapters, thus allowing one to wrap pable of storing metadata attached to objects and con- heterogeneous units, each with a CCS interface. An tainers. The CCS uses an adapter concept to separate instance of a VISION Cloud CCS sits on top of a sin- the integration of different storage servers. Therefore gle storage system. Moreover, the CCS architecture

78 A Multi-criteria Approach for Large-object Cloud Storage

supports multiple cloud providers and underlying 2.5.1 On-boarding Federation storage system types, as long as a storage adapter is provided. Currently, CCS adapters are available The first scenario is a so-called on-boarding federa- for the proprietary VISION Cloud storage service or tion (Vernik et al., 2013). The purpose of this scenario CouchDB. is to migrate data from one storage system to another, The CCS communicates with the object storage the target. One important feature of the on-boarding using an IP connection. This means that the CCS can federation is to allow accessing all the data via the have several storage servers beneath with a load bal- target cloud while the migration is in progress, i.e., ancer in front. Hence, the VISION Cloud decision while data is being transferred in the background. The was to put CCS on top of these (homogeneous) cluster on-boarding scenario helps to avoid a vendor lock- solutions due to several benefits. CCS is just a bridge in, which is the second among top ten obstacles for between the load balancer and the client. All scalabil- growth in cloud computing according to (Fox et al., ity, elasticity, replication, duplication, and partition- 2009). With on-boarding being enabled to move data ing is done by the storage system itself. Therefore, without operational downtime, a client becomes inde- there is no need for CCS to re-implement features that pendent of a single cloud storage provider. are already available in numerous cloud storage im- To set up on-boarding, an administrator has to cre- plementations. In fact, current storage system types ate and maintain a federation of the two involved stor- usually have a built-in cluster implementation already. age systems. A federation is always defined between For instance, the open source project CouchDB has two containers: The administrator has to send a ”fed- an elastic cluster implementation named BigCouch. eration configuration”, which describes the federation MongoDB as another open source example, has vari- to the target container, because the target container ous strategies for deploying clusters of MongoDB in- will initiate a pull mechanism. stances. To our knowledge and published material by "federationinfo": { the vendors, we can assume that these cloud systems // information about remote cloud are able to deal with millions of customers and tens "eu.visioncloud.federation.status": "0", of thousands of servers located in many data centers "eu.visioncloud.federation.job_start_time": "1381393258.3", around the world. Hence, the CCS does not have to "eu.visioncloud.federation.remote_cloud_type": "S3", manage all the distribution, scalability, load balanc- "eu.visioncloud.federation.remote_container_name": ing, and elasticity. This would have tremendously in- "example_S3_bucket", "eu.visioncloud.federation.remote_region": "EU_WEST", creased the complexity of CCS. "eu.visioncloud.federation.type": "sharding", The basic federation functionality of VISION "eu.visioncloud.federation.is_active": "true", Cloud allows the CCS to actually distribute requests "eu.visioncloud.federation.local_cloud_port": "80", between multiple storage nodes. Depending on meta- // credentials to access remote cloud data, the CCS can route storage requests to different "eu.visioncloud.federation.remote_s3_access_key": storage services. This appears similar to the features "AFAIK3344key", of a load balancer. Although the CSS does not repre- "eu.visioncloud.federation.remote_s3_secret_key": "TGIF5566key", sent a load balancer, the role of the CCS in the soft- "eu.visioncloud.federation.status_time": "1381393337.72" } ware stack could be useful to service similar purpose: In fact, the CCS can distribute requests over different Figure 3: Sample payload. storage nodes not based on the classic load balancing algorithms for distributing load, but based on qual- The payload in Figure 3 describes a typical fed- ity characteristics and provided capabilities that are eration configuration in VISION Cloud. The data is matched with the available storage clouds. required for accessing a member’s cloud storage, i.e., Moreover, VISION Cloud was designed with se- the remote storage to be moved to the target cloud, curity in mind and provides fine granular access con- here for an member. With this configura- trol lists (ACLs). ACLs are attachable to tenants, con- tion request, a link between the clouds is created. tainers, and objects. A new REST service, the federation service, pro- vides the basic CRUD operations to configure and 2.5 Use of Federations handle federations. This federation service is de- ployed along with the CCS. PUT creates a new fed- The basic VISION Cloud federations features have eration instance by passing an id in the Uniform Re- been used and slightly extended in two federation sce- source Identifier (URI) and the federation configura- narios. The approaches are briefly presented now. tion in the body. GET gives access to a specific fed- eration instance and returns the federation progress or statistical data. A federation configuration can be

79 DATA 2017 - 6th International Conference on Data Science, Technology and Applications

deleted by DELETE. For details please refer to the vate cloud and vision-tb-2 the public cloud. Such a project deliverable (VISION-Cloud, 2012). specification is needed for any container (here logi- After the administrator has configured the federa- calContainer) to act as a shard. tion, the objects of federated containers will be trans- Clients are enabled to distribute data over the ferred in the background to the container in the target clouds and are provided with a unified view of the cloud. If a client asks for the contents of the container, data that resides in both the private and public cloud. all objects from all containers in the federation will be The hybrid cloud setup is completely transparent for returned. Thus, objects that have not been on-boarded the clients of a container, and the clienst might even yet will be fetched from the remote source, too. not be aware where the data resides. Data CRUD op- A special on-boarding handler of the federa- erations are performed in a sharded way; The possi- tion service intercepts GET-requests from the client bility to determine data confidentiality is given to the and redirects them to the remote system for non- client by a metadata item that indicates data confiden- transferred containers and schedules the background tiality. In order to store confidential data, one need to jobs for copying data from the remote cloud. perform the request in Figure 5.

PUT vision-tb-1.cloudapp.net:8080/CCS/siemens 2.5.2 Hybrid Cloud /logicalContainer/newObject { "confidential" : "true" } Going further, we demonstrated in (Hohenstein et al., 2014) how to set up a hybrid cloud scenario. A hy- Figure 5: PUT request for storing confidential data. brid cloud uses both a public and a private cloud. The motivation for hybrid clouds is often to keep critical PUT requests are handled as follows by a new or privacy data on private servers. One reason might ShardService in the CCS. The metadata of the ob- be regulatory certifications or legal restrictions forc- ject is checked for an item indicating confidentiality such as confidential : true false. The approach bene- ing one to store material that is legally relevant or | subject to possible confiscation on premises. How- fits from the ability of the CCS to connect to multiple ever, non-critical data could be routed to more effi- object storage nodes at the same time using the basic cient cloud offerings from external providers, which federation component of VISION Cloud. The shard- might be cheaper and offer better extensibility. The ing is performed in the CCS based on metadata val- idea is to control the location of objects according to ues. According to the metadata of the container, the metadata. connection information is determined for both clouds. The federation again occurs at container level: A The ShardService decides to store newObject in the logical container can be split across physical public private cloud. and private cloud containers. Every logical container Every GET request to a federated container is sent has to know its physical locations. To this end, we to all clouds participating in the federation unless con- make the two cloud containers aware of each other by fidentiality is part of the query. The results of requests a PUT request with the payload of Figure 4, which are combined, and the result is sent back to the client. has to be sent to the federation service of vision-tb- Each federated cloud can be an access point to the 1. An analogous PUT request is implicitly sent to the federation, i.e., can accept requests. Hence, there is second cloud, however, with an ”inverted” payload. no additional interface to which object creation oper- ations and queries need to be submitted. A request https://vision-tb-1.myserver.net:8080/MyTenant/vision1 can be sent to any shard in any cloud. Access to the { "target_cloud_url" : "vision-tb-2.myserver.net", "target_cloud_port" : "8080", container can be via the public and private cloud store, "target_container_name" : "logicalContainer", accesses are delegated to the right location. "local_container_name" : "logicalContainer", The basic federation service of VISION Cloud (cf. "local_cloud_url" : "vision-tb-1.myserver.net", 2.5.1) is not used for the hybrid cloud setup. Instead, "local_cloud_port" : "8080", the service has been technically implemented in the "type" : "sharding", CCS that provides the content-centric storage func- "private_cloud" : "vision1", "public_cloud" : "vision2" } tions. In fact, in order to enable the hybrid cloud in the CCS, several extensions have been made to the Figure 4: Federation payload. CCS: A new ShardService has been added to CCS the task of which is to intercept requests to the CCS and This configuration contains information regard- decide where to forward the request. The ShardSer- ing the private and public cloud types, URLs, vice implements a reduced CDMI interface and plays users, authorization information etc. Due to the key role to shard in hybrid environments. private/public cloud, vision-tb-1 will be the pri- To sum up, the development of the VISION Cloud

80 A Multi-criteria Approach for Large-object Cloud Storage

project mainly provides important mechanisms that properties are for the whole CSs, not for a specific can be used as a base for federations: container or object. The payload is then automat- 1. Metadata concept. All objects are allowed to con- ically distributed to all the members of the federa- tain user-definable metadata entries. Those key- tion. At this point, the difference to the previously value pairs can be used in several ways to query described VISION Cloud federation approaches be- for objects inside the cloud storage system. A comes obvious: instead of federating containers, the schema can be employed for these metadata en- multi-criteria approach spans across several contain- tries to require the existence of certain metadata ers globally. The extension of the federation con- fields and hence to enforce a certain metadata cept intends to use different cloud storages in dif- schema. ferent provisioning modes. Referencing a classical example, data with higher demands in terms of pri- 2. Adapters for storage clouds. VISION Cloud in- vacy should be routed to locally hosted cloud storage cludes an additional layer that abstracts from the systems, while data subject to less critical could be underlying storage and thus makes it possible to stored in a public cloud offering. As another example, integrate cloud storage systems, e.g a blob storage cloud storage providers have special features such as service of some larger public cloud offering. reduced resilience. Storage with reduced redundancy can be used to store data that is temporarily required, in such a cheaper cloud storage offering. 3 MULTI-CRITERIA STORAGE Technically, the intended goal is to establish meta- data conventions that provide rules for processing ob- APPROACH ject or containers metadata in order to take routing decisions to the according storage. Thus, CCI opera- In the previous Section 2, we have considered an on- tions obtain a higher level of abstraction since hiding boarding and a hybrid cloud setup where the location the various underlying storage systems. The behavior of objects is determined according to metadata. Being of existing CCI operations is affected as follows: able to work with several cloud storage systems in a sharded manner offers further opportunities which are To Create a New Container. A PUT request can • explained in this section. be sent to the CCI of any CSi specifying the de- sired properties according to a criteria model (cf. 3.1 The Approach and Concept 3.2.1): The receiving CSi decides where to create the container in order to satisfy the specified prop- The goal is to extend the existing VISION Cloud fed- erties. Usually, there will be one CSi, however, eration approach to offer a more flexible solution for several CSs might be suitable and chosen for stor- a so named multi-criteria storage. The term multi- ing data, e.g., for sharding data, load balancing, or criteria refers to the ability of the approach to orga- achieving higher availability by redundancy. Any nize the storage in shards based on multiple criteria at CSi is able to handle requests by involving other the same time. CSs, if necessary. The container keeps metadata This work uses the concept of federation (Vernik about its properties and a mapping to its list of et al., 2013) to define a heterogeneous cloud setup in- further containers in other storage systems. volving different cloud storage systems with various To Create a New Object (in a Container). properties and benefits. The metadata processing fa- • Again, a PUT request can be sent to the CCI of cilities are used to let clients specify storage criteria any CSi. We decided to allow for specifying cri- to be satisfied. The metadata processing takes places teria not only for a container but also an object. in the Content Centric Service component of VISION Thus, the container criteria lead to a default stor- Cloud (Jaeger et al., 2012), using the management age location for all its objects, which can be over- models and their associations. ridden on a per-object basis. This principle sup- Let us assume several cloud storage systems (CSs) ports use cases where a ”logical” container should named CS1, CS2, CS3 etc. Each CSi possesses specific store images, however, some of them are confi- properties and advantages with regard to privacy, ac- dential (e.g., non-anonymous press images) while cess speed, availability SLAs and other characteristics others are non-confidential (e.g., anonymous ver- following a resource model (cf. 3.2.2). sions of the same press images), leading to differ- The definition of CS properties requires an admin- ent CSi. Some objects are rarely accessed while istrative PUT request to the federation admin service others are used frequently. Some of them are rel- of a VISION Cloud installation with a payload that evant on the long term while other images could describes the capabilities of any added CSi. These be easily recovered (i.e., thumbnails of original

81 DATA 2017 - 6th International Conference on Data Science, Technology and Applications

images) and thus do not require high reliability. eration of the cloud service. This structure will The CSi takes the same decisions as above to de- be utilized during resource provisioning and will termine the appropriate CS(s). keep the desired resources for meeting the con- To Retrieve a Particular Object in a Container. straints specified by the application. • If a client wants to get an object, each CS must be Criteria that can be specified for creating contain- able to retrieve the object even it is not available ers or objects (and which determines the placement in on its CS. In that case, the object’s metadata is not a storage system) are for instance: available with such a request. Therefore, every Confidentiality (private vs. public cloud, high- CSi must determine the possible object locations • from the globally propagated properties. Anyway, secure data center) querying the CS’s is performed in parallel. Storage cost (at certain levels) • Reliability Please note VISION cloud deployment sets up a • federation admin service/interface for each cloud stor- Access speed (fast disk access and low latency • age. In general, all different cloud storage nodes pro- due to geo-location) vide the same service interfaces. Load balancing or replication properties • By using these two types of models, a user of a 3.2 Requirements and Resource Model cloud storage system is able to define his strategy re- garding placement of data and criteria-based provi- One of the main benefits of a federation approach is sioning. There are important issues, being combin- that data can be placed optimally on different storage able and ranked with a percentage. systems according to user and system specified func- tional and non-functional requirements. In order to 3.2.2 Resource Model achieve this goal, properly capturing the application requirements is needed as well as an accurate descrip- The purpose of the resource model is to describe in a tion of the underlying physical resources. Therefore uniform way different features of storage systems that the VISION Cloud management models (Gogouvitis make up a federation. The resource models consists et al., 2012), requirements and resource model, come among others of properties such as: into play in federation scenarios. Public/private cloud • 3.2.1 Requirements Model Cloud provider and type (e.g., Amazon S3 Blob • Store) In order for an application to run effectively over a Price scheme for storage, basically per GB/month, cloud infrastructure, the customer should be able to • specify requirements, which will be used by the in- number of transactions, data transfer etc. frastructure to drive the data access operations of the Redundancy factor • application. To this end, a requirements model is Disk access properties (e.g., SSD) necessary to capture the requirements emerging from • Latency of data center for locations application attributes modeling and the ones deriv- • ing directly from the user needs. In addition, such a model defines structures to describe lower level re- quirements for the service offerings of the cloud as 4 TECHNICAL ASPECTS OF well as resource requirements that are used for the re- source provisioning. More specifically the following MAPPING APPROACH models have been developed: Indeed, the usage-related criteria must be mapped to User Requirements Model. This model captures the physical storage properties by relating the con- • the user requirements in a formalized way. They cepts of the requirements and resource models. Then, can be high level requirements characterizing the it is possible to map high-level requirements, de- application data or describing the needed storage scribed as metadata, to low level resource require- service and can be translated to low-level storage ments and to finally find storage systems that can ful- characteristics. Examples of parameters described fill the specified user criteria. are metrics like number of users, durability and As the general approach of VISION Cloud was availability requirements. to implement the CDMI, a user of the system should Resource Requirements Model. This model aims perform the configuration of the system using the • at specifying the resource requirements for the op- same REST/JSON-based approach as CDMI does. As

82 A Multi-criteria Approach for Large-object Cloud Storage

such, the already federation service provides some configuration calls that are extended by a PUT opera- tion to let an administrator submit configurations. All users of the interface can use generic REST clients as well as implement their own front end using these REST calls. Inside the federation service, one important task is to map the criteria specified by users to the technical parameters of storage systems in such a way that a storage system fitting best to the criteria will be found. To this end, we implemented an appropriate mapping approach. The basic idea is as follows: Each storage system is described by certain cat- • egories according to the resource model: pub- lic/private, a redundancy factor, its location, ac- Figure 7: Table MAP – mapping table. cess speed, latency etc. (cf. Figure 6). The pay- load of Figure 4 obtains the properties of every mending appropriate storage systems. To determine newly added CS. a ranking, we apply a schema that forms the basis If a user request to create a container or object • for a Multi-Attribute Utility Theory (MAUT). Using with associated metadata requirements arrives at this theory, resource requirements can be assessed and the CCS, the CCS asks the federation service for ranked according to the dimensions of a particular in- the federation information. The request contains terest. Dimensions of interest are in our case user re- metadata according to the requirements model – quirements such as reliability or cost. independently of the physical resource model of To set up a MAUT schema, the properties of the CSs. The federation returns the referring storage storage systems are assigned to the dimensions of in- location(s) accordingly. All the metadata is part terest in a first step ending up in a matrix. Each entry of a request, similar to Figure 5. of the matrix contains a value that defines the rele- vance for the related dimension by associating a cer- tain weight. The larger the value is, the more con- tributes the property to the dimension of interest. Fig- ure 7 presents a sample mapping matrix. The matrix should be understood as follows. The first line (’pub- lic=yes’) specifies that a public infrastructure as a re- source requirement contributes to Figure 6: Table RESOURCE – storage system properties. confidentiality (second column) with a weight of • The basis for recommending appropriate Storage 1, thus being quite low; Systems (CSs) for given user requirements is first a reliability with a medium weight of 5; list of storage systems (cf. Figure 6) with their prop- • cost savings with a high weight of 8 etc. erties. The task of producing a recommendation is of- • ten formulated in knowledge-based systems as a tuple Similarly, ’private=no’ contributes to confidential- (R,E). R corresponds to the set of user requirements ity with a weight of 9, to reliability with a weight of and E is the set of elements that form the knowledge 2, to cost savings with a weight of 2 etc. base. In our case, the elements E are the features of The redundancy factor has impact on the reliabil- the resource model, i.e., entries in Table 6. ity (increasing with the factor), to the costs (decreas- The solution for a task (R,E) is a set S E that ingly) etc., but not on confidentiality. has to satisfy the following condition: ⊆ A user U then formulates his requirements, which e S: e σ (E) represent his preferences in a request. Finding ade- ∀ i ∈ i ∈ (R) σ(R)(E) is a selection on those elements in E that quate storage systems implies that these requirements satisfy R. As an example, if the user requirements are have to be satisfied. Moreover, a user can rank each defined by the concrete set R = r : access speed of its features with a percentage, i.e., a user has the { 1 15; r2 : public = yes , then obviously only CS5 possibility to define a WEIGHT(U,d) for each dimen- satisfies≤ the requirement R.} sion d of interest in a request. For instance, a user U We now propose a procedure that allows recom- can specify a weight of 50% for the dimension ”Con-

83 DATA 2017 - 6th International Conference on Data Science, Technology and Applications

fidentiality” and weights of 30% for the dimension within the area of federated database management ”Reliability” and 20% for the dimension ”Cost Sav- systems (Sheth and Larson, 1990). Sheth and Larson ings”. The storage CSi with the highest overall value define a federated database system as a ”collection is suited best to satisfy the demands. of cooperating but autonomous component database In order to formalize the approach, we assume a systems” including a ”software that provides con- function MAP : Properties x Values x Dimensions trolled and coordinated manipulation of the compo- Int that represents the mapping table. For instance,→ nent database system”. (Sheth and Larson, 1990) de- MAP(public,y,Reliablity) refers to the value 5 in the scribes a five-layer reference architecture for feder- first line in Figure 7. ated database systems. According to the definition, Another function RESOURCE : the federated database layer sits on top of the con- StorageSystemsxProperties Values corresponds tributing component database systems. to Figure 6. → One possible characterization of federated sys- The following formula then computes the rele- tems can be done according to the dimensions of dis- vance of a storage system in a weighted manner using tribution, heterogeneity, and autonomy. One can also those functions: differentiate between tightly coupled systems (where Relevance(U,CSi) = Σd Dimensions (Σp Properties administrators create and maintain a federation in ad- MAP(p,RESOURCE(CS∈ , p),d) WEIGHT∈ (U,d)) vance) and loosely coupled systems (where users cre- i × Using this formula, we can calculate for each CSi ate and maintain a federation on the fly). a value for each dimension and obtain the resulting Based upon App Engine, Bunch et table in Figure 8 for our example. That is, CS2 hav- al. (Bunch et al., 2010) present a unified API to sev- ing a value of 10.4 is best suited for the given set of eral data stores of different open source distributed requirements whilst CS3 and CS5 with a value of 7.7 database technologies. Such a unified API repre- are the worst options. sents a fundamental building block for working with cloud storage as well as on-premises NoSQL database servers. However, the implementation provides ac- cess only to a single storage system at a time. Hence compared to our CCS solution, the main focus is on portability and not on a federated access. Redundant Array of Cloud Storage (RACS) is a cloud storage system proposed by Abu Libdeh et al. (Abu-Libdeh et al., 2010). RACS can be Figure 8: Sample calculation. seen as a proxy tier on top of several cloud stor- age providers, and offers a concurrent use of differ- Due to the logic applied to the generation of a ent storage providers or systems. Adapters for three recommendation, it might happen that a given set of different storage interfaces are discussed in the pa- requirements leads to an empty set of recommenda- per, however, the approach can be expanded to further storage interfaces. Using erasure coding and distribu- tions: σ(R)(E) = 0/. In such a case, it is not very help- ful to inform the user only about such a conflict, but tion, the contents of a single PUT request are split also to give him a hint about what requirement should into parts and distributed over the participating stor- be relaxed in order to obtain a recommendation. age providers similar to a RAID system. Any opera- A set of conflicts is a set SoC R such that tion has to wait until the slowest provider has com- ⊆ pleted the request. While this approach splits data σSoC(E) = 0/. SoC is maximal in the sense that no 6 ( ) = across storage systems, our approach routes a PUT other conflict SoC’ fulfilling σSoC0 E 0/ SoC’ SoC exists. A set of conflicts SoC refers6 ∧ to a cer-⊃ request to the best suited storage system. tain elements e E under consideration of a set R of Brantner et al. (Brantner et al., 2008) build a requirements. SoC∈ gives a concrete hint about what database system on top of Amazon’s S3 cloud stor- condition to relax. age with the intention to include support for multiple cloud storage providers in the future. In fact, Amazon S3 is also one of storage layer options that VISION supports. 5 RELATED WORK There is a lot of ongoing work in the area of multi- cloud APIs and libraries. Their goal is also to enable Even if cloud federation is a research topic, the ba- a unified access to multiple different cloud storage sic concepts and architectures of data storage feder- systems. Among them, Apache Libcloud (Libcloud, ations have already been discussed many years ago

84 A Multi-criteria Approach for Large-object Cloud Storage

2017), Smestorage (SmeStorage, 2017) and Delta- commercial federated and hybrid cloud storage solu- cloud (Deltacloud, 2017) should be mentioned. They tions do provide a range of offerings to satisfy vari- provide unified access to different storage systems, ous customers demands, but pose the risk of a vendor and protect the user from API changes. However, lock-in due to the use of own infrastructure. only basic CRUD methods are supported, mostly The MetaStorage (Bermbach et al., 2011) system lacking of query functionality. Moreover, they have seems to be most comparable to our approach since administration features such as stopping and running it is a federated cloud storage system that is able to storage instances. integrate different cloud storage providers. MetaStor- Further notable approaches can be found in the age uses a distributed hash table service to replicate area of Content Delivery Networks (CDN). A con- data over diverse storage services by providing a uni- tent delivery network consists of a network of servers fied view between the participating storage services around the world which maintain copies of the same, or nodes. merely static data. When a user accesses the stor- age, the CDN infrastructure delivers the website from the closest servers. However, Broberg et al. (Broberg 6 CONCLUSION AND FUTURE et al., 2009) state that storage providers have emerged recently as a genuine alternative to CDNs. Moreover, WORK they propose a system called Meta CDN that makes use of several cloud storage providers. Hence, it is In this paper, we presented a multi-level federation not really a storage system. As CDNs in general, approach for storing large objects like videos. The Meta CDN mostly focuses on read performance and approach is based upon the metadata idea of the VI- neglects write performance, too. Anyway, the system SION Cloud project. Our approach provides a uni- provides cheaper solution by using cloud storage. form interface for accessing data. The key idea is to There are also a couple of hybrid cloud solutions use metadata for controlling data placement behav- in the literature. Most of them focus on transferring ior at a higher semantic level by specifying require- data from private to public, not providing a unified ments instead of physical properties. As a technical view to hybrid storages. For instance, (Na- basis, we benefit from the VISION Cloud software suni, 2017) is a form of network attached storage, stack (VISION-Cloud, 2011) where such a metadata which moves the user’s on-premise data to a cloud concept is an integral part for handling storage en- storage provider. Hence, this hybrid cloud approach tities. We show in detail how well-suited VISION gather and encrypt the data from on-premise stor- Cloud and its storage system architecture is to sup- age, afterwards sending the encrypted data to a public port new scenarios such as using several storage tech- cloud at either Azure or Amazon Web Ser- nologies with different properties for different pur- vices. Furthermore, a user has the option to distribute poses in parallel, e.g., handling confidential and non- data over multiple stores. Compared to our multi- confidential data, the first kept in an on-premise data level sharding approach, Nasuni is a migration ap- store, the later stored in a public cloud. proach that eventually moved data to the public cloud. In this respect, the overall approach allows for Another example is (Nimbula, 2017), adding various further sharding strategies, such as which provides a service allowing the migration of region-based, load balancing, or storage space balanc- existing private cloud applications to the public cloud ing, redundancy level control etc. using an API that permits the management of all re- In our future work, we are evaluating the ap- sources. CloudSwitch (CloudSwitch, 2017) has also proach. Furthermore, several points have not been developed a hybrid cloud that allows an application to tackled so far and are subject to future work. For in- migrate its data to a public cloud. stance, changing or adding properties of a storage sys- A hybrid cloud option has been developed by Nir- tem might lead to a redistribution, maybe taking ben- vanix (Nirvanix, 2017), however, based upon propri- efit from on-board federation/migration. Similarly, etary Nirvanix products and thus limiting usage due adding a new storage system node also affects the dis- to a danger of vendor lock. Nirvanix offers a private tribution to storage systems. And finally, we think of cloud on premises to their customers, and enables data extending the approach to a self-learning system that transfer to the public Nirvanix Cloud Storage Net- uses information about the user’s satisfaction to adopt work. Other public cloud platforms than Nirvanix are the mapping table between users’ requirements and not supported. Due to our adapter approach, VISION the properties of the storage system. is not limited to a specific public cloud service. As a general observation that can be made, most

85 DATA 2017 - 6th International Conference on Data Science, Technology and Applications

ACKNOWLEDGEMENTS Hohenstein, U., Jaeger, M., Dippl, S., Bahar, E., Vernik, G., and Kolodner, E. (2014). An approach for hy- The research leading to the results presented in brid clouds using vision cloud federation. In 5th Int. Conf. on Cloud Computing, GRIDs, and Virtualization this paper has received funding from the European (Cloud Computing), Venice 2014, pages 100–107. Union’s Seventh Framework Programme (FP7 2007- Jaeger, M. C., Messina, A., Lorenz, M., Gogouvitis, S. V., 2013) Project VISION Cloud under grant agreement Kyriazis, D., Kolodner, E. K., Suk, X., and Bahar, number 217019. E. (2012). Cloud-based content centric storage for large systems. In Federated Conference on Computer Science and Information Systems - FedCSIS 2012, Wroclaw, Poland, 9-12 September 2012, Proceedings, REFERENCES pages 987–994. Kolodner, E. et al. (2011). A cloud environment for data- Abu-Libdeh, H., Princehouse, L., and Weatherspoon, H. intensive storage services. In CloudCom, pages 357– (2010). Racs: a case for cloud storage diversity. 366. In Proceedings of the 1st ACM symposium on Cloud Kolodner (2), E. et al. (2012). Data intensive storage ser- computing, SoCC ’10, pages 229–240, New York, NY, vices on clouds: Limitations, challenges and enablers. USA. ACM. In Petcu, D. and Vazquez-Poletti, J. L., editors, Euro- Bermbach, D., Klems, M., Tai, S., and Menzel, M. (2011). pean Research Activities in Cloud Computing, pages Metastorage: A federated cloud storage system to 68–96. Cambridge Scholars Publishing. manage consistency-latency tradeoffs. In Proceed- Libcloud (2017). Apache libcloud: a unified interface to ings of the 2011 IEEE 4th International Conference the cloud. At: http://libcloud.apache.org/. [retrieved: on Cloud Computing, CLOUD ’11, pages 452–459, March, 2017]. Washington, DC, USA. IEEE Computer Society. Mell, P. and Grance, T. (2011). The nist definition of Brantner, M., Florescu, D., Graf, D., Kossmann, D., and cloud computing (draft). NIST special publication, Kraska, T. (2008). Building a database on s3. In 800(145):7. Proceedings of the 2008 ACM SIGMOD international Nasuni (2017). Nasuni. At: http://www.nasuni.com/. [re- conference on Management of data, page 251. trieved: March, 2017]. Broberg, J., Buyya, R., and Tari, Z. (2009). Service-oriented Nimbula (2017). Nimbula. At: http://en.wikipedia. computing — icsoc 2008 workshops. chapter Creat- org/wiki/Nimbula. [retrieved: March, 2017]. ing a ‘Cloud Storage’ Mashup for High Performance, Nirvanix (2017). Nirvanix. At: http://www.nirvanix.com/ Low Cost Content Delivery, pages 178–183. Springer- products-services/cloudcomplete-hybrid-cloud- Verlag, Berlin, Heidelberg. storage/index.aspx. [retrieved: March, 2017]. Bunch, C. et al. (2010). An evaluation of distributed data- NoSQL (2017). Nosql databases. At: http://nosql- stores using the cloud platform. In Proceed- database.org. [retrieved: March, 2017]. ings of the 2010 IEEE 3rd International Conference Sandalage, P. and Fowler, M. (2012). Nosql distilled: a brief on Cloud Computing, CLOUD ’10, pages 305–312, guide to the emerging world of polyglot persistence. Washington, DC, USA. IEEE Computer Society. Pearson Education. CDMI (2010). Cloud data management interface version Sheth, A. and Larson, J. (1990). Federated database sys- 1.0. At: http://snia.cloudfour.com/sites/default/files tems for managing distributed, heterogeneous, and au- /CDMI SNIA Architecture v1.0.pdf. [retrieved: tonomous databases. ACM Computing Surveys, (22 March, 2017]. (3):183–236. CloudSwitch (2017). Cloudswitch. At: http://www. SmeStorage (2017). Smestorage. At: cloudswitch.com. [retrieved: March, 2017]. https://code.google.com/p/smestorage/. [retrieved: Deltacloud (2017). Deltacloud. At: http://deltacloud. March, 2017]. apache.org/. [retrieved: March, 2017]. Vernik, G. et al. (2013). Data on-boarding in federated Fielding, R. T. and Taylor, R. N. (2002). Principled design storage clouds. In Proceedings of the 2013 IEEE of the modern web architecture. ACM Transactions on Sixth International Conference on Cloud Computing, Technologies, 2(2):115–150. CLOUD ’13, pages 244–251, Washington, DC, USA. Fox, A. et al. (2009). Above the clouds: A berkeley view IEEE Computer Society. of cloud computing. Dept. Electrical Eng. and Com- VISION-Cloud (2011). Vision cloud project consor- put. Sciences, University of California, Berkeley, Rep. tium, high level architectural specification release 1.0, UCB/EECS, 28. vision cloud project deliverable d10.2, june 2011. At: http://www.visioncloud.com. [retrieved: March, Gogouvitis, S. V., Katsaros, G., Kyriazis, D., Voulodimos, 2017]. A., Talyansky, R., and Varvarigou, T. (2012). Re- trieving, storing, correlating and distributing infor- VISION-Cloud (2012). Vision cloud project consor- mation for cloud management. In Vanmechelen, K., tium: Data access layer: Design and open speci- Altmann, J., and Rana, O. F., editors, Economics of fication release 2.0, deliverable d30.3b, sept 2012. Grids, Clouds, Systems, and Services, volume 7714 of At: http://www.visioncloud.com/. [retrieved: March, Lecture Notes in Computer Science, pages 114–124. 2017].

86