CLOUD COMPUTING STORAGE

From POSIX to key-value stores

tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 Table des signatures

Nom Fonction Signature Date Organisme

Rédigé par: PDG 2011-11-16 Jean-Paul Smets Nexedi SA

Rédigé par: Ched de Projet 2011-11-16 Romain Courteaud Nexedi SA

Approuvé par: PDG 2011-11-16 Jean-Paul Smets Nexedi SA

Soumis par: PDG 2011-11-16 Jean-Paul Smets Nexedi SA

tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 Liste de diffusion

 Smets, J-P. Nexedi SA  Fermigier, S. Nuxeo SAS  Brettnacher, T. Nexedi SA  Laisne,J-P. Bull SAS  Courteaud, R. Nexedi SA

tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 Historique des modifications

Version Date Page Description des modifications 001 2010-10-09 Toutes First draft 002 2010-10-10 Toutes 003 2011-04-19 Toutes First preliminary version, submitted for review 004 2011-11-16 Toutes Release Candidate

tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 Table of Content

1 Introduction...... 7 1.1 Goals and scope...... 7 1.2 Plan...... 7 1.3 References...... 7 1.3.1.Applicable Documents...... 8 1.3.2.Reference Documents...... 8 1.4 Abbreviations...... 10

2 Executive Summary...... 11

3 Storage for the Cloud...... 13 3.1 The Evolution of the Storage Paradigm...... 14 3.2 Scientific background...... 16 3.3 Cloud Storage Use Cases...... 17

4 Key-Value Storage: Blocks or Blobs...... 19 4.1 Block Device...... 20 4.2 NoSQL key-value stores...... 21

5 Distributed file systems: Shared Directories and Files...... 23 5.1 From file system to key-value...... 24 5.2 From key-value to file system...... 25

6 NoSQL beyond Mere Storage...... 27 6.1 Key-valye Store Limitations...... 27 6.1.1.Authentication...... 27 6.1.2.Encryption...... 28 6.1.3.Multiple namespaces...... 28 6.1.4.Indexing...... 28 6.2 NoSQL Document Stores...... 29 6.2.1.Querying through map-reduce code injection...... 29 6.2.2.Querying through indices...... 29 6.3 NoSQL Columnar Stores...... 29 6.4 NoSQL Graph Databases...... 29 6.4.1.Querying through graph crawling...... 29 6.4.2.Querying through semantic indices...... 30

tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 7 Storage Allocation API...... 31 7.1 Existing Storage Allocation APIs...... 31 7.2 SlapOS: A Universal Allocation API for Storage...... 32 7.2.1.Allocating Storage with slapos request in pytjon...... 32 7.2.2.Allocating Storage with slapos command line...... 32 7.2.3.Multitenancy...... 32 7.2.4.Automating Storage Scalability Testing...... 33 7.3 The SLAP REST API...... 33 7.3.1.Exchange format...... 33 7.3.2.Response status code...... 33 7.3.3.Instance Methods...... 34 7.3.4.Deleting an instance...... 35 7.3.5.Get instance information...... 36 7.3.6.Get instance authentication certificates...... 36 7.3.7.Bang instance...... 37 7.3.8.Modifying instance...... 37

8 Recommendations...... 39 8.1 The End of POSIX...... 40

tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 1 Introduction

1.1 Goals and scope

This document is a preliminary report for Task 2 of Work Package 1 (IaaS) of the COMPATIBLE ONE research project. It provides an overview of current prac- tices related to Cloud Computing storage recommendations for further research.

1.2 Plan

This document is composed of 8 chapters:

• Chapter 1 defines goals, scope, plan, referenced documents and abbrevi- ations;

• Chapter 2 is an executive summary;

• Chapter 3 introduces Cloud storage evolution and scientific background;

• Chapter 4 provides an overview of key-value stores;

• Chapter 5 provides an overview of distributed file systems;

• Chapter 6 introduces desirable features for Cloud storage which go be- yond mere storage;

• Chapter 7 proposes a universal API to allocate Cloud Storage;

• Chapter 8 proposes recommendations for further Cloud R&D.

1.3 References

This section describes applicable documents and reference documents.

tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 1.3.1.Applicable Documents

ID Title Reference No Version/Rev.

12ème Appel à projets du FUI (Fonds Unique AD- Interministériel) N/A N/A 1 http://www.systematic-paris-region.org/fr/les- projets/appels-a-projets/appels-a-projets-du-fui

1.3.2.Reference Documents

ID Title Reference No Version/Rev.

QEMU-RBD RD-1 N/A N/A http://ceph.newdream.net/wiki/QEMU-RBD CEPH RD-2 http://ceph.newdream.net/ N/A N/A http://ceph.newdream.net/publications/ Sheepdog RD-3 N/A N/A http://www.osrg.net/sheepdog/ Sage A. Weil. Ceph: Reliable, Scalable, and High- RD-4 Performance Distributed Storage. Ph.D. thesis, N/A N/A University of California, Santa Cruz, December, 2007. P2P-like Tahoe file system offers secure storage in the cloud RD-5 http://arstechnica.com/open- N/A N/A source/news/2009/08/p2p-like-tahoe-file system- offers-secure-storage-in-the-cloud.ars Brian Warner, Zooko Wilcox-O'Hear, Rob Kinninmont. RD-6 Tahoe: A Secure Distributed file system N/A N/A http://tahoe-lafs.org/~warner/pycon-tahoe.html TAHOE LAFS RD-7 N/A N/A http://tahoe-lafs.org/trac/tahoe-lafs Zooko's Hack Log RD-8 http://insecure.tahoe-lafs.org/uri/URI:DIR2- N/A N/A RO:ixqhc4kdbjxc7o65xjnveoewym:5x6lwoxghrd5rxh wunzavft2qygfkt27oj3fbxlq4c6p45z5uneq/blog.html Memcachefs RD-9 N/A N/A http://memcachefs.sourceforge.net/ Moxi RD-10 N/A N/A http://code.google.com/p/moxi/ Moxi, a memcached proxy RD-11 http://www.slideshare.net/northscale/moxi- N/A N/A memcached-proxy Alexandre LISSY, Arnaud Laprévote. Espace de RD-12 N/A N/A stockage avec XtreemFS. tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 ID Title Reference No Version/Rev.

Memagent RD-13 N/A N/A http://code.google.com/p/memagent/ Linux Block Devices RD-14 http://www.kernel.org/doc/htmldocs/kernel- N/A N/A api/blkdev.html Membase RD-15 N/A N/A http://www.membase.org/ An introduction to block device drivers RD-16 N/A N/A http://www.linuxjournal.com/article/2890?page=0,0 Caching and Performance Lessons from Facebook RD-17 http://www.scribd.com/doc/4069180/Caching- N/A N/A Performance-Lessons-from-Facebook BUP RD-18 N/A N/A https://github.com/apenwarr/bup Using MySQL as a NoSQL - A story for exceeding 750,000 qps on a commodity server. RD-19 N/A N/A http://yoshinorimatsunobu.blogspot.com/2010/10/usi ng-mysql-as-nosql-story-for.html UnQL Specification home RD-20 N/A N/A http://www.unqlspec.org/ Brewer's CAP Theorem RD-21 http://www.julianbrowne.com/article/viewer/brewers- N/A N/A cap-theorem Object-relational impedance mismatch RD-22 http://en.wikipedia.org/wiki/Object- N/A N/A relational_impedance_mismatch The ORM Swarm RD-23 http://blog.wekeroad.com/2007/06/06/the-orm- N/A N/A swarm The Vietnam of Computer Science RD-24 http://blogs.tedneward.com/2006/06/26/The+Vietna N/A N/A m+Of+Computer+Science.aspx Issues and Evaluations of Caching Solutions for Web Application Acceleration. Wen Syan Li, Wang-Ping Hsiun, Dmitri V. Kalashnikov, Radu Sion, Oliver Po, RD-25 Divyakant Agrawal, K. Celçuk Kada. N/A N/A Proceedings of the 28th International Conference on Very Large Data Bases edited by Philip A. Bernstein XtreemFS Roadmap RD-26 N/A N/A http://www.xtreemfs.org/roadmap.php RD-27 NEOPPOD - Distributed Transactional NoSQL for the N/A N/A Cloud tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 ID Title Reference No Version/Rev.

http://www.neoppod.org/ Achieving Business Agility with Search-Based RD-28 technologies N/A N/A Exaled Whitepaper ERP5 RD-29 N/A N/A www.erp5.com Libérez vos bases de données ! RD-30 http://blog.exalead.fr/2008/06/02/liberez-vos-bases- N/A N/A de-donnees/ Linux Kernel API - Chapter 14. Block Devices RD-31 http://www.kernel.org/doc/htmldocs/kernel- N/A N/A api/blkdev.html Linux RAID RD-32 N/A N/A https://raid.wiki.kernel.org/ Storage I/O benchmark RD-33 http://www.niftyname.org/2009/05/storage-io- N/A N/A benchmark/ Scaling memcached at Facebook RD-34 http://www.facebook.com/note.php? N/A N/A note_id=39391378919 NoSQL to InnoDB with Memcached RD-35 http://blogs.innodb.com/wp/2011/04/nosql-to-innodb- N/A N/A with-memcached/ High Performance Zope with ERP5 RD-36 http://www.erp5.com/enterprise- N/A N/A High.Performance.Zope/view Libcloud RD-37 N/A N/A http://libcloud.apache.org/supported_providers.html

1.4 Abbreviations

This section defines abbreviations used in the rest of the document.

ID Abbreviation Title Description

Enterprise Resource Planning An enterprise application which com- bines all business processes in relation AB-1 ERP with accounting, inventory or human re- sources within the same consistent envi- ronment. AB-2 SaaS Software as a Service A service consisting of providing a enter- prise applications to users through a Web based registration and – usually – tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 ID Abbreviation Title Description

through a web based user interface. Platform as a Service A service consisting of providing a de- velopment environment to developers AB-3 PaaS through a Web based registration and – usually – through a web based user in- terface.

tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 2 Executive Summary

Storage systems for Cloud Computing are based on the concept of distributed key-value store running on top of a POSIX file system and standard hardware such as Solid State Disks. Key-value stores are used directly by applications such as web application servers, server virtualization software, content indexing en- gines and POSIX wrappers.

Application Server, qemu

Distributed key-value store (NoSQL)

POSIX

Hardware Block Device

Figure 1. Typical cloud storage stack

Performance of modern key-value stores have reached levels which many SAN and distributed file systems are not able to reach.

tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016

Figure 2. Kumofs memcached based performance

We recommend memcached as standard protocol for key-value store, because of its simple design focused on performance and because of the numerous imple- mentations which exist to cover different types of use case. We recommend to develop Cloud storage proxies to access memcached backends through other protocols such as S3 and through REST transport.

The approach consisting of turning a key-value store into an virtual block de- vice by extending qemu source code seems very efficient. It is used by both sheepdog and by ceph projects. It does not require any change to the guest op- erating system yet brings the high performance of sheepdog or ceph underlying key-value store to qemu. The same approach could be applied to turn any memached backend into a virtual block device. We recommend embracing, con- tributing and extending qemu to promote a standard, extensible virtual block de- vice API capable of leveraging any new key-value store.

The fantastic performance and flexibility of distributed key-value stores make distributed file systems mostly irrelevant for Cloud applications besides desktop use cases. We thus recommend to focus the study of distributed file sys- tems to desktop backup applications only and in particular to take a close look at TAHOE. We consider that besides desktop backup, it is a waste of time to try to develop distributed file systems for the Cloud when key-value stores have be- come the modern way to reach much better results. Extending existing backup tools to support a variety of key-value stores, such as what duplicity provides with TAHOE, has in our opinion a much higher value for the market than trying to POSIX wrappers.

Educating developers how to stop thinking in terms of file system in order to design scalable applications on the Cloud should be the ultimate priority. Provid- tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 ing a simple and uniform way to allocate key-value stores on standard GNU/Linux in a way which feels as intuitive as creating LVM partitions could help accelerating this adoption. This is what SLAP command line utilities actually do. We thus recommend to embed SLAP libraries into stan- dard GNU/Linux operating systems and to define common terms for the most common key-value stores.

tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 3 Storage for the Cloud

Cloud computing is often approached from a computation power point of view. Infrastructure as a Service (IaaS) and server virtualization are presented as a way to quickly allocate any number of CPUs to process complex calculation in a shorter time and later release those CPUs whenever calculation is finished. A typical example of this approach consists of renting one thousand virtual servers during one night to process one million of scanned document instead of purchas- ing ten servers and waiting 3 months. This example however does not show how Cloud is really different from grid computing.

Software as a Service (SaaS) is presented as way to automatically instantiate an application and provide access to it through the Web, without having to in- stall any software. Platform as a Service is either presented as a kind of SaaS for Integrated Development Environment (IDEs) or as a way to customize existing SaaS through Web based development. For both SaaS and PaaS, what people tend to focus on is the front end Web application server which generates web pages for thousands of users and thus needs to provide elastic scalability in a way or another for CPU intensive tasks. However, besides the fact that applica- tions are Web based, this presentation does not differ from Application Service Providing (ASP), a concept which exists for 15 years.

What most presentations of Cloud computing tend to omit is the specific role of storage. Unlike grid computing, which main focus has been for more than ten years about distributing scientific calculations through so-called grid communi- ties such as boinc, condor, Xtrem Web, Bonjour GRID,, etc. Cloud Computing can be characterized by the wide adoption of various forms of high performance dis- tributed virtual storage. Unlike ASP which is based on traditional local file sys- tem and relational databases, Cloud Computing introduces highly available and elastic storage technologies which removes any limits related to local hardware. tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 The pioneers of Cloud Computing storage are GoogleFS (used by Google search engine), MogileFS (used by Live Journal blog), Amazon S3 (used for backup of persistent data of EC2 machines), Amazon Dynamo (which power parts of other Amazon Web Services), memcached (used also by Live Journal blog), NEO (used by ERP5). They introduce usually a simple key-value approach to store data instead of implementing complex data structures such as POSIX hi- erarchical directories of files or relational structures. They often split indexing from storage so that the storage subsystem does not need to implement a com- plex query languages such as SQL, Xquery, etc. and so that storage write perfor- mance is not hindered by index maintenance costs.

3.1 The Evolution of the Storage Paradigm

Just 10 years ago, the typical model on which everyone was thinking about storage could be represented this way:

Application

POSIX SQL

Hardware Block Device

Figure 3. Storage in the 90s

Applications are developed to access files in directories or lines in tables through a specific query language or API. POSIX can be considered as the query language to list files and directories in a file system. SQL can be considered as the query language to list lines in tables.

At that time, software which was not developed according to this model was considered as unprofessional if not weird. The author of this report is also the author of open source ERP5 which is based on NoSQL transactional storage of business data and persistence storage of source code in a NoSQL database rather on a file system. Until very recently, ERP5 was criticized by some system integrators for its NoSQL, non file system approach. Now it is considered as state of the art but many system integrators and developers still do not under- stand why this approach provides high performance and why the use of file sys- tems and relational databases to store data on the Cloud can not. tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 We hope through this report to explain how storage architectures have evolved and to help developers understand how they can benefit from better per- formance and scalability by abandoning file systems and relational databases for storing data.

Today's state of the art is defined by the kind of architectures which had to be developed to power search engines, blogs, social networks and multi-tenant business applications. Such architectures are illustrated bellow:

Application Server, qemu

Distributed key-value store (NoSQL)

POSIX

Hardware Block Device

Figure 4. Storage in 2000

A distributed key-value store engine is implemented on top of the file systems of a large number of cheap personal computers. Applications are then developed to access directly arbitrary values (size, type) through long or arbitrary keys. Distributed key-value stores play the role of a kind infinite persistent and ubiqui- tous memory. Distributed key-value stores can be implemented in different ways, mostly through sharding and aggregation of local key-value stores or through distributed hash-tables (DHT). Because the keys and values of key-value stores are mostly arbitrary, they are much more convenient to use that it would be to use a block device API to store data. And because they provide a low-level ap- proach to storage, performance can be much higher than accessing a file system or a database through some kind of query language. As an example, removing in MySQL the SQL interpretation and accessing data directly through its internal API can provide a 10 to 100 times performance increase [RD-19]. Also, since key-value store APIs are very similar to a block device API, it is possible to im- plement a device driver within a virtual machine such as qemu and access a key- value store as if it were a hard disk. This is what both sheepdog and ceph do.

Legacy applications can benefit from the approach of distributed key-value stores.

tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 Legacy Application

POSIX X Query Language

Distributed key-value store (NoSQL)

POSIX

Hardware Block Device

Figure 5. Legacy applications with modern key-value stores

POSIX file and directory querying API can be implemented over a key-value store. This is what ceph does to provide a distributed POSIX compliant file sys- tem (in addition to providing distributed block device). In theory, any query lan- guage or rich database could be implemented on top of a distributed key-value store. Some rich NoSQL database (ex. RD-20) follow this approach to provide in- dexing in addition of data store.

Complex web sites use a combination of all the approaches which we de- scribed: local SQL database, direct access to key-value store, distributed file sys- tem, rich NoSQL database.

Complex Web Sites

POSIX X Query Language

Distributed key-value store (NoSQL)

POSIX SQL

Hardware Block Device

Figure 6. Combining all paradigms in large Web applications

3.2 Scientific background

Three scientific results should guide us when it comes to analyzing storage: the Brewer's CAP Theorem [RD-21], storage bottleneck principle [RD-25] and the failure of object relational mappers [RD-22, RD-23, RD-24].

The Brewer's CAP Theorem explains that one has two to choose at most two tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 from Consistency (C), Availability (A) and Partitioning (P) thus the name CAP. In other terms, the Cassandra NoSQL database which powers systems such as Facebook, or the memcached NoSQL database behind LiveJournal provide high availability and partitioning by not being always consistent. Because it did not feel so good to claim that a NoSQL database is “sometimes inconsistent”, mar- keting found a better: “eventually consistent”. Whenever some consistency is re- quired, it is often implemented at the expense of partitioning, since high avail- ability is always a must for Web and business applications. In the xtreemfs dis- tributed file system, the catalogs (MRC, DIR) are replicated but still centralized [RD-26]. In the NEO distributed transactional NoSQL database, the transaction ID generator is redundant yet centralized [RD-27].

The storage bottleneck principle shows that in any large scale transactional business application, the transactional database is always the bottleneck, what- ever the application server chosen [RD-25]: Erlang OTP, Tomcat, Zope, Django, etc. It is thus useless to increase the number of nodes of application server or the number of virtual servers if database storage access is the bottleneck. This bottleneck can however be quite wide: a site such as mediawiki is based on a well optimized LAMP stack, with perfectly crafted MySQL indices which do not need any complex distributed database.

The scientific impossibility of perfect object relational mapping is well know since the early days of object languages in the 80s. It has been rediscovered lately [RD-22, RD-23, RD-24] and discussed extensively. Yet, after many discus- sions and proofs, people tend to forget it. Most modern web systems depart from object relational mapping by storing objects directly in a serialized way. This is the case of Zope, an application server with built-in persistent storage created in 1998. Zope ideas were inspired by the Smalltalk RAM based image. Ideas similar to Zope have been adopted in Java community through Struts or Hibernate. However, the core idea that ORMs could not be efficient was not understood in Hibernate. Hibernate tried – once more - to solve the scientific impossibility of perfect object relational mapping and – once more – failed in some cases. Unlike Hibernate, recent trends in Web development tend to store data as persistent Javascript structures (JSON). Such an approach can be considered as a kind of reincarnation of the 70s original Smalltalk approach of storing persistent struc- tures of a dynamic object language as a serialized string, rather than trying to map them to tables. It is very similar to Zope's way in 1998.

It is interesting to understand the importance of the CAP theorem. For exam- tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 ple, any NoSQL database which tries to provide rich feature set by combining storage and consistent indexing will simply not scale. Indices, to be consistent, require in a way or another a kind of centralization of data. Partitioning is thus lost; making write intensive systems impossible to scale up. However, if indexing is provided in a “sometimes inconsistent way”, through asynchronism or index distribution, scalability becomes possible again. Object Relational Mappers (ORM), which use transactional relational databases and tables with lots of index keys on columns will thus mostly lead to business applications which are unable to scale up. This would not be true obviously if database indices are updated asynchronously, but we are not aware of any such database. This explains why search-based architectures – pioneered by Exalead [RD-28] or ERP5 [RD-29] – which combine storage-only database for persistent objects and index-only data- base for querying and reporting in an synchronous way have become increasing popular to develop large scale cloud applications [RD-30].

3.3 Cloud Storage Use Cases

Cloud Storage is either used to provide persistence to Virtual machines in In- frastructure as a Service (IaaS) environments, either to provide persistence to application servers in a Platform as a Service (PaaS) environment or to provide a backup infrastructure.

In the IaaS Use Case, the virtual machines boots from a kernel file and an ini- trd file stored on a local storage device or downloaded from the network by the bootloader and copied into RAM. The initrd file is then loop mounted. The initrd shell scripts then mount the file system of the operating system. At this stage, two approaches are usually considered: one based on network block devices and the other based on network file system. In the first approach in which the initrd mounts a remove block device, cloud storage acts as a virtual hard disk and the linux kernel of the virtual machine implements the mapping between blocks and files through the choice of any file system type such as ext4, btrfs, xfs, etc. In the second approach in which the initrd mounts a network file system, the linux ker- nel of the virtual machine does not handle the mapping between blocks and files. Instead, it relies on a file system module of the Linux kernel to access files. The first approach is usually preferred for performance in particular for database. The second approach is usually preferred to simplify file distribution and sharing in legacy environments. tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 In the PaaS Use Case, an application server needs to store binary large ob- jects (BLOBs) data in different formats (JSON, XML, pickle, etc.) into some dis- tributed and redundant network database by associating the BLOB value to a key (ex. date, integer, ID, etc.). This type of database is sometimes called key- value store or NoSQL database. The PaaS Use Case is very similar to the first ap- proach of IaaS, except that block size is fixed whether blob size is more flexible.

In the backup case, a distributed file system is implemented to provide a re- silient and worldwide backup are for files of a server or of a personal computer. Dropbox is an example of this third use case. In this use case, security and re- silience are more important than performance. For desktop applications, pre- senting a standard hierarchical file system interface is a must. For server appli- cations, command line tools might be sufficient.

We are going to study each use case. First by studying the characteristics of key-value stores: block storage and blob storage. Second by studying the differ- ent types of distributed file systems.

tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 4 Key-Value Storage: Blocks or Blobs

The most common and simple case of storage for the Cloud is based on the idea of associating an arbitrary string value to an arbitrary key. The string value is supposed to be quite long and the key quite short. Depending on the nature of the storage, the key value can be constrained in size (ex. less than 16 charac- ters) or type (ex. 32 bit integer). The string value can also be constrained in size (ex. fixed size of 512 bytes) or type (ex. only ASCII). The example bellow shows two key-value associations. One between a 32-bit integer and an arbitrary long UTF-8 string. The second between an ASCII string and a 16-byte long block. 3843623 → “A long string with accents such 'é' and asian letters 本日” 'abs236-22' → 0xFD23A2B1FD23A2B1FD23A2B1FD23A2B1 The table bellow provides a list of examples of key-value store, with their dif- ferent properties. Although are focus is on bare key-value store, we have also in- cluded some rich NoSQL storage in the list for completeness.

Software Key Value FLO API Trans- Main Purpose SS port Siz Ty Mu Si Ty In- e pe lti ze pe de x mem- 65, Bi- No 4 Bi- No Yes mem- TCP/IP Distributed cache cached 53 na G na cached 6 ry B ry flare 65, Bi- No 4 Bi- No Yes mem- TCP/IP Persistent cache 53 na G na cached 6 ry B ry kumofs 65, Bi- No 4 Bi- No Yes mem- TCP/IP Distributed persistent 53 na G na cached cache 6 ry B ry Distributed NoSQL store InnodDB 65, Bi- No 4 Bi- No Yes mem- TCP/IP Pesistent cache Memcached 53 na G na cached NoSQL store 6 ry B ry mem- 65, Bi- No 4 Bi- No Yes mem- TCP/IP Pesistent cache cachedb 53 na G na cached 6 ry B ry tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 membase 65, Bi- No 4 Bi- No Yes mem- TCP/IP Distributed persistent 53 na G na cached cache 6 ry B ry Voldemort No Yes TCP/IP Handler- No Yes Handler- TCP/IP Pesistent cache Socket Socket NoSQL store MogileFS Yes Yes MogileFS TCP/IP NoSQL store NEO No Yes ZODB TCP/IP Distributed Transac- tionnal NoSQL store ZEO No Yes ZODB TCP/IP Transactionnal NoSQL store RelStorage No Yes ZODB TCP/IP Transactionnal NoSQL store SSD Fix Bi- Fi Bi- No Yes Linux SATA Local block store ed na xe na Block De- ry d ry vice SAN Fix Bi- Fi Bi- No No Linux Fiber Remote block store ed na xe na Block De- Chan- ry d ry vice nel SAN Fix Bi- Fi Bi- No No Linux iSCSI Remote block store ed na xe na Block De- ry d ry vice nbd Fix Fi Bi- No Yes Linux TCP/IP Network block store ed xe na Block De- d ry vice drb Fix Fi Bi- No Yes Linux TCP/IP Network redundant ed xe na Block De- block store d ry vice sheepdog Fix Fi Bi- No Yes qemu TCP/IP Distributed block store ed xe na Block De- d ry vice CEPH qemu-rdb Ceph ra- dosgw Amazon S3 No No S3 TCP/IP Distributed NoSQL store Monstore No Yes S3 TCP/IP Distributed NoSQL S3 store Riak CouchDB Yes MongoDB Yes Reddis Yes Cassandra Yes

Table 1. Comparison table of key-value stores (XXX to be finalized)

Key-value stores in which the key size is constrained in range and the value size is fixed are sometimes called block storage. Key-value stores in which the tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 value size is undetermined and the key mostly arbitrary are sometimes called NoSQL databases and the value is called a Binary Large Object also known as blob.

Key-value stores implement 3 methods: load (always), store (always) and delete (often).

The load method tries to access the value v for key k and returns it. If no value exists, an exception can be raised or a NULL value is returned. load(3843623) returns “A long string with accents such 'é' and asian letters 本日” The store method stores the value v at with key k so that a future load can ac- cess it. store(3843623, “hello”) load(3843623) returns “hello” The delete method deletes any value for key k. delete(3843623) load(3843623) returns NULL or raises exception The delete method is not always implemented. Caching systems may use garbage collection to delete values implicitly after some time and thus prevent storage exhaustion. Storage systems which define a limited range for possible values of key k do not bear the risk of storage exhaustion and thus may not need to provide a delete method.

4.1 Linux Block Device

Block Device provide elementary access to persistent storage from the Linux kernel. Hard disks or ram disks are all accessible through the Block Device API [RD-31].

Many modules following the Linux Block Device specification were added to the Linux kernel and provide access to a wide variety of virtual storage device:

• network block device (nbd) is a simple way to provide access to a block device through a TCP/IP connection;

• md [RD-32] provides a software implementation of redundant local stor- age, a.k.a. RAID;

• drbd provides a simple way to mirror two block device across the net- work;

• iscsi provides access to SAN storage through TCP/IP;

• fcoe provides access to SAN storage through Ethernet. tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 Until recently, the only available options to implement block device for Cloud applications were to purchase a Storage Area Network (SAN) and attach it to virtual machines through iSCSI or Fiber Channel protocols, or to use a combina- tion of ndb, md and drbd with standard hardware.

The combination of nbd, md and drbd does in open source something similar to what a SAN does with proprietary technology. Data redundancy on a single hardware is provided by md. Network access to blocks is provided by nbd. Net- work mirroring between a two or three similar hardware is provided by drbd. Some companies tried to replace nbd with iscsi or fcoe Linux kernel modules but performance was not as high [RD-33].

Nowadays, more options are available: sheepdog, qemu-rdb and proxying.

Sheepdog implements an API similar to Linux Block Device but directly within qemu, thus eliminating the need to provide modules to the guest operating sys- tem and reducing the number of memory copies required to access to blocks. Sheepdog provides a simple way to implement elastic, redundant and scalable block storage for virtual machines. It has been tested by NTT labs on a cluster of 100 servers. Qemu-rdb follows a similar approach using ceph rados key-value store.

Proxying consists of mapping one protocol (ex. memcached) to an interface (ex. Linux Block Device, qemu internal block device API). Cleverscale, a startup company which specializes in Cloud storage, implements a universal proxying proprietary technology which combines any number of sources of storage with different latency and availability characteristics to produce a virtual storage ac- cessible through a single protocol and with better latency and availability char- acteristics.

4.2 NoSQL key-value stores

NoSQL key-value stores are used by application servers as a way to cache data temporarily or to store it permanently. Memcached and MogileFS are good pioneering examples of the kind of ideas which lead to replacing POSIX or SQL by a simple key-value protocol.

Memcached was created by Danga for LiveJournal to keep results of partial web page calculations in RAM and make sure at the same time that a given cal- culation made in one node of a cluster does not need to be computed again by tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 other nodes of the same cluster. Memcached is often presented as a kind of shared, distributed RAM. It solves a well known problem in the scalability of dis- tributed Web applications and is thus used by most large audience web sites in- cluding Facebook [RD-34].

The memcached protocol has evolved into a de-facto standard for caching and storing values attached to a key. Flare, KumoFS, membase are two examples of persistent implementations of memcached. NEO, the transactional distributed NoSQL database of Sytematic R&D Cluster is also developing a memcached wrapper. MySQL itself provides a memcached front end which eliminates any SQL query processing and is thus capable of delivering unprecedented perfor- mance [RD-35, RD-19].

Other key-value store protocols are similar to memcached yet add a bit of complexity or a lot of latency. The S3 protocol by Amazon is also a key-value store protocol, just like Swift (OpenStack) and Dynamo (Amazon). However, none seem to reach the level of memcached implementations such as kumofs or membase.

tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 5 Distributed file systems: Shared Directories and Files

In the early days of Cloud Computing, network file systems were used to distrib- ute files across virtual machines and provide virtual storage. This approach has been supported strongly by Enomaly, a company which has contributed to Ama- zon EC2 early implementation and used to provide an open source Cloud Com- puting stack. During Cloud Camp at CONSEGIS 2010, Reuven Cohen explained that there was no better way than distributed file system to serve file to virtual machines.

Traces of this approach can still be found in OpenNebula implementation. However, because of the growing acceptance of NoSQL storage for application servers and because of the poor performance of network file systems for rela- tional databases, network file systems are being abandoned slowly. Applications which need a fast and scalable access to data tend to use a NoSQL database for this purpose. Applications which need high performance access to SQL data- bases tend to prefer block device storage if not bare metal and replication.

This leaves mostly no room for distributed file systems besides legacy applica- tions, small data size applications or slow speed backup applications. A list of distributed file systems suitable for Cloud applications is provided bellow:

Software Imple- PO Trans Sco Main Pur- menta- SIX port pe pose tion Lustre Yes TCP/I LA Scientific P N calculation Apache FUSE TCP/I Hadoop P HDFS CloudStore tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 GfarmFS Scientific calculation GPFS Yes GlusterFS Yes POHMELFS XTreemFS FUSE Yes TCP/I WA World file P N system CEPH Kernel Yes TCP/I WA module P N FUSE Tahoe FUSE Yes TCP/I WA Secure and P N resilient world file system

Table 2. Comparison table of distributed file systems

We are aware that LustreFS is still used for legacy scientific applications. We are also aware that companies which need to provide legacy file sharing through – for example – CIFS find in distributed file systems a practical way to implement worldwide file sharing. However, both applications are outside the scope of any modern Cloud Computing system. The first relates to scientific computing on lo- cal area networks and the second to enterprise computing on trusted networks.

One of the few remaining applications of distributed file systems in Cloud Computing environments was to share files across the different virtual machines of a Cluster. Such files could be PHP files, python files, perl files, jar files, config- uration files of daemons, etc. However, different system configuration or soft- ware distribution technologies are increasingly used instead for this purpose of file sharing. This includes Mandriva Pulse, AdminKit by Frédéric Lepied or build- out profiles in SlapOS. Pulse can for example update the configuration files of tens of thousands of servers automatically and reliably. AdminKit by Frédéric Lepied uses a shared git repository to update configuration files through HTTP. SlapOS automates all the configuration options of a distributed cluster through a persistent database.

Another common application of distributed file systems was to store tempo- rary calculations or binary files such as images. Early implementations of photo album were using file systems to store images. However, with millions of uploads per day, a key-value store based on distributed hash table is much more efficient, flexible, reliable and scalable.

The complexity of the POSIX overlay which is required in a distributed file sys- tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 tem simply prevents considering anylonger the file system paradigm as a rele- vant option for sharing data on the Cloud when distributed key-value stores do it much more efficiently.

This analysis may sound shocking. It is not new. MogileFS, which acronym may sound like file system, is in reality a key-value store combined with a dis- tributed catalog. It eliminates the need to implement POSIX by providing instead a simpler, non standard protocol. Just like memcached, it was created by the pio- neering folks of Danga for LiveJournal folks and it still a great source of inspira- tion for modern ways of sharing data across storage nodes on the Cloud.

5.1 From file system to key-value

Key-value store protocols may seem too simple for users who are used to the POSIX API or who expect object data to be stored always into tables and col- umns even though it is now well known that it does not scale. Changing habits is difficult. Few people accept the idea that POSIX and SQL are just different querying languages which do not differ in their purpose one from the other, and which need in both case to be implemented on top of the same family of data structures using a kind of btree or hash table.

It is beyond the scope of this document to describe search-based architectures and how, by splitting mere storage and relational or any other form indexing of arbitrary data, application scalability can be greatly extended [RD-36]. Few peo- ple anyway consider relational database as part of the “infrastructure layer” of Cloud, which this report is about.

However, most people believe that distributed file systems are part of the “in- frastructure layer” of Cloud. It is in our opinion not only a nonsense but also a dangerous idea. The reason why it is a nonsense is related to the fact that a file system is just a way to index values of a key-value store with a hierarchical structure and some metadata. To be completely convinced, consider the follow- ing lines of code: ls person/* SELECT * FROM person Both lines implement the same idea of querying a list of objects, one using POSIX and the other using SQL.

The reason why the idea of distributed file system is dangerous in the context tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 of Cloud is because Cloud is about scalability and there is no reason to add a POSIX layer between applications and key-value stores to implement data stor- age. The POSIX layer could even prevent achieving the kind of optimizations without which a web application does not scale. Keeping in people's mind the idea that a distributed file system is useful in a context of Cloud will simply pre- vent many developers to change their habit and will prevent them adopting mod- ern ways of development of scalable applications.

Teaching how to stop using file systems and how to use key-value stores in- stead should thus be one of the highest priority of any cloud computing organi- zation such as COMPATIBLE ONE.

5.2 From key-value to file system

Abandoning POSIX is the most reasonable choice to build scalable applica- tions. But for a few applications, POSIX and hierarchical file systems are still useful. One of these applications is backup, since high performance matters less and easy of use is essential. One reason why dropbox is so successful could be related to its user interface which perfectly integrates into desktop environ- ments and file browsers, preserving the concept of hierarchy of folders.

Key-value stores can be presented to users as if they were file systems through two distinct approaches: by formatting a key-value store as if it were a block device, or by implementing a file to block mapping.

The first approach is illustrated by sheepdog qemu integration [RD-3] and ceph's qemu-rbd [RD-1]. It could be extended to other key-value stores such as memcached.

The second approach has been demonstrated by various implementations of POSIX through FUSE on top of existing key-value stores. The table bellow pro- vides a list of POSIX wrappers for key-value stores:

Soft- Imple- Applica- ware men- tion tation S3fs FUSE Backup VoldF FUSE Experi- S mental Mem FUSE Experi- cach mental tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 efs ceph Kernel Distrib- mod- uted ule Filesys- tem

Table 3. File system wrappers for key-value stores

Since the only relevant and remaining application for distributed file systems seems to be backup, another approach should be considered: integrating key- value stores to backup tools. , duplicity, rdiff-backup, Box Backup and Backup Manager all include multiple storage backend, including for S3 and ceph. Javascript implementations of file access are also provided in TAHOE to access files directly from the browser in a javascript application.

Figure 7. TAHOE supports many clients besides POSIX

tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 6 NoSQL beyond Mere Storage

Authentication, encryption and indexing are not provided by many key-value stores. Pure key-value stores are designed for maximum performance on a trustable network (memcached) or on a network which supports broadcasting (sheepdog). They are often designed to be accessed by a single, high-perfor- mance application rather than shared. Amazon S3 on the other hand provides all the kind of authentication and encryption which a web service can provide through HTTP and TLS, at the expense of performance.

When it comes to indexing, it is perfectly reasonable that pure key-value store do not provide any way of querying metadata or content of values, since values are completely arbitrary. It is the responsibility of the application to build indices by either creating their own index data structure or by relying on an external in- dex provider.

6.1 Key-valye Store Limitations

6.1.1.Authentication

Most network protocols can be extended to support authentication through SASL. This was achieved recently by memcached. It should not be too difficult to extend protocols which do not yet support authentication natively with SASL, al- though it does not always make sense.

Involving a kind of virtual private network (VPN) can be a simple way to add authentication. Different approaches can be considered. One is to setup a cen- tralized VPN such as OpenVPN. But this eventually increases the latency and bandwidth bottlenecks. It is thus preferable to rely on a peer-to-peer VPN ap- proach. IPv6 provides this kind of feature natively with IPSEC. However, setting tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 up IPSEC in the context of untrustable nodes on public Clouds is not easy. An- other alternative is to use SSL tunneling with stunnel user agent. The stunnel approach has another advantage: it can help turning quickly an IPv6 unaware software into an IPv6 aware software, and at the same times encrypts the trans- port.

6.1.2.Encryption

Most key-value store protocols provide no encryption of stored data and no en- cryption of transferred data. Encryption of transferred data can be obtained through the use of IPSEC or stunnel. Generally speaking, adding TLS support to protocols should be considered to simplify authentication and encryption of transferred data over untrusted networks.

Regarding stored data encryption, it is obviously more efficient to rely on the application, just from a point of view of distributing CPU-intensive tasks on mul- tiple nodes rather than increasing the load of the key-value store.

6.1.3.Multiple namespaces

Key-value stores do not always provide multiple namespaces, and sometimes in a way which provides no authentication. Sheepdog for example does not make a difference between the partition names from an authentication point of view. With memcached, multi-tenancy must often be implemented on the client side or on the application side (http://code.google.com/appengine/docs/python/multite- nancy/multitenancy.html).

In our opinion, multi-tenancy is not really an issue as long as there is a way – such as SlapOS does – to allocate automatically and quickly a new key-value store. For large scale applications, developers anyway do not want their input- output access performance to be polluted by someone else.

6.1.4.Indexing

We consider that indexing is a topic out of the scope of this report. Adding in- dices to databases in a synchronous or centralized way just kills the scalability. Adding asynchronous indices to a key-value store is so much dependent on the nature of the application and the nature of the data that it is better to rely on the application to implement or select how it indexes its data. We are aware that for many users, keeping an index on values is a must. In this case, we recommend to tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 use rich NoSQL databases instead of simple key-value store, and to pay a great attention on the index updating process which may just make their application non scalable. Another point of view could be to state that it is possible to build an index-rich NoSQL database on top of a simple key-value store, just the same way as a file system is built on top of a block storage server. A file system is – af- ter all – a way to index blocks.

6.2 NoSQL Document Stores

We provide shortly in this paragraph an overview of NoSQL document store, although it is not the primary focus of this report. A document store is key-value store in which the value is a JSON structure. It was popularize by CouchDB and MongoDB.

6.2.1.Querying through map-reduce code injection

Searching a document store can be achieved by using a map-reduce approach. A javascript function is used to select documents which match a criteria defined in Javascript. The execution of the javascript function is achieved in parallel on multiple nodes of a cluster.

6.2.2.Querying through indices

However, map-reduce may not be sufficient always. Indices may be used to query documents more quickly or limit the scope of the map-reduce scan.

6.3 NoSQL Columnar Stores

NoSQL columnar stores are key-value for which the data is a BLOB and or a list of columns. This includes NoSQL databases such as Big Table, CMIS stores, SphinxSearch, Solr. Some columnar store provided a fixed schema while others provide a dynamic schema with a variable number of columns.

Some columnar store provide a querying syntax which is very close to rela- tional SQL database and even supports joins. Others provide querying syntax which does not support joins.

tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 6.4 NoSQL Graph Databases

NoSQL graph database provide an optimized approach to store graphs such as social graphs or knowledge graphs. RDF databases are used for example for se- mantic web applications. Graph databases (Neo4J, GraphDB) are used to imple- ment certain social networks.

6.4.1.Querying through graph crawling

Some graph databases provide an efficient querying process which is similar to map-reduce in NoSQL Document stores. The parallel execution of a predicate function combined with a graph discovery function leads to high performance queries.

6.4.2.Querying through semantic indices

Other graph databases provide indices, in particular semantic databases, so that complex queries can be efficiently interpreted with large volumes of data.

tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 7 Storage Allocation API

Automated and instant allocation of a service is what characterizes cloud com- pared to traditional hosting of managed storage. We propose in this chapter a unified API for allocate Cloud storage. We show how this unified API helps imple- menting complex scalability testings.

7.1 Existing Storage Allocation APIs

Each vendord (ex. Amazon S3) usually provides its own API and protocol for storage allocation.

Python based libcloud provides one of the most interesting approach to unify the diversity of CloudAPIs for IaaS. Libcloud first provided a common API to allo- cate virtual machines. It now supports storage container allocation, load bal- ancers and DNS management.

Another approach is the SLAP protocol. SLAP stands for “Simple Language for Accounting and Provisionning”. Through a single promise-based API (the so- called request method), SLAP implements what requires in imperative APIs such as libcloud one API for virtual machines (create_node, deploy_node, reboot_node, destroy_node), one API for storage (create_container, upload_ob- ject, download_object, download_object_as_stream, delete_container, delete_ob- ject), one API for load balancing (create_balancer, destroy_balancer, balancer_attach_member, etc.) and one API for DNS (get_zone, get_record, cre- ate_zone, update_zone, create_record, update_record, delete_zone, delete_record).

Since SLAP protocol already covers a wider scope of applications than lib- cloud, with a simpler API, we will obviously recommend the unification of stor- tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 age allocation and access through the SLAP protocol and API.

7.2 SlapOS: A Universal Allocation API for Storage

7.2.1.Allocating Storage with slapos request in pytjon

We follow here SlapOS approach based on a single request method to allocate any storage. In pure SlapOS, the type of storage is defined by a URL to a build- out script. The request API is invoked in python scripting language: >>> storage = slapos.request( 'http://www.gitorious.org/slapos/slapos-software-memcached/software.cfg', 'memcached', 'main', configuration=dict(node_count=10, redundancy=3), computer=dict(same_lan=True)) >>> print storage.getConnectionDict() {'ip': 127.0.0.1, 'port': 10020} The long URL can be replaced with a short alias to simplify syntax: >>> storage = slapos.request('memcached', 'memcached', 'main', configuration=dict(node_count=10, redundancy=3), computer=dict(same_lan=True)) >>> print storage.getConnectionDict() {'ip': 127.0.0.1, 'port': 10020} As a result, the memcached daemon becomes a standard token so that GNU/Linux distributions may provide it if needed through their own packaging system instead of using buildout scripts to build memcached servers.

7.2.2.Allocating Storage with slapos command line

Slapos is a shell command which follows a syntaxt similar to git : # slapos request memcached memcached main –configuration node_count=10 redundancy=3\ –computer same_host=true IP: 127.0.0.1 port: 10020 Allocating multiple storage of different types with different configuration be- comes as easy as this: # slapos request memcached memcached alt1 –configuration node_count=2 redundancy=1\ –computer same_host=true IP: 127.0.0.1 port: 10021 # slapos request memcached memcached alt2 –configuration node_count=2 redundancy=1\ –computer same_host=true IP: 127.0.0.1 port: 10022 # slapos request kumofs memcached alt3 –configuration node_count=3 redundancy=2\ –computer same_host=false IP: 2a01:e34:ec03:8610:221:5dff:fe77:d176/64 port: 10022 Dropping storage services can be performed by: # slapos destroy memcached memcached alt1

tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 7.2.3.Multitenancy

Some storage services (ex. MySQL) provide multiple name spaces, also-known as multitenancy. This is supported in SlapOS in this way: # slapos request mariadb mysqld main –configuration database_count=100 –sla same_host=true IP: 127.0.0.1 port: 10030 # slapos request mariadb mysqld database1 –sla same_host=true –slave=true IP: 127.0.0.1 port: 10030 database: a3uy6bs78j The first line allocates a MySQL database powered by mariadb buildout. The second line requests a database in 'main' mysqld server.

7.2.4.Automating Storage Scalability Testing

Thanks to SlapOS python API, we could build an automated scalability testing framework: nosql-testbed. This framework allocates on the one hand a number of key-value store servers and on the other hand allocates instances of mem- strike process.

This single command fires a sequence of request commands to allocate key- value stores and testing agents. All results of the test are aggregated and stored on ERP5 Forge's testing module. Thanks the the universal approach for storage allocation, the same framework could already be used by Nexedi team to test both kumofs and sheepdog. The testing protocol derives from memstrike, a test- ing utility part of kumofs.

7.3 The SLAP REST API

A detailed description of SLAP protocol can be found on Python Package In- dex.

http://packages.python.org/slapos.core/rest.html

We will provide here an except of the SLAP protocol which relates to storage allocation.

7.3.1.Exchange format

SlapOS master will support both XML and JSON formats for input and output.

The Content Type header is required and responsible for format selection.

tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 7.3.2.Response status code

7.3.2.1 Success GET requests will return a "200 OK" response if the resource is successfully retrieved.

POST requests which create a resource we will return a "201 Created" re- sponse if successful.

POST requests which perform some other action such as sending a campaign will return a "200 OK" response if successful.

PUT requests will return a "200 OK" response if the resource is successfully updated.

DELETE requests will return a "200 OK" response if the resource is success- fully deleted.

7.3.2.2 Common Error Responses

7.3.2.2.1 400 Bad Request

The request body does not follow the API (one argument is missing or mal- formed). The full information is available as text body: HTTP/1.1 400 Bad Request Content-Type: application/json; charset=utf-8

{ "computer_id": "Parameter is missing" }

7.3.2.2.2 402 Payment Required

The request can not be fulfilled because account is locked.

7.3.2.2.3 403 Forbidden

Wrong SSL key used or access to invalid ID.

7.3.2.2.4 404 Not Found

Request to non existing resource made.

7.3.2.2.5 500 Internal Server Error

Unexpected error.

tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 7.3.3.Instance Methods

7.3.3.1 Requesting a new instance Request a new instantiation of a software.

Request: POST http://example.com/api/v1/request HTTP/1.1 Content-Type: application/json; charset=utf-8 Expected Request Body: { "title": "My unique instance", "software_release": "http://example.com/example.cfg", "software_type": "type_provided_by_the_software", "slave": False, # one of: True or False "status": "started", # one of: started, stopped "parameter": { "Custom1": "one string", "Custom2": "one float", "Custom3": ["abc", "def"], }, "sla": { "computer_id": "COMP-0", } } Expected Response: HTTP/1.1 201 Created Content-Type: application/json; charset=utf-8

{ "instance_id": "azevrvtrbt", "status": "started", "connection": { "custom_connection_parameter_1": "foo", "custom_connection_parameter_2": "bar" } } Additional Responses: HTTP/1.1 202 Accepted Content-Type: application/json; charset=utf-8

{ "instance_id": "azevrvtrbt", "status": "processing" } The request has been accepted for processing

Error Responses:

• 409 Conflict The request can not be process because of the current status of the instance (sla changed, instance is under deletion, software release can not be changed, ...).

tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 7.3.4.Deleting an instance

Request the deletion of an instance.

Request: DELETE http://example.com/api/v1/instance/{instance_id} HTTP/1.1 Content-Type: application/json; charset=utf-8 Route values:

• instance_id: the ID of the instance

No Expected Request Body

Expected Response: HTTP/1.1 202 Accepted Content-Type: application/json; charset=utf-8 Error Responses:

• 409 Conflict The request can not be process because of the current status of the instance.

7.3.5.Get instance information

Request all instance information.

Request: GET http://example.com/api/v1/instance/{instance_id} HTTP/1.1 Content-Type: application/json; charset=utf-8 Route values:

• instance_id: the ID of the instance

No Expected Request Body

Expected Response: HTTP/1.1 200 OK Content-Type: application/json; charset=utf-8

{ "instance_id": "azevrvtrbt", "status": "start", # one of: start, stop, destroy "software_release": "http://example.com/example.cfg", "software_type": "type_provided_by_the_software", "slave": False, # one of: True, False "connection": { "custom_connection_parameter_1": "foo", "custom_connection_parameter_2": "bar" }, "parameter": { "Custom1": "one string", "Custom2": "one float", "Custom3": ["abc", "def"], }, "sla": { tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 "computer_id": "COMP-0", } "children_id_list": ["subinstance1", "subinstance2"], "partition": { "public_ip": ["::1", "91.121.63.94"], "private_ip": ["127.0.0.1"], "tap_interface": "tap2", }, } Error Responses:

1. 409 Conflict The request can not be process because of the current status of the instance

7.3.6.Get instance authentication certificates

Request the instance certificates.

Request: GET http://example.com/api/v1/instance/{instance_id}/certificate HTTP/1.1 Content-Type: application/json; charset=utf-8 Route values:

• instance_id: the ID of the instance

No Expected Request Body

Expected Response: HTTP/1.1 200 OK Content-Type: application/json; charset=utf-8

{ "ssl_key": "-----BEGIN PRIVATE KEY-----\nMIIEvgIBADAN...h2VSZRlSN\n-----END PRIVATE KEY-----", "ssl_certificate": "-----BEGIN CERTIFICATE-----\nMIIEAzCCAuugAwIBAgICHQI...ulYdX- JabLOeCOA=\n-----END CERTIFICATE-----", } Error Responses:

• 409 Conflict The request can not be process because of the current status of the instance

7.3.7.Bang instance

Trigger the re-instantiation of all partitions in the instance tree

Request: POST http://example.com/api/v1/instance/{instance_id}/bang HTTP/1.1 Content-Type: application/json; charset=utf-8 Route values:

• instance_id: the ID of the instance tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 Expected Request Body: { "log": "Explain why this method was called", } Expected Response: HTTP/1.1 200 OK Content-Type: application/json; charset=utf-8

7.3.8.Modifying instance

Modify the instance information and status.

Request: PUT http://example.com/api/v1/instance/{instance_id} HTTP/1.1 Content-Type: application/json; charset=utf-8 Expected Request Body: { "status": "started", # one of: started, stopped, updating, error "log": "explanation of the status", "connection": { "custom_connection_parameter_1": "foo", "custom_connection_parameter_2": "bar" } } Where status is required with log, connection is optional and its existence al- low to not send status and log.

Expected Response: HTTP/1.1 200 OK Content-Type: application/json; charset=utf-8 Error Responses:

• 409 Conflict The request can not be process because of the current status of the instance (sla changed, instance is under deletion, software release can not be changed, ...).

tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 8 Recommendations

If there was one standard to recommend for Cloud storage, we would definitely recommend memcached because it is the only protocol which has been imple- mented by many independent parties both for caching and persistent storage, and which scalability has been proven by facebook, NHN and Nico Nico. By proxying memcached through qemu for IaaS or through FUSE for POSIX, one could provide a universal Cloud storage solution with maximum choice and com- petition. The performance of kumofs, a distributed persistent implementation of memcached, is higher than the performance of a Solid Stade Disk (SSD), with an equivalent mean access time of 0.035 ms, and could be higher that many Stor- age Area Networks (SAN).

Figure 8. Kumofs can reach levels of performance which few SANs can reach

If a practical, high performance alternative to SAN is needed, we recommend to use sheepdog because it provides potentially the same performance as nbd and drbd but this a much higher level of flexibility.

If ease of use on the Desktop is required, we recommend proxying key-value stores with FUSE or equivalent. We recommend to assess TAHOE for resilient tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 backup applications. TAHOE acts – just like ceph – both as a key value store and as a distributed file system. TAHOE was created for a company competing with dropbox. It is now lead by Zooko Wilcox-O'Hearn (born Bryce Wilcox) a peer-to- peer hacker and cypherpunk known for his work on DigiCash, Mojo Nation and Mnet.

For scientific applications which need shared storage, only Lustre seems to be recommendable currently. However, we also recommend migrating from POSIX to key-value stores for even higher performance.

We also recommend to keep a close eye on ceph, both as key-value store, as a SAN alternative and as distributed file system. Ceph even includes an S3 gate- way and provides a one stop shop for all storage needs on the Cloud.

Generally speaking, we believe that proxying storage protocols will be increas- ingly used to combine properties (ex. performance, resilience) of different solu- tions. Besides memcached proxy, only proprietary solutions such as CleverScale exist. We thus recommend as part of COMPATIBLE to study key-value proxying based on the memcached protocol, since a lot of work has been already achieved on this topic.

Building a memcached storage proxy compatible with all platforms (MacOS, Windows, Linux) could be a nice outcome of COMPATIBLE ONE. In order to il- lustrate this approach, it could be interesting for example to extend qemu and provide support for kumofs storage as if it were a block device, in the same spirit as what sheepdog or ceph does. This could provide some comparison be- tween sheepdog and kumofs performance on real life applications based on block device API. It would also be interesting to provide S3 protocol on top of memcached, or to implement a RESTful version of memcached protocol. Finally, creating a dropbox like application on top of memcached could be an interesting outcome of COMPATIBLE.

8.1 The End of POSIX

After conducting this study and in particular after reading an excellent report on XtreemFS performance [RD-12], we are now convinced that POSIX distrib- uted file systems are irrelevant for Cloud Computing. None of the major, large scale applications of Cloud Computing are based on POSIX. POSIX seems only used for run legacy software on virtual machines or to run legacy scientific ap- tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016 plications on trusted networks.

We thus recommend as part of COMPATIBLE to reduce the effort on distrib- uted file systems and increase effort considerably on key-value stores, including block storage. Any effort of compatibility through proxying of one protocol into another, through the integration of shell commands to quickly allocate Cloud storage, through the integration of GUI to browse data from the desltop or through the integration of key-value backends to backup modules will increase the adoption of key-value stores in their native form by a wider community and provide a competitive advantage to early adopters.

We consider in particular that perfect integration of distributed block device storage to the operating system for qemu virtual machines can have a great im- pact on the market and lead to the end of proprietary storage area networks (SAN) which currently prevent the emergence of a competition to Red Hat and SuSE on the enterprise market. However, without considerable effort and in- volvement of excellent C developers, this possibility will no happen or will hap- pen outside of COMPATIBLE project members.

tmpmtp6I4.odt Confidentiel © Nexedi SA / ISO 16016