2016 IEEE International Conference on Big Data (Big Data)

Container-Based for Byte-Addressable NVM Data Storage

Ellis R. Giles Rice University Houston, Texas [email protected]

Abstract—Container based virtualization is rapidly growing Storage Class Memory, or SCM, is an exciting new in popularity for cloud deployments and applications as a memory technology with the potential of replacing hard virtualization alternative due to the ease of deployment cou- drives and SSDs as it offers high-speed, byte-addressable pled with high-performance. Emerging byte-addressable, non- volatile memories, commonly called Storage Class Memory or persistence on the main memory bus. Several technologies SCM, technologies are promising both byte-addressability and are currently under research and development, each with dif- persistence near DRAM speeds operating on the main memory ferent performance, durability, and capacity characteristics. bus. These new memory alternatives up a new realm of These include a ReRAM by Micron and Sony, a slower, but applications that no longer have to rely on slow, block-based very large capacity Phase Change Memory or PCM by Mi- persistence, but can rather operate directly on persistent data using ordinary loads and stores through the cache hierarchy cron and others, and a fast, smaller spin-torque ST-MRAM coupled with transaction techniques. by Everspin. High-speed, byte-addressable persistence will However, SCM presents a new challenge for container-based give rise to new applications that no longer have to rely on applications, which typically access persistent data through slow, block based storage devices and to serialize data for layers of block based file isolation. Traditional persistent data persistence. When writing values into a persistent memory accesses in containers are performed through layered file access, which slows byte-addressable persistence and transac- tier, programmers are faced with a dual edged problem tional guarantees, or through direct access to drivers, which of how to catch spurious cache evictions while atomically do not provide for isolation guarantees or security. grouping stores to manage consistency guarantees in case of This paper presents a high-performance containerized ver- failure. sion of byte-addressable, non-volatile memory (SCM) for applications running inside a container that solves perfor- Figure 1 shows how new Storage Class Memory sits at mance challenges while providing isolation guarantees. We the intersection of both byte-addressability and persistence, created an open-source container-aware loadable Kernel allowing applications to use ordinary loads and stores to Module (LKM) called Containerized Storage Class Memory, quickly persist data without having to serialize data for block or CSCM, that presents SCM for application isolation and storage. SCM is timely for Big Data applications as it sits ease of portability. We performed evaluation using micro- benchmarks, STREAMS, and Redis, a popular in-memory at the common challenges of Velocity, Volume, and Variety data structure store, and found our CSCM driver has near [4] [5] and recent extensions to additional characteristics the same memory throughput for SCM applications as a non- [6]. Figure 1 also shows traditional container based storage containerized application running on a host and much higher in relation to Storage Class Memory. As noted in the figure, throughput than persistent in-memory applications accessing SCM sits at an interesting intersection of byte-addressable SCM through Storage or Volumes. access and persistence. Access to SCM will be provided via a traditional mmap call to an underlying or I.INTRODUCTION file [7]–[9]. This poses a problem for applications running Docker [1] is a relatively new open-source implemen- tation of container-based virtualization technology that has been gaining in popularity for quick and easy cloud deploy- ments and for recent work in live migration of containers using Flocker [2]. Docker creates lightweight Linux based containers with containerized applications having near the same performance as when the application is executed outside the container [3]. Containers offer lightweight vir- tualization for applications and services running on the same host . Containers are an alternative to full virtualization of a host operating system and devices, only isolating running applications using a ”” for persistent file accesses, Linux groups for CPU and memory usage, Figure 1: Container based storage with byte-addressable, and I/O isolation. fast, and non-volatile Storage Class Memory.

978-1-4673-9005-7/16/$31.00 ©2016 IEEE 2754 inside an isolated container because an mmap of a file from inside a container using regular ac- through an isolation layer may not gain the performance cesses with limits imposed by the operating system if Linux benefits of SCM, due to cascading persistence consistency are used. guarantees in layered file accesses. Additionally, exposing a Handling access to persistent data inside a Docker Con- shared device or volume to multiple containers can remove tainer is not a choice to be taken lightly, as there are several persistence isolation from containers and introduce security choices available: using a container file or traditional Docker and portability issues. New Direct Access Storage, an external or Docker Volume, or direct access to a (DAX) support and (XIP) support bypasses device. Each has its own advantages and disadvantages and the virtual-memory system and page caches creating direct needs to be specified up-front on container start if requiring access to SCM without the need to copy data into internal a special device or volume. We summarize each method and buffers for persistence. Even with DAX support in file then relate to SCM storage access. systems such as and , containers will still have 1. Docker Storage: Storage to traditional files inside a to access data through a Docker Storage Layer or Volume, container are accessed using a pluggable storage driver ar- each with the challenges above. Additionally, multi-threaded chitecture. File accesses are layered using (or Another applications utilizing the Docker Volume driver do not scale Union ), , OverlayFS, VFS, or well due to internal Docker synchronization. ZFS. A layered access requires copying of data through We present a solution to these problems by introducing multiple file or persistence layers; an example is shown in a Containerized Storage Class Memory or CSCM driver. Figure 2. AUFS uses Copy-on-Write and copies the entire It detects when applications are accessing the driver from file on first update, which could waste space but be faster if within a container and presents an identical copy of SCM. all of the contents of a file are updated. Device Mapper is a This allows for ease in portability of containers coupled with new option that performs copies at the block layer reducing high-speed, byte-addressable persistent memory accesses for space, but is slower on the writes to first blocks. Another storage. In addition, with isolated, direct access to persistent drawback for Docker Storage is the entire Docker service SCM, consistency guarantees that perform persistent mem- must be configured for the driver and cannot be ory fences in an application, such as an in-memory database, changed unless reconfigured and re-installed. do not suffer from multiple layers each performing persis- 2. Docker Volumes: Docker Volumes are used to share tence consistency, thus allowing our CSCM implementation data between containers or between the container and the to achieve high scalability. host, also in Figure 2. Volume specifications are passed as options on startup of a container, and cannot be added later. II.OVERVIEW Data is also not isolated, so changes in one environment Virtualization was pioneered by Popek and Goldberg who affect the other. If a container is paused and restarted stated that a should be ”an efficient, isolated elsewhere, volume data must be copied and managed. Ad- duplicate of the real machine” [10]. Virtualization has three ditionally, we found that multi-threaded applications, with key properties including: 1) Isolation - guests should not each stressing the persistence volume, did not scale be able to affect others; 2) Encapsulation - allowing for well due to internal synchronization points in the volume easy migration as an entire system can be represented as driver. a single file; and 3) Interposition - the system can monitor 3. Direct Device Access: Direct access to a device is or intercept guest operations. specified and granted on the start of a container, and the Jails [11] was introduced for lightweight virtualization of environments to allow for the sharing of a machine between several customers or users while still allowing for isolation of files and services of the guests on the same machine through the use of chroot and I/O constraints. Chroot changes the root of the file system to a different location for application-level persistence isolation and security. This has many benefits since a system can be shared securely with little performance burden, but services such as CPU and memory were not isolated and could be abused by users. Linux Containers or LXC [12] were introduced as the Linux version of Jails. The implementation added features to restrict memory and CPU usage that extended its isolation features. Docker [1] is an open-source Linux project which automates the deployment of applications or bundled ser- Figure 2: An example Docker Storage and Volume Drivers vices inside of Linux Containers. Volatile memory is handled accessing data through a persistent file layer.

2755 We developed a new driver, which we call CSCM, that manages SCM at a file level, gaining ease of portabil- ity, while simultaneously presenting data to the container through the direct driver access avenue, thus allowing for byte-addressable speed with containerized persistence. It provides isolation through the driver and management through the file layer, while allowing direct access to SCM through loads and stores for performance and scalability without layered access. A. Persistence Consistency and Transactions Persistence consistency guarantees for transactions im- pose additional problems. Consider a simple application that is writing to a variable Z. The value might be caught Figure 3: Storage Class Memory exposed to Docker Con- in a number of places from the processor cache to write tainer through a SCM Filesystem or Driver buffers. If an atomic operation is needed for an in-memory data structure, the new value of Z should be visible and accessible to others on the completion of a transaction. If a power failure occurs, for DRAM based variables being exact same driver must be present on restart. If a new located anywhere in the hardware is not problematic as the driver is installed later, the container must be restarted or variable value is cleared on restart. However, if the value is the host may crash. Special privilege must also be granted located in SCM and a power failure occurs, the result of the to the container which violates isolation principles, as it computation may not be committed to persistent SCM. Even allows each container to access or overwrite data in the if a value is written to a memory location using a cache-line shared driver among containers launched with the same flush, clflush, a value may not be persisted to SCM. specification. The new Intel architecture specification [13] specifies Storage Class Memory presents a new situation for con- new instructions such as clwb, or cache-line write-back, tainer based storage as SCM offers byte-addressable memory which writes a cache line out to the write buffers with- that is also persistent. Access to persistent memory, or SCM, out invalidation and persistent memory store fences, which like regular file accesses can be presented to a container guarantee writes against loss but are global synchronous using one of the options listed above. The three options for operations. SCM accesses are shown in Figure 3. This complicates fast reliable storage for Docker Con- Combining Docker Containers with SCM poses an in- tainers, as multiple levels of synchronization, which are teresting challenge. If SCM is exposed as a Docker Stor- expensive, are required to maintain consistency through the age or Volume, then the high performance of SCM byte layers. For instance, an application might be performing its addressability is lost. Further, exposing persistent SCM as own consistency guarantees through an Undo Log, which a dedicated, privileged device driver will cause a loss in requires a synchronous operation making a copy of a value application isolation and portability. before writing a new value. Before each write however, each Additionally, Direct Access (DAX) support, which allows value must be made persistent on the underlying medium. for mmaped files to bypass the virtual-memory system and In the case of disk based persistence, this is accomplished page caches to store values directly to byte-addressable through a disk flush. In SCM this can be accomplished persistent memory, still must be passed to a Docker container through creating a log, writing the address of Z and value using one of the three avenues above. If a DAX supported of Z to the log, waiting for a persistent memory fence, then filesystem is passed to a containerized application through writing the new value of Z. Once new values are written, Docker Storage, persistent memory is still exposed via a the values may be flushed immediately or delayed. If the slow image layer and driver as in Figure 2; exposing through writes to the SCM locations are being performed through a a Volume or Direct Device still has the drawbacks above. Docker Storage Layer or Volume, then all of the writes to Adding support for containers to the persistent memory the variables, and the synchronization points are also passed DAX driver would introduce a new problem of managing through, which can be a very expensive operation. containerized ranges of SCM inside a given /dev/pmem Therefore, it is advantageous to access SCM directly device range; and when migrating the container to another through the device driver to avoid cascading synchroniza- system, management tools would need to be created to tion. Our CSCM driver allows for direct access to SCM access the persistent memory and recreate an identical copy in an isolated and managed manner so that containerized on the new system. applications can have fast, scalable in-memory transactions.

2756 III.CONTAINERIZING SCM This section presents the Containerized Storage Class Memory design as a Linux or LKM. The overall system design is shown in Figure 4.

Figure 5: Data Layout on /dev/pmem0 for CSCM Figure 4: Containerized Storage Class Memory System Design The Docker container is executed using a privileged de- vice driver to our CSCM LKM driver, which gives privileged The system implementation is comprised of three main access to the container to and write to our device driver components described briefly in the list below and in more /dev/cscm. Additional options for the volume tests can be detail following: passed on container creation. Even though the privileged • Docker CLI: The Docker Command Line Client flag is passed, our CSCM driver performs accesses to SCM Integrations are simple additions we performed via the via loads and stores in isolation as described below. Docker API to add functionality when launching or docker run –privileged -ti –device=/dev/cscm – saving a Docker Image. name=test1 rhel:7 • User Library: This can be any User-Level Library, such as pmem.io [8] or SoftWrAP [17] that provides Docker CLI: Docker Command Line Integration a Storage Class Memory allocator and optional persis- tence mechanisms or transaction support. We provide Docker 1.10.3 and API for Client Version 1.22 is built in a a basic working implementation for accessing SCM language called the Go Programming Language. The Docker and flushing updates through the cache hierarchy for API client allows for simple extensions to the Docker persistence consistency and with configurations for interface. It communicates to the Docker server daemon and evaluation purposes. responses from the daemon can be handled through the Go • CSCM Driver: Our Container Aware Linux loadable language. On container start, the command line to launch an Kernel Module for Storage Class Memory or CSCM. image is: This is where most of the work is performed to provide docker run [options] imageid isolated, scalable access to SCM through the Docker We modified the Docker API Client run code to initialize system. the data in a container on startup. The image id is obtained Due to the lack of readily available SCM DIMMS from the return value from the Docker daemon after the for Docker testing, we used Linux 4.5.3 with NVDIMM, container create image and before a container run. The PMEMFS and Direct Access (DAX) supports that we chose data in the /scmfs/images for the image identifier is copied as options on compiling into the Kernel, per [8], to simulate if an image exists for the new container being launched. SCM data. Then for a file system that does persistence If no image id is present, it need not be copied since consistency, we used an ext4 file system in the /dev/pmem0, the CSCM driver creates it on the first access. Likewise emulated PM or SCM data space. The file system contents for a commit, the reverse procedure is followed; the file for Docker and supporting libraries are installed into the docker//client/commit.go was modified to copy file /scmf- SCM FS space which we on /scmfs. The file system s/container/value to the image directory. Response codes layout is shown in Figure 5 and details described below. in the Docker CLI from communication with the Docker

2757 Server Daemon indicate success or failure of a commit and persisted and have identical data for the container application identifiers for the launching container. accesses. The resulting pointer from the mmap call is returned Listing 1: Docker CLI Additions and saved statically in the function and incremented by the func ( c l i ∗ DockerCli) CmdCommit(args) { requested size. Subsequent calls to scmmalloc use an offset response, err := of the pointer and increment the next available location as cli . client .ContainerCommit (...) well. Note that sophisticated memory allocators for SCM // Commit to SCM in /scmfs/images are directly usable by this library or any application that exec .Command(”docker scm commit ” , initializes it with the /dev/cscm driver. name, response.ID); The scmsync function call simply calls the persistence } mechanism on the file handle. When accessing SCM directly func ( c l i ∗ DockerCli) CmdRun(args) { via /dev/cscm we can use a persistent memory fence after ... first issuing a store fence to ensure proper ordering. We // Init to SCM in /scmfs/containers emulate the cost of a persistent memory fence using a global exec .Command(”docker scm run ” , serializing instruction, CPUID. name, createResponse.ID); // Run command in container. CSCM: Container Aware SCM Linux Kernel Module } The Containerized SCM Linux loadable Kernel Module creates an SCM data area for the container and sets up the User Library node and major version number. It registers our container supported /dev/cscm device appropriately and registers the The User-Level Library may be any library, such as mmap handler. On container restart it is attached through the pmem.io [8] or SoftWrAP [17], that provides a Storage active device to the persistent SCM data. Class Memory Allocator and optional persistence mecha- On an mmap the following pseudo-code is executed: nisms or transaction support. Our library provides several functions, namely scmalloc and scmsync. The scmalloc call Listing 2: CSCM LKM mmap flow examines the environment and can be configured to provide s t a t i c i n t scm mmap(filp , vma){ access to any one of the three configurations described ... previously: Docker Volumes, Storage, our CSCM driver, or // Get the file system root volatile DRAM. For volume or storage access, a named // Detect chroot (is fs root /) file can be specified and is then opened on first access and // Get SCM location mmaped into user address space. // If root, /scmdata/hostdata For evaluation purposes, first an environment variable, // else examine /proc/1/cgroup SCM, is read on library initialization on application startup. // Lock If the value is 0 or the variable is not defined, then /dev/cscm // Open SCM File is accessed, opened for reading and writing and mmapped // Unlock process into user-space. This creates an interface to the /dev/cscm // Setup VMA generic f i l e m m a p driver for the application. If the value of SCM is 1, then } Docker Storage on /dfscmdata is opened. This is access to the Docker Storage which is on /scmfs or any SCM file On driver installation, the /dev/cscm device is created for system. It provides byte-addressable and DAX supported file the Linux host operating system. Applications access the access to Docker, but it does not pass through entirely to driver by using a regular file open call and then may call the application layer, requiring access through the AUFS or mmap to access and map data into . other Docker Storage Layers. When mmapping the data, the pseudo-code above is A value of 2 for SCM indicates that our library should executed. The file system root is determined, Container allocate memory from a file in the Docker Volume driver level isolation is detected via OS calls to detect chroot, external to the Container and mmap it into user space. This and then, based on the result and operating system type, volume is configured via the v option on Docker Container a offset into the base operating file system is determined. launch to point to the /scmfs/volumes volume. Note, that this This allows our driver to manage SCM data on a file basis volume is shared, as would be by any container accessing for containers for portability, while still serving DAX byte- volume data. To secure isolation in this mode, managers level persistence through the driver and Docker System to must configure each container with its own volume, and on Container Applications for high performance and scalability. container migration or restart, ensure that the volumes have One challenge after getting the generic Virtual Memory

2758 Address range, or VMA, set was determining not just if an application is in a container but determining the container id. The container id dictates what underlying file to open. The file /proc/1/cgroup has different values depending on the OS. Once the correct SCM data file is opened, the file can be mapped since it is using Direct Access Extensions underneath and the resulting generic file mmap call value can be returned to the scmalloc call. If the SCM data file isn’t present, it is be created on the first access.

A. Security and Additional Operating Systems

One area of consideration is security for the privileged Figure 6: Number of Elements Different on Each Portion of CSCM LKM driver. Chroot has legacy security issues such Random Array Update as open file handles to parent files can still be used when issuing a chroot; thereby giving a child process access to a parent view of a file system. However, our CSCM LKM evaluation, SCM is exposed as 1GB chunks on the /scmfs securely locks the file system and current process so it can’t file system. A Docker container is executed with: open a file when the LKM has chrooted itself to open the docker run –privileged -ti –device=/dev/cscm – appropriate SCM file in the parent system. It prevents any name=test1 -v /scmfs:/vol testa child to grab the file handle as it is used internally, only We tested two micro-benchmarks followed by tests us- locked briefly, and is used only once. ing STREAM [14] and Redis [15] using each config- Additionally, the CSCM driver locks information stored uration of Host, Our Containerized SCM Docker Device in the /proc/1/cgroup. It is possible that the host root could (CSCM), Docker Storage and Docker Volumes. We also put some malicious code in that file to cause a parse error or tested transactional guarantees to in-memory data structures direct to a different file for SCM if known. While this could and show the CSCM driver performs isolated SCM access to be a problem for any malicious application in a container, a shared device. We first show isolation, then show how each we are working on securing this further. method performs similarly with little to no synchronization We also ported, built, and tested the system on another or transactional guarantees, and finally add in transactions , Server 14.04 LTS, upgraded to and synchronization to show how our CSCM driver has near- Linux Kernel 4.2 also built with DAX and PM support. The native application performance and scalability. test machine was an Intel i7 4500U running at 3.0 GHz with 8GB of memory. Minor changes to the source code A. Isolation Testing were performed to get the Linux loadable Kernel Module To first test the guarantee of isolation between concurrent to work in the additional system and kernel. Such changes Docker Containers accessing the device, we created a large included parsing the /proc/1/cgroup file to handle changes array in SCM in a container’s persistent memory space. We in the way that containers are referred to. Additionally, then update the array with a constant value, and test how different function calls were not exposed for the LKM and many times it is different. On each pass of a percentage of were implemented the CSCM source. randomly located elements in the array it displays the per- cent different. Figure 6 shows the tests for two Docker Con- IV. EVALUATION tainers simultaneously accessing /dev/cscm. When CSCM For evaluation, we used an Intel Xeon(R) CPU E5-2697 is enabled to use a containerized data segment, each array v2 12 core processor at 2.70 GHz, with 32 GB DDR3 (4 operates on its own copy of SCM, even though they are both x 8 GB) clocked at 1.867 GHz running Red Hat Enterprise accessing the same privileged /dev/cscm device. The array Linux Server release 7.2 x64 distribution and a Linux 4.5.3 differences go to zero quickly once all of the elements have kernel compiled (per [8]) with NVDIMM, PMEMFS, and been scanned and updated in random order. Direct Access (DAX) supports. A 6GB persistent memory However, when CSCM support is disabled in the driver or emulation space for the pmem driver was created in DRAM when accessing a volume mount to a shared data segment, and a ext4 file system was created and mounted for persistent the arrays always show differences as they are not isolated. memory file system emulation. Our Containerized Storage This is shown in the bouncing behavior of the disabled Class Memory library including the CSCM Linux loadable CSCM driver in processes P1 and P2 in separate Docker Kernel Module driver were built using gcc 4.8.5. We used containers. These tests show that the CSCM driver provides Docker Version 1.10.3 and API for Client Version 1.22. In isolation to the emulated SCM through the shared /dev/cscm

2759 Figure 7: Single Updates Per Second on Integer Array Figure 9: Single Element Inserts per Second into B-Tree

Figure 8: Memory Copy Throughput in MB/s for STREAM Figure 10: Redis Benchmark of Service Requests Per Second device. Therefore, a containerized application is unaware it Docker CSCM device has near the same memory throughput is running inside a container, appearing to itself as running as when the test is running on the host. There are no on a host. consistency guarantees in this memory benchmark, and the sequential nature of the copy allows for methods to have B. Micro-Benchmarks similar performance due to caching and buffers. We tested We used a set of micro-benchmarks to update elements in a micro-benchmark that adds elements to an in-memory a SCM backed array, initially without any synchronization initialized B-Tree. Figure 9 shows the average time required points. These tests show the similarity in performance of to insert additional elements. Docker Storage had the lowest the different methods without any type of synchronized value recorded. The equal performance is due to many of persistence or consistency guarantees. the top level elements in the tree are in the processor cache, The first micro-benchmark is an array update which resulting in a constant cost. Consistency guarantees in the creates an array in SCM using the CSCM user library following experiments show the higher performance of the scmalloc for 200M 4-byte integers and randomly updates CSCM Device Driver. elements in the array. We record the average throughput of updates per second for each configuration and average C. Benchmarks over 20 executions. Figure 7 shows how our Docker CSCM Next, we tested several benchmarks in Redis [15], an in- device has near the same memory throughput as when the memory data structure store, that offers lists, sets and hash test is running and only slightly better than other methods. table values. We configured Redis to operate in memory We then modified the Memory STREAM Benchmark and integrated it with the CSCM user library by modifying [14] to allocate memory using the user-level library with the memory allocator Jemalloc that ships with Redis to scmalloc and linked with the CSCM user library. We tested use our CSCM scmalloc routines. We executed the Redis the Copy function which sequentially copies data from one benchmarks for set, get, and list push times and recorded 90MB chunk to another location. Figure 8 shows how the the average requests serviced per second. Figure 10 shows

2760 Figure 11: Average Number of 10 Element Update Transac- Figure 13: Array Element Updates Per Second for Host and tions Per Second Into an Array Docker SCM Devices Without Numactl Per Execution

Figure 12: Average Number of B-Tree Random Element Figure 14: Average Number of Array Element Updates Per Insert Transactions Per Second Second Pinned to Core the CSCM driver in a Docker container has equal requests on the host. When the SCM is accessed through volumes per second throughput as the same benchmark running on however, performance degrades by a factor of 10. This is the host through the driver. These have twice the throughput due to the additional level of synchronization required to the as the Docker Storage or Volume options as requests do not volume. Even slower is the Docker Storage driver which is have to flow through the Docker Storage or Volume drivers. another factor of 10 slower than the volume driver, making it The benchmark is scaled to use 40 threads performing simul- two orders of magnitude slower than the host or device. This taneous requests to Redis for the Set, Get, and LPush opera- is due to the storage driver going through additional layers. tions; in addition to poor write performance, the Volume and Future work includes exploring changes in the transaction Storage drivers do not scale well. Our CSCM Device scales size and persistence method, such as with SoftWrAP [17]. well and provides near native host performance. Forward work includes investigating increasing client threads and D. Non-Uniform Memory Access additional in-memory database consistency techniques. Non-Uniform Memory Access, or NUMA, testing was Consistency Testing: performed on a machine with two Intel(R) Xeon(R) CPU If an application is doing a copy-on-write approach to E5-2697 v2 processors running at 2.70GHz each with 32 persistence, then it will copy the old values and persist them GB as described above for a total of 64 GB and 48 cores. before writing new values. We tested persistence consistency Figure 13 shows the performance of consecutive execu- by adding synchronization using scmsync in the CSCM tions of the array update over time. The performance can library to the array and B-Tree tests. Figure 11 shows vary from 8 to almost 16 million updates per second. This the array throughput and Figure 12 shows the B-Tree is due to the CPU and where the process is throughput. In both cases the CSCM driver running in the placed. If it is on the same socket as the SCM driver memory Docker Container has the same throughput as when running accesses, it achieves 16 otherwise 8 million.

2761 Figure 14 shows the performance when the array test changes to the hardware. The Persistent Memory File Sys- thread is pinned to hardware threads using numactl. This tem [9], or PMFS, is a complete file-system implementation produces higher performance if the SCM data is local but built for SCM but doesn’t address containers. DAX support can produce results that are twice as slow when the accesses for file systems, including Ext2 and Ext4, provide direct, are to another socket. unbuffered access to files, but don’t scale when used with Docker Storage or Volumes or provide isolation as a device. V. RELATED WORK SCMFS uses sequences of mfence and clflush operations to NVM programming models will include use of mmap to perform ordering and flushing of load and store instructions access SCM through loads and stores [7]. Docker [1] has a and requires garbage collection [30]. Aerie [31] exposes number of supported file systems and volumes from AUFS, a a SCM file system without kernel interaction in user mode. layered union file system that uses Copy-on-Write when files NOVA [32] is a hybrid file system for SCM and DRAM that are modified, to the Device Mapper which uses thin provi- provides consistency guarantees. These file systems could be sioning to implement the layers. Recent work on Flocker [2] mounted and accessed in a container; however, they would manages Docker containers themselves and integrates with a be accessed via regular Docker Storage or through Docker Docker Swam manager allowing for Docker Data Volumes Volume management suffering from poor performance and to follow migrated Docker Containers. However, this support scalability and lack of isolation for volumes. Our CSCM is for block based devices and not byte-addressable, non- driver, which uses a file system for our container-data volatile memory which faces persistent memory consistency management, may be utilized with any underlying SCM problems. based file-system that supports mmap and DAX for high- Consistency models for persistent memory was considered performance. in [16]. For SCM atomic consistency, numerous software Research into Non-Volatile Memory allocators such as logging approaches and new hardware-software methods nvmalloc could be accessed from containers but do not have been proposed [17]–[23]. Specialized reliable SCM support any sort of container isolation other than regular software structures like B-Trees [24], multi-versioned in- virtual memory isolation. Recent work to virtualize Non- dexes [20], [25], and NV heaps [26] have been proposed Volatile Ram [33] was presented to support , recently; these often require changes to the underlying but doesn’t address containers or the consistency guarantees system architecture. ATLAS [19] uses a compiler pass that might be needed by host applications. to automatically generate transactional regions for atomic writes utilizing a synchronous undo log, but faces the VI.SUMMARY same disadvantage of coupling distinct concerns in a single Running applications inside containers using Docker is framework. Mnemosyne [22] uses software transactional growing in popularity as it presents a low-cost, high per- memory (STM) to intercept transactional writes and reads formance method for isolating applications and services. and integrate concurrency control and atomicity. SoftWrAP Emerging byte-addressable non-volatile memory, commonly [17] uses an asynchronous Redo-Log and fast aliasing tech- called Storage Class Memory, presents an interesting chal- nique; a similar approach is described in REWIND [27], lenging for in-memory persistent applications running inside which offers an in-place updates with multilayered recovery a container. aided by an Atomic Doubly Linked List. Pmem.io is a This paper investigated tradeoffs for presenting the SCM recently released persistent memory library by Intel [8] persistence to a container based application through a which offers several user-level libraries for direct access to memory-mapped file inside a container, mounted as a vol- SCM and persistent memory transactions using Copy-on- ume, and as a container-aware Linux loadable Kernel Mod- Write techniques. These all rely atomic 8-64 byte writes to ule. We presented and evaluated our Containerized Storage SCM and new instructions such as pcommit. Application- Class Memory Driver, or CSCM, for a containerized LKM level control of consistency is discussed in RVM [28]. with an mmap interface for in-memory applications and However, these models do not address virtual machine integration with Docker. containers for accessing SCM, but they may be used as We found our CSCM driver to have the highest in-memory transaction mechanisms for writes by applications running application throughput with orders of magnitude higher in a container with direct access to SCM. throughput for transactional persistence to volumes and Several general purpose persistent memory file systems storage while still achieving container persistence isolation built on SCM have been proposed that can allow quick for Storage Class Memory. adoption of application use of SCM. Since these are general purpose file systems, they could support full virtual ma- VII.ACKNOWLEDGEMENTS chines running on top of them such as KVM [29]. BPFS, I would like to give special thanks to Dr. Scott Rixner and or Block-Persistent File System, [24] uses copy-on-write Dr. Peter Varman for their inspiration and encouragement techniques for a ordering of cache evictions but requires and to the National Science Foundation for support.

2762 REFERENCES cations, ser. OOPSLA ’14. New York, NY, USA: ACM, 2014, pp. 433–452. [1] D. Merkel, “Docker: Lightweight linux containers for consis- tent development and deployment,” Linux J., vol. 2014, no. [20] S. Venkatraman, N. Tolia, P. Ranganathan, and R. H. Camp- 239, Mar. 2014. bell, “Consistent and durable data structures for non-volatile byte addressable memory,” in Proceedings of 9th Usenix [2] R. Peinl, F. Holzschuher, and F. Pfitzer, “Docker cluster Conference on File and Storage Technologies. ACM Press, management for the cloud-survey results and own solution,” 2011, pp. 61–76. Journal of Grid Computing, pp. 1–18, 2016. [21] J. Zhao, O. Mutlu, and Y. Xie, “FIRM: Fair and high- [3] W. Felter, A. Ferreira, R. Rajamony, and J. Rubio, “An performance memory control for persistent memory systems,” updated performance comparison of virtual machines and in Proceedings of the 47th Annual IEEE/ACM International linux containers,” in Performance Analysis of Systems and Symposium on Microarchitecture, ser. MICRO-47. Washing- Software (ISPASS), 2015 IEEE International Symposium On. ton, DC, USA: IEEE Computer Society, 2014, pp. 153–165. IEEE, 2015, pp. 171–172. [22] H. Volos, A. J. Tack, and M. M. Swift, “Mnemosyne: [4] S. Singh and N. Singh, “Big data analytics,” in Communi- Lightweight persistent memory,” in Proceedings of the Six- cation, Information Computing Technology (ICCICT), 2012 teenth International Conference on Architectural Support International Conference on, Oct 2012, pp. 1–4. for Programming Languages and Operating Systems, ser. [5] P. Russom et al., “Big data analytics,” TDWI Best Practices ASPLOS XVI. New York, NY, USA: ACM, 2011, pp. 91– Report, Fourth Quarter, 2011. 104. [6] M. F. Uddin, N. Gupta et al., “Seven v’s of big data un- [23] A. Chatzistergiou, M. Cintra, and S. D. Viglas, “REWIND: derstanding big data to extract value,” in American Society Recovery write-ahead system for in-memory non-volatile for Engineering Education (ASEE Zone 1), 2014 Zone 1 data-structures,” Proceedings of the VLDB Endowment, vol. 8, Conference of the. IEEE, 2014, pp. 1–5. no. 5, pp. 497–508, 2015. [7] A. Rudoff, “Programming models for emerging non-volatile [24] J. Condit, E. B. Nightingale, C. Frost, E. Ipek, B. Lee, memory technologies,” Login, vol. 83, no. 3, June 2013. D. Burger, and D. Coetzee, “Better I/O through byte- [8] Pmem.io. (2016) Pereistent Memory Programming. [Online]. addressable, persistent memory,” in Proceedings of 22nd ACM Available: http://pmem.io/ SOSP. ACM Press, 2009. [9] S. R. Dulloor, S. Kumar, A. Keshavamurthy, P. Lantz, [25] J. Moraru, D. Andersen, M. Kmainsky, N. Binkert, N. Tolia, D. Reddy, R. Sankaran, and J. Jackson, “System software for R. Munz, and P. Ranganathan, “Persistent, protected and persistent memory,” in Proceedings of the Ninth European cached: Building blocks for main memory data stores,” in Conference on Computer Systems, ser. EuroSys ’14. New CMU Parallel Data Lab Trechnical Report, CMU-PDL-11- York, NY, USA: ACM, 2014, pp. 15:1–15:15. 114, Dec. 2011. [10] G. J. Popek and R. P. Goldberg, “Formal requirements for [26] J. Coburn, A. Caulfield, A. Akel, L. Frupp, R. Gupta, R. Jhala, virtualizable third generation architectures,” Communications and S. Swanson, “Nv-heaps: Making persistent objects fast of the ACM, vol. 17, no. 7, pp. 412–421, 1974. and safe with next generation, non-volatile memories,” in [11] P.-H. Kamp and R. N. Watson, “Jails: Confining the om- Proceedings of 16th ASPLOS. ACM Press, 2011, pp. 105– nipotent root,” in Proceedings of the 2nd International SANE 118. Conference, vol. 43, 2000, p. 116. [27] A. Chatzistergiou, M. Cintra, and S. D. Viglas, “Rewind: [12] Linux, “Linux containers,” 2012. [Online]. Available: Recovery write-ahead system for in-memory non-volatile http://lxc.sourceforge.net data-structures,” VLDB’15, vol. 8, no. 5, pp. 497–508, 2015. [13] Intel Corporation, “Intel Architecture Instruction Set [28] M. Satyanarayanan, H. H. Mashburn, P. Kumar, D. C. Steere, Extensions Programming Reference,” October 2014, and J. J. Kistler, “Lightweight recoverable virtual memory,” http://software.intel.com/. ACM Trans. Comput. Syst., vol. 12, no. 1, pp. 33–57, Feb. [14] J. D. McCalpin, “A survey of memory bandwidth and machine 1994. balance in current high performance computers,” IEEE TCCA [29] A. Kivity, Y. Kamay, D. Laor, U. Lublin, and A. Liguori, Newsletter, pp. 19–25, 1995. “kvm: the linux virtual machine monitor,” in Proceedings of [15] Redis.io. (2016) Redis, a data structure store. [Online]. the Linux symposium, vol. 1, 2007, pp. 225–230. Available: http://redis.io/ [30] X. Wu and A. L. N. Reddy, “Scmfs: a file system for [16] S. Pelley, P. M. Chen, and T. F. Wenisch, “Memory persis- storage class memory,” in Proceedings of 2011 International tency,” in ISCA’14, 2014, pp. 265–276. Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’11. New York, NY, USA: [17] E. Giles, K. Doshi, and P. Varman, “Softwrap: A lightweight ACM, 2011, pp. 39:1–39:11. framework for transactional support of storage class memory,” in Proc. 31st Symposium on Mass Storage Systems and [31] H. Volos, S. Nalli, S. Panneerselvam, V. Varadarajan, P. Sax- Technologies, ser. MSST ’15. IEEE, 2015. ena, and M. M. Swift, “Aerie: Flexible file-system interfaces to storage-class memory,” in Proceedings of the Ninth Euro- [18] K. Doshi, E. Giles, and P. Varman, “Atomic persistence for pean Conference on Computer Systems. ACM, 2014, p. 14. scm with a non-intrusive backend controller,” in 2016 IEEE International Symposium on High Performance Computer [32] J. Xu and S. Swanson, “Nova: a log-structured file system for Architecture (HPCA). IEEE, 2016, pp. 77–89. hybrid volatile/non-volatile main memories,” in 14th USENIX Conference on File and Storage Technologies (FAST 16), [19] D. R. Chakrabarti, H.-J. Boehm, and K. Bhandari, “ATLAS: 2016, pp. 323–338. Leveraging locks for non-volatile memory consistency,” in Proceedings of the 2014 ACM International Conference on [33] A. Ruia, “Virtualization of non-volatile ram,” Ph.D. disserta- Object Oriented Programming Systems Languages & Appli- tion, Texas A&M University, 2015.

2763