Restoring Uniqueness in Microvm Snapshots

Restoring Uniqueness in MicroVM Snapshots Marc Brooker Adrian Costin Catangiu Mike Danilov Amazon Web Services Amazon Web Services Amazon Web Services Alexander Graf Colm MacCarthaigh Andrei Sandu Amazon Web Services Amazon Web Services Amazon Web Services Abstract MicroVM boot ~200ms Code initialization—the step of loading code, executing static Download 20ms to code, filling caches, and forming re-used connections—tends 60s to dominate cold-start time in serverless compute systems Initialize “cold start” such as AWS Lambda. Post-initialization memory snapshots, cloned and restored on start, have emerged as a viable solution Invoke <10ms to this problem, with incremental snapshot and fast restore “Warm start” Invoke support in VMMs like Firecracker. Saving memory introduces the challenge of managing … high-value memory contents, such as cryptographic secrets. Cloning introduces the challenge of restoring the uniqueness Death of the VMs, to allow them to do unique things like generate UUIDs, secrets, and nonces. This paper examines solutions Figure 1: Time line of the life cycle of a microVM in AWS to these problems in the every microsecond counts context of Lambda serverless cold-start, and discusses the state-of-the-art of avail- able solutions. We present two new interfaces aimed at solving this problem—MADV_WIPEONSUSPEND and VmGenId—and control, leaving little opportunity to optimize or improve it on compare them to alternative solutions. behalf of the customer (beyond simply allocating additional resources to its execution). 1 Introduction Figure1 shows a typical time line of the life cycle of a microVM in Lambda. The microVM boot, booting a minimal When AWS Lambda receives a request to invoke a server- Linux kernel in Firecracker, with a minimal set of userspace less function, the system attempts to match the request to an daemons, and network setup, takes approximately 200ms. already-running microVM containing a copy of the function This is hidden from the customer through the use of a small arXiv:2102.12892v1 [cs.CR] 4 Feb 2021 code. At scale, this warm start succeeds with high probability– pool of pre-booted microVMs. The cold start portion, includ- both because of the averaging effects of scale, and because ing downloading the function code into the microVM and serverless systems attempt to predict the number of incoming initialization, frequently takes up to 60s but can be as little requests using techniques such as reinforcement learning [6], as 20ms for small static binaries. The download is typically LSTMs [16], and ARIMA models [26]. When warm start short: at 25Gb/s, downloading Lambda’s maximum package does not succeed, the system falls back on cold start, which size of 250MB takes only 80ms. Finally, each invoke expe- involves starting a Firecracker [5] microVM (or taking one riences a warm start delay caused by routing, typically less from a pool), loading the function code, and initializing the than 10ms. The same microVM is then re-used, for future code to get it ready to handle the request. The last step of invokes of the same version of the same function, to amortize the the cold start, initialization, is typically the one that takes the cold start costs. longest. Du et al [11] found this initialization step to take be- The time and resources taken by initialization depend tween 25% and 95% of total start-up time, mirroring our own strongly on the customer’s choice of language and frame- experiences. As an additional challenge for cloud providers, work, and on additional work the customer has chosen to do this initialization step is almost entirely within the customer’s at process start time. Of the popular languages, Java typically 1 MicroVM boot 1.1 Challenges Snapshotting and clone-on-restore introduce several system- Download 200ms to 60s level challenges. First, by their nature snapshots turn data Initialize (Hidden) in RAM into data on storage. This data can include cryptographic secrets and other high-value tokens which the pro- Snapshot grammer wished to control carefully, irrespective of the security properties of the snapshot storage and distribution layer. Second, cloning duplicates memory state, which is a challenge for applications which want to generate unique data such as re- Restore Restore Restore quest IDs, UUIDs, and cryptographic nonces. Third, cloning is not compatible with existing session-layer network protocols Invoke Invoke Invoke such as TCP and TLS, which assume that the client is a single unique entity with a single identity. Re-establishing these connections can add significantly to restore latency [17,18], and Figure 2: Time line of the life cycle of a microVM in AWS can even cause correctness issues in some protocols (a type of Lambda with snapshot and clone replay attack). This paper focuses on the second challenge— uniqueness—but some of the solutions we analyze also help address the challenges posed by high-value secrets and network protocols. exhibits the longest cold-start times. This is driven by three 2 The Value of Uniqueness main factors: the start-up time of the JVM itself, the decom- pression and loading of class code, and the execution of static Most modern distributed systems depend on the ability of code in the loaded classes. It is typically the last of these nodes to generate unique or random values. RFC4122 [20] which drives long cold-start times, because Java allows for version 4 UUIDs are widely used as unique database keys, and the execution of effectively arbitrary code at static initializa- as request IDs used for log correlation and distributed tracing. tion time. While Java code initialization and static execution In this context, duplicate IDs will lead to correctness issues, is multithreaded, the specification [13] limits parallelism by logical data corruption, incorrect traces, and confusing or defining a strict recursive order on class initialization. This misleading logs. UUIDs are also used as tokens for API idem- problem is not only limited to Java. The same outcome is com- potency, where multiple calls with the same token behave as a mon, but less ubiquitous, in workloads written in Javascript, single call. Here, duplicate IDs can lead to availability or cor- Python, and C#. Workloads written in native languages like C, rectness issues. Many common distributed systems protocols, C++, Rust and Go tend to initialize quickly, largely because including consensus protocols like Paxos [19], and ordering it’s a less-common pattern to do extensive work at startup in protocols like vector clocks, rely on the fact that participants these languages. can uniquely identify themselves. Cloning without changing identity could be seen as an unintentional Sybil [10] attack, Post-initialization snapshots have been proposed as a solu- which most distributed systems protocols don’t tolerate. tion to this initialization time problem [9,11], to the general Jitter, the intentional addition of pseudorandomness to val- problem of cloud scaling [8], to fault-tolerance [21], and even ues like retry intervals, sampling intervals, timeouts and re- for accelerating hypervisor fuzzing [25]. Simply, initialization source limits are widely used to improve the resilience of is complete, the virtual machine is snapshotted, and when a distributed systems. For example, adding jitter to the retry cold start is needed the snapshot is cloned and restored. This intervals used by clients of an overloaded system reduce the approach is attractive, because microVM restore can be made clustering of retries in time, helping systems recover faster very fast. In our own experiments, Firecracker snapshot re- from transient overload. Adding jitter to periodic jobs, like store can take as little as 4ms. Catalyzer [11] reports latencies cron, helps avoid seasonalities like start-of-day and top-of- reliably below 10ms for a broad range of application types. hour spikes. Systems that aren’t unique will tend to cluster, and clustering leads to correlated load, which leads to lower Figure2 shows the timeline for a microVM with snapshot system utilization. and clone. The entire boot and cold start portion can be hidden Cryptography is the most critical application of unique from the customer by performing it once at function creation data. Any predictability in the data used to generate crypto- time. The blocking portion then requires only a snapshot graphic keys—whether long-lived keys for applications like restore, a much faster operation for which the entire latency storage encryption or ephemeral keys for protocols like TLS— is under the control of AWS. fundamentally compromise the confidentiality and authenti- 2 cation properties offered by cryptography. A potentially less- 3 Userspace Interfaces for Uniqueness obvious vulnerability relates to initialization vectors (IVs) and nonces used in common block cipher modes like counter Solving the uniqueness problem strongly enough for crypto- mode (CTR) and Galois counter mode (GCM). Using the graphic purposes requires a mechanism which can determin- same IV with the same key multiple times breaks the security istically reseed userspace PRNGs with new entropy at restore of these modes. In many protocols, the IV is translated in the time. This mechanism must also support the high-throughput clear, making it easy for an attacker to tell if an IV has been and low-latency use-cases that led programmers to pick a re-used. Dworkin in NIST SP800-38D [12] says: userspace PRNG in the first place (so simply reverting to getrandom is not acceptable); be usable by both application code and libraries; allow transparent retrofitting behind exist- The probability that the authenticated encryption ing popular PRNG interfaces without changing application function ever will be invoked with the same IV and code; it must be efficient, especially on restore; and be simple the same key on two (or more) distinct sets of input enough for wide adoption. data shall be no greater than 2−32.

Restoring Uniqueness in Microvm Snapshots

Checkpoint and Restoration of Micro-Service in Docker Containers

The Kernel Report

Rootless Containers with Podman and Fuse-Overlayfs

Evaluating and Improving LXC Container Migration Between

Practical and Effective Sandboxing for Non-Root Users

Demystifying Internet of Things Security Successful Iot Device/Edge and Platform Security Deployment — Sunil Cheruvu Anil Kumar Ned Smith David M

Hardening Kubernetes Containers Security with Seccomp an Often Overlooked Way to Harden Kubernetes Containers’ Security Is by Applying Seccomp Profiles

Enclave Security and Address-Based Side Channels

The Aurora Operating System

Checkpoint and Restore of Singularity Containers

Etsi Tr 103 528 V1.1.1 (2018-08)

A Research on Android Technology with New Version Naugat(7.0,7.1)