A File System for Serverless Computing

Johann Schleier-Smith and Joseph M. Hellerstein Presented at HPTS November 5, 2019 Outline

Serverless computing background Limitations The Cloud Function File System (CFFS) Evaluation and learnings

2 ©2019 RISELab Serverless computing in 2014

AWS announced Lambda Function as a Service (FaaS) in 2014, others clouds followed quickly Write code, upload it to the cloud, run at any scale Successful for web APIs and event processing, limited in stateful applications

Thumbnail

3 ©2019 RISELab Serverless computing in 2014

AWS announced Lambda Function as a Service (FaaS) in 2014, others clouds followed quickly Write code, upload it to the cloud, run at any scale Successful for web APIs and event processing, limited in stateful applications

Thumbnail

4 ©2019 RISELab Defining characteristics of serverless abstractions

Hiding the servers and the complexity of programming them

Consumption-based costs and no charge for idle resources

Excellent autoscaling so resources match demand closely

5 ©2019 RISELab Understanding serverless computing’s impact

Resources People

On-premise Serverful cloud Serverless cloud Job Serverful Cloud Serverless Cloud

Infrastructure Outsourced job Outsourced job Administration

System Simplified job Outsourced job Administration Resources Software

Allocated & Charged Allocated Little change Simplified job Development

Time

6 ©2019 RISELab Serverless as next phase of cloud computing

Serverless abstractions offer Simplified programming Outsourced operations Improved resource utilization

7 ©2019 RISELab Serverless is much more than FaaS

Function as a Service Big Data Processing AWS Lambda Object Storage Google Cloud Dataflow Google Cloud Functions AWS S3 AWS Glue Google Cloud Run Azure Blobs AWS Athena Azure Functions Google Cloud Storage AWS Redshift

Queue Service Mobile back end Google Firebase AWS SQS Key-Value Store AWS AppSync Google Cloud Pub/Sub AWS DynamoDB Azure CosmosDB Google Cloud Datastore AWS Serverless Aurora Google App Engine 8 ©2019 RISELab Fertile ground for research

120 Actual Extrapolated 100

80

60

40

Number of Publications 20

0 2012 2013 2014 2015 2016 2017 2018 2019 Year Source: dblp computer science bibliography. Outline

Serverless computing background Limitations The Cloud Function File System (CFFS) Evaluation and learnings

10 ©2019 RISELab Limitations of FaaS

No inbound network Limited runtime connections

No specialized Ephemeral state hardware, e.g., GPU Plenty of complementary stateful serverless services

Others Object Storage Key-Value Store • AWS Aurora Serverless • AWS S3 • AWS DynamoDB • Google Firebase • Azure Blobs • Azure CosmosDB • Google Cloud Storage • Google Cloud Datastore • Anna KVS (Berkeley) Combine with FaaS to build applications

Compute: λ λ isolated & ephemeral state

Storage: Object Key-Value Etc. shared & durable state Storage Store Allows independent scaling

λ λ λ λ λ

Object Key-Value Etc. Etc. Etc. Storage Store How happy are we? How happy are we?

Two main problems

Latency API Can I please have something like local disk, but for the cloud? File systems let us run so much software

Data analysis with Pandas Machine learning with TensorFlow Software builds with Make Search indexes with Sphinx Image rendering with Radiance Databases with SQLite Web serving with Nginx Email with Postfix

20 ©2019 RISELab Objections

It won’t scale Need a simpler data model (key-value store, immutable objects) Need a weaker consistency model (eventual consistency) Performance and reliability will suffer otherwise

You don’t need it You should be rewriting your software for the cloud anyhow Why not just use use a key-value store?

21 ©2019 RISELab How does this make me feel?

22 ©2019 RISELab Outline

Serverless computing background Limitations The Cloud Function File System (CFFS) Evaluation and learnings

23 ©2019 RISELab Introducing the Cloud Function File System (CFFS)

POSIX semantics, including strong consistency Local caches for local disk performance Works under autoscaling, extreme elasticity, and FaaS limitations Implemented as a transaction system

24 ©2019 RISELab What is special about the FaaS environment?

Function invocations have well defined beginning and end At-least-once execution—expects idempotent code Constrained execution model Clients frozen in between invocations No inbound network connections No root access

25 ©2019 RISELab CFFS Architecture

CFFSλ CFFSλ CFFSλ CFFSλ CFFSλ CFFSλ

CFFS

Back end transactional system

26 ©2019 RISELab FaaS instance caching

“Function as a Service” suggests statelessness, but most implementations reuse instances and preserve their state Setup of sandboxed environment takes time Loads selected runtime (e.g., JavaScript, Python, C#, etc.) Configures network endpoint, IAM privileges Loads user code Runs user initialization Caching is key to amortizing instance setup

27 ©2019 RISELab Cloud Function File System (CFFS)

Application — POSIX API — Our Txn Buffer emphasis Cache

Cache Logs Txn Commit

28 Back end transactional©2019 RISELab system Core API implemented in CFFS

open New descriptor (handle) for file or mkdir Create directory directory rename Rename file / directory close Close descriptor unlink Delete file / directory write / pwrite Write / positioned write chmod Set access permissions read / pread Read / positioned read chown Set ownership stat Get size, ownership, access, utimes Update modified / accessed permissions, last modified timestamps seek Set descriptor position clock_gettime Get current time dup / dup2 Copy descriptor chdir Set working directory truncate Set file size getcwd Get working directory flock Byte range lock and unlock begin Start transaction commit / abort End transaction

29 ©2019 RISELab POSIX guarantees - language from the spec

• Writes can be serialized with respect to other reads and “ writes. ” • If a read() of file data can be proven (by any means) to occur after a write() of the data, it must reflect that write()… • A similar requirement applies to multiple write operations to the same file position… • This requirement is particularly significant for networked file systems, where some caching schemes violate these semantics.

https://pubs.opengroup.org/onlinepubs/9699919799/ POSIX guarantees in database terms

Atomic operations Each operation (at the API level) is observed entirely or not at all Some violations in practice Consistency model Spec references time, technically requires strict consistency (shared global clock) Actually implemented as sequential consistency (global order exists, consistent with order at each processor) We use serializability to provide isolation and atomicity at function granularity

Open question: what guarantees do applications actually rely on?

31 ©2019 RISELab Implementation highlights

Choice of transaction mechanism not fundamental Implemented timestamp-order serilizable Optimistic protocols can be a good fit—FaaS side effects must be idempotent State-of-the-art protocols promise lower abort rates, more effective local caches (e.g., Yu, et al., VLDB 2018) Cache updates through on-demand filtererd log shipping Check for updates when function starts execution Eviction messages help back-end track client cache content

32 ©2019 RISELab CFFS in context: Transactional file systems

QuickSilver distributed system – IBM, 1991 Very close in spirit No FaaS No caching Inversion File System – Berkeley, 1993 Built on top of PostgreSQL Access through custom library Transactional NTFS (TxF) – Microsoft, 2006 Shipping in Windows Deprecated on account of complexity

There are many non-transactional shared & distributed file systems

33 ©2019 RISELab CFFS in context: Shared file systems

Must choose between consistency and latency Eventual consistency Delegation/lock-based caching No caching

HPC Big data Client-server Lustre HDFS NFS GPFS (IBM) GFS (Google) SMB GlusterFS (RedHat / IBM) (IBM) GFS () Mainframe MooseFS MapR-FS zFS LizardFS Alluxio BeeGFS

34 ©2019 RISELab Outline

Serverless computing background Limitations The Cloud Function File System (CFFS) Evaluation and learnings

35 ©2019 RISELab Sample workload call frequencies

120,000 make OpenSSH 100,000 TPC-C on SQLite 80,000

60,000

40,000

20,000

-

stat dup read seek write chdir close open unlink chmod getdent rename

36 ©2019 RISELab Caching benefits (TPC-C / SQLite)

9,000 8,105 8,000 7,000 6,000 5,000

tpmC 4,000 3,000 2,000 1,000 817 142 - Local NFS CFFS

37 ©2019 RISELab Scaling in AWS Lambda

180,000 Uncached Cached 160,000 140,000 120,000 100,000 80,000 60,000 40,000 20,000 - 10 20 30 40 50 60 70 80 90 100

4k random reads

38 ©2019 RISELab Returning to objections

It won’t scale—contention risks File length: must be checked on every read Stat command: mainly used use it to check the file length or permissions, but also returns modification time and access time Challenges here, optimistic they will be overcome You don’t need it I think it will be pretty useful

39 ©2019 RISELab How happy are we? How happy are we? CFFS Summary

Transactions are a natural fit for FaaS BEGIN and END from function context At-least-once execution goes well with optimistic transactions On-demand filtered log shipping allows asynchronous cache updates

Overcomes limitations of FaaS & traditional shared file systems Allows caching for lower latency, preserving consistency Highly scalable, especially with snapshot reads POSIX API enables vast range of tools and libraries

42 ©2019 RISELab