Facebook's Tectonic Filesystem: Efficiency from Exascale
Total Page:16
File Type:pdf, Size:1020Kb
Facebook’s Tectonic Filesystem: Efficiency from Exascale Satadru Pan, Facebook, Inc.; Theano Stavrinos, Facebook, Inc. and Princeton University; Yunqiao Zhang, Atul Sikaria, Pavel Zakharov, Abhinav Sharma, Shiva Shankar P, Mike Shuey, Richard Wareing, Monika Gangapuram, Guanglei Cao, Christian Preseau, Pratap Singh, Kestutis Patiejunas, and JR Tipton, Facebook, Inc.; Ethan Katz-Bassett, Columbia University; Wyatt Lloyd, Princeton University https://www.usenix.org/conference/fast21/presentation/pan This paper is included in the Proceedings of the 19th USENIX Conference on File and Storage Technologies. February 23–25, 2021 978-1-939133-20-5 Open access to the Proceedings of the 19th USENIX Conference on File and Storage Technologies is sponsored by USENIX. Facebook’s Tectonic Filesystem: Efficiency from Exascale Satadru Pan1, Theano Stavrinos1,2, Yunqiao Zhang1, Atul Sikaria1, Pavel Zakharov1, Abhinav Sharma1, Shiva Shankar P1, Mike Shuey1, Richard Wareing1, Monika Gangapuram1, Guanglei Cao1, Christian Preseau1, Pratap Singh1, Kestutis Patiejunas1, JR Tipton1, Ethan Katz-Bassett3, and Wyatt Lloyd2 1Facebook, Inc., 2Princeton University, 3Columbia University Abstract In building Tectonic, we confronted three high-level chal- Tectonic is Facebook’s exabyte-scale distributed filesystem. lenges: scaling to exabyte-scale, providing performance isola- Tectonic consolidates large tenants that previously used tion between tenants, and enabling tenant-specific optimiza- service-specific systems into general multitenant filesystem tions. Exabyte-scale clusters are important for operational instances that achieve performance comparable to the spe- simplicity and resource sharing. Performance isolation and cialized systems. The exabyte-scale consolidated instances tenant-specific optimizations help Tectonic match the perfor- enable better resource utilization, simpler services, and less mance of specialized storage systems. operational complexity than our previous approach. This pa- To scale metadata, Tectonic disaggregates the filesys- per describes Tectonic’s design, explaining how it achieves tem metadata into independently-scalable layers, similar to scalability, supports multitenancy, and allows tenants to spe- ADLS [42]. Unlike ADLS, Tectonic hash-partitions each cialize operations to optimize for diverse workloads. The metadata layer rather than using range partitioning. Hash paper also presents insights from designing, deploying, and partitioning effectively avoids hotspots in the metadata layer. operating Tectonic. Combined with Tectonic’s highly scalable chunk storage layer, disaggregated metadata allows Tectonic to scale to exabytes 1 Introduction of storage and billions of files. Tectonic is Facebook’s distributed filesystem. It currently Tectonic simplifies performance isolation by solving the serves around ten tenants, including blob storage and data isolation problem for groups of applications in each tenant warehouse, both of which store exabytes of data. Prior to with similar traffic patterns and latency requirements. Instead Tectonic, Facebook’s storage infrastructure consisted of a of managing resources among hundreds of applications, Tec- constellation of smaller, specialized storage systems. Blob tonic only manages resources among tens of traffic groups. storage was spread across Haystack [11] and f4 [34]. Data Tectonic uses tenant-specific optimizations to match the warehouse was spread across many HDFS instances [15]. performance of specialized storage systems. These optimiza- The constellation approach was operationally complex, re- tions are enabled by a client-driven microservice architecture quiring many different systems to be developed, optimized, that includes a rich set of client-side configurations for con- and managed. It was also inefficient, stranding resources in trolling how tenants interact with Tectonic. Data warehouse, the specialized storage systems that could have been reallo- for instance, uses Reed-Solomon (RS)-encoded writes to im- cated for other parts of the storage workload. prove space, IO, and networking efficiency for its large writes. A Tectonic cluster scales to exabytes such that a single Blob storage, in contrast, uses a replicated quorum append cluster can span an entire datacenter. The multi-exabyte ca- protocol to minimize latency for its small writes and later pacity of a Tectonic cluster makes it possible to host several RS-encodes them for space efficiency. large tenants like blob storage and data warehouse on the same cluster, with each supporting hundreds of applications Tectonic has been hosting blob storage and data warehouse in turn. As an exabyte-scale multitenant filesystem, Tectonic in single-tenant clusters for several years, completely replac- provides operational simplicity and resource efficiency com- ing Haystack, f4, and HDFS. Multitenant clusters are being pared to federation-based storage architectures [8, 17], which methodically rolled out to ensure reliability and avoid perfor- assemble smaller petabyte-scale clusters. mance regressions. Tectonic simplifies operations because it is a single system Adopting Tectonic has yielded many operational and effi- to develop, optimize, and manage for diverse storage needs. It ciency improvements. Moving data warehouse from HDFS is resource-efficient because it allows resource sharing among onto Tectonic reduced the number of data warehouse clusters all cluster tenants. For instance, Haystack was the storage by 10 , simplifying operations from managing fewer clusters. ⇥ system specialized for new blobs; it bottlenecked on hard disk Consolidating blob storage and data warehouse into multi- IO per second (IOPS) but had spare disk capacity. f4, which tenant clusters helped data warehouse handle traffic spikes stored older blobs, bottlenecked on disk capacity but had spare with spare blob storage IO capacity. Tectonic manages these IO capacity. Tectonic requires fewer disks to support the same efficiency improvements while providing comparable or better workloads through consolidation and resource sharing. performance than the previous specialized storage systems. USENIX Association 19th USENIX Conference on File and Storage Technologies 217 2 Facebook’s Previous Storage Infrastructure Tectonic cluster Tectonic cluster geo-replication Before Tectonic, each major storage tenant stored its data dc1:blobstore dc1:warehouse dc2:blobstore dc2:warehouse appA appZ Metadata appZ Metadata … … … … … … … … in one or more specialized storage systems. We focus here Store Store on two large tenants, blob storage and data warehouse. We Chunk Chunk discuss each tenant’s performance requirements, their prior Store Store storage systems, and why they were inefficient. Datacenter 1 Datacenter 2 Figure 1: Tectonic provides durable, fault-tolerant stor- 2.1 Blob Storage age inside a datacenter. Each tenant has one or more sep- Blob storage stores and serves binary large objects. These arate namespaces. Tenants implement geo-replication. may be media from Facebook apps (photos, videos, or mes- sage attachments) or data from internal applications (core 2.2 Data Warehouse dumps, bug reports). Blobs are immutable and opaque. They Data warehouse provides storage for data analytics. Data vary in size from several kilobytes for small photos to several warehouse applications store objects like massive map-reduce megabytes for high-definition video chunks [34]. Blob stor- tables, snapshots of the social graph, and AI training data age expects low-latency reads and writes as blobs are often and models. Multiple compute engines, including Presto [3], on path for interactive Facebook applications [29]. Spark [10], and AI training pipelines [4] access this data, process it, and store derived data. Warehouse data is parti- Haystack and f4. Before Tectonic, blob storage consisted tioned into datasets that store related data for different product of two specialized systems, Haystack and f4. Haystack han- groups like Search, Newsfeed, and Ads. dled “hot” blobs with a high access frequency [11]. It stored Data warehouse storage prioritizes read and write through- data in replicated form for durability and fast reads and writes. put over latency, since data warehouse applications often As Haystack blobs aged and were accessed less frequently, batch-process data. Data warehouse workloads tend to issue they were moved to f4, the “warm” blob storage [34]. f4 stored larger reads and writes than blob storage, with reads averaging data in RS-encoded form [43], which is more space-efficient multiple megabytes and writes averaging tens of megabytes. but has lower throughput because each blob is directly acces- sible from two disks instead of three in Haystack. f4’s lower HDFS for data warehouse storage. Before Tectonic, throughput was acceptable because of its lower request rate. data warehouse used the Hadoop Distributed File System (HDFS) [15, 50]. However, HDFS clusters are limited in size However, separating hot and warm blobs resulted in poor because they use a single machine to store and serve metadata. resource utilization, a problem exacerbated by hardware and As a result, we needed tens of HDFS clusters per datacenter blob storage usage trends. Haystack’s ideal effective replica- to store analytics data. This was operationally inefficient; ev- tion factor was 3.6 (i.e., each logical byte is replicated 3 , ⇥ ⇥ ery service had to be aware of data placement and movement with an additional 1.2 overhead for RAID-6 storage [19]). ⇥ among clusters. Single data warehouse datasets are often large