FBOSS: Building Switch Software at Scale

FBOSS: Building Switch Software at Scale Sean Choi* Boris Burkov Alex Eckert Tian Fang Stanford University Facebook, Inc. Facebook, Inc. Facebook, Inc. Saman Kazemkhani Rob Sherwood Ying Zhang Hongyi Zeng Facebook, Inc. Facebook, Inc. Facebook, Inc. Facebook, Inc. ABSTRACT KEYWORDS The conventional software running on network devices, such FBOSS, Facebook, Switch Software Design, Data Center as switches and routers, is typically vendor-supplied, pro- Networks, Network Management, Network Monitoring prietary and closed-source; as a result, it tends to contain ACM Reference Format: extraneous features that a single operator will not most likely Sean Choi, Boris Burkov, Alex Eckert, Tian Fang, Saman fully utilize. Furthermore, cloud-scale data center networks Kazemkhani, Rob Sherwood, Ying Zhang, and Hongyi Zeng. 2018. often times have software and operational requirements that FBOSS: Building Switch Software at Scale. In SIGCOMM ’18: SIG- may not be well addressed by the switch vendors. COMM 2018, August 20–25, 2018, Budapest, Hungary. ACM, New In this paper, we present our ongoing experiences on over- York, NY, USA, 15 pages. https://doi.org/10.1145/3230543.3230546 coming the complexity and scaling issues that we face when designing, developing, deploying and operating an in-house 1 INTRODUCTION software built to manage and support a set of features re- quired for data center switches of a large scale Internet con- The world’s desire to produce, consume, and distribute on- tent provider. We present FBOSS, our own data center switch line content is increasing at an unprecedented rate. Commen- software, that is designed with the basis on our switch-as- surate with this growth are equally unprecedented technical a-server and deploy-early-and-iterate principles. We treat challenges in scaling the underlying networks. Large Internet software running on data center switches as any other soft- content providers are forced to innovate upon all aspects of ware services that run on a commodity server. We also build their technology stack, including hardware, kernel, compiler, and deploy only a minimal number of features and iterate on and various distributed systems building blocks. A driving it. These principles allow us to rapidly iterate, test, deploy factor is that, at scale even a relatively modest efficiency and manage FBOSS at scale. Over the last five years, our improvement can have large effects. For us, our data center experiences show that FBOSS’s design principles allow us networks power a cloud-scale Internet content provider with to quickly build a stable and scalable network. As evidence, billions of users, interconnecting hundreds of thousands of we have successfully grown the number of FBOSS instances servers. Thus, it is natural and necessary to innovate on the 1 running in our data center by over 30x over a two year period. software that runs on switches. Conventional switches typically come with software writ- CCS CONCEPTS ten by vendors. The software includes drivers for managing dedicated packet forwarding hardware (e.g., ASICs, FPGAs, • Networks ! Data center networks; Programming in- or NPUs), routing protocols (e.g., BGP, OSPF, STP, MLAG), terfaces; Routers; monitoring and debugging features (e.g., LLDP, BFD, OAM), configuration interfaces (e.g., conventional CLI, SNMP, Net- *Work done while at Facebook, Inc. Conf, OpenConfig), and a long tail of other features needed to run a modern switch. Implicit in the vendor model is the Permission to make digital or hard copies of all or part of this work for assumption that networking requirements are correlated be- personal or classroom use is granted without fee provided that copies are not tween customers. In other words, vendors are successful be- made or distributed for profit or commercial advantage and that copies bear cause they can create a small number of products and reuse this notice and the full citation on the first page. Copyrights for components them across many customers. However our network size and of this work owned by others than ACM must be honored. Abstracting with the rate of growth of the network (Figure 1) are unlike most credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request other data center networks. Thus, they imply that our require- permissions from [email protected]. ments are quite different from most customers. SIGCOMM ’18, August 20–25, 2018, Budapest, Hungary © 2018 Association for Computing Machinery. 1We use “switch” for general packet switching devices such as switches and ACM ISBN 978-1-4503-5567-4/18/08. $15.00 routers. Our data center networks are fully Layer 3 routed similar to what is https://doi.org/10.1145/3230543.3230546 described in [36]. SIGCOMM ’18, August 20–25, 2018, Budapest, Hungary S. Choi et al. Growth in FBOSS deployments 30x Thanks to this trend, we started an experiment of building our in-house designed switch software five years ago. Our 20x server fleet already runs thousands of different software ser- 15x vices. We wanted to see if we can run switch software in a 10x similar way we run software services. This model is quite different from how conventional networking software is man- 5x FBOSS Deployments → Deployments FBOSS aged. Table 1 summarizes the differences between the two 1x Months → high-level approaches using a popular analogy [17]. 0 3 6 9 12 15 18 21 24 The result is Facebook Open Switching System (FBOSS), which is now powering a significant portion of our data center Figure 1: Growth in the number of switches in our data infrastructure. In this paper, we report on five years of expe- center over two year period as measured by number of riences on building, deploying and managing FBOSS. The total FBOSS deployments. main goals of this paper are: (1) Provide context about the internal workings of the soft- One of the main technical challenges in running large net- ware running on switches, including challenges, design trade- works is managing complexity of excess networking features. offs, and opportunities for improvement, both in the abstract Vendors supply common software intended for their entire for all network switch software and our specific pragmatic customer base, thus their software includes the union of all design decisions. features requested by all customers over the lifetime of the (2) Describe the design, automated tooling for deployment product. However, more features lead to more code and more monitoring, and remediation methods of FBOSS. code interactions, which ultimately lead to increased bugs, (3) Provide experiences and illustrative problems encoun- security holes, operational complexity, and downtime. To tered on managing a cloud-scale data center switch software. mitigate these issues, many data centers are designed for sim- (4) Encourage new research in the more accessible/open plicity and only use a carefully selected subset of networking field of switch software and provide a vehicle, an open source features. For example, Microsoft’s SONiC focuses on build- version of FBOSS [4], for existing network research to be ing a “lean stack” in switches [33]. evaluated on real hardware. Another of our network scaling challenges is enabling a The rest of the paper closely follows the structure of Ta- high-rate of innovation while maintaining network stability. ble 1 and is structured as follows: We first provide a couple of It is important to be able to test and deploy new ideas at scale our design principles that guide FBOSS’s development and in a timely manner. However, inherent in the vendor-supplied deployment (Section 2). Then, we briefly describe major hard- software model is that changes and features are prioritized by ware components that most data center switch software needs how well they correlate across all of their customers. A com- to manage (Section 3) and summarize the specific design mon example we cite is IPv6 forwarding, which was imple- decisions made in our system (Section 4). We then describe mented by one of our vendors very quickly due to widespread the corresponding deployment and management goals and customer demand. However, an important feature to our oper- lessons (Section 5, Section 6). We describe three operational ational workflow was fine-grained monitoring of IPv6, which challenges (Section 7) and then discuss how we have success- we quickly implemented for our own operational needs. Had fully overcome them. We further discuss various topics that we left this feature to the demands of the customer market led to our final design (Section 8) and provide a road mapfor and to be implemented by the vendors, we would not have future work (Section 9). had this feature until over four years later, which was when the feature actually arrived to the market. In recent years, the practice of building network switch components has become more open. First, network vendors 2 DESIGN PRINCIPLES emerged that do not build their own packet forwarding chips. We designed FBOSS with two high-level design princi- Instead they rely on third-party silicon vendors, commonly ples: (1) Deploy and evolve the software on our switches the known as “merchant silicons”. Then, merchant silicon ven- same as we do our servers (Switch-as-a-Server). (2) Use early dors along with box/chassis manufacturers have emerged that deployment and fast iteration to force ourselves to have a create a new, disaggregated ecosystem where networking minimally complex network that only uses features that are hardware can be purchased without any software. As a re- strictly needed (Deploy-Early-and-Iterate). These principles sult, it is now possible for end-customers to build a complete have been echoed in the industry - a few other customized custom switch software stack from scratch. switch software efforts like Microsoft ACS [8]/SONiC [33] is based on similar motivation.

FBOSS: Building Switch Software at Scale

SOLUTIONS Products &

KINTSUGI Identifying & Addressing Challenges in Embedded Binary

Junos OS NAT Configuration Examples for Screenos Users

Juniper Networks Inc

Certification Report Juniper Junos OS 20.2R1 for SRX345, SRX345-DUAL-AC, SRX380 and SRX1500

Enhancing Networks Via Virtualized Network Functions

Junos® OS Intrusion Detection and Prevention User Guide Copyright © 2021 Juniper Networks, Inc

Day One: Junos® Pyez Cookbook

Request System Power Off

Release Notes

Abril De 2021

Sysadministrivia S0E13: "Better Late Than Never"