High-Availability Storage Made Practical In
Total Page:16
File Type:pdf, Size:1020Kb
PaxosStore: Highavailability Storage Made Practical in WeChat ∗ Jianjun Zheng† Qian Lin§† Jiatao Xu† Cheng Wei† Chuwei Zeng† Pingan Yang† Yunfan Zhang† †Tencent Inc. §National University of Singapore †{rockzheng, sunnyxu, dengoswei, eddyzeng, ypaapyyang, ifanzhang}@tencent.com §[email protected] ABSTRACT their implementation. Initially, each development team ran- In this paper, we present PaxosStore, a high-availability stor- domly adopts off-the-shelf storage systems to prototype the age system developed to support the comprehensive busi- individual component. However, in production, the wide ness of WeChat. It employs a combinational design in the variety of fragmented storage systems not only costs great storage layer to engage multiple storage engines constructed effort for system maintenance, but also renders these sys- for different storage models. PaxosStore is characteristic of tems hard to scale. This calls for a universal storage system to serve the variety of WeChat business, and PaxosStore is extracting the Paxos-based distributed consensus protocol 1 as a middleware that is universally accessible to the under- the second generation of such storage system in WeChat lying multi-model storage engines. This facilitates tuning, production. maintaining, scaling and extending the storage engines. Ac- The following requirements regarding storage service are cording to our experience in engineering practice, to achieve common among WeChat business components. First, the a practical consistent read/write protocol is far more com- well-known three V’s of big data (volume, velocity, variety) plex than its theory. To tackle such engineering complex- are a real presence. On average, about 1.5 TB data are gen- ity, we propose a layered design of the Paxos-based stor- erated per day carrying assorted types of content including age protocol stack, where PaxosLog, the key data structure text messages, images, audios, videos, financial transactions, used in the protocol, is devised to bridge the programming- posts of social networking, etc. Application queries are is- oriented consistent read/write to the storage-oriented Paxos sued at the rate of tens of thousands queries per second in procedure. Additionally, we present optimizations based the daytime. In particular, single-record accesses dominate on Paxos that made fault-tolerance more efficient. Discus- the system usage. Second, high availability is the first-class sion throughout the paper primarily focuses on pragmatic citizen regarding the storage service. Most applications de- solutions that could be insightful for building practical dis- pend on PaxosStore for their implementation (e.g., point-to- tributed storage systems. point messaging, group messaging, browsing social network- ing posts). Availability is critical to user experience. Most applications in WeChat require the latency overhead in Pax- 1. INTRODUCTION osStore to be less than 20 ms. Such latency requirement WeChat is one of the most popular mobile apps with 700 needs to be met at the urban scale. million active users per day. Services provided by WeChat Along with building PaxosStore to provide high-availability include instant messaging, social networking, mobile pay- storage service, we face the following challenges: ment, third-party authorization, etc. The comprehensive • Effective and efficient consensus guarantee. At business of WeChat is supported by its backend which con- the core, PaxosStore uses the Paxos algorithm [19] to sists of many functional components developed by different handle consensus. Although the original Paxos proto- teams. In spite of the diversity of business logic, most of col theoretically offers the consensus guarantee, its im- the backend components require reliable storage to support plementation complexity (e.g., the sophisticated state machines needs to be maintained and traced properly) ∗Part of this work was done while Qian Lin was an intern as well as the runtime overhead (e.g., bandwidth for at Tencent. synchronization) render it inadequate to support the comprehensive WeChat services. • Elasticity and low latency. PaxosStore is required to support low-latency read/write at the urban scale. Load surge needs to be handled properly at runtime. This work is licensed under the Creative Commons Attribution NonCommercialNoDerivatives 4.0 International License. To view a copy • Automatic cross-datacenter fault-tolerance. In of this license, visit http://creativecommons.org/licenses/byncnd/4.0/. For WeChat production, PaxosStore is deployed over thou- any use beyond those covered by this license, obtain permission by emailing [email protected]. 1 Proceedings of the VLDB Endowment, Vol. 10, No. 12 The first generation of WeChat storage system was based Copyright 2017 VLDB Endowment 21508097/17/08. on the quorum protocol (NWR). 1730 sands of commodity servers across multiple datacenters Application Clients all over the world. Hardware failure and network out- Programming age are not rare in such large-scale system. The fault- Model Key ͲValue Table Queue Set ... ... tolerant scheme should support effective failure detec- tion and recovery without impacting the overall effi- ciency of system. Consensus Layer Paxos Ͳbased Storage Protocol PaxosStore is designed and implemented as the practical solution of high-availability storage in the WeChat backend. It employs a combinational design in the storage layer to en- Storage Layer gage multiple storage engines constructed for different stor- Main/Delta Bitcask LSM Ͳtree age models. PaxosStore is characteristic of extracting the Table Paxos-based distributed consensus protocol as a middleware that is universally accessible to the underlying multi-model storage engines. The Paxos-based storage protocol is exten- Figure 1: Overall architecture of PaxosStore. sible to support assorted data structures with respect to the programming model exposed to the applications. Moreover, PaxosStore adopts the leaseless design, which benefits im- of storage system mainly in that it explicitly extracts the im- proved system availability along with fault-tolerance. plementation of the consensus protocol as a middleware and The key contributions of this paper are summarized as universally provides data consensus guarantee as a service follows: to all the underlying storage engines. Conventional distributed storage systems, more often than • We present the design of PaxosStore, with emphasis on not, are built based on their individual single storage model the construction of the consistent read/write protocol and tweak the design and implementation to fulfill the vari- as well as how it functions. By decoupling the consen- ous application demands. However, these tend to introduce sus protocol from the storage layer, PaxosStore can sup- complicated trade-offs between different components. Al- port multiple storage engines constructed for different though assembling multiple off-the-shelf storage systems to storage models with high scalability. form a comprehensive system can meet the diversity of stor- • Based on the design of PaxosStore, we further present age requirements, it usually makes the entire system difficult the fault-tolerant scheme and the details about data to maintain and less scalable. Moreover, each storage sys- recovery. The described techniques have been fully ap- tem embeds its implementation of data consensus protocol plied in a real large-scale production environment, en- into the storage model, and the consequential divide-and- 2 abling PaxosStore to achieve 99.9999% availability for conquer approach for consensus could be error-prone. This all the production applications of WeChat. is because the consensus guarantee of the overall system is • PaxosStore has been serving the ever growing WeChat determined by the conjunction of consensus achievements business for more than two years. Based on this prac- provided by the individual subsystems. Furthermore, appli- tical experience, we discuss design trade-offs as well as cations with cross-model data accesses (i.e., across multiple present results of experimental evaluation from our de- storage subsystems) can hardly leverage any consensus sup- ployment. port from the underlying subsystems, but have to separately implement the consensus protocol for such cross-model sce- The remaining content of this paper is organized as fol- nario by their own. lows. We first present the detailed design and architec- PaxosStore applies a combinational storage design such ture of PaxosStore in Section 2, and then describe its fault- that multiple storage models are implemented in the storage tolerant scheme and data recovery techniques in Section 3. layer with the only engineering concern of high availability, We discuss the key points of PaxosStore implementation in leaving alone the consensus concern to the consensus layer. Section 4. In Section 5, we conduct a performance study of This facilitates each storage engine as well as the entire stor- PaxosStore and collect measurements from the running sys- age layer to scale on demand. As the implementation of the tem. We discuss related research in Section 6, and finally consensus protocol is decoupled from the storage engines, conclude the paper as well as list the lessons learned from universal service of providing data consensus guarantee can building PaxosStore in Section 7. be shared among all the supported storage models. This also makes the consensus of cross-model data accesses easy 2. DESIGN to achieve. While the design and implementation of the programming