(2020) Analysis of Trade-Offs in Fault-Tolerant Distributed Computing and Replicated Databases

View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Leeds Beckett Repository Citation: Gorbenko, A and Karpenko, A and Tarasyuk, O (2020) Analysis of Trade-offs in Fault-Tolerant Distributed Computing and Replicated Databases. In: 2020 IEEE 11th International Conference on Dependable Systems, Services and Technologies (DESSERT). IEEE. ISBN 978-1-7281-9958-0, 978-1-7281-9957-3 DOI: https://doi.org/10.1109/dessert50317.2020.9125078 Link to Leeds Beckett Repository record: http://eprints.leedsbeckett.ac.uk/id/eprint/7186/ Document Version: Book Section c 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The aim of the Leeds Beckett Repository is to provide open access to our research, as required by funder policies and permitted by publishers and copyright law. The Leeds Beckett repository holds a wide range of publications, each of which has been checked for copyright and the relevant embargo period has been applied by the Research Services team. We operate on a standard take-down policy. If you are the author or publisher of an output and you would like it removed from the repository, please contact us and we will investigate on a case-by-case basis. Each thesis in the repository has been cleared where necessary by the author for third party copyright. If you would like a thesis to be removed from the repository or believe there is an issue with copyright, please contact us on [email protected] and we will investigate on a case-by-case basis. The 11th IEEE International Conference on Dependable Systems, Services and Technologies, DESSERT’2020 14-18 May, 2020, Kyiv, Ukraine Analysis of Trade-offs in Fault-Tolerant Distributed Computing and Replicated Databases Anatoliy Gorbenko1,2 Andrii Karpenko Olga Tarasyuk 1Leeds Beckett University 2National Aerospace University 2National Aerospace University Leeds, UK Kharkiv, Ukraine Kharkiv, Ukraine [email protected] [email protected] [email protected] Abstract—This paper examines fundamental trade-offs in discussed by other researchers [3, 4, 5], failures occur fault-tolerant distributed systems and replicated databases built regularly on the Internet, clouds and in scale-out data center over the Internet. We discuss interplays between consistency, networks. availability, and latency which are in the very nature of globally distributed systems and also analyse their interconnection with When system architects employ replication and other fault durability and energy efficiency. In this paper we put forward tolerant techniques for the Internet- and cloud-based systems, an idea that consistency, availability, latency, durability and they have to care about additional delays and their uncertainty other properties need to be viewed as more continuous than and also understand energy and other overheads. Besides, binary in contrast to the well-known CAP/PACELC theorems. maintaining consistency between replicas is another important We compare different consistency models and highlight the role issue that needs to be addressed in fault-tolerant distributed of the application timeout, replication factor and other settings computing and replicated data storages. that essentially determine the interplay between above properties. Our findings may be of interest to software engineers Maintaining several system replicas inevitably increases and system architects who develop Internet-scale distributed the overall energy, consumed by the system. The bigger the computer systems and cloud solutions. replication factor, the higher availability can be ensured at higher energy costs. Moreover, necessity of data Keywords—distributed system, replication, trade-offs, synchronization between replicas to guaranty their consistency, availability, latency, durability, energy-efficiency consistency causes additional energy overheads spent for parallel requests processing and transferring larger amount of I. INTRODUCTION data over the network. Internet-scale distributed computer systems are now This paper examines fundamental trade-offs between extensively used in business-critical applications and critical system latency, durability and energy consumption in addition infrastructures. In such application domains, system failures to Consistency (C), Availability (A) and Partition tolerance affect businesses and people’s lives. Thus, ensuring (P) properties, as described by the CAP theorem [6]. In this dependability of distributed computer systems is a must, as work we put forward an idea that these properties need to be well as a challenge. However, by their nature, large-scale viewed as more continuous than binary. Understanding trade- distributed systems running over the Internet are subject to offs between them will allow systems developers to predict components failures, packets loss, network disconnections, system response time depending on the used replication factor congestions and other accidents. and/or the selected consistency level. Besides, it will enable High availability requirements for many Internet balancing availability, durability and/or consistency against applications call for the use of data replication and system latency, power consumption and other properties. redundancy. Traditional fault tolerance mechanisms such as The rest of the paper is organized as follows. In Sections 2 N-modular, cold- and hot-spare redundancy usually rely on a we discuss fundamental CAP and PACELC theorems and synchronous communication between replicated components. their qualitative implications. Section 3 discusses different This suggests that system replicas are synchronized over a consistency models and examines trade-offs in fault-tolerant short and well-predicted amount of time [1]. This is a distributed computing and replicated databases between CAP reasonable assumption for embedded applications and for properties, latency, durability and energy consumption. those distributed computer systems which components are Finally, we draw our conclusions in Section 4 where the role located, for instance, within the same data center or in the of the application timeout, replication factor and other settings same local area network. However, this does not apply to that essentially determine the interplay between above globally distributed computer systems, which replicas are properties is also discussed. deployed across the Internet and their updates cannot be propagated immediately. This circumstance makes it difficult II. CAP AND PACELC THEOREMS AND THEIR IMPLICATIONS to guarantee consistency across replicas. The CAP theorem [6], first appeared in 1998-1999, defines The Internet and globally distributed computer systems are a qualitative trade-off between system Consistency, characterized by a high level of uncertainty of network delays. Availability, and Partition tolerance. It declare that the only This makes it almost impossible to guarantee that network two of the three properties can be preserved at once in messages will always be delivered between system distributed replicated systems. components within a certain time. It have been previously shown that there is a significant uncertainty of response time Gilbert and Lynch [7] consider the CAP theorem as a in service-oriented systems invoked over clouds and the particular case of a more general trade-off between consistency Internet [2]. Besides, in our practical experiments and as 978-1-7281-9957-3/20/$31.00 ©2020 IEEE and availability in unreliable distributed systems propagating III. TRADE-OFFS IN FAULT-TOLERANT DISTRIBUTED updates eventually over time. COMPUTING AND REPLICATED DATABASES Internet-scale distributed systems cannot avoid partitions In this paper we put forward an idea that CAP/PACELC happened due to network failures and congestions, arbitrary properties need to be viewed as more continuous than binary. message losses and components crashes and, hence, have to Indeed, the availability is measured as usual between 0% choose between strong consistency or high availability. and 100%. Latency (response time) can theoretically vary Thus, a distributed system working over the unreliable and between zero and infinity. Though, in practice it is restricted unpredictable network environment has to sacrifice one of from the right by the application timeout and from the left by these properties. Failing to achieve consistency (i.e. receive some minimal response time higher than zero. Consequently, responses from all replicas) within the specified timeout causes the replica’s timeout defines system partitioning. a partition of the replicated systems. Consistency is also a continuum, ranging from weak Timeout settings are of great importance in the context of consistency at one extreme to strong consistency on the other, the CAP theorem. If timeout is less than the typical response with varying points of eventual consistency in between. time, a system will likely enter a partition mode more often [8]. In this section we describe different consistency models When a system detects a partition (e.g. when one of its replica and discuss trade-offs between core properties of distributed did not respond before the time out) it has to decide whether to storage

Load more