Data Partitioning and Placement Mechanisms for Elastic Key-Value Stores

Data Partitioning and Placement Mechanisms for Elastic Key-Value Stores Han Li A dissertation submitted in fulﬁlment of the requirements for the degree of Doctor of Philosophy The University of New South Wales School of Computer Science and Engineering Faculty of Engineering 28 March 2014 PLEASE TYPE THE UNIVERSITY OF NEW SOUTH WALES Thesis/Dissertation Sheet Surname or Family name: LI First name: HAN Other name/s: Abbreviation for degree as given in the University calendar: Ph.D. School: School of Computer Science and Engineering Faculty: Faculty of Engineering Title: Data partitioning and placement mechanisms for elastic key- value stores Abstract 350 words maximum: (PLEASE TYPE) Key-Value Stores (KVSs) have become a standard component for many web services and applications due to their inherent scalability, availability, and reliability. Many enterprises are now adopting them for use on servers leased from Infrastructure-as-a-Service (IaaS) providers. The defining characteristic of IaaS is resource elasticity. KVSs benefit from elasticity, when they incorporate new resources on-demand as KVS nodes to deal with increasing workload, and decommission excess resources to save on operational costs. Elasticity of a KVS poses challenges in allowing efficient, dynamic node arrivals and departures. On one hand, the workload needs to be quickly balanced among the KVS nodes. However, current data partitioning and migration schemes provide low priority to populate new nodes, thereby reducing the effect of adding resources on increasing workload. On the other hand, dynamic node changes downgrade data durability at multiple node failures caused by hardware failure in IaaS Cloud, which is built from commodity components that fail as the norm at large scales; but current replica placement strategies tend to rely on static mapping of data to nodes for high durability. This thesis proposes a set of data management schemes to address these issues. Firstly, it presents a decentralised automated partitioning algorithm and a lightweight migration strategy, to improve the efficiency of node changes. Secondly, it presents the design of ElasCass, an elastic KVS that incorporates these schemes, implemented atop Apache Cassandra. Finally, it presents a replica placement algorithm with a proof that shows its correctness, to fill the gap of allowing dynamic node changes while maintaining high data durability at multiple node failures. Contributions of this thesis lie in this set of novel schemes for data partitioning, placement, and migration, which provide efficient elasticity for decentralised, shared-nothing KVSs. The evaluations of ElasCass, conducted on Amazon EC2, revealed that, the proposed schemes reduce node incorporation time and improve load- balancing, thereby increasing scalability and query performance. The other evaluation simulated thousands of KVS nodes and demonstrated that the proposed placement algorithm maintains a close to minimised data loss probability under different failure scenarios, and exhibits better scalability and elasticity than state-of-the-art placement schemes. Declaration relating to disposition of project thesis/dissertation I hereby grant to the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or in part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all property rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation. I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstracts International (this is applicable to doctoral theses only). 28 March 2014 …………………………………………………………… ……………………………………..……………… ……….……………………...…….… Signature Witness Date The University recognises that there may be exceptional circumstances requiring restrictions on copying or conditions on use. Requests for restriction for a period of up to 2 years must be made in writing. Requests for a longer period of restriction may be considered in exceptional circumstances and require the approval of the Dean of Graduate Research. FOR OFFICE USE ONLY Date of completion of requirements for Award: THIS SHEET IS TO BE GLUED TO THE INSIDE FRONT COVER OF THE THESIS ORIGINALITY STATEMENT ‘I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgement is made in the thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project's design and conception or in style, presentation and linguistic expression is acknowledged.’ Signed ................................................................. 28 March 2014 Date ................................................................. Copyright Statement I hereby grant the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of thi$ thesis or dissertation. I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstract International (this is applicable to doctoral theses only). I have either used no substantial portions of copyright material in my thesis or I have obtained permission to use copyright material; where permission has not been granted I have applied/will apply for a partial restriction of the digital copy of my thesis or dissertation. ~ (51-... Signed Han Li / ~ ' V41(}1( Date 28 March 2014 Authenticity Statement I certify that the Library deposit digital copy is a direct equivalent of the final officially approved version of my thesis. No emendation of content has occurred and if there are any minor variations in formatting, they are the result of the conversion to digital format. ~ ~~~ Signed Han Li lc~ vfQI(f Date 28 March 2014 vii Contents Contents ix Abstract xiii Acknowledgements xv List of Publications xvii List of Figures xix List of Tables xxi List of Acronyms xxiii 1 Introduction1 1.1 Introduction to Distributed Data Stores.................... 1 1.2 Key-Value Stores on the IaaS Cloud ...................... 3 1.3 Research Problem ................................ 5 1.4 Research Objective................................ 7 1.5 Research Contribution.............................. 8 1.6 Thesis Outline .................................. 9 2 Background 13 2.1 Cloud Computing and Elasticity ........................ 13 2.1.1 Cloud Computing ............................ 13 2.1.2 Elasticity: A Key Feature of Cloud................... 17 2.2 Distributed Data Stores ............................. 19 2.2.1 Distributed Database Systems ..................... 19 ix x CONTENTS 2.2.2 Evolution of Distributed Data Stores.................. 21 2.2.3 Distributed Data Stores on the Cloud ................. 27 2.3 Chapter Summary ................................ 30 3 State of the Art 33 3.1 Key-Value Stores (KVSs) ............................ 33 3.1.1 System Architecture........................... 34 3.1.2 Consistency Model............................ 40 3.1.3 Quality Attributes............................ 44 3.1.4 Query Performance............................ 47 3.1.5 Key-Value Stores in Practice ...................... 52 3.2 Approaches to the Elasticity of KVSs ..................... 55 3.2.1 Data Partitioning............................. 56 3.2.2 Data Placement ............................. 59 3.2.3 Data Migration.............................. 63 3.3 Discussion and Summary ............................ 64 4 Data Distribution for Eﬃcient Elasticity of KVSs 67 4.1 The Elasticity Challenge............................. 67 4.2 Design of An Elasticity Middleware....................... 69 4.2.1 Automated Data Partitioning...................... 71 4.2.2 Decentralised Coordination ....................... 75 4.2.3 Replica Placement............................ 80 4.2.4 Data Migration.............................. 89 4.3 Chapter Summary ................................ 93 5 Implementation and Evaluation of ElasCass 95 5.1 Introduction.................................... 95 5.2 Background.................................... 97 5.2.1 Existing Open Source KVSs....................... 97 5.2.2 Overview of Apache Cassandra..................... 98 5.3 ElasCass Implementation ............................103 5.3.1 Token Management ...........................105 5.3.2 Data Storage...............................111 CONTENTS xi 5.3.3 Replica Reallocation...........................118 5.4 Experiments and Results.............................122 5.4.1 Experimental Setup ...........................122 5.4.2 Data partitioning.............................124 5.4.3 Node Bootstrapping ...........................129 5.4.4 Query Performance............................135

Data Partitioning and Placement Mechanisms for Elastic Key-Value Stores

Openldap Slides

LIST of NOSQL DATABASES [Currently 150]

Apache Cassandra from the Ground Up

The Rationale for Relational History of DB Models Table of Contents

Survey on Nosql Database

Big Data Storage Technologies: a Survey

VS-1046 Certified Apache Cassandra Sample Material

Rethinkdb As a High Performance Nosql Engine for Real-Time Applications

A Practical Guide to Big Data: Opportunities, Challenges & Tools 2 © 2012 Dassault Systèmes About EXALEAD

Performance Comparison of the Most Popular Relational and Non-Relational Database Management Systems

Storage Solutions for Big Data Systems: a Qualitative Study and Comparison

The Nosql RDBMS