The Definitive Guide to Backup and Recovery for Cassandra EBOOK

The Definitive Guide to Backup and Recovery for Cassandra EBOOK The Definitive Guide to Backup and Recovery for Cassandra Table of Contents Executive Summary Executive Summary 1. Cassandra Technology Overview In today’s era of big data, enterprise applications 2. The Need for Backup & create a large volume of data that may be Recovery for Cassandra structured, semi-structured or unstructured in nature. Additionally, application development cycles are 3. Backup and Recovery much shorter and application availability is a critical Requirements requirement. Given these requirements, enterprises 4. Existing Practices for Backup are forced to look beyond traditional relational and Recovery databases to onboard next-generation applications (on IaaS or cloud-based PaaS). NoSQL databases 5. RecoverX Overview such as Apache Cassandra are now being adopted 6. Compare and Contrast and evaluated by enterprises for these applications, Cassandra Backup & including eCommerce, content management, etc. Recovery Solutions 7. Fortune 100 Retailer Customer Success Story EBOOK 2 www.datos.io [email protected] Part I: Cassandra Technology Overview Apache Cassandra™ is an open source, distributed and decentralized storage system (database) for managing large amounts of structured data across many commodity servers, while providing highly available service and no single point of failure. Apache Cassandra offers capabilities including continuous availability, linear scale performance, operational simplicity and easy data distribution across multiple data centers and cloud availability zones. Key features of Cassandra include the following: • Elastic scalability: Cassandra is highly scalable; more hardware can be added to accommodate more customers and additional data. • Always on architecture: Cassandra has no single point of failure and is continuously available for business-critical applications. • Fast linear-scale performance: Cassandra is linearly scalable, i.e., it increases throughput as a user increases the number of nodes in the cluster. Therefore, it maintains a quick response time. • Flexible data storage: Cassandra accommodates all possible data formats including structured, semi-structured, and unstructured. It can dynamically accommodate changes to data structures. • Easy data distribution: Cassandra provides the flexibility to distribute data by replicating data across multiple data centers. • Transaction support: Cassandra supports properties like Atomicity, Consistency, Isolation, and Durability (ACID). • Fast writes: Cassandra was designed to run on cheap commodity hardware. It performs fast writes and can store hundreds of terabytes of data, without sacrificing read efficiency. Apache Cassandra’s architecture is responsible for its ability to scale, perform, and offer continuous uptime. All nodes play an identical role as each node communicates with each other equally. Apache Cassandra’s built-for- scale architecture is capable of handling large amounts of data and thousands of concurrent users or operations per second — even across multiple data centers — as easily as it can manage much smaller amounts of data and user traffic. Apache Cassandra’s architecture has no single point of failure and therefore is capable of offering true continuous availability and uptime — simply add new nodes to an existing cluster without having to take it down. In Cassandra, one or more of the nodes in a cluster act as replicas for a given piece of data. If it is detected that some of the nodes responded with an out-of-date value, Cassandra will return the most recent value to the client. After returning the most recent value, Cassandra performs a read repair in the background to update the stale values. EBOOK 3 www.datos.io [email protected] Part II: The Need for Backup & Recovery for Cassandra As organizations increasingly rely upon NoSQL databases for their business-critical applications, and as these same organizations adopt a cloud-first strategy, the need for a comprehensive backup and recovery strategy has never been more critical. In this section, we will explore the need for a complete data protection strategy. One of the advantages of Cassandra is that it provides scalability and high availability without compromising performance. Put simply, replication creates an alternative copy of an original and changes with the original in real time. So while native replication is good for protecting against media and network failures by replicating data across nodes, it can be a serious detriment in the case of data corruption or loss whereby the corrupted data set gets replicated, exacerbating an already bad situation. Furthermore, the most common cause for data corruption and loss today is human error including operator error, fat finger mistakes, and accidental deletions. Recently, more malicious cyber threats have propagated including malware and ransomware. To protect against these threats, a comprehensive backup and recovery strategy is required. Unlike replication, backup is a copy of data taken at a certain point in time that does not change with time. A backup copy enables an administrator to restore data at a certain time in the past, typically a time before the data loss or ransomware attack occurred. In summary, backup plays a critical role in an organization’s overall data protection strategy. In the next section, we’ll discuss in more detail the technical requirements for backup and recovery of Cassandra databases. Part III: Requirements of Protection for Cassandra This section will introduce the key requirements for protecting data that resides on Apache Cassandra, deployed either on-premise, or on private cloud with as-a-service model, or in public cloud with Amazon AWS, Google Cloud Platform. Requirement #1: Online Cluster-Consistent Backups One of the key requirements of next-generation applications that are deployed on Apache Cassandra is the always- on nature. This means that quiescing the database for taking backups is not feasible and moreover, the backup operation should not impact the performance of the application. As the application scales, the underlying Apache Cassandra also needs to scale-out to multiple shards. In this case, a backup solution must provide a consistent backup copy across shards without disrupting database and application performance during backup operations. Requirement #2: Flexible Backup Options Depending on the application, data may have different change rate and patterns. For example, in a product catalog, certain items may be refreshed everyday (fast selling goods), while the others may have longer shelf life (premium items). Based on the application requirements, some collections may need to be backed up every hour versus the EBOOK 4 www.datos.io [email protected] others that may be backed up daily. Providing this flexibility to schedule backups at any interval and at collection level granularity is another requirement that we have heard from customers who are using Apache Cassandra. More importantly, these backups should always be stored on the secondary storage in native formats to avoid vendor lock-in. Requirement #3: Handling Failure Failures are a norm in the distributed database world. However, the backup solution should be resilient to database process failures, node failures, network failure and even logical corruption of data during backup and recovery operations. Finally, the backup solution should be able to handle failures of Apache Cassandra configuration servers that store metadata. Part IV: Existing Solutions in the Marketplace 1. Traditional Backup and Recovery Solutions. Traditional solutions were architected for IT applications deployed on traditional scale-up and static resources of compute and storage. They were designed for scale-up application environments with uncompressed workloads built to be operated within walls of on-premises infrastructure. Media server based legacy backup architecture has no place in the cloud. Likewise, legacy media server based architectures have no place protecting next generation scale-out data. Legacy backup solutions from long established backup vendors (e.g., Veritas, EMC/Legato, CommVault), are all based on the same legacy architecture: single vendor, end to end architectures which centralize the control of the backup process (the “control plane”) and moving/storing the backed-up data (the “data plane”). In these legacy backup solutions “media servers” act as the consolidated “control plane” and “data plane”. Each backup vendor’s unique client “agent” software, installed on each host, used database specific APIs to invoke and process backups of structured data (e.g., Oracle RMAN). Likewise, client agents installed on each file server processed backing up unstructured files on each file server or used APIs on each NAS (an appliance based dedicated file server). With an average architecture age of 20 years, legacy backup and recovery products from vendors including Veritas, Dell EMC, and CommVault were all designed before the advent of the cloud and modern applications, and are not architected to support the new, modern IT stack. 2. Manual Scripted Solutions Manual solutions leverage native Apache Cassandra snapshot utility and scripts to transfer data to secondary storage. The scripts are customized for each Apache Cassandra cluster and require significant operational effort to scale or adapt to any topology changes. Further, these scripts are not resilient to failure scenarios e.g., failure EBOOK 5 www.datos.io [email protected] of a node (primary or secondary)

The Definitive Guide to Backup and Recovery for Cassandra EBOOK

Trojan Women: Introduction

Apache Cassandra on AWS Whitepaper

Apache Cassandra and Apache Spark Integration a Detailed Implementation

Study Questions Helen of Troycomp

ABSTRACT a Director's Approach to Euripides' Hecuba Christopher F. Peck, M.F.A. Mentor: Deanna Toten Beard, Ph.D. This Thesi

Implementing Replication for Predictability Within Apache Thrift Jianwei Tu the Ohio State University [email protected]

Chapter 2 Introduction to Big Data Technology

Why Migrate from Mysql to Cassandra?

Apache Cassandra™ Architecture Inside Datastax Distribution of Apache Cassandra™

Technology Overview

Hbase Or Cassandra? a Comparative Study of Nosql Database Performance

Troy Myth and Reality