The Definitive Guide to Backup and Recovery for Cassandra EBOOK

Total Page:16

File Type:pdf, Size:1020Kb

The Definitive Guide to Backup and Recovery for Cassandra EBOOK The Definitive Guide to Backup and Recovery for Cassandra EBOOK The Definitive Guide to Backup and Recovery for Cassandra Table of Contents Executive Summary Executive Summary 1. Cassandra Technology Overview In today’s era of big data, enterprise applications 2. The Need for Backup & create a large volume of data that may be Recovery for Cassandra structured, semi-structured or unstructured in nature. Additionally, application development cycles are 3. Backup and Recovery much shorter and application availability is a critical Requirements requirement. Given these requirements, enterprises 4. Existing Practices for Backup are forced to look beyond traditional relational and Recovery databases to onboard next-generation applications (on IaaS or cloud-based PaaS). NoSQL databases 5. RecoverX Overview such as Apache Cassandra are now being adopted 6. Compare and Contrast and evaluated by enterprises for these applications, Cassandra Backup & including eCommerce, content management, etc. Recovery Solutions 7. Fortune 100 Retailer Customer Success Story EBOOK 2 www.datos.io [email protected] Part I: Cassandra Technology Overview Apache Cassandra™ is an open source, distributed and decentralized storage system (database) for managing large amounts of structured data across many commodity servers, while providing highly available service and no single point of failure. Apache Cassandra offers capabilities including continuous availability, linear scale performance, operational simplicity and easy data distribution across multiple data centers and cloud availability zones. Key features of Cassandra include the following: • Elastic scalability: Cassandra is highly scalable; more hardware can be added to accommodate more customers and additional data. • Always on architecture: Cassandra has no single point of failure and is continuously available for business-critical applications. • Fast linear-scale performance: Cassandra is linearly scalable, i.e., it increases throughput as a user increases the number of nodes in the cluster. Therefore, it maintains a quick response time. • Flexible data storage: Cassandra accommodates all possible data formats including structured, semi-structured, and unstructured. It can dynamically accommodate changes to data structures. • Easy data distribution: Cassandra provides the flexibility to distribute data by replicating data across multiple data centers. • Transaction support: Cassandra supports properties like Atomicity, Consistency, Isolation, and Durability (ACID). • Fast writes: Cassandra was designed to run on cheap commodity hardware. It performs fast writes and can store hundreds of terabytes of data, without sacrificing read efficiency. Apache Cassandra’s architecture is responsible for its ability to scale, perform, and offer continuous uptime. All nodes play an identical role as each node communicates with each other equally. Apache Cassandra’s built-for- scale architecture is capable of handling large amounts of data and thousands of concurrent users or operations per second — even across multiple data centers — as easily as it can manage much smaller amounts of data and user traffic. Apache Cassandra’s architecture has no single point of failure and therefore is capable of offering true continuous availability and uptime — simply add new nodes to an existing cluster without having to take it down. In Cassandra, one or more of the nodes in a cluster act as replicas for a given piece of data. If it is detected that some of the nodes responded with an out-of-date value, Cassandra will return the most recent value to the client. After returning the most recent value, Cassandra performs a read repair in the background to update the stale values. EBOOK 3 www.datos.io [email protected] Part II: The Need for Backup & Recovery for Cassandra As organizations increasingly rely upon NoSQL databases for their business-critical applications, and as these same organizations adopt a cloud-first strategy, the need for a comprehensive backup and recovery strategy has never been more critical. In this section, we will explore the need for a complete data protection strategy. One of the advantages of Cassandra is that it provides scalability and high availability without compromising performance. Put simply, replication creates an alternative copy of an original and changes with the original in real time. So while native replication is good for protecting against media and network failures by replicating data across nodes, it can be a serious detriment in the case of data corruption or loss whereby the corrupted data set gets replicated, exacerbating an already bad situation. Furthermore, the most common cause for data corruption and loss today is human error including operator error, fat finger mistakes, and accidental deletions. Recently, more malicious cyber threats have propagated including malware and ransomware. To protect against these threats, a comprehensive backup and recovery strategy is required. Unlike replication, backup is a copy of data taken at a certain point in time that does not change with time. A backup copy enables an administrator to restore data at a certain time in the past, typically a time before the data loss or ransomware attack occurred. In summary, backup plays a critical role in an organization’s overall data protection strategy. In the next section, we’ll discuss in more detail the technical requirements for backup and recovery of Cassandra databases. Part III: Requirements of Protection for Cassandra This section will introduce the key requirements for protecting data that resides on Apache Cassandra, deployed either on-premise, or on private cloud with as-a-service model, or in public cloud with Amazon AWS, Google Cloud Platform. Requirement #1: Online Cluster-Consistent Backups One of the key requirements of next-generation applications that are deployed on Apache Cassandra is the always- on nature. This means that quiescing the database for taking backups is not feasible and moreover, the backup operation should not impact the performance of the application. As the application scales, the underlying Apache Cassandra also needs to scale-out to multiple shards. In this case, a backup solution must provide a consistent backup copy across shards without disrupting database and application performance during backup operations. Requirement #2: Flexible Backup Options Depending on the application, data may have different change rate and patterns. For example, in a product catalog, certain items may be refreshed everyday (fast selling goods), while the others may have longer shelf life (premium items). Based on the application requirements, some collections may need to be backed up every hour versus the EBOOK 4 www.datos.io [email protected] others that may be backed up daily. Providing this flexibility to schedule backups at any interval and at collection level granularity is another requirement that we have heard from customers who are using Apache Cassandra. More importantly, these backups should always be stored on the secondary storage in native formats to avoid vendor lock-in. Requirement #3: Handling Failure Failures are a norm in the distributed database world. However, the backup solution should be resilient to database process failures, node failures, network failure and even logical corruption of data during backup and recovery operations. Finally, the backup solution should be able to handle failures of Apache Cassandra configuration servers that store metadata. Part IV: Existing Solutions in the Marketplace 1. Traditional Backup and Recovery Solutions. Traditional solutions were architected for IT applications deployed on traditional scale-up and static resources of compute and storage. They were designed for scale-up application environments with uncompressed workloads built to be operated within walls of on-premises infrastructure. Media server based legacy backup architecture has no place in the cloud. Likewise, legacy media server based architectures have no place protecting next generation scale-out data. Legacy backup solutions from long established backup vendors (e.g., Veritas, EMC/Legato, CommVault), are all based on the same legacy architecture: single vendor, end to end architectures which centralize the control of the backup process (the “control plane”) and moving/storing the backed-up data (the “data plane”). In these legacy backup solutions “media servers” act as the consolidated “control plane” and “data plane”. Each backup vendor’s unique client “agent” software, installed on each host, used database specific APIs to invoke and process backups of structured data (e.g., Oracle RMAN). Likewise, client agents installed on each file server processed backing up unstructured files on each file server or used APIs on each NAS (an appliance based dedicated file server). With an average architecture age of 20 years, legacy backup and recovery products from vendors including Veritas, Dell EMC, and CommVault were all designed before the advent of the cloud and modern applications, and are not architected to support the new, modern IT stack. 2. Manual Scripted Solutions Manual solutions leverage native Apache Cassandra snapshot utility and scripts to transfer data to secondary storage. The scripts are customized for each Apache Cassandra cluster and require significant operational effort to scale or adapt to any topology changes. Further, these scripts are not resilient to failure scenarios e.g., failure EBOOK 5 www.datos.io [email protected] of a node (primary or secondary)
Recommended publications
  • Trojan Women: Introduction
    Trojan Women: Introduction 1. Gods in the Trojan Women Two gods take the stage in the prologue to Trojan Women. Are these gods real or abstract? In the prologue, with its monologue by Poseidon followed by a dialogue between the master of the sea and Athena, we see them as real, as actors (perhaps statelier than us, and accoutered with their traditional props, a trident for the sea god, a helmet for Zeus’ daughter). They are otherwise quite ordinary people with their loves and hates and with their infernal flexibility whether moral or emotional. They keep their emotional side removed from humans, distance which will soon become physical. Poseidon cannot stay in Troy, because the citizens don’t worship him any longer. He may feel sadness or regret, but not mourning for the people who once worshiped but now are dead or soon to be dispersed. He is not present for the destruction of the towers that signal his final absence and the diaspora of his Phrygians. He takes pride in the building of the walls, perfected by the use of mason’s rules. After the divine departures, the play proceeds to the inanition of his and Apollo’s labor, with one more use for the towers before they are wiped from the face of the earth. Nothing will be left. It is true, as Hecuba claims, her last vestige of pride, the name of Troy remains, but the place wandered about throughout antiquity and into the modern age. At the end of his monologue Poseidon can still say farewell to the towers.
    [Show full text]
  • Apache Cassandra on AWS Whitepaper
    Apache Cassandra on AWS Guidelines and Best Practices January 2016 Amazon Web Services – Apache Cassandra on AWS January 2016 © 2016, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document is provided for informational purposes only. It represents AWS’s current product offerings and practices as of the date of issue of this document, which are subject to change without notice. Customers are responsible for making their own independent assessment of the information in this document and any use of AWS’s products or services, each of which is provided “as is” without warranty of any kind, whether express or implied. This document does not create any warranties, representations, contractual commitments, conditions or assurances from AWS, its affiliates, suppliers or licensors. The responsibilities and liabilities of AWS to its customers are controlled by AWS agreements, and this document is not part of, nor does it modify, any agreement between AWS and its customers. Page 2 of 52 Amazon Web Services – Apache Cassandra on AWS January 2016 Notices 2 Abstract 4 Introduction 4 NoSQL on AWS 5 Cassandra: A Brief Introduction 6 Cassandra: Key Terms and Concepts 6 Write Request Flow 8 Compaction 11 Read Request Flow 11 Cassandra: Resource Requirements 14 Storage and IO Requirements 14 Network Requirements 15 Memory Requirements 15 CPU Requirements 15 Planning Cassandra Clusters on AWS 16 Planning Regions and Availability Zones 16 Planning an Amazon Virtual Private Cloud 18 Planning Elastic Network Interfaces 19 Planning
    [Show full text]
  • Apache Cassandra and Apache Spark Integration a Detailed Implementation
    Apache Cassandra and Apache Spark Integration A detailed implementation Integrated Media Systems Center USC Spring 2015 Supervisor Dr. Cyrus Shahabi Student Name Stripelis Dimitrios 1 Contents 1. Introduction 2. Apache Cassandra Overview 3. Apache Cassandra Production Development 4. Apache Cassandra Running Requirements 5. Apache Cassandra Read/Write Requests using the Python API 6. Types of Cassandra Queries 7. Apache Spark Overview 8. Building the Spark Project 9. Spark Nodes Configuration 10. Building the Spark Cassandra Integration 11. Running the Spark-Cassandra Shell 12. Summary 2 1. Introduction This paper can be used as a reference guide for a detailed technical implementation of Apache Spark v. 1.2.1 and Apache Cassandra v. 2.0.13. The integration of both systems was deployed on Google Cloud servers using the RHEL operating system. The same guidelines can be easily applied to other operating systems (Linux based) as well with insignificant changes. Cluster Requirements: Software Java 1.7+ installed Python 2.7+ installed Ports A number of at least 7 ports in each node of the cluster must be constantly opened. For Apache Cassandra the following ports are the default ones and must be opened securely: 9042 - Cassandra native transport for clients 9160 - Cassandra Port for listening for clients 7000 - Cassandra TCP port for commands and data 7199 - JMX Port Cassandra For Apache Spark any 4 random ports should be also opened and secured, excluding ports 8080 and 4040 which are used by default from apache Spark for creating the Web UI of each application. It is highly advisable that one of the four random ports should be the port 7077, because it is the default port used by the Spark Master listening service.
    [Show full text]
  • Study Questions Helen of Troycomp
    Study Questions Helen of Troy 1. What does Paris say about Agamemnon? That he treated Helen as a slave and he would have attacked Troy anyway. 2. What is Priam’s reaction to Paris’ action? What is Paris’ response? Priam is initially very upset with his son. Paris tries to defend himself and convince his father that he should allow Helen to stay because of her poor treatment. 3. What does Cassandra say when she first sees Helen? What warning does she give? Cassandra identifies Helen as a Spartan and says she does not belong there. Cassandra warns that Helen will bring about the end of Troy. 4. What does Helen say she wants to do? Why do you think she does this? She says she wants to return to her husband. She is probably doing this in an attempt to save lives. 5. What does Menelaus ask of King Priam? Who goes with him? Menelaus asks for his wife back. Odysseus goes with him. 6. How does Odysseus’ approach to Priam differ from Menelaus’? Who seems to be more successful? Odysseus reasons with Priam and tries to appeal to his sense of propriety; Menelaus simply threatened. Odysseus seems to be more successful; Priam actually considers his offer. 7. Why does Priam decide against returning Helen? What offer does he make to her? He finds out that Agamemnon sacrificed his daughter for safe passage to Troy; Agamemnon does not believe that is an action suited to a king. Priam offers Helen the opportunity to become Helen of Troy. 8. What do Agamemnon and Achilles do as the rest of the Greek army lands on the Trojan coast? They disguise themselves and sneak into the city.
    [Show full text]
  • ABSTRACT a Director's Approach to Euripides' Hecuba Christopher F. Peck, M.F.A. Mentor: Deanna Toten Beard, Ph.D. This Thesi
    ABSTRACT A Director’s Approach to Euripides’ Hecuba Christopher F. Peck, M.F.A. Mentor: DeAnna Toten Beard, Ph.D. This thesis explores a production of Euripides’ Hecuba as it was directed by Christopher Peck. Chapter One articulates a unique Euripidean dramatic structure to demonstrate the contemporary viability of Euripides’ play. Chapter Two utilizes this dramatic structure as the basis for an aggressive analysis of themes inherent in the production. Chapter Three is devoted to the conceptualization of this particular production and the relationship between the director and the designers in pursuit of this concept. Chapter Four catalogs the rehearsal process and how the director and actors worked together to realize the dramatic needs of the production. Finally Chapter Five is a postmortem of the production emphasizing the strengths and weaknesses of the final product of Baylor University’s Hecuba. A Director's Approach to Euripides' Hecuba by Christopher F. Peck, B.F.A A Thesis Approved by the Department of Theatre Arts Stan C. Denman, Ph.D., Chairperson Submitted to the Graduate Faculty of Baylor University in Partial Fulfillment of the Requirements for the Degree of Master of Fine Arts Approved by the Thesis Committee DeAnna Toten Beard, Ph.D., Chairperson David J. Jortner, Ph.D. Marion D. Castleberry, Ph.D. Steven C. Pounders, M.F.A. Christopher J. Hansen, M.F.A. Accepted by the Graduate School May 2013 J. Larry Lyon, Ph.D., Dean Page bearing signatures is kept on file in the Graduate School. Copyright © 2013 by Christopher F. Peck
    [Show full text]
  • Implementing Replication for Predictability Within Apache Thrift Jianwei Tu the Ohio State University [email protected]
    Implementing Replication for Predictability within Apache Thrift Jianwei Tu The Ohio State University [email protected] ABSTRACT have a large number of packets. A study indicated that about Interactive applications, such as search, social networking and 0.02% of all flows contributed more than 59.3% of the total retail, hosted in cloud data center generate large quantities of traffic volume [1]. TCP is the dominating transport protocol small workloads that require extremely low median and tail used in data center. However, the performance for short flows latency in order to provide soft real-time performance to users. in TCP is very poor: although in theory they can be finished These small workloads are known as short TCP flows. in 10-20 microseconds with 1G or 10G interconnects, the However, these short TCP flows experience long latencies actual flow completion time (FCT) is as high as tens of due in part to large workloads consuming most available milliseconds [2]. This is due in part to long flows consuming buffer in the switches. Imperfect routing algorithm such as some or all of the available buffers in the switches [3]. ECMP makes the matter even worse. We propose a transport Imperfect routing algorithms such as ECMP makes the matter mechanism using replication for predictability to achieve low even worse. State of the art forwarding in enterprise and data flow completion time (FCT) for short TCP flows. We center environment uses ECMP to statically direct flows implement replication for predictability within Apache Thrift across available paths using flow hashing. It doesn’t account transport layer that replicates each short TCP flow and sends for either current network utilization or flow size, and may out identical packets for both flows, then utilizes the first flow direct many long flows to the same path causing flash that finishes the transfer.
    [Show full text]
  • Chapter 2 Introduction to Big Data Technology
    Chapter 2 Introduction to Big data Technology Bilal Abu-Salih1, Pornpit Wongthongtham2 Dengya Zhu3 , Kit Yan Chan3 , Amit Rudra3 1The University of Jordan 2 The University of Western Australia 3 Curtin University Abstract: Big data is no more “all just hype” but widely applied in nearly all aspects of our business, governments, and organizations with the technology stack of AI. Its influences are far beyond a simple technique innovation but involves all rears in the world. This chapter will first have historical review of big data; followed by discussion of characteristics of big data, i.e. from the 3V’s to up 10V’s of big data. The chapter then introduces technology stacks for an organization to build a big data application, from infrastructure/platform/ecosystem to constructional units and components. Finally, we provide some big data online resources for reference. Keywords Big data, 3V of Big data, Cloud Computing, Data Lake, Enterprise Data Centre, PaaS, IaaS, SaaS, Hadoop, Spark, HBase, Information retrieval, Solr 2.1 Introduction The ability to exploit the ever-growing amounts of business-related data will al- low to comprehend what is emerging in the world. In this context, Big Data is one of the current major buzzwords [1]. Big Data (BD) is the technical term used in reference to the vast quantity of heterogeneous datasets which are created and spread rapidly, and for which the conventional techniques used to process, analyse, retrieve, store and visualise such massive sets of data are now unsuitable and inad- equate. This can be seen in many areas such as sensor-generated data, social media, uploading and downloading of digital media.
    [Show full text]
  • Why Migrate from Mysql to Cassandra?
    Why Migrate from MySQL to Cassandra? 1 Table of Contents Abstract ....................................................................................................................................................................................... 3 Introduction ............................................................................................................................................................................... 3 Why Stay with MySQL ........................................................................................................................................................... 3 Why Migrate from MySQL ................................................................................................................................................... 4 Architectural Limitations ........................................................................................................................................... 5 Data Model Limitations ............................................................................................................................................... 5 Scalability and Performance Limitations ............................................................................................................ 5 Why Migrate from MySQL ................................................................................................................................................... 6 A Technical Overview of Cassandra .....................................................................................................................
    [Show full text]
  • Apache Cassandra™ Architecture Inside Datastax Distribution of Apache Cassandra™
    Apache Cassandra™ Architecture Inside DataStax Distribution of Apache Cassandra™ Inside DataStax Distribution of Apache Cassandra TABLE OF CONTENTS TABLE OF CONTENTS ......................................................................................................... 2 INTRODUCTION .................................................................................................................. 3 MOTIVATIONS FOR CASSANDRA ........................................................................................ 3 Dramatic changes in data management ....................................................................... 3 NoSQL databases ...................................................................................................... 3 About Cassandra ....................................................................................................... 4 WHERE CASSANDRA EXCELS ............................................................................................. 4 ARCHITECTURAL OVERVIEW .............................................................................................. 5 Highlights .................................................................................................................. 5 Cluster topology ......................................................................................................... 5 Logical ring structure .................................................................................................. 6 Queries, cluster-level replication ................................................................................
    [Show full text]
  • Technology Overview
    Big Data Technology Overview Term Description See Also Big Data - the 5 Vs Everyone Must Volume, velocity and variety. And some expand the definition further to include veracity 3 Vs Know and value as well. 5 Vs of Big Data From Wikipedia, “Agile software development is a group of software development methods based on iterative and incremental development, where requirements and solutions evolve through collaboration between self-organizing, cross-functional teams. Agile The Agile Manifesto It promotes adaptive planning, evolutionary development and delivery, a time-boxed iterative approach, and encourages rapid and flexible response to change. It is a conceptual framework that promotes foreseen tight iterations throughout the development cycle.” A data serialization system. From Wikepedia, Avro Apache Avro “It is a remote procedure call and serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format.” BigInsights Enterprise Edition provides a spreadsheet-like data analysis tool to help Big Insights IBM Infosphere Biginsights organizations store, manage, and analyze big data. A scalable multi-master database with no single points of failure. Cassandra Apache Cassandra It provides scalability and high availability without compromising performance. Cloudera Inc. is an American-based software company that provides Apache Hadoop- Cloudera Cloudera based software, support and services, and training to business customers. Wikipedia - Data Science Data science The study of the generalizable extraction of knowledge from data IBM - Data Scientist Coursera Big Data Technology Overview Term Description See Also Distributed system developed at Google for interactively querying large datasets. Dremel Dremel It empowers business analysts and makes it easy for business users to access the data Google Research rather than having to rely on data engineers.
    [Show full text]
  • Hbase Or Cassandra? a Comparative Study of Nosql Database Performance
    International Journal of Scientific and Research Publications, Volume 10, Issue 3, March 2020 808 ISSN 2250-3153 HBase or Cassandra? A Comparative study of NoSQL Database Performance Prashanth Jakkula* * National College of Ireland DOI: 10.29322/IJSRP.10.03.2020.p9999 http://dx.doi.org/10.29322/IJSRP.10.03.2020.p9999 Abstract- A significant growth in data has been observed with the growth in technology and population. This data is non-relational and unstructured and often referred to as NoSQL data. It is growing in complexity for the traditional database management systems to manage such vast databases. Present day cloud services are offering numerous NoSQL databases to manage such non-relational databases ad- dressing different user specific requirements such as performance, availability, security etc. Hence there is a need to evaluate and find the behavior of different NoSQL databases in virtual environments. This study aims to evaluate two popular NoSQL databases and in support to the study, a benchmarking tool is used to compare the performance difference between HBase and Cassandra on a virtual instance deployed on OpenStack. Index Terms- NoSQL Databases, Performance Analysis, Cassandra, HBase, YCSB I. INTRODUCTION ata is growing in complexity with the rise in data. Large amount of data is being generated every day from different sources and D corners of the internet. This exponential data growth is represented by big data. It is serving different use cases in the present day data driven environment and there is a need to manage it with respect to velocity, volume and variety. The traditional way of managing the databases using relational database management systems could not handle because of the volume and they are capable of storing the data which is schema based and only in certain predefined formats.
    [Show full text]
  • Troy Myth and Reality
    Part 1 Large print exhibition text Troy myth and reality Please do not remove from the exhibition This two-part guide provides all the exhibition text in large print. There are further resources available for blind and partially sighted people: Audio described tours for blind and partially sighted visitors, led by the exhibition curator and a trained audio describer will explore highlight objects from the exhibition. Tours are accompanied by a handling session. Booking is essential (£7.50 members and access companions go free) please contact: Email: [email protected] Telephone: 020 7323 8971 Thursday 12 December 2019 14.00–17.00 and Saturday 11 January 2020 14.00–17.00 1 There is also an object handling desk at the exhibition entrance that is open daily from 11.00 to 16.00. For any queries about access at the British Museum please email [email protected] 2 Sponsor’sThe Trojan statement War For more than a century BP has been providing energy to advance human progress. Today we are delighted to help you learn more about the city of Troy through extraordinary artefacts and works of art, inspired by the stories of the Trojan War. Explore the myth, archaeology and legacy of this legendary city. BP believes that access to arts and culture helps to build a more inspired and creative society. That’s why, through 23 years of partnership with the British Museum, we’ve helped nearly five million people gain a deeper understanding of world cultures with BP exhibitions, displays and performances. Our support for the arts forms part of our wider contribution to UK society and we hope you enjoy this exhibition.
    [Show full text]