DOES MEAN BIG STORAGE?

Mikhail Gloukhovtsev Sr. Cloud Solutions Architect Orange Business Services Table of Contents 1. Introduction ...... 4

2. Types of Storage Architecture for Big Data ...... 7

2.1 Storage Requirements for Big Data: Batch and Real-Time Processing ...... 7

2.2 Integration of Big Data Ecosystem with Traditional Enterprise Data Warehouse ...... 8

2.3 Data Lake ...... 9

2.4 SMAQ Stack ...... 9

2.5 Big Data Storage Access Patterns ...... 10

2.6 Taxonomy of Storage Architectures for Big Data ...... 11

2.7 Selection of Storage Solutions for Big Data ...... 14

3. Hadoop Framework ...... 14

3.1 Hadoop Architecture and Storage Options ...... 14

3.2 Enterprise-class Hadoop Distributions ...... 18

3.3 Big Data Storage and Security ...... 20

3.4 EMC Isilon Storage for Big Data ...... 21

3.5 EMC Greenplum Distributed Computing Appliance (DCA) ...... 22

3.6 NetApp Storage for Hadoop...... 23

3.7 Object-Based Storage for Big Data ...... 23

3.7.1 Why Is Object-based Storage for Big Data Gaining Popularity? ...... 23

3.7.2 EMC Atmos ...... 26

3.8 Fabric Storage for Big Data: SAN Functionality at DAS Pricing ...... 27

3.9 Virtualization of Hadoop ...... 27

4. and Big Data ...... 30

5. Big Data Backups ...... 30

5.1 Challenges of Big Data Backups and How They Can Be Addressed ...... 30

5.2 EMC Data Domain as a Solution for Big Data Backups ...... 32

6. Big Data Retention ...... 34

2014 EMC Proven Professional Knowledge Sharing 2 6.1 General Considerations for Big Data Archiving ...... 34

6.1.1 Backup vs. Archive ...... 34

6.1.2 Why Is Archiving Needed for Big Data? ...... 34

6.1.3 Pre-requisites for Implementing Big Data Archiving ...... 34

6.1.4 Specifics of Big Data Archiving ...... 35

6.1.5 Archiving Solution Components ...... 36

6.1.6 Checklist for Selecting Big Data Archiving Solution ...... 37

6.2 Big Data Archiving with EMC Isilon ...... 37

6.3 RainStor and Dell Archive Solution for Big Data ...... 39

7. Conclusions ...... 39

8. References ...... 41

Disclaimer: The views, processes or methodologies published in this article are those of the author. They do not necessarily reflect the views, processes or methodologies of EMC Corporation or Orange Business Services (my employer).

2014 EMC Proven Professional Knowledge Sharing 3

1. Introduction Big Data has become a buzz word today and we can hear about Big Data from early morning – reading the newspaper that tells us “How Big Data Is Changing the Whole Equation for Business”1 – through our entire day. A search for “big data” on Google returned about 2,030,000,000 results in December 2013. So what is Big Data?

According to Krish Krishnan,2 the so-called three V’s definition of Big Data that became popular in the industry was first suggested by Doug Laney in a research report published by META Group (now Gartner) in 2001. In a more recent report,3 Doug Laney and Mark Beyer define Big Data as follows: "’Big Data’ is high-volume, -velocity and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.”

Let us briefly review these characteristics of Big Data in more detail.

1. Volume of data is huge (for instance, billions of rows and millions of columns). People create digital data every day by using mobile devices and social media. Data defined as Big Data includes machine-generated data from sensor networks, nuclear plants, X-ray and scanning devices, and consumer-driven data from social media. According to IBM, as of 2012, every day 2.5 exabytes of data were created and 90% of the data in the world today was created in the last 2 years alone.4 This data growth is being accelerated by the Internet of Things (IoT), which is defined as the network of physical objects that contain embedded technology to communicate and interact with their internal states or the external environment (IoT excludes PCs, tablets, and smartphones). IoT will grow to 26 billion units installed in 2020, representing an almost 30-fold increase from 0.9 billion in 2009, according to Gartner.5 2. Velocity of new data creation and processing. Velocity means both how fast data is being produced and how fast the data must be processed to meet demand. In the case of Big Data, the data streams in a continuous fashion and time-to-value can be achieved when data capture, data preparation, and processing are fast. This requirement is more challenging if we take into account that the data generation speed changes and data size varies.

2014 EMC Proven Professional Knowledge Sharing 4 3. Variety of data. In addition to traditional structured data, the data types include semi- structured (for example, XML files), quasi-structured (for example, clickstream string), and unstructured data.

A misunderstanding that big volume is the key characteristic defining Big Data can result in a failure of a Big Data–related project unless it also focuses on variety, velocity, and complexity of the Big Data, which are becoming the leading features of Big Data. What is seen as a large data volume today can become a new normal data size in a year or two.

A fourth V – Veracity – is frequently added to this definition of Big Data. Data Veracity deals with uncertain or imprecise data. How accurate is that data in predicting business value? Does a Big Data analytics give meaningful results that are valuable for business? Data accuracy must be verifiable.

Just retaining more and more data of various types does not create any business advantage unless the company has developed a Big Data strategy to get business information from Big Data sets. Business benefits are frequently higher when addressing the variety of the data rather than addressing just the data volume. Business value can also be created by combining the new Big Data types with the existing information assets that results in even larger data type diversity. According to research done by MIT and the IBM Institute for Business Value6, organizations applying analytics to create a competitive advantage within their markets or industries are more than twice as likely to substantially outperform their peers.

The requirement of time-to-value warrants innovations in data processing that are challenged by Big Data complexity. Indeed, in addition to a great variety in the Big Data types, the combination of different data types presenting different challenges and requiring different data analytical methods in order to generate a business value makes data management more complex. Complexity with an increasing volume of unstructured data (80%–90% of the data in existence is unstructured) means that different standards, data processing methods, and storage formats can exist with each asset type and structure.

The level of complexity and/or data size of Big Data has resulted in another definition as data that cannot be efficiently managed using only traditional data-capture technology and processes or methods. Therefore, new applications and new infrastructure as well as new processes and procedures are required to use Big Data. The storage infrastructure for Big Data applications

2014 EMC Proven Professional Knowledge Sharing 5 should be capable of managing large data sets and providing required performance. Development of new storage solutions should address the 3V+V characteristics of Big Data.

Big Data creates great potential for business development but at the same time it can mean “Big Mistakes” by spending a lot of money and time on poorly defined business goals or opportunities. The goal of this article is to help readers anticipate how Big Data will affect storage infrastructure design and data lifecycle management so that they can work with storage vendors to develop Big Data road maps for their companies and spend the budget for Big Data storage solutions wisely.

While this article considers the storage technologies for Big Data, I want readers to keep in mind that Big Data is about more than just a technology. To get business advantages of Big Data, companies have to make changes in the way they do business and develop enterprise information management strategies to address the Big Data lifecycle, including hardware, software, services, and policies for capturing, storing, and analyzing Big Data. For more detail, refer to the excellent EMC course, Data Science and Big Data Analytics8.

2014 EMC Proven Professional Knowledge Sharing 6 2. Types of Storage Architecture for Big Data 2.1 Storage Requirements for Big Data: Batch and Real-Time Processing Big Data architecture is based on two different technology types: for real-time, interactive workloads and for batch processing requirements. These classes of technology are complementary and frequently deployed together – for example, Pivotal One Platform that includes Pivotal Data Fabric9,10 (Pivotal is partly owned by EMC, VMware, and General Electric).

Big Data frameworks such as Hadoop are batch process-oriented. They address the problems of the cost and speed of Big Data processing by using open source software and massively parallel processing. Server and storage costs are reduced by implementing scale-out solutions based on commodity hardware.

NoSQL that are highly optimized key–value data stores, such as HBase, are used for high performance, index-based retrieval in real-time processing. NoSQL will process large amounts of data from various sources in a flexible data structure with low latency. It will also provide real-time data integration with Complex Event Process (CEP) engine to enable actionable real-time Big Data Analytics. High-speed processing of Big Data in-flight – so-called Fast Big Data – is typically done using in-memory computing (IMC). IMC relies on in-memory data management software to deliver high speed, low-latency access to terabytes of data across a distributed application. Some Fast Big Data solutions such as Terracotta BigMemory11 maintain all the data in-memory with the motto “Ditch the Disk” and use disk-based storage only for storing data copies and redo logs for DB startup and fault recovery as SAP HANA, an in- memory database, does.12 Other solutions for Fast Big Data such as Oracle Exalytics In- Memory Machine,13 Active EDW platform,14 and DataDirect Networks' SFA12KX series appliances15 use hybrid storage architecture (see Section 2.3).

As data of various types are captured, they are stored and processed in traditional DBMS (for structured data), simple files, or distributed-clustered systems such as NoSQL data stores and Hadoop Distributed File System (HDFS). Due to the size of the Big Data sets, the raw data is not moved directly to a data warehouse. The raw data undergoes transformation using MapReduce processing, and the resulted reduced data sets are loaded into the data warehouse environment, where they are used for further analysis – conventional BI reporting/dashboards, statistical, semantic, correlation capabilities, and advanced data visualization.

2014 EMC Proven Professional Knowledge Sharing 7

Figure 1: Enterprise Big Data Architecture (Ref. 10)

Storage requirements for batch processing and for real-time analytics are very different. As the capability to store and manage hundreds of terabytes in a cost-effective way is an important requirement for batch processing of Big Data, low I/O latency is the key for large-capacity performance-intensive Fast Big Data analytics applications. Storage architectures for such I/O- performance applications that include using all-flash storage appliances and/or in-memory databases are not in the scope of this article.

2.2 Integration of Big Data Ecosystem with Traditional Enterprise Data Warehouse The integration of a Big Data platform with traditional BI enterprise data warehouse (EDW) architecture is important, and this requirement, which many enterprises have, results in development of so-called consolidated storage systems. Instead of a “rip & replace” of the existing storage ecosystems, organizations can leverage the existing storage systems and adapt their data integration strategy using Hadoop as a form of preprocessor for Big Data integration in the data warehouse. The consolidated storage includes storage tiering and is used for very different data management processes: for primary workloads, real-time online analytics queries, and offline batch-processing analytics. These different data processing types result in heterogeneous or hybrid storage environments discussed later. Readers can find more information about consolidated storage in Ref. 16.

2014 EMC Proven Professional Knowledge Sharing 8 The new integrated EDW should have three main capabilities:

1. Hadoop-based analytics to process and analyze any data type across commodity server clusters 2. Real-time stream processing with Complex Event Processing (CEP) engine with sub- millisecond response times 3. Data Warehousing providing insight with advanced in-database analytics

The integration of Big Data and EDW environments is reflected in the definition of “Data Lake”.

2.3 Data Lake Booz Allen Hamilton introduced the concept of Data Lake.17 Instead of storing information in discrete data structures, the Data Lake consolidates an organization’s complete repository of data in a single, large table. The Data Lake includes data from all data sources – unstructured, semi-structured, streaming, and batch data. To be able to store and process terabytes to petabytes of data, a Data Lake should scale in both storage and processing capacity in an affordable manner. Enterprise Data Lake should provide high-availability, protected storage; support existing data management processes and tools as well as real-time data ingestion and extraction; and be capable of data archiving.

Support for Data Lake architecture is included in Pivotal's Hadoop products that are designed to work within existing SQL environments and can co-exist alongside in-memory databases for simultaneous batch and real-time analytic queries.18 Customers implementing Data Lakes can use the Pivotal HD and HAWQ platform for storing and analyzing all types of data – structured and unstructured.

Pentaho has created an optimized system for organizing the data that is stored in the Data Lake, allowing customers to use Hadoop to sift through the data and extract the chunks that answer the questions at hand.19

2.4 SMAQ Stack The term SMAQ stack, coined by Ed Dumbill in a blog post at O’Reilly Radar,20 relates to a processing stack for Big Data that consists of layers of Storage, MapReduce technologies, and Query technologies. SMAQ systems are typically open source and distributed and run on commodity hardware. Similar to the commodity LAMP stack of , Apache, MySQL and PHP, which has played a critical role in development of Web 2.0, SMAQ systems are expected

2014 EMC Proven Professional Knowledge Sharing 9 to be a framework for development of Big Data-driven products and services. While Hadoop- based architectures dominate in SMAQ, SMAQ systems also include a variety of NoSQL databases.

Figure 2: The SMAQ Stack for Big Data (Ref. 8)

As expected, storage is a foundational layer of the SMAQ stack and is characterized by distributed and unstructured content. At the intermediate layer, MapReduce technologies enable the distribution of computation across many servers as well as supporting a batch-oriented processing model of data retrieval and computation. Finally, at the top of the Stack are the Query functions. Characteristic of this function is the ability to find efficient ways of defining computation and providing a platform for “user-friendly” analytics.

2.5 Big Data Storage Access Patterns Typical Big Data storage access patterns are write once, read many times workloads with metadata lookups and large block–sized reads (64 MB to 128 MB, e.g. Hadoop HDFS) as well as small-sized accesses for HBase. Therefore, the Big Data processing design should provide efficient data reads.

2014 EMC Proven Professional Knowledge Sharing 10 Both scale-out file systems and object-based systems can meet the Big Data storage requirements. Scale-out file systems provide a global namespace file system, whereas use of metadata in object storage (see Section 3.7 below) allows for high scalability for large data sets.

2.6 Taxonomy of Storage Architectures for Big Data Big Data storage architectures can be categorized into shared-nothing, shared primary, or shared secondary storage. Implementation of Hadoop using Direct Attached Storage (DAS) is common, as many Big Data architects see shared storage architectures as relatively slow, complex, and, above all, expensive. However, DAS has its own limitations (first of all, inefficiency in storage use) and is an extreme in the broad spectrum of storage architectures. As a result, in addition to DAS-based HDFS systems, enterprise-class storage solutions using shared storage (scaled-out NAS, i.e. EMC Isilon®, or SAN), alternative distributed file systems, cloud object-based storage for Hadoop (using REST API like CDMI, S3, or Swift; see Ref.7 for detail), decoupled storage and compute nodes such as solutions using vSphere BDE (see Section 3.9) are gaining popularity (see Figure 3).

For Hadoop workloads, the storage resource to compute resource ratios vary by application and it is often difficult to determine them in advance. This challenge makes it imperative that a Hadoop cluster is designed for flexibility and scaling storage and compute independently. Decoupling storage and compute resource is a way to scale storage independent of compute. Examples of such architectures are SeaMicro Fabric Storage and Hadoop virtualization using VMware BDE, which are discussed in Sections 3.8 and 3.9, respectively.

2014 EMC Proven Professional Knowledge Sharing 11

Figure 3: Technologies Evaluated or Being Deployed to Meet Big Data Requirements (Ref 21)

Figure 3 presents technologies in the priority order in which companies are evaluating them or have deployed them to meet Big Data requirements.21 EMC Isilon is an example of shared storage (scaled-out NAS) as primary storage for Big Data Analytics. A Big Data “Stack” like the EMC Big Data Stack presented below needs to be able to operate on a multi-petabyte scale to handle structured and unstructured data.

2014 EMC Proven Professional Knowledge Sharing 12

Technology Layer EMC Product

Collaborative – Act Documentum® xCP, Greenplum® Chorus

Real Time - Analyze Greenplum + Hadoop, Pivotal

Structured/unstructured data Pivotal HD, Isilon

Storage, Petabyte Scale Isilon, Atmos®

Table 1. EMC Big Data Stack

A challenge for IMC-based Big Data Analytics is that the volume of data that companies want to analyze grows at a faster rate than affordability of memory. The 80/20 rule applies to many enterprise analytic environments – only 20% of the data generate 80% of I/Os for a given period of time. As the data ages, it is more rational to implement dynamic storage tiering in a hybrid storage architecture rather than placing all the data in-memory. The goal of hybrid storage architecture is to address the great variety of storage performance requirements that various types of Big Data have by implementing dynamic storage tiering which moves data chunks between different storage pools of SSD, SAS, and SATA drives. The hybrid storage in the Teradata Active EDW platform14 and DataDirect Networks’ Storage Fusion Architecture15 exemplifies this storage architecture type.

DataDirect Networks' SFA12KX series appliances are built on the company's integrated Big Data-oriented Storage Fusion Architecture. A single SFA12KX appliance that can accommodate a mix of up to 1,680 SSD, SATA, or SAS drives delivers up to 1.4 million input/output operations per second (IOPS) and pushes data through at a rate of 48 GB per second.15

Nutanix offers Converged Storage Architecture. Nutanix Complete Cluster using the Nutanix Distributed File System (NDFS) enables MapReduce to be run without HDFS and its NameNode. The Nutanix Complete Cluster consolidates DAS with compute resources in four- node Intel-based appliances called “Compute + Storage Together.”22 The internal storage that is a combination of PCIe-SSD (Fusion-io) and SATA hard disks from all nodes is virtualized into a

2014 EMC Proven Professional Knowledge Sharing 13 unified pool by Nutanix Scale-out Converged Storage and can be dynamically allocated to any virtual machine or guest operating system. A Nutanix Controller Virtual Machine (VM) on each host manages storage for VMs on the host. Controller VMs work together to manage storage across the cluster as a pool using the Nutanix Distributed File System (NDFS).

2.7 Selection of Storage Solutions for Big Data Many Big Data solutions emphasize low cost; however, there are also high-cost solutions such as those using enterprise-class storage that remain cost effective because they yield significant benefits. Choosing a storage architecture for Big Data is a mix of science and art to find the right balance between TCO and value for the business.

3. Hadoop Framework 3.1 Hadoop Architecture and Storage Options The platform,23 an open-source software framework supporting data-intensive distributed applications, has two core components: the Hadoop Distributed File System (HDFS), which manages massive unstructured data storage on commodity hardware, and MapReduce, which provides various functions to access the data on HDFS. HDFS architecture evolved from the Google Filesystem architecture. MapReduce consists of a Java API as well as software to implement the services that Hadoop needs to function. Hadoop integrates the storage and analytics in a framework that provides reliability, scalability, and management of the data.

The three principal goals of the HDFS architecture are:

1. Process extremely large data volumes (large number of files) ranging from gigabytes to petabytes 2. Streaming data processing to read data at high-throughput rates and process data on read 3. Capability to execute on commodity hardware with no special hardware requirements

Hadoop supports several different node types:

 Multiple Data Nodes  The NameNode, which manages the HDFS name space by determining which DataNode contains the data requested by the client and redirects the client to that particular DataNode

2014 EMC Proven Professional Knowledge Sharing 14  The Checkpoint node, which is a secondary NameNode that manages the on-disk representation of the NameNode metadata  The JobTracker node, which manages all jobs submitted to the Hadoop cluster and facilitates job and task scheduling

Subordinate nodes provide both TaskTracker and DataNode functionality. These nodes perform all of the real work done by the cluster. DataNodes store the data and they serve I/O requests under the control of NameNode. The NameNode houses and manages the metadata and when a TaskTracker gets a read or write request for an HDFS block, the NameNode informs the TaskTracker where a block exists or where one should be written. TaskTrackers execute the job tasks assigned to them by the JobTracker.

Hadoop is “rack aware” – the NameNode utilizes a data structure that determines which DataNode is preferred based on the “network distance” between them. Nodes that are “closer” are preferred (same rack, different rack, same data center). The HDFS uses this when replicating data to try to keep different copies of the data on different racks. The goal is to reduce the impact of a rack power outage or switch failure so that even if these events occur, the data may still be readable.

HDFS uses “shared nothing” architecture for primary storage – all the nodes have direct attached SAS or SATA disks. Direct Attached Storage (DAS) means that a server attaches directly to storage system ports without a switch. Internal drives in the server enclosure fall into this category. Since DAS actually uses a point-to-point connection, it provides high bandwidth between the server and storage system. No DAS-type storage is shared as the disks are locally attached and there are no disks attached to two or more nodes. The default way to store data for Hadoop is HDFS on local direct attached disks. However, it can be seen as HDFS on HA- DAS because the data is replicated across nodes for HA purposes. Compute nodes are distributed file system clients if scale-out NAS servers are used.

The import stage (putting data into HDFS for processing) and export stage (extracting data from the system after processing) can be significantly accelerated by replacing conventional hard disk drives (HDDs) with solid state disks (SSDs). Random read times especially benefit from using SSDs. For example, Intel has shown that replacement of conventional HDDs with the Intel SSD 520 Series reduced the time to complete the workload by approximately 80 percent — from about 125 minutes to about 23 minutes.24 Even though the cost of SSDs continues to

2014 EMC Proven Professional Knowledge Sharing 15 plummet, it is still prohibitively expensive to use all-SSD storage for Hadoop clusters except some use cases when time-to-value justifies the cost of the SSD-based storage. Therefore, a tiered storage model combining conventional HDDs and SSDs in the same server can provide the right balance between the performance and storage cost.

Pros and cons of various storage options25,26 for HDFS are presented in the table below. As seen from the table, HDFS has a few issues the Apache Hadoop community is working to address. One of the top issues for Hadoop v.1.0 is that NameNode represents a single point of failure (SPOF). When it goes offline, the cluster shuts down and has to be restarted at the beginning of the process that was running at the time of the failure. Version 2.0 of Hadoop (2.2.0 is the first stable release in the 2.x line; the GA date of Hadoop 2.0 was October 16, 2013) introduces both manual and automated failover to a standby NameNode without needing to restart the cluster.27 Automatic failover adds two new components to an HDFS deployment: a ZooKeeper quorum and the ZKFailoverController process.

Vendors are also coming to market with fixes such as a NameNode failover mode in HDFS, as well as file system alternatives that do not use a NameNode function (that means no NameNode to fail).

There are some disadvantages of triple-mirror server-based replication used by HDFS for ingestion and redistribution of data:

 Inefficient storage utilization; for example, three TBs of raw capacity are required to store just one usable terabyte (TB) of data.  Server-based triple replication creates a significant load on the servers themselves.  Server-based triple replication creates a significant load on the network.

Vendors’ solutions (EMC Isilon, NetApp E-series, and FAS) that address this are described later.

2014 EMC Proven Professional Knowledge Sharing 16 Pros Cons

DAS Writes are highly parallel and tuned High replication cost compared with that in for Hadoop jobs; JobTracker tries to shared storage. NodeName keeping track make local reads. of data location is still a SPOF (failover, introduced in Hadoop 2.0, can also be addressed in dispersed storage solutions [Cleversafe]).

SAN Array capabilities (redundancy, Cost, limited scalability of scale-up storage replication, dynamic tiering, virtual arrays. provisioning) can be leveraged. As the storage is shared, a new node can be easily assigned to a failed- node data. Centralized management. Using shared storage eliminates or reduces the need for three-way data replication between data nodes.

Distributed Shared data access, POSIX- While HDFS is highly optimized for Hadoop, File compatible and works for non- it is not likely to get the same level of System Hadoop apps just as a local file optimization for a general Distributed File /Scale-out system, centralized management System (DFS). Strict POSIX compliance NAS and administration. leads to unnecessary serialization. Scaling limitations as some DFSs are not designed for thousands of nodes.

Table 2: Hadoop Storage Options

A tightly coupled Distributed File System (DFS) for Hadoop is a general purpose shared file system implemented in the kernel with a single name space.26 Local awareness is part of the DFS so there is no need for NameNode. Compute nodes may or may not have local storage. Remote storage is accessed using a file system-specific internode protocol. If DFS uses local disks, compute nodes are part of DFS with data spread across nodes.

2014 EMC Proven Professional Knowledge Sharing 17 3.2 Enterprise-class Hadoop Distributions Several vendors now offer Hadoop distributions as enterprise-class products packaged with maintenance and technical support options. The goals of commercial distributions of Hadoop are to address Hadoop challenges such as inefficient data staging and loading processes and lack of multi-tenancy, backup, and DR capabilities. Vendors providing Hadoop distributions are evaluated in a recent Forrester Review.28 Some of the vendors are EDW vendors, such as EMC Greenplum, IBM, Microsoft, and Oracle, which are modifying their products to support Hadoop.

For example, the Hadoop distribution called Pivotal HD is based on Hadoop 2.0 and integrates the Greenplum database with Apache Hadoop.29 This integration reduces the need of data movement for processing and analysis. Pivotal value-add components include advanced database services (HAWQ) (high-performance, “True SQL” query interface running within the Hadoop cluster, Extensions Framework providing support for HAWQ interfaces on external data providers [HBase, Avro, etc.]), and advanced analytics functions (MADLib). Pivotal HD is available as a software-only or appliance-based solution.

Pivotal HD provides Unified Storage Service (USS), enabling user access to data residing on multiple platforms without data copying. USS is a "pseudo" Hadoop File System (HDFS) that delegates file system operations directed at it to other file systems in an "HDFS-like" way. Using USS, users do not need to copy data from the underlying storage system to HDFS to process the data using Hadoop framework, significantly reducing time and operational costs. Large organizations typically have multiple data sets residing on various storage systems. As moving this data to a central Data Lake environment would be time consuming and costly, USS can be used to provide a unified view of underlying storage systems for Big Data analytics.

2014 EMC Proven Professional Knowledge Sharing 18

Figure 4: Pivotal HD Architecture (Ref. 29)

A growing list of vendors (both systems and storage vendors) are incorporating Hadoop into preconfigured products and offer them as an appliances – EMC Greenplum HD (a bundle that combines MapR’s version of Hadoop, the Greenplum database and a standard x86 based server), Pivotal HD (discussed above), the Dell/Cloudera Solution (which combines Dell PowerEdge C2100 servers and PowerConnect switches with Cloudera’s Hadoop distribution and its Cloudera Enterprise management tools), and Pentaho Data Integration 4.2.

EMC Greenplum is the first EDW vendor to provide a full-featured enterprise-grade Hadoop appliance and offer an appliance platform that integrates its Hadoop, EDW, and data integration in a single rack. These solutions provide an easier way for users to benefit from Hadoop-based analytics without in-house development of integration that was required in early Hadoop implementations. For example, Cisco offers a comprehensive solution stack: the Cisco UCS Common Platform Architecture (CPA) for Big Data includes compute, storage, connectivity, and unified management.30

2014 EMC Proven Professional Knowledge Sharing 19 3.3 Big Data Storage and Security Regulation and compliance are also important considerations for Big Data. The first versions of Hadoop offered limited (if any) ways to respond to corporate security and data governance policies. The security model of Hadoop has been improving through development of Apache projects such as and the release of “security-enhanced” distributions of Hadoop by vendors (for example, Cloudera Sentry and Intel secure Hadoop distribution using Intel Expressway API Manager functioning as a security gateway enforcement point for all REST Hadoop APIs). The release of the 2.x distributions of Hadoop addresses many security issues including security enhancements for HDFS (enforcement of HDFS file permissions). This article considers the storage-related aspects of the Hadoop security model, namely encryption of data at rest. Readers interested in other security aspects of Hadoop can find many reviews, for example Refs. 31 and 32.

Hadoop 2.0 does not include encryption for data at rest on HDFS. If encryption for data on Hadoop clusters is required, there are two options: using third-party tools for implementing HDFS disk-level encryption or security-enhanced Hadoop distributions, such as Intel distribution. The Intel Distribution for Apache Hadoop software33 is optimized for Intel Advanced Encryption Standard New Instructions (Intel AES-NI), a technology that is built into Intel Xeon processors. The encryption can apply transparently to users at a file-level granularity and be integrated with external standards-based key management applications.

Following best practices for data security, sensitive files must be encrypted by external security applications before they arrive at the Apache Hadoop cluster and are loaded into HDFS. Each file must arrive with the corresponding encryption key. If files were encrypted only after arrival, they would reside on the cluster in their unencrypted form, which would create vulnerabilities. When an encrypted file enters the Apache Hadoop environment, it remains encrypted in HDFS. It is then decrypted as needed for processing and re-encrypted before it is moved back into storage. The results of the analysis are also encrypted, including intermediate results. Data and analysis results are neither stored nor transmitted in unencrypted form.33

In 2013, Intel launched an open source effort called Project Rhino to improve the security capabilities of Hadoop and the Hadoop ecosystem and contributed code to Apache.34 The objective of Project Rhino is to take a holistic hardening approach to the Hadoop framework, with consistent concepts and security capabilities across projects. In order to achieve this, a splittable AES codec implementation is introduced to Hadoop, allowing distributed data to be

2014 EMC Proven Professional Knowledge Sharing 20 encrypted and decrypted from disk. The key distribution and management framework will make it possible for MapReduce jobs to perform encryption and decryption.

3.4 EMC Isilon Storage for Big Data Isilon is an enterprise storage system that can natively integrate with the HDFS.35 The solution uses shared scaled-out NAS storage as a large repository of Hadoop data for data protection, archive, security, and data governance purposes. EMC Isilon storage is managed by intelligent software to scale data across vast quantities of commodity hardware, enabling explosive growth in performance and capacity. While HDFS creates a 3x replica for redundancy, Isilon OneFS® dramatically reduces the need for a three-way copy. Every node in the cluster is connected to the internal InfiniBand network. Clients connect using standard protocols like NFS, CIFS, FTP, and HTTP from the front end network which is either 1 or 10 Gb/s Ethernet. OneFS uses the internal InfiniBand network to allocate and stripe data across all nodes in the cluster automatically. As OneFS distributes the Hadoop NameNode to provide high-availability and load balancing, it eliminates the single point of failure (see Table 3). Isilon storage provides a single file system/single volume scalable up to 15 PB.35 Data can also be staged from other protocols to HDFS by using OneFS as a staging gateway. Integration with EMC ViPR® Software-Defined Storage, offering access to object storage APIs from Amazon S3, EMC Atmos and others, enables leveraging cloud-based applications and workflows.

The benefits of using Isilon for Big Data are presented in the table below.

2014 EMC Proven Professional Knowledge Sharing 21 Hadoop/DAS Challenges Hadoop with Isilon Solutions to Address Them

Dedicated storage infrastructure for Hadoop Scale-out storage platform. Multiple only. applications and workflows.

Single-point of failure (NameNode failover in No single point of failure – Distributed Hadoop 2.0) namespace.

Lack of enterprise class data protection – no End-to-end data protection – SnapshotIQ, snapshots, backup, replication SyncIQ, NDMP backup.

Poor storage efficiency – three-way mirroring Storage efficiency, >80% storage utilization.

Manual import/export Multi-protocol support. Industry standard protocols: NFS, CIFS, FTP, HTTP, HDFS

Fixed scalability – rigid compute to storage Independent scalability: decoupling compute ratio and storage – add compute and storage independently.

Table 3: Benefits of Using Isilon for Hadoop

Isilon provides multi-tenancy in the Hadoop environment.

 One directory within OneFS per tenant, one subdirectory per data scientist  Access controlled by group and user rights  Leveraging SmartQuotas to set resource limits and report usage

3.5 EMC Greenplum Distributed Computing Appliance (DCA) Combining Isilon and Greenplum HD provides the best of both worlds for Big Data Analytics. Greenplum Data Base (GPDB) with Hadoop delivers a solution for the analytics of structured, semi-structured, and unstructured data.36 The Greenplum DCA is a massively parallel architecture and the GPDB is a scalable analytic database. It features “shared nothing,” in contrast to Oracle and DB2. Operations are extremely simple – once data is loaded, Greenplum’s automated parallelization and tuning provide the rest; no partitioning is required. To scale, simply add nodes (Greenplum DCA fully leverages the industry standard x86 platform); storage, performance, and load bandwidth are managed entirely in software.

2014 EMC Proven Professional Knowledge Sharing 22 Users can perform complex, high-speed, interactive analytics using GPDB, as well as streaming the data directly from Hadoop into GPDB to incorporate unstructured or semi-structured data in the above analyses within GPDB. Hadoop also can be used to transform unstructured and semi-structured data into a structured format that can then be fed into GPDB for high speed, interactive querying.31

3.6 NetApp Storage for Hadoop The NetApp Open Solution for Hadoop preserves the shared nothing architectural model. It provides DAS storage in the form of a NetApp E-series array (for example, E2660) to each Data Node within the Hadoop cluster.37 Compute and storage resources are decoupled with SAS- attached NetApp E2660 arrays and the recoverability of a failed Hadoop NameNode is improved with a NFS-attached FAS2040. The E2660 array is configured as four volumes of DAS so that each Data Node has its own non‐shared set of disks and each Data Node “sees” only its share of disk. The FAS2040 is used as storage for the NameNode, mitigating loss of cluster metadata due to NameNode failure. It functions as a single, unified repository of cluster metadata that supports faster recovery from disk failure. It also serves as a repository for other cluster software including scripts. Instead of three-way data mirroring consuming storage capacity and network bandwidth, data is mirrored to a direct attached NetApp E2660 array via 6 Gb/s SAS connections.

3.7 Object-Based Storage for Big Data 3.7.1 Why Is Object-based Storage for Big Data Gaining Popularity? The challenge of management of traditional block-based storage for Big Data results in many organizations’ interest in object storage which can use the same types of hardware systems as the traditional approach but stores data as objects that are self-contained groups of logically related data.

While in block-based storage data is stored in groups of blocks with a minimal amount of metadata storage with the content, object-based storage stores data as an object with a unique global identifier (128-bit Universally Unique ID [UUID]) that is used for data access or retrieval. The Object-based Storage Device (OSD) is a new disk interface technology being standardized by the ANSI T10 technical committee (Fig. 5). Metadata that includes everything needed to manage content is attached to the primary data and is stored contiguously with the object. The object can be any unstructured data, file, or group of files, for example, audio, documents, email, images, and video files. By combining metadata with content, objects are never locked to

2014 EMC Proven Professional Knowledge Sharing 23 a physical location on a disk, enabling automation and massive scalability required for cloud and Big Data solutions. Incorporation of metadata into objects simplifies the use of data management (preservation, retention, and deletion) policies and, therefore, reduces the management overhead. To applications, all of the information appears as one big pool of data. With flat-address-space design, there is no need to use file systems (file systems have an average overhead of 25%) or to manage LUNs and RAID groups. Access to object-based storage is provided using web protocols such as REST and SOAP. Object-based systems typically secure information via Kerberos, Simple Authentication and Security Layer, or some other Lightweight Directory Access Protocol-based authentication mechanism.

Object-based storage was brought to the market first as content-addressed storage (CAS) systems such as EMC Centera®. The main goal has been to provide regulatory compliance for archived data. Cloud-oriented object-based systems appeared in 2009 and have become the next generation of object-based storage. These cloud-oriented systems have to support data transfer across wide geographical areas, such as global content distribution of primary storage, and function as low-cost storage for backup and archiving in the cloud.

The advantages of using object-based storage in cloud infrastructures are exemplified in solutions offered by the biggest cloud service providers such as Amazon and Google since object-based storage can simplify many cloud operations.

2014 EMC Proven Professional Knowledge Sharing 24

Figure 5: Block-Based vs. Object-Based Storage Models (from Ref 38)

Cleversafe announced plans to build the Dispersed Compute Storage solution by combining the power of Hadoop MapReduce with Cleversafe’s Dispersed Storage System.39 The object-based Dispersed Storage systems will be able to capture data at 1 TB per second at exabyte capacity. Combining MapReduce with the Dispersed Storage Network system on the same platform and replacing the HDFS which relies on three copies to protect data, will significantly improve reliability and enable analytics at a scale previously unattainable through traditional HDFS configurations.

DataDirect Networks (DDN) object storage solution is built on a software and hardware stack that has been tuned for on-premise cloud/Big Data storage, self-service data access, and reduced time to insight. DDN Web Object Scaler (WOS) software (WOS 3.0) and WOS7000 storage appliance can federate a group of WOS clusters to achieve management of up to 983 petabytes and 32 trillion unique objects.40 Such an implementation requires 32 federated WOS clusters, each of which supports up to 1 trillion objects and is made up of 256 WOS object

2014 EMC Proven Professional Knowledge Sharing 25 storage servers. According to DDN,40 the platform can retrieve 256 million objects per second with sub-50 millisecond latency and achieve throughput performance of 10 TB per second. A fully populated rack of preconfigured WOS 3.0-powered WOS7000 appliances offers 2.5 PB of storage capacity.

3.7.2 EMC Atmos Atmos®, EMC’s software solution for object-based storage, uses the Infiniflex HW platform to deliver cloud and Big Data storage services.41 It is the first multi-petabyte information management solution designed to help automatically manage and optimize the delivery of rich, unstructured information across large-scale, global cloud storage environments. For example, eBay used EMC Atmos to manage over 500 million objects per day.42 New applications can be introduced to the Atmos cloud without having to specifically tie them to storage systems or locations.

At EMC World 2012, EMC announced a suite of enhancements to the EMC Atmos Cloud platform that transforms how service providers and enterprises manage Big Data in large, globally distributed cloud storage environments. EMC also announced new Atmos Cloud Accelerators that make it even easier and faster to move data in and out of Atmos-powered clouds.41 As a platform, Atmos version 2 offers several other features that tie cloud-based storage to Big Data and application use. The Big Data paradigm is supported by the scalability of the Atmos platform, on which sites can be added to increase storage relatively simply. Atmos GeoDrive further enables support for Big Data and data resilience by providing access to a storage cloud instantly from any Microsoft Windows desktop or server anywhere, without writing a single line of code. Atmos 2.1.4, announced in September 2013, further extends features such as GeoParity, age-based Policy Management, S3 API support, and a host of cloud delivered services with new capabilities.

Other vendors of object-based storage products include Amazon S3, NetApp (StorageGRID), Dell (DX Object Storage Platform), HDS (HCP), Caringo (CAStor), Cleversafe (Dispersed Storage), DataDirect Networks (Web Object Scaler) (both Cleversafe Dispersed Storage and DataDirect Networks WOS are reviewed in Section 3.7.1), NEC (Hydrastor), and Amplidata (AmpliStor).

2014 EMC Proven Professional Knowledge Sharing 26 3.8 Fabric Storage for Big Data: SAN Functionality at DAS Pricing Scale-out fabric storage offered by AMD SeaMicro as a Big Data storage solution provides massive scale-out capacity with commodity drives.43 Decoupling storage from compute and network to grow storage independently enables moving from DAS, with a rigid storage-to- compute ratio, to flexible scale-out fabric storage up to 5 PB and creating pools of compute, storage, and network I/O that can be the right size for an Apache Hadoop deployment.

According to AMD SeaMicro43, the SM15000 Server Platform optimized for Big Data and cloud can reduce power dissipation by half and supply SAN functionality at DAS pricing by coupling data storage through a "Freedom Fabric" switch that removes the constraints of traditional servers. Unlike the industry standard model, where disk storage is located remotely from processing nodes, SeaMicro has worked out a networking switched fabric that connects servers to the “in rack” disk drives and is extensible beyond the SM15000 rack frame, allowing construction of cumulatively very large systems.

3.9 Virtualization of Hadoop VMware vSphere Big Data Extensions (BDE 1.0), announced in September 2013, allow vSphere 5.1 or later to manage Hadoop clusters.44 BDE development has been enabled by Serengeti, an open source project initiated by VMware to automate deployment and management of Apache Hadoop clusters on virtualized environments such as vSphere. BDE is a downloadable virtual appliance with a plugin to vCenter server and its deployment is simple: downloading the OVA and importing it into the existing vSphere environment.

The Serengeti virtual appliance includes two virtual machines: the Serengeti Management Server and the Hadoop Template Server. The creation of a Hadoop cluster, including creation and configuration of the virtual machines, is managed by the Serengeti Management Server. The Hadoop Template virtual machine installs the Hadoop distribution software and configures the Hadoop parameters based on the specified cluster configuration settings. Once the Hadoop cluster creation is complete, the Serengeti Management Server starts the Hadoop service. BDE is controlled and monitored through the vCenter server. By default, the basic Apache Foundation distribution of Hadoop is included, but VMware BDE also supports other major Hadoop distributions including Cloudera, Pivotal HD, Hortonworks, and MapR.

2014 EMC Proven Professional Knowledge Sharing 27 Hadoop virtualization dramatically accelerates Big Data Analytics implementations by making it affordable for companies of various sizes. Virtualized Hadoop clusters can be provisioned on- demand and elastically expanded or shrunk using the service catalog. The benchmark results show that virtualized Hadoop performs comparably with respect to physical configuration.45

By decoupling Hadoop nodes from the underlying physical infrastructure, VMware can bring the benefits of cloud infrastructure7 – rapid deployment, high availability, optimal resource utilization, elasticity, and secure multi-tenancy – to Hadoop. VMware defines the Big Data Extensions core value propositions44 as:

1. Operational Simplicity with Performance: BDE automates configuration and deployment of Hadoop clusters, and IT departments can provide self-service tools. As Hadoop deployment requires configuration of multiple cluster nodes, vSphere tools and capabilities such as cloning, templates, and resource allocation significantly simplify and accelerate deployment of Hadoop. The integration with vCloud Automation Center can be used to create Hadoop-as-a-Service, enabling users to select pre-configured templates and customize them according to the user requirements. 2. Maximization of Resource Utilization on New or Existing Hardware: Big Data Extensions enable IT departments to lower the total cost of ownership (TCO) by maximizing resource utilization on new or existing infrastructure. Virtualizing Hadoop can improve data center efficiency by increasing the types of mixed workloads that can be run on a virtualized infrastructure. This includes running different versions of Hadoop itself on the same cluster or running Hadoop along with other customer applications, forming an elastic environment. Shared resources lead to higher consolidation ratios that result in cost savings, as less hardware, software, and infrastructure are required to run a given set of business apps. To facilitate elasticity, BDE can automatically scale the number of compute virtual machines in a Hadoop cluster based on contention from other workloads running on the same shared physical infrastructure. Compute virtual machines are added or removed from the Hadoop cluster as needed to give the best performance to Hadoop when needed and make resources available for other applications or Hadoop clusters at other times. Isolating different tenants running Hadoop in separate VMs provides stronger resource and security isolation for multi-tenancy. Multi-tenancy can be provided by deploying separate compute clusters for different tenants sharing HDFS. Users can run mixed workloads simultaneously on a single physical host. Additional

2014 EMC Proven Professional Knowledge Sharing 28 efficiency can be achieved by running Hadoop and non-Hadoop applications on the same physical cluster. 3. Architect Scalable and Flexible Big Data Platforms: Big Data Extensions is designed to support multiple Hadoop distributions and hardware architectures.

To provide data-locality or “rack-awareness” (see Section 3.1) to virtualized Hadoop clusters, VMware has contributed the Hadoop Virtual Extensions (HVE) into Apache Hadoop 1.2. HVEs help Hadoop nodes become "data locality"-aware in a virtual environment. Data locality knowledge is important to keep compute tasks close to required data. Native Hadoop knows about data locality to the node and rack level, but with the extensions, Hadoop becomes more "virtualization aware" with a concept of "node groups" that basically correspond to the set of Hadoop virtual nodes running in each physical hypervisor server. Pivotal HD is the first Hadoop distribution to include HVE plug-ins, enabling easy deployment of virtualized Hadoop.

Figure 6: Scenarios of Virtual Hadoop Deployment (Ref. 46).

An interesting option is to run compute nodes and data storage nodes as separate VMs to support orthogonal scaling and optimal usage of each resource (Fig. 6). Another option is to leverage SAN storage. Extending the concept of data-compute separation, multiple tenants can be accommodated on the virtualized Hadoop cluster by running multiple Hadoop compute clusters against the same data service.46

2014 EMC Proven Professional Knowledge Sharing 29 4. Cloud Computing and Big Data Large data volumes along with the variety of data types and complexity are common features with what we see in cloud storage.7 Therefore, storage architectures are the place where Big Data meets Cloud Data.

The use of cloud-infrastructure for Big Data application faces many challenges. Performance issues and data transport are the first of them. Indeed, adoption of cloud-based solutions for Big Data demands technologies for moving data into and out of cloud. How much data needs to be moved, and at what cost? Moving large volumes of data to and from the cloud may be cost- prohibitive. Real-time data requires enormous resources to manage, and data that streams nonstop may be better processed locally.

Cloud providers for Big Data services, such as Amazon and Rackspace, suggest that customers ship data on portable storage devices as base data transfer following data sync via storage gateways. This approach has its own challenges as the shipments can be delayed and storage devices can be damaged or lost during shipping.

These challenges have led to development of new technologies for Big Data transport. For example, Aspera has developed its fasp™ transfer technology to offer a suite of On Demand Transfer Products, that solves both technical problems of the WAN and the cloud I/O bottleneck and delivers efficient transfer of large files, or large collections of files, in and out of the cloud.47 According to Aspera, file transfer times can be guaranteed regardless of the network distance and conditions, including transfers over satellite, wireless, and unreliable long-distance international links. Security is built-in, including secure endpoint authentication, on-the-fly data encryption, and integrity verification.

5. Big Data Backups 5.1 Challenges of Big Data Backups and How They Can Be Addressed Challenges to provide storage solutions for growing volumes of Big Data may overshadow challenges related to Big Data protection and recovery. However, data protection should be the key component of the enterprise strategy for Big Data lifecycle management.48

The needs for Big Data backups can be categorized based on the Big Data definition:

 Velocity: needs to be protected quickly  Volume: requires deduplication to protect efficiently

2014 EMC Proven Professional Knowledge Sharing 30  Variety: protect structured and unstructured files

Huge data volumes that are typical for Big Data come as the first challenge for \Big Data backup solutions. Petabyte-size datastores would not allow for completing backups with accepted backup windows. Furthermore, traditional backup is not designed to handle millions of small files, which can be common for Big Data environments. The challenge becomes manageable if we understand that not all Big Data information may need to be backed up. We have to review which part of these data volumes really needs to be backed up. If data can be easily regenerated from another system that is already being backed up, there is no need to back up these data sets at all. When we compare the cost of protecting our data with the cost of regenerating the data, we may find that, in many instances, while the source data needs to be protected, post-processed data may be less expensive to reproduce by rerunning the process rather than protecting this data.

Therefore, the real problem is backup of the unique data that cannot be recreated. This is often machine-generated data coming from devices or sensors (for example, the Internet of Things discussed earlier). It is essentially point-in-time data that cannot be regenerated. As data is often copied within the Big Data environment so that it can be safely analyzed, it results in some redundancy. Thus, data deduplication becomes critical to eliminate redundancy and compress much of the data to optimize backup capacity. Since the Hadoop file system is based on additions rather than updates/deletion of data, large storage savings are achieved when applying deduplication.

As mentioned, many Big Data environments consist of millions or even billions of small files. By using backup software products such as Symantec NetBackup Accelerator,49 a very large file system with millions or billions of files can be fully backed up within the amount of time required for an incremental backup. NetBackup Accelerator uses change tracking to reduce the file system overhead associated with traversing a large file system, identifying and accessing only changed data. An optimized synthetic full backup is created and catalogued inline, providing full restore capabilities and shortened Recovery Time Objective (RTO).

One more issue Big Data backup systems face is scanning the file systems each time the backup solutions start their jobs. For file systems in Big Data environments, this can be time- consuming. One of the solutions for the scanning issue is the OnePass feature developed by Commvault for its Simpana data protection software.50 OnePass feature is a converged process

2014 EMC Proven Professional Knowledge Sharing 31 for backup, archive, and reporting from a single data collection. The created single catalog is then shared, accessed, and used by each of Simpana software's archiving, backup, and reporting modules.

Which storage media should be used for Big Data backups, disks, or tapes? In most cases, the answer is both of them. Deduplicated data can be stored on low-cost, high-capacity disks (for example, on Data Domain® appliances, discussed later) as near-term data sets that are not being analyzed at that moment or on tape for long-term storage of less frequently accessed data (to write deduplicated data to tape, the data should be “rehydrated” first and then written in the original form). Many Big Data projects may not be cost effective if tape is not integrated into the solution. Tapes last longer than disks. Physical lifetimes for digital magnetic tape are at least 10 to 20 years.51 Self-contained Information Retention Format (SIRF), discussed later, is a way to keep data on tapes retrievable while transitioning to the future technologies. For disks, the median lifespan is six years.52 Tape can be leveraged as part of the access tier through the use of an Active Archive. Active Archiving is the ability to combine high performance primary disk tier with secondary disk tier and then tape to create a single, fully integrated access point (see Big Data Archive in the next section). The Active Archive software would automatically move the data between the various tiers based on access or the movement could be pre-programmed into the application. In the Active Archive process, new data can be copied to disk and tape simultaneously, meaning that backups are happening as data is received. Active Archive also helps with a major restore. Instead of having to restore the entire data set, only data that‘s currently needed must be recovered. Tape libraries like those from Spectra Logic can leverage the Active Archive technology, and Linear Tape File System (LTFS) for data transfer can become a major part of the Big Data infrastructure.53

5.2 EMC Data Domain as a Solution for Big Data Backups EMC Data Domain deduplication storage systems are ideal for Big Data backups, as they overcome the Big Data backup challenges48 (Table 4). Data Domain systems are purpose-built backup appliances used in conjunction with EMC or any third party backup application or native backup utility (Fig. 7). Regardless of the big data system (EMC Greenplum, EMC Isilon, Teradata, Oracle Exadata, etc.), Data Domain systems offer advanced integration with effective backup tools for that environment to provide fast backup and recovery.

2014 EMC Proven Professional Knowledge Sharing 32 Backup Challenge Data Domain Solution

Data volume High speed inline deduplication

Performance Up to 248 TB can be backed up in less than 8 hours (31 TB/hr)

Scale Protects up to 65 PB of logical capacity in a single system

Data Islands Simultaneously supports NFS, CIFS, VTL, DD Boost, and NDMP

Integration with major Qualified with leading backup and archive applications backup software

Table 4: Data Domain Solutions for Big Data Backup Challenges

Data Domain provides high-speed inline deduplication leading to 10 to 30x reduction in backup storage required to enable Big Data backup completion within backup windows. For example, the DD990 offers up to 31 TB/hour ingestion; it can backup up to 248 TB in less than 8 hours. In addition, Data Domain systems can protect up to 65 PB of logical capacity in a single system. Data Domain systems eliminate data islands by enabling backup of the entire environment (using NFS, CIFS, VTL, DD Boost, and/or NDMP) to a single Data Domain system.

Figure 7: Big Data Backup Using Data Domain (Ref. 54)

2014 EMC Proven Professional Knowledge Sharing 33 6. Big Data Retention 6.1 General Considerations for Big Data Archiving 6.1.1 Backup vs. Archive Archiving and backup are sometimes considered very similar data retention solutions, as an assumption is made that a backup can be a substitute for an archive. However, there are significant differences between these two categories of data management. Backup is a data protection solution for operational purposes, whereas data archiving objectives are information retrieval, regulatory compliance, and data footprint reduction.

Backups are a secondary copy of data and are used for operational data recovery to restore data that may have been lost, corrupted, or destroyed. Backup retention periods usually are relatively short – days, weeks, months. Conversely, a data archive is a primary copy of information and archived data are typically retained long term (years, decades, or forever) and maintained for analysis or compliance as a managed repository. Archiving provides data footprint reduction capabilities by deleting fixed content and duplicate data. An archive is used to meet regulatory requirements by enforcing retention policies. Financial, healthcare, and other industries can have archive retention periods of 10-15 years or even up to 100 years.

6.1.2 Why Is Archiving Needed for Big Data? The reasons for Big Data archiving are similar to those for traditional data archiving and include regulatory compliance (for example, X-rays are stored for periods of 75 years). Retention of 20 years or more is required by 70% of repositories.55

Many archiving technologies have advanced file deduplication and compression techniques, which significantly reduce Big Data footprint. For example, a database archiving solution supports moving and converting the inactive data from production databases to an optimized file archive that is highly compressed. Archiving moves inactive data to a lower-cost infrastructure, providing cost reduction by using storage tiering. As a result, archiving solutions make it possible to keep data easily accessible without the need to locate it on tape and restoring it.

6.1.3 Pre-requisites for Implementing Big Data Archiving 1. Data classification. Data should be classified according to the business value (determined by the current data position in the Big Data lifecycle) and security requirements.

2014 EMC Proven Professional Knowledge Sharing 34 2. Review to ensure that regulatory requirements do not prevent the use of data deduplication techniques for streamlining both data and data access. 3. Storage tiering policies determining the data access latency by placing the data on different disk types and/or different storage systems are developed based on the data classification. As it leads to data footprint reduction, storage tiering reduces energy cost and data center raised floor use.

6.1.4 Specifics of Big Data Archiving The differentiating feature of Big Data retention is the need to continually reanalyze the same machine-generated data sets. Data scientists need to identify patterns with timeframes of hours, days, months, and years. Companies are not just retaining Big Data, they re-use them.

While the massively parallel processing (MPP) systems for Big Data analytics solutions are designed to run complex large-scale analytics where performance is the prime objective, these systems are not suitable targets for long-term retention of Big Data content. Big Data archiving solutions should be cost effective. If it is cost prohibitive to retain the needed historical data or too difficult to organize the data for timely ad hoc retrieval, companies would not be able to extract value from their collected information. The key question is whether the current storage environment can handle this new data explosion and the Big Data retention challenges resulting from such data growth.

The primary data management challenge associated with Big Data is to ensure that the data is retained (to satisfy compliance needs at the lowest possible costs) while also keeping up with the unique and fast-evolving scaling requirements associated with new business analytics efforts. Companies that achieve this balance will increase efficiency, reduce data storage cost, and be in a far better position to capitalize on Big Data Analytics.

Security is a challenge for Big Data archiving, as traditional database management systems support security policies that are quite granular, whereas Big Data applications generally have no such security controls. Companies including any sensitive data in Big Data operations must ensure that the data itself is secure and that the same data security policies that apply to the data when it exists in databases or files are also enforced in the Big Data archives (see also Section 3.3).

2014 EMC Proven Professional Knowledge Sharing 35 6.1.5 Archiving Solution Components An archiving solution (for example, Symantec eVault) typically includes:

Archiving software that automates the movement of data from primary storage to archival storage based on policies established in the data classification and rationalization process. Archive software can delete files at the end of their retention period.

E-discovery software that uses archiving software as a base to provide advanced search features that enable users and administrators to quickly search all files, emails, texts, and other data related to a specific topic for use in data mining services or in response to legal inquiries. Some applications combine e-discovery and archive into purpose-built platforms such as an email-archive solution or document and records management solutions.

Physical media for data archiving are hard disks and tapes. As the IDC survey56 shows, companies increase the use of disk-based storage for long-term data retention. However, the growth of disk-based archives does not mean that tape-based archives are becoming relics. There are still some financial and practical reasons to choose tape storage for Big Data archives:

 Longer media life expectancy  Low cost per TB over time

As data storage technologies change over time, how can we be sure that the archived data can be retrieved 20-40 years from now? To address this challenge, the SNIA Long Term Retention Workgroup has developed Self-contained Information Retention Format (SIRF).57 SIRF is a logical data format of a storage container that is self-describing (can be interpreted by different systems), self-contained (all data needed for the interpretation is in the container), and extensible so it can meet future needs. Therefore, SIRF provides a way for collecting all the information that will be needed for transition to new technologies in the future. Development of SIRF serialization for Linear Tape File System (LTFS) makes it possible to provide economically scalable containers for long-term retention of Big Data.

2014 EMC Proven Professional Knowledge Sharing 36 6.1.6 Checklist for Selecting Big Data Archiving Solution Big Data active archival storage solutions should address the following major requirements:

Ability to rapidly and intelligently move data from primary storage into the active archive system. This ability ensures that the source application continues to run at maximum efficiency in terms of performance and reliability.

Flexibility in data ingestion capability. The amount of data to be archived according to the archiving policy can vary significantly from time to time, depending on the amount of activity. For example, financial trade monitoring systems can experience very high levels of activity due to a market event that in turn could trigger a sudden surge in the number of trades. The active archive target should be able to manage such workload variations and be able to ingest data at different rates as required.

Rapid, non-disruptive scalability of archival storage capacity and I/O performance. The solution should be capable of scaling out by non-disruptively adding more units. It can outgrow the archive system module, but it should not outgrow the archival platform capacity and I/O performance. When the data size is about hundreds of terabytes to multiple petabytes, migrating to a new platform should be a last resort option.

Data portability for future technology changes. The selected solution should provide capability to transition to new technologies developed in the future – see the above discussion of SIRF.

6.2 Big Data Archiving with EMC Isilon EMC offers a Big Data archive solution that is based on the Isilon scale-out NAS platform and meets the criteria for selecting archive solutions reviewed above. The solution can meet the large-scale data retention needs of enterprises, reduce costs, and help customers comply with governance or regulatory requirements.58 Isilon scale-out NAS delivers efficient utilization of capacity, reducing the overall storage footprint and delivering significant savings for capital expenditures (CapEx) and operating expenditures (OpEx) (see also Section 3.4). It provides over 15 petabytes of capacity per cluster or more. The ability of Isilon NAS to scale quickly, easily, and without disrupting users makes it an attractive platform for large-scale data archives – see the discussion of the criteria for selecting archive solutions above.

2014 EMC Proven Professional Knowledge Sharing 37 The following features make Isilon an excellent archive solution for Big Data:

 Performance: Isilon clusters provide instant access to large data archives with scalable performance to over 100 GB/s of throughput. Automatic load-balancing of the archive servers’ access across the cluster using SmartConnect maximizes performance and utilization of cluster resources (CPU, memory, and network). SmartConnect automatically balances incoming client connections across all available interfaces on the Isilon storage cluster.  Automanagement and self-healing: Isilon clusters utilize an automatic management and provisioning capability that is used to monitor system health and automatically correct for any failures.  Cost-efficiency based on storage tiering: Isilon clusters provide automigration between working sets and archives using a policy-based approach available with Isilon SmartPools software. SmartPools is tightly integrated with Isilon OneFS, so all data, regardless of physical location, is in the same single file system. This means that SmartPools data movements are completely transparent to the end user application, removing management, backup, and other issues related to stub-based tiering architectures such as those present in hierarchical storage management (HSM) implementations.  Scalability: Data can be moved seamlessly and automatically as new nodes are introduced or as capacity is added. This enables very long-term archiving without the problems inherent in moving to new systems.  Flexibility: A single Isilon cluster supports the concurrent use of write-protected archival data alongside online, active content. WORM and non-WORM data can be mixed in one general-purpose system. Retention defaults can be set at the directory and file level. SmartLock® software adds a layer of “write once, read many” (WORM) data protection and security, which protects archived data against accidental, premature, or malicious alteration or deletion.  Remote replication: Isilon SnapshotIQ™ and Isilon SyncIQ® can be leveraged to efficiently replicate the archive among multiple remote sites for business continuity and disaster recovery.

2014 EMC Proven Professional Knowledge Sharing 38 6.3 RainStor and Dell Archive Solution for Big Data Dell’s Big Data retention solution combines the Dell DX Object Storage Platform with RainStor specialized database and compression technology to help significantly reduce the cost of retaining Big Data through extreme data reduction.59,60 RainStor provides a Big Data database that runs natively on Hadoop. As the data volumes grow in Hadoop, node capacity can be increased due to RainStor’s compression and deduplication capabilities, allowing for significant reductions in storage footprint – as much as 20-40 times. RainStor’s built-in data deduplication and compression dramatically speed up query and analysis by as much as 10-100 times and provide high-speed data load at rates of up to multi-billions of records per day.

The RainStor database, which provides online data retention at a massive scale, can be deployed on any combination of Dell servers and storage, on premise or in a cloud configuration. When paired with the Dell DX Object Storage platform, the solution provides a single system for retaining structured, semi-structured, and unstructured data across various data sources, formats, and types, providing cost savings.61

7. Conclusions The 3V+V characteristics of Big Data have required development of new types of storage architecture. The Big Data storage architectures represent a broad spectrum of shared-nothing DAS to shared storage (scaled-out NAS, object-based storage), to hybrid and converged storage. As Big Data processing is part of the new integrated enterprise data warehouse (EDW) environment, an ideal storage solution should be able to support all storage functionality needed for this integrated environment, including data discovery, capture, processing, load, analysis, data protection, and retention. Such integrated EDW environment assumes co-existence and symbiosis of traditional EDW storage architectures and new evolving storage technologies with distributed massive parallel processing architectures to parse large data sets. As the business value of Big Data changes over time, it is important to implement Big Data lifecycle management. This would control the storage cost by stemming the data growth with data retention and archiving policies. Along with Big Data protection (backup) solutions, it will also address the security and regulatory requirements for Big Data.

The scope of the Big Data project determines the type of storage solution. It may be implemented as Big Data appliances integrating server, networking, and storage resources into a single enclosure and running analytics software. Or it may be a large multi-system environment storing and processing hundreds of terabytes or tens of petabytes of data. In all

2014 EMC Proven Professional Knowledge Sharing 39 these cases, incorporation of Big Data architectures into the existing EDW environment should comply with the company policies for data management and data governance.

As discussed, cost is not the primary factor in Big Data solution selection, as a storage solution can turn Big Data into actionable time-to-value business knowledge providing an impressive ROI. However, Big Data does not necessary mean big budget. To be able to choose the best solutions, we as users have to review with vendors – both established EDW vendors and Big Data startups offering emerging technologies – their development roadmaps for Big Data and Big Data Analytics so that we can design our own cost-effective Big Data service strategy.

2014 EMC Proven Professional Knowledge Sharing 40 8. References 1. S. Rosenbush and M. Totty. How Big Data Is Changing the Whole Equation for Business. Wall Street Journal (http://online.wsj.com), March 10, 2013. 2. K. Krishnan. Data Warehousing in the Age of Big Data. Morgan Kaufmann Publishers. 2013. 3. M. A. Beyer and D. Laney. The Importance of 'Big Data': A Definition. Gartner, 2012. 4. Bringing Smarter Computing to Big Data. IBM, 2011. 5. Gartner Inc. Gartner Says the Internet of Things Installed Base Will Grow to 26 Billion Units By 2020. Press Release. December 12, 2013. 6. New Intelligent Enterprise. The MIT Sloan Management Review and the IBM Institute for Business Value, 2011. 7. M. Gloukhovtsev. Does the Advent of Cloud Storage Mean “Creation by Destruction” of Traditional Storage? EMC Proven Professional Knowledge Sharing, 2013. 8. Data Science and Big Data Analytics. EMC Educational Course. EMC, 2012. 9. http://www.gopivotal.com/press-center/11122013-pivotal-one 10. M. Crutcher. Big and Fast Data: The Path To New Business Value. EMC World, 2013. 11. Ditch the Disk: Designing a High-Performance In-Memory Architecture. Terracotta, 2013. 12. SAP HANA Storage Requirements. White Paper. SAP, 2013. 13. Oracle Exalytics In-Memory Machine. Oracle White Paper. Oracle, 2013. 14. Hybrid Storage. White Paper EB-6743. Teradata, 2013. 15. SFA12K Product Family. Datasheet. DataDirect Networks, 2012. 16. M. Murugan. Big Data: A Storage Systems Perspective. 2013 SNIA Analytics and Big Data Summit. 17. The Data Lake: Turning Big Data into Opportunity. Booz Allen Hamilton, 2012. 18. http://www.gopivotal.com/products/pivotal-hd. 19. http://www.pentahobigdata.com/ecosystem/platforms/hadoop 20. http://radar.oreilly.com/2010/09/the-smaq-stack-for-big-data.html 21. S. Childs and M. Adrian. Big Data Challenges for the IT Infrastructure Team. Gartner, 2012. 22. http://www.nutanix.com/products.html; Hadoop on Nutanix. Reference Architecture. Nutanix, 2012. 23. . http://wiki.apache.org/hadoop/ProjectDescription 24. Big Data Technologies for Near-Real-Time Results. White Paper. Intel, 2013. 25. J. Webster. Storage for Hadoop: A Four-Stage Model. SNW, October 2012.

2014 EMC Proven Professional Knowledge Sharing 41 26. S. Fineberg. Big Data Storage Options for Hadoop. SNW, October 2012. 27. https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces4 8. 28. J. G. Kobielus. The Forrester Wave: Enterprise Hadoop Solutions. Forrester, 2012. 29. S. K Krishnamurthy. Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights. EMC World, 2013. 30. Cisco Common Platform Architecture for Big Data. Solution Brief. Cisco, 2013. 31. B. Yellin. Leveraging Big Data to Battle Cyber Threats – A New Security Paradigm? EMC Proven Professional Knowledge Sharing, 2013. 32. J. Shih. Hadoop Security Overview. Hadoop & Big Data Technology Conference, 2012. 33. Fast, Low-Overhead Encryption for Apache Hadoop. Solution Brief. Intel, 2013. 34. https://github.com/intel-hadoop/project-rhino 35. EMC Isilon Scale-out NAS for Enterprise Big Data and Hadoop. EMC Forum, 2013. 36. EMC Big Data Storage and Analytics Solution. Solution Overview. EMC, 2012. 37. http://www.netapp.com/us/solutions/big-data/hadoop.aspx. 38. Virtualized Data Center and Cloud Infrastructure. EMC Educational Course for Cloud Architects. EMC, 2011. 39. http://www.cleversafe.com/overview/how-cleversafe-works 40. https://www.ddn.com/products 41. Atmos Cloud Storage Platform for Big Data in Cloud. EMC World, 2012. 42. D. Robb. EMC World Continues Focus on Big Data, Cloud and Flash. Infostor, May 2011. 43. S. Nanniyur. Fabric Architecture: A Big Idea for the Big Data infrastructure. SNW, April 2012. 44. http://www.vmware.com/products/big-data-extensions 45. Virtualized Hadoop Performance with VMware vSphere 5.1. Technical White Paper. VMware, 2013. 46. J. Yang and D. Baskett, Virtualize Big Data to Make the Elephant Dance. EMC World, 2013. 47. http://cloud.asperasoft.com/big-data-cloud/ 48. S. Manjrekar and G. Maxwell. Big Data Backup Strategies with Data Domain for EMC Greenplum, EMC Isilon, Teradata & Oracle Exadata. EMC World, 2013. 49. Better Backup for Big Data. Solution Overview. Symantec, 2012. 50. CommVault Simpana OnePass™ Feature. Datasheet. Commvault, 2012.

2014 EMC Proven Professional Knowledge Sharing 42 51. The lifespan of data stored on LTO tape is usually quoted as 30 years. However, tape is extremely sensitive to storage conditions, and the life expectancy numbers cited by tape manufacturers assume ideal storage conditions. 52. B. Beach. How long do disk drives last? http://blog.backblaze.com/2013/11/12/how-long- do-disk-drives-last. 53. http://www.spectralogic.com/ 54. EMC Backup Meets Big Data. EMC World, 2012. 55. SNIA – 100 Year Archive Requirement Survey. 2007. 56. Adoption Patterns of Disk-Based Backup. IDC Survey. IDC, 2010. 57. D. Pease. Long Term Retention of Big Data. SNIA Analytics and Big Data Summit, 2012. 58. Archive Solutions for the Enterprise with EMC Isilon Scale-out NAS. White Paper. EMC, 2012. 59. RainStor for Hadoop. Solution Brief, RainStor, 2013. 60. M. Cusack. Making the Most of Hadoop with Optimized Data Compression. SNW 2012. 61. R. L. Villars and M. Amaldas. Rethinking Your Data Retention Strategy to Better Exploit the Big Data Explosion. IDC, 2011.

EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice.

THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” EMC CORPORATION MAKES NO RESPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Use, copying, and distribution of any EMC software described in this publication requires an applicable software license.

2014 EMC Proven Professional Knowledge Sharing 43