Does Big Data Mean Big Storage?

DOES BIG DATA MEAN BIG STORAGE? Mikhail Gloukhovtsev Sr. Cloud Solutions Architect Orange Business Services Table of Contents 1. Introduction ......................................................................................................................... 4 2. Types of Storage Architecture for Big Data ......................................................................... 7 2.1 Storage Requirements for Big Data: Batch and Real-Time Processing ............................. 7 2.2 Integration of Big Data Ecosystem with Traditional Enterprise Data Warehouse ............... 8 2.3 Data Lake ......................................................................................................................... 9 2.4 SMAQ Stack ..................................................................................................................... 9 2.5 Big Data Storage Access Patterns ...................................................................................10 2.6 Taxonomy of Storage Architectures for Big Data .............................................................11 2.7 Selection of Storage Solutions for Big Data .....................................................................14 3. Hadoop Framework ...........................................................................................................14 3.1 Hadoop Architecture and Storage Options .......................................................................14 3.2 Enterprise-class Hadoop Distributions .............................................................................18 3.3 Big Data Storage and Security .........................................................................................20 3.4 EMC Isilon Storage for Big Data ......................................................................................21 3.5 EMC Greenplum Distributed Computing Appliance (DCA) ...............................................22 3.6 NetApp Storage for Hadoop.............................................................................................23 3.7 Object-Based Storage for Big Data ..................................................................................23 3.7.1 Why Is Object-based Storage for Big Data Gaining Popularity? ................................23 3.7.2 EMC Atmos ...............................................................................................................26 3.8 Fabric Storage for Big Data: SAN Functionality at DAS Pricing .......................................27 3.9 Virtualization of Hadoop ...................................................................................................27 4. Cloud Computing and Big Data ..........................................................................................30 5. Big Data Backups ..............................................................................................................30 5.1 Challenges of Big Data Backups and How They Can Be Addressed ...............................30 5.2 EMC Data Domain as a Solution for Big Data Backups ...................................................32 6. Big Data Retention .............................................................................................................34 2014 EMC Proven Professional Knowledge Sharing 2 6.1 General Considerations for Big Data Archiving ................................................................34 6.1.1 Backup vs. Archive ....................................................................................................34 6.1.2 Why Is Archiving Needed for Big Data? ....................................................................34 6.1.3 Pre-requisites for Implementing Big Data Archiving ...................................................34 6.1.4 Specifics of Big Data Archiving ..................................................................................35 6.1.5 Archiving Solution Components ................................................................................36 6.1.6 Checklist for Selecting Big Data Archiving Solution ...................................................37 6.2 Big Data Archiving with EMC Isilon ..................................................................................37 6.3 RainStor and Dell Archive Solution for Big Data ..............................................................39 7. Conclusions .......................................................................................................................39 8. References ........................................................................................................................41 Disclaimer: The views, processes or methodologies published in this article are those of the author. They do not necessarily reflect the views, processes or methodologies of EMC Corporation or Orange Business Services (my employer). 2014 EMC Proven Professional Knowledge Sharing 3 1. Introduction Big Data has become a buzz word today and we can hear about Big Data from early morning – reading the newspaper that tells us “How Big Data Is Changing the Whole Equation for Business”1 – through our entire day. A search for “big data” on Google returned about 2,030,000,000 results in December 2013. So what is Big Data? According to Krish Krishnan,2 the so-called three V’s definition of Big Data that became popular in the industry was first suggested by Doug Laney in a research report published by META Group (now Gartner) in 2001. In a more recent report,3 Doug Laney and Mark Beyer define Big Data as follows: "’Big Data’ is high-volume, -velocity and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.” Let us briefly review these characteristics of Big Data in more detail. 1. Volume of data is huge (for instance, billions of rows and millions of columns). People create digital data every day by using mobile devices and social media. Data defined as Big Data includes machine-generated data from sensor networks, nuclear plants, X-ray and scanning devices, and consumer-driven data from social media. According to IBM, as of 2012, every day 2.5 exabytes of data were created and 90% of the data in the world today was created in the last 2 years alone.4 This data growth is being accelerated by the Internet of Things (IoT), which is defined as the network of physical objects that contain embedded technology to communicate and interact with their internal states or the external environment (IoT excludes PCs, tablets, and smartphones). IoT will grow to 26 billion units installed in 2020, representing an almost 30-fold increase from 0.9 billion in 2009, according to Gartner.5 2. Velocity of new data creation and processing. Velocity means both how fast data is being produced and how fast the data must be processed to meet demand. In the case of Big Data, the data streams in a continuous fashion and time-to-value can be achieved when data capture, data preparation, and processing are fast. This requirement is more challenging if we take into account that the data generation speed changes and data size varies. 2014 EMC Proven Professional Knowledge Sharing 4 3. Variety of data. In addition to traditional structured data, the data types include semi- structured (for example, XML files), quasi-structured (for example, clickstream string), and unstructured data. A misunderstanding that big volume is the key characteristic defining Big Data can result in a failure of a Big Data–related project unless it also focuses on variety, velocity, and complexity of the Big Data, which are becoming the leading features of Big Data. What is seen as a large data volume today can become a new normal data size in a year or two. A fourth V – Veracity – is frequently added to this definition of Big Data. Data Veracity deals with uncertain or imprecise data. How accurate is that data in predicting business value? Does a Big Data analytics give meaningful results that are valuable for business? Data accuracy must be verifiable. Just retaining more and more data of various types does not create any business advantage unless the company has developed a Big Data strategy to get business information from Big Data sets. Business benefits are frequently higher when addressing the variety of the data rather than addressing just the data volume. Business value can also be created by combining the new Big Data types with the existing information assets that results in even larger data type diversity. According to research done by MIT and the IBM Institute for Business Value6, organizations applying analytics to create a competitive advantage within their markets or industries are more than twice as likely to substantially outperform their peers. The requirement of time-to-value warrants innovations in data processing that are challenged by Big Data complexity. Indeed, in addition to a great variety in the Big Data types, the combination of different data types presenting different challenges and requiring different data analytical methods in order to generate a business value makes data management more complex. Complexity with an increasing volume of unstructured data (80%–90% of the data in existence is unstructured) means that different standards, data processing methods, and storage formats can exist with each asset type and structure. The level of complexity and/or data size of Big Data has resulted in another definition as data that cannot be efficiently managed using only traditional data-capture technology and processes

Load more