Does Big Data Mean Big Storage?

Total Page:16

File Type:pdf, Size:1020Kb

Does Big Data Mean Big Storage? DOES BIG DATA MEAN BIG STORAGE? Mikhail Gloukhovtsev Sr. Cloud Solutions Architect Orange Business Services Table of Contents 1. Introduction ......................................................................................................................... 4 2. Types of Storage Architecture for Big Data ......................................................................... 7 2.1 Storage Requirements for Big Data: Batch and Real-Time Processing ............................. 7 2.2 Integration of Big Data Ecosystem with Traditional Enterprise Data Warehouse ............... 8 2.3 Data Lake ......................................................................................................................... 9 2.4 SMAQ Stack ..................................................................................................................... 9 2.5 Big Data Storage Access Patterns ...................................................................................10 2.6 Taxonomy of Storage Architectures for Big Data .............................................................11 2.7 Selection of Storage Solutions for Big Data .....................................................................14 3. Hadoop Framework ...........................................................................................................14 3.1 Hadoop Architecture and Storage Options .......................................................................14 3.2 Enterprise-class Hadoop Distributions .............................................................................18 3.3 Big Data Storage and Security .........................................................................................20 3.4 EMC Isilon Storage for Big Data ......................................................................................21 3.5 EMC Greenplum Distributed Computing Appliance (DCA) ...............................................22 3.6 NetApp Storage for Hadoop.............................................................................................23 3.7 Object-Based Storage for Big Data ..................................................................................23 3.7.1 Why Is Object-based Storage for Big Data Gaining Popularity? ................................23 3.7.2 EMC Atmos ...............................................................................................................26 3.8 Fabric Storage for Big Data: SAN Functionality at DAS Pricing .......................................27 3.9 Virtualization of Hadoop ...................................................................................................27 4. Cloud Computing and Big Data ..........................................................................................30 5. Big Data Backups ..............................................................................................................30 5.1 Challenges of Big Data Backups and How They Can Be Addressed ...............................30 5.2 EMC Data Domain as a Solution for Big Data Backups ...................................................32 6. Big Data Retention .............................................................................................................34 2014 EMC Proven Professional Knowledge Sharing 2 6.1 General Considerations for Big Data Archiving ................................................................34 6.1.1 Backup vs. Archive ....................................................................................................34 6.1.2 Why Is Archiving Needed for Big Data? ....................................................................34 6.1.3 Pre-requisites for Implementing Big Data Archiving ...................................................34 6.1.4 Specifics of Big Data Archiving ..................................................................................35 6.1.5 Archiving Solution Components ................................................................................36 6.1.6 Checklist for Selecting Big Data Archiving Solution ...................................................37 6.2 Big Data Archiving with EMC Isilon ..................................................................................37 6.3 RainStor and Dell Archive Solution for Big Data ..............................................................39 7. Conclusions .......................................................................................................................39 8. References ........................................................................................................................41 Disclaimer: The views, processes or methodologies published in this article are those of the author. They do not necessarily reflect the views, processes or methodologies of EMC Corporation or Orange Business Services (my employer). 2014 EMC Proven Professional Knowledge Sharing 3 1. Introduction Big Data has become a buzz word today and we can hear about Big Data from early morning – reading the newspaper that tells us “How Big Data Is Changing the Whole Equation for Business”1 – through our entire day. A search for “big data” on Google returned about 2,030,000,000 results in December 2013. So what is Big Data? According to Krish Krishnan,2 the so-called three V’s definition of Big Data that became popular in the industry was first suggested by Doug Laney in a research report published by META Group (now Gartner) in 2001. In a more recent report,3 Doug Laney and Mark Beyer define Big Data as follows: "’Big Data’ is high-volume, -velocity and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.” Let us briefly review these characteristics of Big Data in more detail. 1. Volume of data is huge (for instance, billions of rows and millions of columns). People create digital data every day by using mobile devices and social media. Data defined as Big Data includes machine-generated data from sensor networks, nuclear plants, X-ray and scanning devices, and consumer-driven data from social media. According to IBM, as of 2012, every day 2.5 exabytes of data were created and 90% of the data in the world today was created in the last 2 years alone.4 This data growth is being accelerated by the Internet of Things (IoT), which is defined as the network of physical objects that contain embedded technology to communicate and interact with their internal states or the external environment (IoT excludes PCs, tablets, and smartphones). IoT will grow to 26 billion units installed in 2020, representing an almost 30-fold increase from 0.9 billion in 2009, according to Gartner.5 2. Velocity of new data creation and processing. Velocity means both how fast data is being produced and how fast the data must be processed to meet demand. In the case of Big Data, the data streams in a continuous fashion and time-to-value can be achieved when data capture, data preparation, and processing are fast. This requirement is more challenging if we take into account that the data generation speed changes and data size varies. 2014 EMC Proven Professional Knowledge Sharing 4 3. Variety of data. In addition to traditional structured data, the data types include semi- structured (for example, XML files), quasi-structured (for example, clickstream string), and unstructured data. A misunderstanding that big volume is the key characteristic defining Big Data can result in a failure of a Big Data–related project unless it also focuses on variety, velocity, and complexity of the Big Data, which are becoming the leading features of Big Data. What is seen as a large data volume today can become a new normal data size in a year or two. A fourth V – Veracity – is frequently added to this definition of Big Data. Data Veracity deals with uncertain or imprecise data. How accurate is that data in predicting business value? Does a Big Data analytics give meaningful results that are valuable for business? Data accuracy must be verifiable. Just retaining more and more data of various types does not create any business advantage unless the company has developed a Big Data strategy to get business information from Big Data sets. Business benefits are frequently higher when addressing the variety of the data rather than addressing just the data volume. Business value can also be created by combining the new Big Data types with the existing information assets that results in even larger data type diversity. According to research done by MIT and the IBM Institute for Business Value6, organizations applying analytics to create a competitive advantage within their markets or industries are more than twice as likely to substantially outperform their peers. The requirement of time-to-value warrants innovations in data processing that are challenged by Big Data complexity. Indeed, in addition to a great variety in the Big Data types, the combination of different data types presenting different challenges and requiring different data analytical methods in order to generate a business value makes data management more complex. Complexity with an increasing volume of unstructured data (80%–90% of the data in existence is unstructured) means that different standards, data processing methods, and storage formats can exist with each asset type and structure. The level of complexity and/or data size of Big Data has resulted in another definition as data that cannot be efficiently managed using only traditional data-capture technology and processes
Recommended publications
  • Greenplum Database Performance on Vmware Vsphere 5.5
    Greenplum Database Performance on VMware vSphere 5.5 Performance Study TECHNICAL WHITEPAPER Greenplum Database Performance on VMware vSphere 5.5 Table of Contents Introduction................................................................................................................................................................................................................... 3 Experimental Configuration and Methodology ............................................................................................................................................ 3 Test Bed Configuration ..................................................................................................................................................................................... 3 Test and Measurement Tools ......................................................................................................................................................................... 5 Test Cases and Test Method ......................................................................................................................................................................... 6 Experimental Results ................................................................................................................................................................................................ 7 Performance Comparison: Physical to Virtual ......................................................................................................................................
    [Show full text]
  • Data Warehouse Fundamentals for Storage Professionals – What You Need to Know EMC Proven Professional Knowledge Sharing 2011
    Data Warehouse Fundamentals for Storage Professionals – What You Need To Know EMC Proven Professional Knowledge Sharing 2011 Bruce Yellin Advisory Technology Consultant EMC Corporation [email protected] Table of Contents Introduction ................................................................................................................................ 3 Data Warehouse Background .................................................................................................... 4 What Is a Data Warehouse? ................................................................................................... 4 Data Mart Defined .................................................................................................................. 8 Schemas and Data Models ..................................................................................................... 9 Data Warehouse Design – Top Down or Bottom Up? ............................................................10 Extract, Transformation and Loading (ETL) ...........................................................................11 Why You Build a Data Warehouse: Business Intelligence .....................................................13 Technology to the Rescue?.......................................................................................................19 RASP - Reliability, Availability, Scalability and Performance ..................................................20 Data Warehouse Backups .....................................................................................................26
    [Show full text]
  • Hitachi Solution for Databases in Enterprise Data Warehouse Offload Package for Oracle Database with Mapr Distribution of Apache
    Hitachi Solution for Databases in an Enterprise Data Warehouse Offload Package for Oracle Database with MapR Distribution of Apache Hadoop Reference Architecture Guide By Shashikant Gaikwad, Subhash Shinde December 2018 Feedback Hitachi Data Systems welcomes your feedback. Please share your thoughts by sending an email message to [email protected]. To assist the routing of this message, use the paper number in the subject and the title of this white paper in the text. Revision History Revision Changes Date MK-SL-131-00 Initial release December 27, 2018 Table of Contents Solution Overview 2 Business Benefits 2 High Level Infrastructure 3 Key Solution Components 4 Pentaho 6 Hitachi Advanced Server DS120 7 Hitachi Virtual Storage Platform Gx00 Models 7 Hitachi Virtual Storage Platform Fx00 Models 7 Brocade Switches 7 Cisco Nexus Data Center Switches 7 MapR Converged Data Platform 8 Red Hat Enterprise Linux 10 Solution Design 10 Server Architecture 11 Storage Architecture 13 Network Architecture 14 Data Analytics and Performance Monitoring Using Hitachi Storage Advisor 17 Oracle Enterprise Data Workflow Offload 17 Engineering Validation 29 Test Methodology 29 Test Results 30 1 Hitachi Solution for Databases in an Enterprise Data Warehouse Offload Package for Oracle Database with MapR Distribution of Apache Hadoop Reference Architecture Guide Use this reference architecture guide to implement Hitachi Solution for Databases in an enterprise data warehouse offload package for Oracle Database. This Oracle converged infrastructure provides a high performance, integrated, solution for advanced analytics using the following big data applications: . Hitachi Advanced Server DS120 with Intel Xeon Silver 4110 processors . Pentaho Data Integration . MapR distribution for Apache Hadoop This converged infrastructure establishes best practices for environments where you can copy data in an enterprise data warehouse to an Apache Hive database on top of Hadoop Distributed File System (HDFS).
    [Show full text]
  • Database Solutions on AWS
    Database Solutions on AWS Leveraging ISV AWS Marketplace Solutions November 2016 Database Solutions on AWS Nov 2016 Table of Contents Introduction......................................................................................................................................3 Operational Data Stores and Real Time Data Synchronization...........................................................5 Data Warehousing............................................................................................................................7 Data Lakes and Analytics Environments............................................................................................8 Application and Reporting Data Stores..............................................................................................9 Conclusion......................................................................................................................................10 Page 2 of 10 Database Solutions on AWS Nov 2016 Introduction Amazon Web Services has a number of database solutions for developers. An important choice that developers make is whether or not they are looking for a managed database or if they would prefer to operate their own database. In terms of managed databases, you can run managed relational databases like Amazon RDS which offers a choice of MySQL, Oracle, SQL Server, PostgreSQL, Amazon Aurora, or MariaDB database engines, scale compute and storage, Multi-AZ availability, and Read Replicas. You can also run managed NoSQL databases like Amazon DynamoDB
    [Show full text]
  • EMC Secure Remote Services 3.18 Site Planning Guide
    EMC® Secure Remote Services Release 3.26 Site Planning Guide REV 01 Copyright © 2018 EMC Corporation. All rights reserved. Published in the USA. Published January 2018 EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice. The information in this publication is provided as is. EMC Corporation makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose. Use, copying, and distribution of any EMC software described in this publication requires an applicable software license. EMC2, EMC, and the EMC logo are registered trademarks or trademarks of EMC Corporation in the United States and other countries. All other trademarks used herein are the property of their respective owners. For the most up-to-date regulatory document for your product line, go to Dell EMC Online Support (https://support.emc.com). 2 EMC Secure Remote Services Site Planning Guide CONTENTS Preface Chapter 1 Overview ESRS architecture........................................................................................ 10 ESRS installation options ...................................................................... 10 Other components ................................................................................ 11 Requirements for ESRS customers......................................................... 11 Supported devices.....................................................................................
    [Show full text]
  • Mapr Spark Certification Preparation Guide
    MAPR SPARK CERTIFICATION PREPARATION GUIDE By HadoopExam.com 1 About Spark and Its Demand ........................................................................................................................ 4 Core Spark: ........................................................................................................................................ 6 SparkSQL: .......................................................................................................................................... 6 Spark Streaming: ............................................................................................................................... 6 GraphX: ............................................................................................................................................. 6 Machine Learning: ............................................................................................................................ 6 Who should learn Spark? .............................................................................................................................. 6 About Spark Certifications: ........................................................................................................................... 6 HandsOn Exam: ......................................................................................................................................... 7 Multiple Choice Questions: ......................................................................................................................
    [Show full text]
  • Dell EMC IT Big Data Analytics Journey
    Dell EMC IT Big Data Analytics Journey Nagesh Madhwal Client Solutions Director, Consulting, Southeast Asia, Dell EMC Agenda 1 Dell EMC IT Big Data Journey 2 Building the Data Lake 3 Marketing Science Lab Use Case 4 Technical Benefits 5 Lessons Learned 6 Q&A 3 Dell - Internal Use - Confidential Dell EMC IT Big Data Journey A Journey Of Maturity 1 AGGREGATE 2 LIBERATE 3 INNOVATE/ITERATE HARNESS Consolidation BA-as-a-Service Flexible / Scalable Analytics -based decision making Master Data Data Scientist Services Mission Critical Leveraging data to predict future models Common BI Tools Collaborative Analytic Tools Real Time Capable Transforming operations by BI Governance Unified Analytical Platform Collaborative Delivery applying analytics FOUNDATION ANALYTICS ENABLEMENT DATA LAKE ANALYTICS ENTERPRISE 2010 2011 2012 2013 2014 2015 2016 4 Dell - Internal Use - Confidential Building The Data Lake PROCESS MONITOR THE MEASURE BUSINESS IMPROVE THE EXECUTION BUSINESS PERFORMANCE BUSINESS APPS ERP INNOVATE CRM ITERATE REFINE Master Data Workspace Analytics Machine BU App Data EMBED INTO BUSINESS APPS “MAKE THEM SMARTER” GOVERNANCE 5 Dell - Internal Use - Confidential Powered by Intel® Xeon® Processors Dell EMC IT Data Lake Architecture ANALYTICS TOOLBOX APPLICATIONS DATA GOVERNANCE APPLICATIONS COLLIBRA BATCH - DATA PLATFORM MICRO EXECUTION CASSANDRA POSTGRESQL MEMORY DB GEMFIRE PROCESS SPRING XD PIVOTAL HD GREENPLUM DB ATTIVIO BATCH APACHE APACHE RANGER INGESTION Social Media Sensor Network Web Supplier Market ERP CRM PLM UNSTRUCTURED STRUCTURED
    [Show full text]
  • Wherescape RED for Pivotal Greenplum
    WhereScape RED for Pivotal Greenplum Wherescape red for pivotal greenplum WhereScape RED is an agile data warehouse development and management solution that automates much of the data warehouse life cycle—from initial scoping, prototyping, loading and populating to ongoing management and optimization. In addition, WhereScape RED automates the creation and management of documentation, diagrams and lineage information. “Our results using Optimized for Greenplum WhereScape have been extremely impressive. WhereScape RED for Pivotal Greenplum is optimized to fully leverage the Greenplum Database. WhereScape RED accelerates time to value for your WhereScape enabled Greenplum investment by requiring fewer resources to model, build and us to design, develop, deploy your data warehouse. Eliminating hand coding and automating document and deploy Greenplum development creates a simplified infrastructure a production-ready and dramatically reduces total cost of ownership. solution in 8 weeks. WhereScape RED “knows” all Greenplum objects—including views, Using traditional data distribution keys and append-only tables, and utilizes Greenplum’s warehouse development rich feature set to build native Greenplum objects, document them methods would have and schedule data to be loaded. Utilizing the RED user interface, users taken us 6-8 months.” can simply drag and drop to develop Greenplum objects—build tables, generate Greenplum SQL code to populate the tables, and create HTML documentation. RED’s open metadata architecture is stored in database Ryan Fenner, VP, Data tables for easy access and integrates with external testing and source Solutions Architect, control tools. Union Bank WhereScape RED works seamlessly as an ELT (extract, load and transformation) using the Greenplum GPLOAD bulk load utility, the fast method for loading data into Greenplum.
    [Show full text]
  • Pivotal Greenplum Command Center Documentation | Pivotal GPCC Docs
    Table of Contents Table of Contents 1 Pivotal Greenplum Command Center Documentation 2 About Pivotal Greenplum Command Center 3 Installing the Greenplum Command Center Software 6 Downloading and Running the Greenplum Command Center Installer 7 Setting the Greenplum Command Center Environment 9 Creating the gpperfmon Database 10 Upgrading Greenplum Command Center 12 Uninstalling Greenplum Command Center 14 Creating Greenplum Command Center Console Instances 15 Greenplum Command Center User Guide 18 Connecting to the Greenplum Command Center Console 19 Dashboard 20 Query Monitor 23 Host Metrics 25 Cluster Metrics 27 Monitoring Multiple Greenplum Database Clusters 29 History 30 System 33 Segment Status 34 Storage Status 37 Admin 38 Permission Levels for GPCC Access 39 Authentication 41 Workload Management 43 Administering Greenplum Command Center 47 About the Command Center Installation 48 Starting and Stopping Greenplum Command Center 49 Administering Command Center Agents 50 Administering the Command Center Database 51 Administering the Web Server 52 Configuring Greenplum Command Center 53 Enabling Multi-Cluster Support 54 Securing a Greenplum Command Center Console Instance 56 Configuring Authentication for the Command Center Console 58 Enabling Authentication with Kerberos 60 Securing the gpmon Database User 65 Utility Reference 67 gpcmdr 68 gpccinstall 70 Configuration File Reference 71 Command Center Agent Parameters 72 Command Center Console Parameters 74 Setup Configuration File 75 Greenplum Database Server Configuration Parameters 77 © Copyright Pivotal Software Inc, 2013-2017 1 3.3.1 Pivotal Greenplum Command Center Documentation Documentation for Pivotal Greenplum Command Center. About Greenplum Command Center Pivotal Greenplum Command Center is a management tool for the Greenplum Big Data Platform. This section introduces key concepts about Greenplum Command Center and its components.
    [Show full text]
  • In the United States District Court for the Eastern District of Texas Tyler Division
    Case 6:11-cv-00660-LED Document 1 Filed 12/08/11 Page 1 of 16 PageID #: 1 IN THE UNITED STATES DISTRICT COURT FOR THE EASTERN DISTRICT OF TEXAS TYLER DIVISION Personalweb Technologies LLC Plaintiff, v. Civil Action No. 6:11-cv-660 EMC Corporation, and JURY TRIAL REQUESTED VMware, Inc. Defendants. COMPLAINT FOR PATENT INFRINGEMENT Plaintiff PersonalWeb Technologies LLC files this Complaint for Patent Infringement against EMC Corporation and VMware Inc. (collectively, “Defendants”) and states as follows: THE PARTIES 1. Plaintiff PersonalWeb Technologies LLC (“PersonalWeb” or “Plaintiff”) is a limited liability company organized under the laws of Texas with its principal place of business at 112 E. Line Street, Suite 204, Tyler, Texas, 75702. PersonalWeb was founded in August 2010 and is in the business of developing and distributing software based on its technology assets. 2. PersonalWeb protects its proprietary business applications and operations through a portfolio of patents that it owns, including 13 issued and pending United States patents. PersonalWeb is assignee and owner of eight patents at issue in this action: U.S. Patent Nos. 5,978,791, 6,415,280, 6,928,442, 7,802,310, 7,945,539, 7,945,544, 7,949,662, and 8,001,096. 3. Defendant EMC Corporation (“EMC”) is a Massachusetts Corporation with its principal place of business at 176 South Street, Hopkinton, Massachusetts. EMC is qualified to McKool 298950v1 Case 6:11-cv-00660-LED Document 1 Filed 12/08/11 Page 2 of 16 PageID #: 2 do business in the state of Texas, Filing No. 0007347306, and has appointed CT Corporation System, 350 N Saint Paul St.
    [Show full text]
  • EMC STRATEGY Journey to Cloud -Big Data
    EMC STRATEGY Journey to Cloud -Big Data Agathi Galani Indirect District Manager Greece, Malta, Cyprus 5th December 2011 © Copyright 2011 EMC Corporation. All rights reserved. 1 EMC’s Mission To Lead Customers On Their Journey To Hybrid Cloud Computing © Copyright 2011 EMC Corporation. All rights reserved. 2 The Journey to Your Cloud: Infrastructure Private Cloud is the logical first step Enterprise IT Private Cloud Public Cloud ComplexTrusted Simple ControlledExpensive Low Cost InflexibleReliable Flexible SecureSiloed Dynamic “70% Will Spend More On Private Cloud through 2012” GARTNER DATA CENTER CONFERENCE 2009 Infrastructure © Copyright 2011 EMC Corporation. All rights reserved. 3 The Journey To The Private Cloud % Virtualized Simplicity Scalability Efficiency Continuity Standardization Protection Security Automation IT Production Business Production IT-as-a-Service Infrastructure Focus Applications Focus Business Focus © Copyright 2011 EMC Corporation. All rights reserved. 4 IT Production Virtualize non-business-critical IT-owned applications Challenges Approach • Islands of infrastructure • Consolidated infrastructure • CAPEX • Virtualized servers • Power • Tiered SANs • Disk-based backup Efficiency © Copyright 2011 EMC Corporation. All rights reserved. 5 EMC IT: IT Production Benefits Realized IT Production EMC IT Department Efficiency Benefits Realized $12M Power and Space Savings $74M Data Center Equipment Savings 170% Gain in Storage Admin Productivity 34% Increase in Energy Efficiency 60M Pounds of CO 2 Reduced Phase 1 IT -owned Apps © Copyright 2011 EMC Corporation. All rights reserved. 6 “VNXe is the easiest storage device we’ve ever used” THE CITY OF SAFFORD “Extremely well equipped, and starting at under $10,000 represents excellent value” COMPUTER RESELLER NEWS Simple. Efficient. Affordable. © Copyright 2011 EMC Corporation. All rights reserved.
    [Show full text]
  • IBM Big SQL (With Hbase), Splice Major Contributor to the Apache Be a Major Determinant“ Machine (Which Incorporates Hbase Madlib Project
    MarketReport Market Report Paper by Bloor Author Philip Howard Publish date December 2017 SQL Engines on Hadoop It is clear that“ Impala, LLAP, Hive, Spark and so on, perform significantly worse than products from vendors with a history in database technology. Author Philip Howard” Executive summary adoop is used for a lot of these are discussed in detail in this different purposes and one paper it is worth briefly explaining H major subset of the overall that SQL support has two aspects: the Hadoop market is to run SQL against version supported (ANSI standard 1992, Hadoop. This might seem contrary 1999, 2003, 2011 and so on) plus the to Hadoop’s NoSQL roots, but the robustness of the engine at supporting truth is that there are lots of existing SQL queries running with multiple investments in SQL applications that concurrent thread and at scale. companies want to preserve; all the Figure 1 illustrates an abbreviated leading business intelligence and version of the results of our research. analytics platforms run using SQL; and This shows various leading vendors, SQL skills, capabilities and developers and our estimates of their product’s are readily available, which is often not positioning relative to performance and The key the case for other languages. SQL support. Use cases are shown by the differentiators“ However, the market for SQL engines on colour of each bubble but for practical between products Hadoop is not mono-cultural. There are reasons this means that no vendor/ multiple use cases for deploying SQL on product is shown for more than two use are the use cases Hadoop and there are more than twenty cases, which is why we describe Figure they support, their different SQL on Hadoop platforms.
    [Show full text]