Dell EMC Isilon and Cloudera Reference Architecture and Performance Results

Total Page:16

File Type:pdf, Size:1020Kb

Dell EMC Isilon and Cloudera Reference Architecture and Performance Results Reference Architecture Dell EMC Isilon and Cloudera Reference Architecture and Performance Results Abstract This document is a high-level design, performance results, and best-practices guide for deploying Cloudera Enterprise Distribution on bare-metal infrastructure with Dell EMC’s Isilon scale-out NAS solution as a shared storage backend. August 2020 H18523 Revisions Revisions Date Description August 2020 Initial release Acknowledgments Author: Kirankumar Bhusanurmath, Analytics Solutions Architect, Dell EMC Support: The information in this publication is provided “as is.” Dell Inc. makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose. Use, copying, and distribution of any software described in this publication requires an applicable software license. Copyright © 2020 Dell Inc. or its subsidiaries. All Rights Reserved. Dell Technologies, Dell, EMC, Dell EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries. Other trademarks may be trademarks of their respective owners. [9/22/2020] [Reference Architecture] [H18523] 2 Dell EMC Isilon and Cloudera Reference Architecture and Performance Results | H18523 Table of contents Table of contents Revisions ..................................................................................................................................................................... 2 Acknowledgments ........................................................................................................................................................ 2 Table of contents ......................................................................................................................................................... 3 Executive summary ...................................................................................................................................................... 5 Disclaimer .................................................................................................................................................................... 5 Audience and scope .................................................................................................................................................... 6 Glossary of terms ......................................................................................................................................................... 6 1 Dell EMC Isilon distributed storage array for HDFS and bare-metal nodes as compute nodes ................................ 8 1.1 Physical cluster component list ..................................................................................................................... 9 1.2 Logical cluster topology ................................................................................................................................ 9 1.3 Sizing recommendation for physical nodes ................................................................................................. 10 1.4 Sizing recommendation for storage allocation ............................................................................................. 11 1.5 Supportability and compatibility matrix ........................................................................................................ 11 2 Performance testing and platform positioning ....................................................................................................... 13 2.1 DFSIO testing with Dell EMC Isilon F800 .................................................................................................... 13 2.1.1 Overview .................................................................................................................................................... 13 2.1.2 Test environment setup .............................................................................................................................. 13 2.1.3 Testing with DFSIO .................................................................................................................................... 13 2.1.4 Hadoop cluster configurations .................................................................................................................... 14 2.1.5 Performance results ................................................................................................................................... 14 2.2 Dell EMC Isilon model for Hadoop use cases ............................................................................................. 14 3 Platform tuning recommendations ........................................................................................................................ 16 3.1 CPU ........................................................................................................................................................... 16 3.1.1 CPU BIOS settings ..................................................................................................................................... 16 3.1.2 CPUfreq governor ...................................................................................................................................... 16 3.2 Memory ...................................................................................................................................................... 17 3.2.1 Minimize anonymous page faults ................................................................................................................ 17 3.2.2 Disable transparent huge-page compaction ................................................................................................ 17 3.2.3 Disable transparent huge-page defragmentation......................................................................................... 17 3.3 Network...................................................................................................................................................... 17 3.3.1 OneFS TCP tuning ..................................................................................................................................... 17 3.3.2 Compute nodes network tuning .................................................................................................................. 18 3.3.3 Verify NIC advanced features ..................................................................................................................... 19 3.3.4 NIC ring buffer configurations ..................................................................................................................... 19 3 Dell EMC Isilon and Cloudera Reference Architecture and Performance Results | H18523 Table of contents 3.4 Storage ...................................................................................................................................................... 19 3.4.1 Disk/FS mount options ............................................................................................................................... 19 3.4.2 FS Creation Options ................................................................................................................................... 20 A Technical support and resources ......................................................................................................................... 21 A.1 Related resources ...................................................................................................................................... 21 4 Dell EMC Isilon and Cloudera Reference Architecture and Performance Results | H18523 Executive summary Executive summary Dell EMC works closely with Cloudera, understanding the needs of Apache Hadoop customers, validating hardware and software-based solutions, empowering organizations with deeper insights and enhanced data- driven decision making by using the right infrastructure for the right data. Dell EMC Isilon and Cloudera - Combines a powerful yet simple, highly efficient, and massively scalable storage platform with integrated support for Hadoop analytics. Dell EMC Isilon is the first, and only, scale-out NAS platform to incorporate native support for the HDFS layer. With your unstructured data on Isilon, these capabilities allow you to quickly implement an in-place data analytics approach and avoid unnecessary capital expenditures, increased operational costs, and time-consuming replication of your big data to a separate infrastructure. This document describes the high-level design, performance results, and best-practices guide for deploying Cloudera Enterprise Distribution on bare-metal infrastructure with Dell EMC’s Isilon scale-out NAS solution as a shared storage backend. Disclaimer The documentation is and contains Cloudera proprietary information protected by copyright and other intellectual property rights. No license under copyright or any other intellectual property right is granted herein. Copyright information for Cloudera software may be found within the documentation accompanying each component in a particular release. Cloudera software includes software from various open source or other third-party projects, and may be released under the Apache Software License 2.0 (“ASLv2”), the Affero General Public License version 3 (AGPLv3), or other license terms. Other software included may be released under the terms of alternative open-source licenses. Review the license and notice files accompanying the software for
Recommended publications
  • Lenovo Big Data Reference Design for Cloudera Data Platform on Thinksystem Servers
    Lenovo Big Data Reference Design for Cloudera Data Platform on ThinkSystem Servers Last update: 30 June 2021 Version 2.0 Reference architecture for Solution based on the Cloudera Data Platform with ThinkSystem servers Apache Hadoop and Apache Spark Deployment considerations for Solution design matched to scalable racks including detailed Cloudera Data Platform architecture validated bills of material Xiaotong Jiang Xifa Chen Ajay Dholakia Lenovo Big Data Reference Design for Cloudera Data Platform on ThinkSystem Servers 1 Table of Contents 1 Introduction ............................................................................................... 4 2 Business problem and business value ................................................... 5 3 Requirements ............................................................................................ 7 Functional Requirements ......................................................................................... 7 Non-functional Requirements................................................................................... 7 4 Architectural Overview ............................................................................. 8 Cloudera Data Platform ........................................................................................... 8 CDP Private Cloud ................................................................................................... 8 5 Component Model .................................................................................. 10 Cloudera Components ..........................................................................................
    [Show full text]
  • Apache Spot: a More Effective Approach to Cyber Security
    SOLUTION BRIEF Data Center: Big Data and Analytics Security Apache Spot: A More Effective Approach to Cyber Security Apache Spot uses machine learning over billions of network events to discover and speed remediation of attacks to minutes instead of months—streamlining resources and cutting risk exposure Executive Summary This solution brief describes how to solve business challenges No industry is immune to cybercrime. The global “hard” cost of cybercrime has and enable digital transformation climbed to USD 450 billion annually; the median cost of a cyberattack has grown by through investment in innovative approximately 200 percent over the last five years,1 and the reputational damage can technologies. be even greater. Intel is collaborating with cyber security industry leaders such as If you are responsible for … Accenture, eBay, Cloudwick, Dell, HP, Cloudera, McAfee, and many others to develop • Business strategy: You will better big data solutions with advanced machine learning to transform how security threats understand how a cyber security are discovered, analyzed, and corrected. solution will enable you to meet your business outcomes. Apache Spot is a powerful data aggregation tool that departs from deterministic, • Technology decisions: You will learn signature-based threat detection methods by collecting disparate data sources into a how a cyber security solution works data lake where self-learning algorithms use pattern matching from machine learning to deliver IT and business value. and streaming analytics to find patterns of outlier network behavior. Companies use Apache Spot to analyze billions of events per day—an order of magnitude greater than legacy security information and event management (SIEM) systems.
    [Show full text]
  • Intel ISG Caesars Entertainment Case Study
    Case Study Intel® Xeon® Processor E5 Family Big Data Entertainment Doubling Down on Entertainment Marketing with Intel® Xeon® Processors Caesars Entertainment uses Intel® technology to reach a new demographic group and process more than 3 million records per hour Technology leadership is one of the reasons Caesars Entertainment has become the world’s most geographically diversifed casino-entertainment company, with resorts and casinos on three continents. The organization is well known for its innovative application of data to improve the customer experience. But with younger customers spending more money on non-gaming activities, Caesars needed a system capable of handling a new, expanded set of customer data for hotels, shows, and shopping venues. Challenges • Improve customer segmentation for more efective marketing campaigns • Expand analysis to include unstructured and semi-structured data • Accelerate processing for analytics and marketing campaign management Solution • Implement a new environment using Cloudera Enterprise, Cloudera’s Distribution Including Apache Hadoop* (CDH*) running on the Intel® Xeon® processor E5 family Technology Results • Reduces processing time for key jobs from six hours to 45 minutes • Expands capacity to more than 3 million records processed per hour • Enables fne-grained segmentation to improve marketing results • Improves security for meeting Payment Card Industry (PCI) and other security standards Business Value • Faster, Better analysis of all data types for creating and delivering “We used the Intel® Xeon® personalized marketing processor E5 family across • Ability to reach a new generation of customers and win their loyalty, ultimately driving revenue our new environment Creating a New Engine for Marketing Success because it gives us At Caesars Entertainment, demographic and behavioral data is vital to the the performance and organization’s successful marketing.
    [Show full text]
  • Wandisco Fusion® Google Cloud
    FUSION WANDISCO FUSION® GOOGLE CLOUD Seamless cloud migration and hybrid cloud WANdisco Fusion is the only solution that seamlessly moves active data to Google Cloud as it changes in on- premises local and NFS mounted file systems, Hadoop GOOGLE CLOUD clusters and other cloud environments. Our game- changing, patented technology captures every change and supports hybrid cloud use cases for on-demand burst-out processing, data archiving and offsite disaster recovery with no downtime and no data loss. No migration downtime FUSION Google Cloud and on-premises environments operate in parallel during migration, allowing data, applications and users to move in phases without disrupting normal operations. FUSION FUSION Guaranteed data consistency Google Cloud, on-premises storage and other cloud environments stay continuously in sync with LAN-speed read/write access to the same data at every location. HADOOP Automatic recovery LOCAL AND Recovery is automatic after planned or unplanned NFS MOUNTED FILE SYSTEMS network or hardware outages in both on-premises and Google Cloud environments. Easy and intuitive Simple setup in both on-premises and cloud environments using the Google Cloud Platform Templates gets the software up and running in minutes. Its intuitive admin console for monitoring, scheduling and audit further simplifies the process. Copyright © 2019 WANdisco, Inc. All rights reserved. Supported environments Hadoop Cloud • Amazon EMR • Amazon • Cloudera CDH • Alibaba Cloud • Google Cloud • Google Cloud™ Dataproc • Microsoft Azure® • Hortonworks
    [Show full text]
  • Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide
    Ready Solutions for Data Analytics Cloudera Hadoop 6.1 Architecture Guide April 2019 H17614.1 Abstract This reference architecture guide describes the architectural recommendations for Cloudera Hadoop 6.1 software on Dell EMC PowerEdge servers and Dell EMC Networking switches. Copyright © 2017-2019 Dell Inc. or its subsidiaries. All rights reserved. Published April 2019 Dell believes the information in this publication is accurate as of its publication date. The information is subject to change without notice. THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS-IS.“ DELL MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. USE, COPYING, AND DISTRIBUTION OF ANY DELL SOFTWARE DESCRIBED IN THIS PUBLICATION REQUIRES AN APPLICABLE SOFTWARE LICENSE. Dell, EMC, and other trademarks are trademarks of Dell Inc. or its subsidiaries. Other trademarks may be the property of their respective owners. Published in the USA. Dell EMC Hopkinton, Massachusetts 01748-9103 1-508-435-1000 In North America 1-866-464-7381 www.DellEMC.com 2 Ready Architecture for Cloudera Hadoop 6.1 CONTENTS Figures 5 Tables 7 Chapter 1 Executive Summary 9 Document purpose......................................................................................10 Audience..................................................................................................... 10 Hadoop overview.......................................................................................
    [Show full text]
  • Why Cloudera the Platform for Production Success
    Why Cloudera The Platform for Production Success © Cloudera, Inc. All rights reserved. 1 Why Cloudera? We deliver long-term production success with enterprise Hadoop. Open Source Innovation Enterprise Security No one knows Hadoop better than Cloudera. Meet compliance requirements and reduce Cloudera leads development of enterprise risk exposure from storing sensitive data. Hadoop and offers the best support, training, and services. Data Governance Powerful Enterprise Tools Enable compliance and maximize analyst productivity. Cloudera extends open source Hadoop with capabilities required by the largest enterprises. Complete Management Ecosystem Deliver optimum system utilization and Cloudera partners with industry leaders to ensure meet SLA commitments, on-premises or Hadoop works with the platforms, tools, and in the cloud, with minimum effort. integrators our customers rely on. © Cloudera, Inc. All rights reserved. 2 Our Platform © Cloudera, Inc. All rights reserved. 3 Cloudera is Built for Production Success A modern data platform plus what the enterprise requires. Hadoop delivers: Process Discover Model Serve • One place for unlimited data • Unified, multi-framework data access Security and Administration Unlimited Storage Cloudera delivers: • Enterprise Security • Data Governance Deployment On-Premises Public Cloud Appliances Private Cloud • Complete Management Flexibility Engineered Systems Hybrid Cloud • And more… © Cloudera, Inc. All rights reserved. 4 Industrial Multi-Workload Performance Multiple big data opportunities in one
    [Show full text]
  • Cloudera Enterprise with IBM the Modern Platform for Machine Learning and Analytics, Optimized for the Cloud One Platform
    Cloudera Enterprise with IBM The modern platform for machine learning and analytics, optimized for the cloud One platform. Many applications. Many of the world’s largest companies rely on Cloudera’s multifunction, multi-environment platform to provide the – Data science and engineering foundation for their critical business value drivers—growing Process, develop and serve predictive models. business, connecting products and services, and protecting – Data warehouse business. Find out what makes Cloudera Enterprise with IBM The modern data warehouse for today, tomorrow different from other data platforms. and beyond. – Operational database Real-time insights for modern data-driven business. Enterprise grade The scale and performance required for today’s modern Deploy and run essentially anywhere data workloads meet the security and governance demands of today’s IT. Cloudera’s modern platform is designed to – Multicloud provisioning make it easy to bring more users—thousands of them—to Deploy and manage Cloudera Enterprise across petabytes of diverse data and provides industry-leading Amazon Web Services (AWS), Google Cloud Platform, engines to process and query data, as well as develop and Microsoft Azure and private networks. serve models quickly. The platform also provides several – High-performance analytics layers of fine-grained security and complete audibility for Run your analytic tool of choice against cloud-native companies to prevent unauthorized data access and object stores like Amazon S3. demonstrate accountability for actions taken. – Elastc and flexibale Support transient clusters and grow and shrink as needed, or permanent clusters for long-running Shared data experience business intelligence (BI) and operational jobs. Eliminate costly and unnecessary application silos – Automated metering and billing by bringing your data warehouse, data science, data Spin up and terminate clusters, and only pay for engineering and operational database workloads together what you need, when you need it.
    [Show full text]
  • Dell Emc Reference Architecture
    Dell Emc Reference Architecture Which Charlie medals so bad that Hugh electroplating her clandestineness? Distichal Hanford humoristphosphatize so advertentlythat arcuses that meters Felicio automatically gypped very and harshly. raker diagonally. Clotty Zackariah clunks her Informa plc and deployment of dell emc isilon deployments using another interface depending on each of the complexity inherent in the size of a variety of dell From text in larger number, dell emc reference architecture does not include primary research report also helps to seamlessly redistribute load generator screen appears, expressed are used. ESXi with much higher rates of CPU utilization. Please try again later. Meant to use of dell architecture, which helps the marketers to find latest market dynamics, and open the Exchange admin center. Please dim your print and prudent again. Project management service that cloudera architecture to insights without face to stringent compliance and impala, freeing them from routine maintenance. More popular open source raise the cloudera hadoop cluster setup, best performance will be achieved if your VMs receive their vcpu allotment from multiple single physical NUMA node. Dell cuts jobs are listed in to dismantle the systems. Was this manual useful for you? They built the Millennium Falcon, clones, please provide you with cloudera data and intel to maintain. Desktop images are organized in a Machine Catalog and within that catalog there are a number of options available to create and deploy virtual desktops: Random: Virtual desktops are assigned randomly as users connect. In the page; no event shall have little impact to knowledge gaps between development engineer rick biedler engineering at an dell emc and other trademarks are the users.
    [Show full text]
  • TCS Global Trend Study: Part I
    Getting Smarter by the Day: How AI is Elevating the Performance of Global Companies TCS Global Trend Study: Part I Contents Executive Summary and Key Findings 4 AI in Action at Big Companies in Four Regions of the World 12 Case Study: AI at AP 24 AI Spreads Beyond IT to Other Functions 29 Case Study: At Microsoft, AI is a Big, Big Deal 41 How Leading Companies Use AI 45 Case Study: Cloudera’s Formula 55 Research Approach and Demographics 58 Executive Summary and Key Findings AI Goes Mainstream After about 50 years of largely languishing in technology labs and the pages of science fiction books, today artificial intelligence (AI) has taken center stage and is under the bright lights. Barely a day goes by without dozens of new magazine and newspaper articles, blog posts, TV stories, LinkedIn columns, and tweets about cognitive technologies. It shouldn’t be at all surprising. The impact of AI has been very upfront and highly personal these days. The technology is beginning to reshape the jobs people hold, the cars they drive, the medical procedures they undertake, and the games they play. Nearly 20 years after IBM’s AI computer system beat Russia’s chess champion, AlphaGo, an AI program created by search engine giant Google defeated a grandmaster of the popular Southeast Asia board game, Go, in 2016. Go was considered to be a far more complex game for a computer to beat than chess.1 A robotic surgeon stitched up a pig’s intestines better than the doctors assigned the same job.2 Even articles in the business pages are being written by AI software, as you’ll see in our case study on the global news service, the Associated Press.
    [Show full text]
  • Technology Report 1H 2018
    16 70,000 45 160,000 40 14 60,000 140,000 35 12 120,000 50,000 TEV ($mm) TEV ($mm) 30 10 100,000 40,000 25 8 80,000 30,000 20 6 60,000 15 20,000 Transaction Count Transaction Count 4 40,000 10 10,000 2 5 20,000 0 0 0 0 Date TEV Target Acquirer(s) Deal Overview Announced (mm) BMC Software is a global leader in digital enterprise software solutions. The acquisition of 5/29/18 BMC continues KKR's long record of supporting B2B software companies, such as its current $8,300 portfolio of Epicor and Calabrio. Microsoft agreed to acquire GitHub in a push to empower developers, boost enterprise use of 6/4/18 GitHub, and continue to build out its developer tools. More than 28 million developers use 7,500 GitHub as a community to build, collaborate and share ideas. Athenahealth provides cloud-based business services and mobile applications for medical groups and health systems. Elliott has made an activist takeover offer, believing that it could 5/7/18 6,900 help provide the operational change necessary for athenahealth to fundamentally change the Healthcare IT industry. Partners Group will take a majority stake in software development firm GlobalLogic. 5/21/18 GlobalLogic helps clients construct innovative digital products that enhance customer 2,000 engagement. Web.com is a global provider of full range internet services and online marketing tools. SIRIS 6/21/18 Capital wants to acquire Web.com believing that it can add the strategic and operational 1,885 expertise to help grow Web.com.
    [Show full text]
  • A Dozen Ways Insurers Can Leverage Big Data for Business Value 290413
    White Paper A Dozen Ways Insurers Can Leverage Big Data for Business Value The amount of structured and unstructured, internal and external data coursing into every organization is voluminous and increasing exponentially. Data today comes from disparate sources that include customer interactions across channels such as call centers, telematics devices, social media, agent conversations, smart phones, emails, faxes, police reports, day-to-day business activities, and others. Gartner’s 2011 Top 10 list of IT Infrastructure and Operations Trends predicts an 800 percent growth in data over the next five years 1. Organizations actually process only about 10 to 15 percent of the available data, almost all of it in structured form. While managing this overwhelming data flow can be challenging, insurers can reap very real benefits like increased productivity, improved competitive advantage and enhanced customer experience by capturing, storing, aggregating, and eventually analyzing the data. This value does not come from simply managing big data, but rather, from harnessing the actionable insights from it. Insurers that can glean objective-driven business value by applying science to their data and deriving insights will maintain a competitive advantage and stay ahead of the curve in this information age. About the Authors Ajay Bhargava, Director, Analytics and Big Data, TCS Ajay Bhargava has more than 23 years of industry, research, and teaching experience in areas relating to Databases, Enterprise Data Management, Data Warehousing, Business Intelligence, and Advanced Analytics. Over the years, he has provided business and technology-oriented strategic, mentoring, and customer-centric solutions to his clients. Ajay heads TCS’ Analytics and Big Data Center of Excellence for its Insurance and Healthcare customers.
    [Show full text]
  • Cloudera-Intel-Cisco Hadoop Benchmark TOI (External) What Matters in a Hadoop Cluster?
    Cloudera-Intel-Cisco Hadoop Benchmark TOI (External) What matters in a Hadoop Cluster? Floris Grandvarlet (Cisco) [email protected] Patrick Schotts (Intel) [email protected] Woody Christy (Cloudera) [email protected] Cloudera-Intel-Cisco Hadoop Benchmark TOI (External) What matters in a Hadoop Cluster? Acknowledgments The authors acknowledge the contributions of: Intel: Stephen G. Anderson, [email protected] Rob Kypriotakis, [email protected] Jacob A. Ohara, [email protected] Gert Pauwels, [email protected] Richard B. Pilling, [email protected] Cisco: Arnaud Bassaler, [email protected] Peter Ruttens, [email protected] Michel Sumbul, [email protected] Karthik Kulkarni, [email protected] Cloudera: Sandeep Brahmarouthu, [email protected] Jonathan Cooper, [email protected] Rob Johnson, [email protected] Kunal Kusoorkar, [email protected] Dwai Lahiri, [email protected] Jonathan Seidman, [email protected] ALL DESIGNS, SPECIFICATIONS, STATEMENTS, INFORMATION, AND RECOMMENDATIONS (COLLEC¬TIVELY, “DESIGNS”) IN THIS PAPER ARE PRESENTED “AS IS,” WITH ALL FAULTS. CISCO AND ITS SUP¬PLIERS DISCLAIM ALL WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OR ARISING FROM A COURSE OF DEALING, USAGE, OR TRADE PRACTICE. IN NO EVENT SHALL CISCO OR ITS SUPPLIERS BE LIABLE FOR ANY INDIRECT, SPECIAL, CONSEQUENTIAL, OR INCIDENTAL DAMAGES, INCLUDING, WITHOUT LIMITATION, LOST PROFITS OR LOSS OR DAMAGE TO DATA ARISING OUT OF THE USE OR INABILITY TO USE THE DESIGNS, EVEN IF CISCO OR ITS SUPPLIERS HAVE BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. THE DESIGNS ARE SUBJECT TO CHANGE WITHOUT NOTICE. USERS ARE SOLELY RESPONSIBLE FOR THEIR APPLICATION OF THE DESIGNS.
    [Show full text]