Hortonworks Data Platform an Open-Architecture Platform to Manage Data in Motion and at Rest

Total Page:16

File Type:pdf, Size:1020Kb

Hortonworks Data Platform an Open-Architecture Platform to Manage Data in Motion and at Rest IBM Analytics Data Sheet Hortonworks Data Platform An open-architecture platform to manage data in motion and at rest Every business is now a data business. Data is your organization’s Highlights future and its most valuable asset. The Hortonworks Data Platform (HDP) is a security-rich, enterprise-ready, open source Apache • Addresses a range of data-at-rest Hadoop distribution based on a centralized architecture (YARN). use cases HDP addresses the needs of data at rest, powers real-time customer • Powers real-time customer applications applications, and delivers robust analytics that help accelerate decision making and innovation. • Delivers robust analytics The Hortonworks difference HDP helps enterprises transform their businesses by unlocking the full potential of big data with the following benefits: Open Central Interoperable Enterprise ready HDP is composed YARN is the archi- Its 100 percent HDP is built for of numerous tectural center of open-source enterprises. Apache Software open-enterprise architecture Open-enterprise Foundation (ASF) Hadoop. It enables HDP to Hadoop provides projects that allocates re- be interoperable consistent opera- enable enterprises sources among with a broad range tions, with central- to deploy, integrate diverse applica- of data center ized management and work with tions that process and business and monitoring of unprecedented data. YARN coor- intelligence clusters through volumes of dinates cluster- applications. a single pane of structured and wide services for HDP’s interoper- glass. With HDP, unstructured operations, data ability helps security and data. ASF’s governance and minimize the governance is built approach is to security. YARN expense and into the platform. deliver enterprise- also maximizes effort required This feature helps grade software data ingestion by to connect provide a security- that fosters enabling enter- customers’ IT rich environment innovation and prises to analyze infrastructures that’s consist- prevents vendor data to support with HDP’s data ently administered lock-in. diverse use cases. and processing across data This process capabilities. With access engines. empowers HDP, customers Hadoop opera- can preserve their tors to confidently investment in extend their big existing IT archi- data assets to the tecture as they largest possible adopt Hadoop. audience in their organizations. IBM Analytics Data Sheet The Hortonworks Data Platform multiple workloads simultaneously. YARN also provides HDP offers a security-rich, enterprise-ready open-source the resource management and pluggable architecture for Hadoop distribution based on a centralized architecture. enabling a wide variety of data access methods. HDP addresses a range of data-at-rest use cases, powers Data access real-time customer applications and delivers robust analytics that accelerate decision making and innovation. With YARN at its architectural center, HDP provides a range of processing engines that allow users to simultaneously Data management interact with data in multiple ways. YARN enables a range The foundational components of HDP are Apache Hadoop of access methods to coexist in the same cluster against shared YARN and the Hadoop Distributed File System (HDFS). data sets. This feature avoids unnecessary and costly data While HDFS provides the scalable, fault-tolerant, cost- silos. HDP enables multiple data processing engines that efficient storage for a big data lake, YARN provides the range from interactive structured query language (SQL) centralized architecture that enables organizations to process and real-time streaming to data science and batch processing to use data stored in a single platform. GOVERNANCE INTEGRATION TOOLS SECURITY OPERATIONS Data Lifecycle Zeppelin Ambari User Views Administration Provisioning, & Governance Authentication Managing, DATA ACCESS Authorization & Monitoring Falcon Auditing Data Protection Batch Script SQL NoSQL Stream Search In-Mem Others Ambari Atlas Map Pig Hive Hbase Storm Solr Spark ISV Cloudbreak Reduce Accumilo Engineers Ranger Phoenix Partners Data Workflow Knox ZooKeeper Sqoop TezTez Slider Slider S T Atlas Scheduling HDFS Encryption Flume YARN: DATA OPERATING SYSTEM Kafka Oozie HDFS NFS Hadoop Distributed File System WebHDFS DATA MANAGEMENT Figure 1: Next-generation Hadoop security 2 IBM Analytics Data Sheet Security and governance governance are built into their big data environments, As organizations pursue Hadoop initiatives to capture new enterprises can use the full value of advanced analytics opportunities for data-driven insights, data governance and without exposing their businesses to new risks. security requirements can pose a key challenge. In response to this challenge, the Data Governance Initiative (DGI), Governance a consortium of cross-industry leaders, was created to As organizations pursue Hadoop initiatives to capture address the need for an open-source governance solution new opportunities for data-driven insight, data governance to manage data classification, lineage, security and data requirements can pose a key challenge. The management lifecycle management. of information to identify its value and enable effective control, security and compliance for customer and Apache Atlas, created as part of DGI, empowers enterprise data is a core requirement for both traditional organizations to apply consistent data classification and big data architectures. across the data ecosystem. Apache Ranger provides Operations centralized security administration for Hadoop. By integrating Atlas with Ranger, Hortonworks empowers HDP Operations is designed to enable IT organizations to enterprises to institute dynamic access policies at runtime bring Hadoop online quickly by taking the guesswork out of that proactively help prevent violations from occurring. the manual processes and replacing them with automated, preconfigured best practices, guided configurations and full This integration enables enterprises to implement dynamic operation control. HDP operations help simplify operation classification-based security policies. Ranger’s centralized of distributed multiuser, multitenant and multidata access platform empowers data administrators to define security engines and manage HDP clusters at scale through an policy based on Atlas metadata tags or attributes. They can integrated web user interface or single pane of glass. then apply this policy in real time to the entire hierarchy of data assets, including databases, tables and columns. HDP uses Apache Ambari, an open-source management platform for provisioning, managing, monitoring and Security securing Hadoop clusters. Ambari removes the manual and A Hadoop-powered data lake can provide a robust foundation often error-prone tasks associated with operating Hadoop. for a new generation of analytics and insight. It’s important, It also provides the necessary integration points to fit however, to secure the data before launching or expanding seamlessly into the enterprise. a Hadoop initiative. By ensuring that data protection and Apache Storm Classification-based Policy PDP ENTITIES RESOURCE ATLAS CACHE IN DATA Prohibition-based policy LAKE Notification Metastore Falcon Framework Pipelines Tags RANGER Assets Topics Time-based Policy HDFS HBase Entities files Tables Atlas Client Subscribes Hive to Topic Tables Gets Metadata Updates Location-based Policy Apache NiFi Figure 2: Next-generation Hadoop security 3 IBM Analytics Data Sheet Deployment options HDP for teams HDP offers a range of infrastructure choices to deploy an Successful deployment of Hadoop in any organization open and flexible data platform. Users have the flexibility depends on using existing skill sets and resources to adopt to combine the infrastructure options that best suit their the big data architecture. HDP provides valuable tools unique use cases. and capabilities for every role on your big data team. On premises The data scientist Several organizations that have invested in data center Apache Spark, part of HDP, plays an important role when it infrastructure and managed services and are now considering comes to data science. Data scientists commonly use machine Hadoop capabilities will find on-premise implementation learning, a set of techniques and algorithms that can learn to be a viable option. HDP is designed to be easily deployed from data. These algorithms are often iterative, and Spark’s on premises to integrate with existing data centers. ability to cache the data in memory greatly accelerates the iterative data processing, making it an ideal processing engine Cloud for implementing such algorithms. HDP can be deployed in the cloud as part of Microsoft Azure HDInsight. Azure HDInsight is a managed service The business analyst offering on the Microsoft Azure cloud, powered by HDP. HDP provides business analysts with fast access to vast This deployment option enables organizations to scale amounts of data through SQL on Hadoop interfaces provided from terabytes to petabytes of data on demand by spinning by Apache Hive, Spark SQL and Apache Phoenix. With up any number of nodes at any time. With HDInsight, these interfaces, business analysts can use their favorite enterprises can also connect their on-premises Hadoop business intelligence and business analytics tools to create clusters to the cloud. reports, visualizations, dashboards and scorecards to make more effective insight-driven decisions. Hybrid cloud and Cloudbreak Cloudbreak is a solution for provisioning Hadoop clusters The developer on a cloud infrastructure.
Recommended publications
  • Amazon Connect Data Lake Best Practices AWS Whitepaper Amazon Connect Data Lake Best Practices AWS Whitepaper
    Amazon Connect Data Lake Best Practices AWS Whitepaper Amazon Connect Data Lake Best Practices AWS Whitepaper Amazon Connect Data Lake Best Practices : AWS Whitepaper Copyright © Amazon Web Services, Inc. and/or its affiliates. All rights reserved. Amazon's trademarks and trade dress may not be used in connection with any product or service that is not Amazon's, in any manner that is likely to cause confusion among customers, or in any manner that disparages or discredits Amazon. All other trademarks not owned by Amazon are the property of their respective owners, who may or may not be affiliated with, connected to, or sponsored by Amazon. Amazon Connect Data Lake Best Practices AWS Whitepaper Table of Contents Abstract and introduction .................................................................................................................... i Abstract .................................................................................................................................... 1 Are you Well-Architected? ........................................................................................................... 1 Introduction .............................................................................................................................. 1 Amazon Connect ................................................................................................................................ 4 Data lake design principles .................................................................................................................
    [Show full text]
  • Splitting the Load How Separating Compute from Storage Can Transform the Flexibility, Scalability and Maintainability of Big Data Analytics Platforms
    IBM Analytics Engine White paper Splitting the load How separating compute from storage can transform the flexibility, scalability and maintainability of big data analytics platforms 2 Splitting the load Contents Executive summary 2 Executive summary Hadoop is the dominant big data processing system in use today. It is, however, a technology that has been around for 10 3 Challenges of Hadoop design years, and the world of big data has changed dramatically over 4 Limitations of traditional Hadoop clusters that time. 5 How cloud has changed the game 5 Introducing IBM Analytics Engine Hadoop started with a specific focus – a bunch of engineers 6 Overcoming the limitations wanted a way to store and analyze copious amounts of web logs. They knew how to write Java and how to set 7 Exploring the IBM Analytics Engine architecture up infrastructure, and they were hands-on with systems 8 Use cases for IBM Analytics Engine programming. All they really needed was a cost-effective file 10 Benefits of migrating to IBM Analytics Engine system (HDFS) and an execution paradigm (MapReduce)—the 11 Conclusion rest, they could code for themselves. Companies like Google, 11 About the author Facebook and Yahoo built many products and business models using just these two pieces of technology. 11 For more information Today, however, we’re seeing a big shift in the way big data applications are being programmed and deployed in production. Many different user personas, from data scientists and data engineers to business analysts and app developers need access to data. Each of these personas needs to access the data through a different tool and on a different schedule.
    [Show full text]
  • Extended Version
    Sina Sheikholeslami C u rriculum V it a e ( Last U pdated N ovember 2 0 18) Website: http://sinash.ir Present Address : https://www.kth.se/profile/sinash EIT Digital Stockholm CLC , https://linkedin.com/in/sinasheikholeslami Isafjordsgatan 26, Email: si [email protected] 164 40 Kista (Stockholm), [email protected] Sweden [email protected] Educational Background: • M.Sc. Student of Data Science, Eindhoven University of Technology & KTH Royal Institute of Technology, Under EIT-Digital Master School. 2017-Present. • B.Sc. in Computer Software Engineering, Department of Computer Engineering and Information Technology, Amirkabir University of Technology (Tehran Polytechnic). 2011-2016. • Mirza Koochak Khan Pre-College in Mathematics and Physics, Rasht, National Organization for Development of Exceptional Talents (NODET). Overall GPA: 19.61/20. 2010-2011. • Mirza Koochak Khan Highschool in Mathematics and Physics, Rasht, National Organization for Development of Exceptional Talents (NODET). Overall GPA: 19.17/20, Final Year's GPA: 19.66/20. 2007-2010. Research Fields of Interest: • Distributed Deep Learning, Hyperparameter Optimization, AutoML, Data Intensive Computing Bachelor's Thesis: • “SDMiner: A Tool for Mining Data Streams on Top of Apache Spark”, Under supervision of Dr. Amir H. Payberah (Defended on June 29th 2016). Computer Skills: • Programming Languages & Markups: o F luent in Java, Python, Scala, JavaS cript, C/C++, A ndroid Pr ogram Develop ment o Familia r wit h R, SAS, SQL , Nod e.js, An gula rJS, HTM L, JSP •
    [Show full text]
  • Poweredge R640 Apache Hadoop
    A Principled Technologies report: Hands-on testing. Real-world results. The science behind the report: Run compute-intensive Apache Hadoop big data workloads faster with Dell EMC PowerEdge R640 servers This document describes what we tested, how we tested, and what we found. To learn how these facts translate into real-world benefits, read the report Run compute-intensive Apache Hadoop big data workloads faster with Dell EMC PowerEdge R640 servers. We concluded our hands-on testing on October 27, 2019. During testing, we determined the appropriate hardware and software configurations and applied updates as they became available. The results in this report reflect configurations that we finalized on October 15, 2019 or earlier. Unavoidably, these configurations may not represent the latest versions available when this report appears. Our results The table below presents the throughput each solution delivered when running the HiBench workloads. Dell EMC™ PowerEdge™ R640 Dell EMC PowerEdge R630 Percentage more throughput solution solution Latent Dirichlet Allocation 4.13 1.94 112% (MB/sec) Random Forest (MB/sec) 100.66 94.43 6% WordCount (GB/sec) 5.10 3.45 47% The table below presents the minutes each solution needed to complete the HiBench workloads. Dell EMC PowerEdge R640 Dell EMC PowerEdge R630 Percentage less time solution solution Latent Dirichlet Allocation 17.11 36.25 52% Random Forest 5.55 5.92 6% WordCount 4.95 7.32 32% Run compute-intensive Apache Hadoop big data workloads faster with Dell EMC PowerEdge R640 servers November 2019 System configuration information The table below presents detailed information on the systems we tested.
    [Show full text]
  • Apache Hadoop Today & Tomorrow
    Apache Hadoop Today & Tomorrow Eric Baldeschwieler, CEO Hortonworks, Inc. twitter: @jeric14 (@hortonworks) www.hortonworks.com © Hortonworks, Inc. All Rights Reserved. Agenda Brief Overview of Apache Hadoop Where Apache Hadoop is Used Apache Hadoop Core Hadoop Distributed File System (HDFS) Map/Reduce Where Apache Hadoop Is Going Q&A © Hortonworks, Inc. All Rights Reserved. 2 What is Apache Hadoop? A set of open source projects owned by the Apache Foundation that transforms commodity computers and network into a distributed service •HDFS – Stores petabytes of data reliably •Map-Reduce – Allows huge distributed computations Key Attributes •Reliable and Redundant – Doesn’t slow down or loose data even as hardware fails •Simple and Flexible APIs – Our rocket scientists use it directly! •Very powerful – Harnesses huge clusters, supports best of breed analytics •Batch processing centric – Hence its great simplicity and speed, not a fit for all use cases © Hortonworks, Inc. All Rights Reserved. 3 What is it used for? Internet scale data Web logs – Years of logs at many TB/day Web Search – All the web pages on earth Social data – All message traffic on facebook Cutting edge analytics Machine learning, data mining… Enterprise apps Network instrumentation, Mobil logs Video and Audio processing Text mining And lots more! © Hortonworks, Inc. All Rights Reserved. 4 Apache Hadoop Projects Programming Pig Hive (Data Flow) (SQL) Languages MapReduce Computation (Distributed Programing Framework) HMS (Management) HBase (Coordination) Zookeeper Zookeeper HCatalog Table Storage (Meta Data) (Columnar Storage) HDFS Object Storage (Hadoop Distributed File System) Core Apache Hadoop Related Apache Projects © Hortonworks, Inc. All Rights Reserved. 5 Where Hadoop is Used © Hortonworks, Inc.
    [Show full text]
  • Big Business Value from Big Data and Hadoop
    Big Business Value from Big Data and Hadoop © Hortonworks Inc. 2012 Page 1 Topics • The Big Data Explosion: Hype or Reality • Introduction to Apache Hadoop • The Business Case for Big Data • Hortonworks Overview & Product Demo Page 2 © Hortonworks Inc. 2012 Big Data: Hype or Reality? © Hortonworks Inc. 2012 Page 3 What is Big Data? What is Big Data? Page 4 © Hortonworks Inc. 2012 Big Data: Changing The Game for Organizations Transactions + Interactions Mobile Web + Observations Petabytes BIG DATA Sentiment SMS/MMS = BIG DATA User Click Stream Speech to Text Social Interactions & Feeds Terabytes WEB Web logs Spatial & GPS Coordinates A/B testing Sensors / RFID / Devices Behavioral Targeting Gigabytes CRM Business Data Feeds Dynamic Pricing Segmentation External Demographics Search Marketing ERP Customer Touches User Generated Content Megabytes Affiliate Networks Purchase detail Support Contacts HD Video, Audio, Images Dynamic Funnels Purchase record Product/Service Logs Payment record Offer details Offer history Increasing Data Variety and Complexity Page 5 © Hortonworks Inc. 2012 Next Generation Data Platform Drivers Organizations will need to become more data driven to compete Business • Enable new business models & drive faster growth (20%+) Drivers • Find insights for competitive advantage & optimal returns Technical • Data continues to grow exponentially • Data is increasingly everywhere and in many formats Drivers • Legacy solutions unfit for new requirements growth Financial • Cost of data systems, as % of IT spend, continues to
    [Show full text]
  • TR-4744: Secure Hadoop Using Apache Ranger with Netapp In
    Technical Report Secure Hadoop using Apache Ranger with NetApp In-Place Analytics Module Deployment Guide Karthikeyan Nagalingam, NetApp February 2019 | TR-4744 Abstract This document introduces the NetApp® In-Place Analytics Module for Apache Hadoop and Spark with Ranger. The topics covered in this report include the Ranger configuration, underlying architecture, integration with Hadoop, and benefits of Ranger with NetApp In-Place Analytics Module using Hadoop with NetApp ONTAP® data management software. TABLE OF CONTENTS 1 Introduction ........................................................................................................................................... 4 1.1 Overview .........................................................................................................................................................4 1.2 Deployment Options .......................................................................................................................................5 1.3 NetApp In-Place Analytics Module 3.0.1 Features ..........................................................................................5 2 Ranger ................................................................................................................................................... 6 2.1 Components Validated with Ranger ................................................................................................................6 3 NetApp In-Place Analytics Module Design with Ranger..................................................................
    [Show full text]
  • View Whitepaper
    INFRAREPORT Top M&A Trends in Infrastructure Software EXECUTIVE SUMMARY 4 1 EVOLUTION OF CLOUD INFRASTRUCTURE 7 1.1 Size of the Prize 7 1.2 The Evolution of the Infrastructure (Public) Cloud Market and Technology 7 1.2.1 Original 2006 Public Cloud - Hardware as a Service 8 1.2.2 2016 - 2010 - Platform as a Service 9 1.2.3 2016 - 2019 - Containers as a Service 10 1.2.4 Container Orchestration 11 1.2.5 Standardization of Container Orchestration 11 1.2.6 Hybrid Cloud & Multi-Cloud 12 1.2.7 Edge Computing and 5G 12 1.2.8 APIs, Cloud Components and AI 13 1.2.9 Service Mesh 14 1.2.10 Serverless 15 1.2.11 Zero Code 15 1.2.12 Cloud as a Service 16 2 STATE OF THE MARKET 18 2.1 Investment Trend Summary -Summary of Funding Activity in Cloud Infrastructure 18 3 MARKET FOCUS – TRENDS & COMPANIES 20 3.1 Cloud Providers Provide Enhanced Security, Including AI/ML and Zero Trust Security 20 3.2 Cloud Management and Cost Containment Becomes a Challenge for Customers 21 3.3 The Container Market is Just Starting to Heat Up 23 3.4 Kubernetes 24 3.5 APIs Have Become the Dominant Information Sharing Paradigm 27 3.6 DevOps is the Answer to Increasing Competition From Emerging Digital Disruptors. 30 3.7 Serverless 32 3.8 Zero Code 38 3.9 Hybrid, Multi and Edge Clouds 43 4 LARGE PUBLIC/PRIVATE ACQUIRERS 57 4.1 Amazon Web Services | Private Company Profile 57 4.2 Cloudera (NYS: CLDR) | Public Company Profile 59 4.3 Hortonworks | Private Company Profile 61 Infrastructure Software Report l Woodside Capital Partners l Confidential l October 2020 Page | 2 INFRAREPORT
    [Show full text]
  • Final HDP with IBM Spectrum Scale
    Hortonworks HDP with IBM Spectrum Scale Chih-Feng Ku ([email protected]) Sr Manager, Solution Engineering APAC Par Hettinga ( [email protected]) Program Director, GloBal SDI EnaBlement Challenges with the Big Data Storage Models IBM Storage & SDI It’s not just one ! type of data – file & oBject Key Business processes now It’s not just ! depend on the analytics ! one type of analytics . Ingest data at Move data to the Perform Repeat! various end points analytics engine analytics More data sources than ever ! It takes hours or days Can’t just throw away data due to Before, not just data you own, ! to move the data! ! regulations or Business requirement But puBlic or rented data 2 Modernizing and Integrating Data Lake Infrastructure IBM Storage & SDI Insurance Company Use Case Value of IBM Power Systems Big SQL Business unit - Performance and Business Business scalability unit unit - High memory, I/O bandwidth - Optimized server for analytics and integration of accelerators ETL-processes DB2 for Warehouse H PD Data a Policy D d Value of IBM Spectrum Scale M Data D o M Customer D o - Separation of compute Data M p and storagepPart - In-place analytics (Posix conform) - Integration of objects and Files Global Namespace - HA / DR solutions - Storage tiering Spectrum Scale (inkl. tape integration) 8 16 9 17 EXP3524 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4 - Flexibility of data movement 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4 8 16 9 17 EXP3524 - Future: Using of common data formats (f.e.
    [Show full text]
  • Building Big Data Storage Solutions (Data Lakes) for Maximum Flexibility
    Building Big Data Storage Solutions (Data Lakes) for Maximum Flexibility July 2017 © 2017, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document is provided for informational purposes only. It represents AWS’s current product offerings and practices as of the date of issue of this document, which are subject to change without notice. Customers are responsible for making their own independent assessment of the information in this document and any use of AWS’s products or services, each of which is provided “as is” without warranty of any kind, whether express or implied. This document does not create any warranties, representations, contractual commitments, conditions or assurances from AWS, its affiliates, suppliers or licensors. The responsibilities and liabilities of AWS to its customers are controlled by AWS agreements, and this document is not part of, nor does it modify, any agreement between AWS and its customers. Contents Introduction 1 Amazon S3 as the Data Lake Storage Platform 2 Data Ingestion Methods 3 Amazon Kinesis Firehose 4 AWS Snowball 5 AWS Storage Gateway 5 Data Cataloging 6 Comprehensive Data Catalog 6 HCatalog with AWS Glue 7 Securing, Protecting, and Managing Data 8 Access Policy Options and AWS IAM 9 Data Encryption with Amazon S3 and AWS KMS 10 Protecting Data with Amazon S3 11 Managing Data with Object Tagging 12 Monitoring and Optimizing the Data Lake Environment 13 Data Lake Monitoring 13 Data Lake Optimization 15 Transforming Data Assets 18 In-Place Querying 19 Amazon Athena 20 Amazon Redshift Spectrum 20 The Broader Analytics Portfolio 21 Amazon EMR 21 Amazon Machine Learning 22 Amazon QuickSight 22 Amazon Rekognition 23 Future Proofing the Data Lake 23 Contributors 24 Document Revisions 24 Abstract Organizations are collecting and analyzing increasing amounts of data making it difficult for traditional on-premises solutions for data storage, data management, and analytics to keep pace.
    [Show full text]
  • Hortonworks Data Platform
    Hortonworks Data Platform Apache Ambari Installation for IBM Power Systems (November 15, 2018) docs.cloudera.com Hortonworks Data Platform November 15, 2018 Hortonworks Data Platform: Apache Ambari Installation for IBM Power Systems Copyright © 2012-2018 Hortonworks, Inc. Some rights reserved. The Hortonworks Data Platform, powered by Apache Hadoop, is a massively scalable and 100% open source platform for storing, processing and analyzing large volumes of data. It is designed to deal with data from many sources and formats in a very quick, easy and cost-effective manner. The Hortonworks Data Platform consists of the essential set of Apache Hadoop projects including MapReduce, Hadoop Distributed File System (HDFS), HCatalog, Pig, Hive, HBase, ZooKeeper and Ambari. Hortonworks is the major contributor of code and patches to many of these projects. These projects have been integrated and tested as part of the Hortonworks Data Platform release process and installation and configuration tools have also been included. Unlike other providers of platforms built using Apache Hadoop, Hortonworks contributes 100% of our code back to the Apache Software Foundation. The Hortonworks Data Platform is Apache-licensed and completely open source. We sell only expert technical support, training and partner-enablement services. All of our technology is, and will remain free and open source. Please visit the Hortonworks Data Platform page for more information on Hortonworks technology. For more information on Hortonworks services, please visit either the Support or Training page. Feel free to Contact Us directly to discuss your specific needs. Except where otherwise noted, this document is licensed under Creative Commons Attribution ShareAlike 4.0 License.
    [Show full text]
  • Cost Modeling Data Lakes for Beginners How to Start Your Journey Into Data Analytics
    Cost Modeling Data Lakes for Beginners How to start your journey into data analytics November 2020 Notices Customers are responsible for making their own independent assessment of the information in this document. This document: (a) is for informational purposes only, (b) represents current AWS product offerings and practices, which are subject to change without notice, and (c) does not create any commitments or assurances from AWS and its affiliates, suppliers or licensors. AWS products or services are provided “as is” without warranties, representations, or conditions of any kind, whether express or implied. The responsibilities and liabilities of AWS to its customers are controlled by AWS agreements, and this document is not part of, nor does it modify, any agreement between AWS and its customers. © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Contents Introduction .......................................................................................................................... 1 What should the business team focus on?...................................................................... 2 Defining the approach to cost modeling data lakes ........................................................... 2 Measuring business value ................................................................................................... 4 Establishing an agile delivery process ................................................................................ 5 Building data lakes .............................................................................................................
    [Show full text]