Hortonworks Data Platform an Open-Architecture Platform to Manage Data in Motion and at Rest

IBM Analytics Data Sheet Hortonworks Data Platform An open-architecture platform to manage data in motion and at rest Every business is now a data business. Data is your organization’s Highlights future and its most valuable asset. The Hortonworks Data Platform (HDP) is a security-rich, enterprise-ready, open source Apache • Addresses a range of data-at-rest Hadoop distribution based on a centralized architecture (YARN). use cases HDP addresses the needs of data at rest, powers real-time customer • Powers real-time customer applications applications, and delivers robust analytics that help accelerate decision making and innovation. • Delivers robust analytics The Hortonworks difference HDP helps enterprises transform their businesses by unlocking the full potential of big data with the following benefits: Open Central Interoperable Enterprise ready HDP is composed YARN is the archi- Its 100 percent HDP is built for of numerous tectural center of open-source enterprises. Apache Software open-enterprise architecture Open-enterprise Foundation (ASF) Hadoop. It enables HDP to Hadoop provides projects that allocates re- be interoperable consistent opera- enable enterprises sources among with a broad range tions, with central- to deploy, integrate diverse applica- of data center ized management and work with tions that process and business and monitoring of unprecedented data. YARN coor- intelligence clusters through volumes of dinates cluster- applications. a single pane of structured and wide services for HDP’s interoper- glass. With HDP, unstructured operations, data ability helps security and data. ASF’s governance and minimize the governance is built approach is to security. YARN expense and into the platform. deliver enterprise- also maximizes effort required This feature helps grade software data ingestion by to connect provide a security- that fosters enabling enter- customers’ IT rich environment innovation and prises to analyze infrastructures that’s consist- prevents vendor data to support with HDP’s data ently administered lock-in. diverse use cases. and processing across data This process capabilities. With access engines. empowers HDP, customers Hadoop opera- can preserve their tors to confidently investment in extend their big existing IT archi- data assets to the tecture as they largest possible adopt Hadoop. audience in their organizations. IBM Analytics Data Sheet The Hortonworks Data Platform multiple workloads simultaneously. YARN also provides HDP offers a security-rich, enterprise-ready open-source the resource management and pluggable architecture for Hadoop distribution based on a centralized architecture. enabling a wide variety of data access methods. HDP addresses a range of data-at-rest use cases, powers Data access real-time customer applications and delivers robust analytics that accelerate decision making and innovation. With YARN at its architectural center, HDP provides a range of processing engines that allow users to simultaneously Data management interact with data in multiple ways. YARN enables a range The foundational components of HDP are Apache Hadoop of access methods to coexist in the same cluster against shared YARN and the Hadoop Distributed File System (HDFS). data sets. This feature avoids unnecessary and costly data While HDFS provides the scalable, fault-tolerant, cost- silos. HDP enables multiple data processing engines that efficient storage for a big data lake, YARN provides the range from interactive structured query language (SQL) centralized architecture that enables organizations to process and real-time streaming to data science and batch processing to use data stored in a single platform. GOVERNANCE INTEGRATION TOOLS SECURITY OPERATIONS Data Lifecycle Zeppelin Ambari User Views Administration Provisioning, & Governance Authentication Managing, DATA ACCESS Authorization & Monitoring Falcon Auditing Data Protection Batch Script SQL NoSQL Stream Search In-Mem Others Ambari Atlas Map Pig Hive Hbase Storm Solr Spark ISV Cloudbreak Reduce Accumilo Engineers Ranger Phoenix Partners Data Workflow Knox ZooKeeper Sqoop TezTez Slider Slider S T Atlas Scheduling HDFS Encryption Flume YARN: DATA OPERATING SYSTEM Kafka Oozie HDFS NFS Hadoop Distributed File System WebHDFS DATA MANAGEMENT Figure 1: Next-generation Hadoop security 2 IBM Analytics Data Sheet Security and governance governance are built into their big data environments, As organizations pursue Hadoop initiatives to capture new enterprises can use the full value of advanced analytics opportunities for data-driven insights, data governance and without exposing their businesses to new risks. security requirements can pose a key challenge. In response to this challenge, the Data Governance Initiative (DGI), Governance a consortium of cross-industry leaders, was created to As organizations pursue Hadoop initiatives to capture address the need for an open-source governance solution new opportunities for data-driven insight, data governance to manage data classification, lineage, security and data requirements can pose a key challenge. The management lifecycle management. of information to identify its value and enable effective control, security and compliance for customer and Apache Atlas, created as part of DGI, empowers enterprise data is a core requirement for both traditional organizations to apply consistent data classification and big data architectures. across the data ecosystem. Apache Ranger provides Operations centralized security administration for Hadoop. By integrating Atlas with Ranger, Hortonworks empowers HDP Operations is designed to enable IT organizations to enterprises to institute dynamic access policies at runtime bring Hadoop online quickly by taking the guesswork out of that proactively help prevent violations from occurring. the manual processes and replacing them with automated, preconfigured best practices, guided configurations and full This integration enables enterprises to implement dynamic operation control. HDP operations help simplify operation classification-based security policies. Ranger’s centralized of distributed multiuser, multitenant and multidata access platform empowers data administrators to define security engines and manage HDP clusters at scale through an policy based on Atlas metadata tags or attributes. They can integrated web user interface or single pane of glass. then apply this policy in real time to the entire hierarchy of data assets, including databases, tables and columns. HDP uses Apache Ambari, an open-source management platform for provisioning, managing, monitoring and Security securing Hadoop clusters. Ambari removes the manual and A Hadoop-powered data lake can provide a robust foundation often error-prone tasks associated with operating Hadoop. for a new generation of analytics and insight. It’s important, It also provides the necessary integration points to fit however, to secure the data before launching or expanding seamlessly into the enterprise. a Hadoop initiative. By ensuring that data protection and Apache Storm Classification-based Policy PDP ENTITIES RESOURCE ATLAS CACHE IN DATA Prohibition-based policy LAKE Notification Metastore Falcon Framework Pipelines Tags RANGER Assets Topics Time-based Policy HDFS HBase Entities files Tables Atlas Client Subscribes Hive to Topic Tables Gets Metadata Updates Location-based Policy Apache NiFi Figure 2: Next-generation Hadoop security 3 IBM Analytics Data Sheet Deployment options HDP for teams HDP offers a range of infrastructure choices to deploy an Successful deployment of Hadoop in any organization open and flexible data platform. Users have the flexibility depends on using existing skill sets and resources to adopt to combine the infrastructure options that best suit their the big data architecture. HDP provides valuable tools unique use cases. and capabilities for every role on your big data team. On premises The data scientist Several organizations that have invested in data center Apache Spark, part of HDP, plays an important role when it infrastructure and managed services and are now considering comes to data science. Data scientists commonly use machine Hadoop capabilities will find on-premise implementation learning, a set of techniques and algorithms that can learn to be a viable option. HDP is designed to be easily deployed from data. These algorithms are often iterative, and Spark’s on premises to integrate with existing data centers. ability to cache the data in memory greatly accelerates the iterative data processing, making it an ideal processing engine Cloud for implementing such algorithms. HDP can be deployed in the cloud as part of Microsoft Azure HDInsight. Azure HDInsight is a managed service The business analyst offering on the Microsoft Azure cloud, powered by HDP. HDP provides business analysts with fast access to vast This deployment option enables organizations to scale amounts of data through SQL on Hadoop interfaces provided from terabytes to petabytes of data on demand by spinning by Apache Hive, Spark SQL and Apache Phoenix. With up any number of nodes at any time. With HDInsight, these interfaces, business analysts can use their favorite enterprises can also connect their on-premises Hadoop business intelligence and business analytics tools to create clusters to the cloud. reports, visualizations, dashboards and scorecards to make more effective insight-driven decisions. Hybrid cloud and Cloudbreak Cloudbreak is a solution for provisioning Hadoop clusters The developer on a cloud infrastructure.

Hortonworks Data Platform an Open-Architecture Platform to Manage Data in Motion and at Rest

Amazon Connect Data Lake Best Practices AWS Whitepaper Amazon Connect Data Lake Best Practices AWS Whitepaper

Splitting the Load How Separating Compute from Storage Can Transform the Flexibility, Scalability and Maintainability of Big Data Analytics Platforms

Extended Version

Poweredge R640 Apache Hadoop

Apache Hadoop Today & Tomorrow

Big Business Value from Big Data and Hadoop

TR-4744: Secure Hadoop Using Apache Ranger with Netapp In

View Whitepaper

Final HDP with IBM Spectrum Scale

Building Big Data Storage Solutions (Data Lakes) for Maximum Flexibility

Hortonworks Data Platform

Cost Modeling Data Lakes for Beginners How to Start Your Journey Into Data Analytics