Real-Time Big Data Analytics for the Enterprise

White Paper Intel® Distribution for Apache Hadoop* Big Data Real-Time Big Data Analytics for the Enterprise

SAP HANA* and the Intel® Distribution for Apache Hadoop* Software

Executive Summary Companies are using real-time big data analytics to reshape the competitive landscape in their industries. They do it by capturing, storing, and analyzing volumes and varieties of data that were previously unmanageable, and then extracting insights fast enough to support real-time business processes. What started with a few leading Internet companies has spread to finance, healthcare, government, manufacturing, retail, scientific research, and many other fields. Yet implementing real-time big data analytics can be challenging, requiring IT organizations to implement mission-critical solutions based, at least in part, on open- source software that does not always meet enterprise requirements. Not only is integration complex, but IT organizations must establish security, compliance, and high availability from the ground up to ensure the system is up to the challenge of housing sensitive data and supporting revenue-generating business processes. Intel and SAP have addressed these challenges to provide an enterprise-ready solution for real-time big data analytics. With SAP HANA* running on the latest Intel® Xeon® processor E7 family and the Intel® Distribution for Apache Hadoop* software running on the latest Intel Xeon processor E5 family, businesses can ingest, store, and analyze petabytes of polystructured data, and they can generate insights in fractions of a second to support real-time business processes. This solution includes a rich set of data management and business intelligence tools for turning data into high-value insights that can be embedded into other applications and business processes. Just as importantly, the solution is designed to meet enterprise requirements of security, compliance, and high availability so businesses can confidently integrate sensitive data into their analytics environment. This white paper discusses the value of performing real-time analytics using all available enterprise data and describes how Intel and SAP have overcome the inherent challenges to deliver an enterprise-ready solution. Real-Time Big Data Analytics for the Enterprise

Table of Contents

Executive Summary...... 1 Extending Real-Time Analytics to All Enterprise Data ...... 3 Solving the Challenges of Big Data Integration ...... 4 Advanced Analytics across All Data Sets...... 4 Industry-Leading Performance for Apache Hadoop ...... 4 Integrated Data Management...... 6 An Enterprise-Ready Platform ...... 6 End-to-End Security ...... 6 High Availability...... 7 Enterprise-Class Manageability ...... 7 SAP and Intel: A Shared Vision for Big Data Integration ...... 7 SAP: Single Point of Contact for Service and Support...... 8 Conclusion ...... 8

2 Real-Time Big Data Analytics for the Enterprise

Extending Real-Time Analytics to All Enterprise Data Advances in data analytics are changing the way businesses compete, enabling them to make faster and better decisions based on real-time analysis. Until recently, companies had to make tradeoffs between deep analysis of large data sets and fast time to results. Intel and SAP are eliminating the need to compromise with an analytics platform designed to deliver real-time query performance while acting on petabytes of both structured and unstructured data. SAP HANA provides a real-time analytics platform using an in-memory database. Organizations can combine large data sets from their operational systems and other sources and perform complex queries in real time, typically in milliseconds. They can even use a single SAP HANA instance as a common foundation for all their applications, both transactional and analytical. This approach streamlines infrastructure and eliminates the physical and operational complexities of moving large amounts of data from operational systems to analytic systems. With these capabilities, SAP HANA answers the business challenge of delivering data-driven intelligence to support real- time business processes. Big data introduces a new set of challenges. Companies generate enormous volumes of poly-structured data from Web logs, sensors, call records, social network posts, emails, and many other sources. They need a cost-effective, massively scalable solution for capturing, storing, and analyzing this data. They also need to be able to integrate their big data into their real-time analytics environment to maximize business value. For example, many companies want to analyze the clickstream trails of online customers in combination with historical purchasing patterns to deliver personalized offers and information. Deep analysis across diverse data sets can improve outcomes in such scenarios, but results are needed quickly to positively impact online transactions. Intel and SAP have collaborated to meet this challenge by integrating the Intel Distribution for Apache Hadoop (IDH) software with SAP HANA, SAP Data Services, and SAP Business Objects. The result is a real-time analytics platform designed to efficiently ingest, store, integrate, and analyze all enterprise data. The platform offers: • Real-time analytics with cost-effective storage that can scale to petabytes, and potentially exabytes, of data. • Transparent data integration and query federation, so advanced analytics can be applied across all data using SAP tools and familiar SQL-based programming models. • Enterprise-class support for security, compliance, and manageability so businesses can realize the advantages of real-time big data analytics more quickly and with reduced cost and risk.

3 Real-Time Big Data Analytics for the Enterprise

Solving the Challenges of Big architecture (see sidebar). Second, Intel Industry-Leading Performance for Data Integration and SAP make it easy to generate queries Apache Hadoop* that make efficient use of both platforms. SAP HANA is known for its unmatched The Intel® Distribution for Apache query performance at scale. Intel Advanced Analytics across All Hadoop* (IDH) software is optimized with collaborated with SAP engineers to help Data Sets the latest Intel® Xeon® processors, Intel® them optimize their in-memory processing Solid-State Drives, and 10 gigabit Intel® platform to get maximum benefit from SAP HANA and SAP Business Objects Ethernet Adapters to deliver: the hardware capabilities of the Intel Xeon provide comprehensive support for • Up to 30x higher performance than processor E7 family, including its multicore advanced analytics, including traditional unoptimized Hadoop software running on architecture, large cache, large memory SQL-based queries, dashboards, predictive legacy hardware.3 capacity and high-bandwidth I/O channels. analytics, planning, text mining, and more. In combination with IDH, these models can • Up to 2 .6x faster performance than Based on these efforts, SAP HANA speeds be applied transparently across the data other open-source Hadoop distributions query processing times by as much as stored in both platforms. running on the same hardware platform.4 10,000 times1 versus traditional data warehouse solutions. The latest Intel BI users and developers see data stored Additional optimizations within IDH Xeon processor E7 v2 family delivers in IDH as an extension of the data help to improve performance for other even greater performance benefits and stored in SAP HANA. The queries they key functions, such as MapReduce* job can process much larger in-memory data generate are automatically federated, as launches and Hive* queries (Hive provides sets. These new processors support appropriate, across the two platforms. data-warehouse-like functionality for three times more memory than previous- For example, one part of a query might Hadoop environments and is a key generation processors: up to 6 TB on extract customer purchasing data from component for integrating the Intel a four-socket server and up to 12 TB SAP HANA; another part might search Distribution with SAP HANA*.) on an eight-socket server. They also associated Web server logs or call center These and other optimizations help to provide more cores, threads, and system data records in the Hadoop cluster. The shorten query completion times. They bandwidth to enable up to 2x faster results are then combined and further also allow organizations to perform performance2 for complex, ad hoc queries, analyzed in SAP HANA to provide desired more queries in the time available, which compared to previous-generation SAP insights. As part of this query federation provides greater agility and better HANA platforms. process, some components of the SQL queries generated by BI users and utilization of the infrastructure. The distributed architecture of Apache developers are automatically translated Hadoop addresses very different into MapReduce* applications that can run requirements than SAP HANA. Hadoop natively in Hadoop. enables query performance and data capacity to be scaled cost-effectively The separate parts of a federated query across tens to hundreds of standard, can be performed simultaneously. They two-socket servers based on Intel Xeon can also be performed asynchronously, so processors and configured with direct- that intermediate results from the Hadoop attached storage drives. This clustered cluster are available as needed to support architecture stores and processes data at real-time processes in SAP HANA. Query a cost-per-terabyte that is far lower than performance statistics are provided, so traditional data warehousing systems. developers can shape queries to address specific latency requirements. Although Hadoop enables fast processing of massive data sets, queries typically take minutes to hours to complete. This creates challenges when integrating Hadoop into a real-time analytics environment. Intel and SAP address these challenges in two ways. First, IDH is highly optimized for performance on Intel®

4 Real-Time Big Data Analytics for the Enterprise

Real-Time Analytics with Big Data Integration Weather Data

Market OLAP Data ETL SAP HANA* Analysis

Location Data Data Mining SAP HANA Real Time Smart Data Access SAP Data SAP Business Services Objects Optimized for: • Data relocation Reporting • Query federation and acceleration (proxy tables, hot replication, caching)

Connectors Intel® Manager for Apache Hadoop* Software Ingest, Export Deployment, Conﬁguration, Monitoring, Alerts, and Security Web Logs Pig* Mahout* R* HCatalog* Hive* * * Machine Scripting Learning Stats Metadata Query Figure 1 . The SAP Oozie

Call Sqoop

Workﬂow HANA* Smart Data Logs

Data Exchange Data Access connector has HBase * YARN* (+ MapReduce*) been engineered and Distributed Processing Framework NoSQL Store optimized by Intel and

* SAP to simplify and ac-

Sensor * celerate data sharing Logs HDFS and query execution

Flume across both platforms.

Zookeeper Hadoop* Distributed File System Coordination Log Collector Log As a result, analysts Big Data can achieve fast query results across petabytes Intel® Distribution for Apache Hadoop Software of structured and unstructured data. Open source components with: Some Intel optimization Extensive Intel optimization

Much of this functionality is supported data are automatically created to avoid With these and other optimizations, Intel through the SAP HANA Smart Data Access contention. Suppose a company launches and SAP help to make the integration connector, which Intel and SAP have a popular new product, and the associated between SAP HANA and IDH as seamless optimized for use with IDH (Figure 1). This data is under continuous demand. Dozens and as transparent as possible for BI users connector supports data relocation as or even hundreds of replicas can be and developers. well as the creation of proxy tables within generated so the data can be accessed SAP HANA to simplify and accelerate data and manipulated without bottlenecks. access and query execution. Another performance-enhancing Intel implemented a number of feature is caching. Frequently used optimizations to improve query data and intermediate query results are performance on Apache Hadoop. One automatically stored in the in-memory example is hot replication, in which database of SAP HANA, so they can be multiple replicas of frequently used accessed almost instantly when needed.

5 Real-Time Big Data Analytics for the Enterprise

Integrated Data Management Apache Hadoop, on the other hand, is an source and proprietary tools to provide a open-source software application that platform that addresses the requirements SAP Data Services provides an integrated, combines features and optimizations of enterprise deployments. enterprise-class platform for data generated by many companies and integration, data quality, data profiling, individuals. This development model End-to-End Security and metadata management. System enables exceptionally fast innovation, administrators can use it to load and IDH provides end-to-end security to which is evidenced by the rapid evolution manage data across both SAP HANA protect data. Tools and capabilities include: of the Hadoop software ecosystem. and IDH for SAP. They can also use it However, because of this rapid evolution, • Authentication and Access Control. to manage data that has been loaded there are gaps in most available Hadoop IDH supports user authentication and independently into the Hadoop cluster. distributions, particularly with respect to role-based access controls. Queries security, availability, and manageability. generated in SAP Business Objects An Enterprise-Ready Platform These gaps have kept many businesses are authenticated just once for both SAP HANA is engineered specifically from deploying Hadoop in production SAP HANA and IDH, and IDH provides to support mission-critical computing environments. granular access controls for data and environments. Intel implements advanced services. Users and queries can only Intel has worked to close those gaps in security and reliability features in the access authorized data sets, which IDH. IDH includes the full open source Intel Xeon processor E7 family and related helps to protect sensitive data solution stack, with all components platform components, and works with against both internal threats and pre-integrated and optimized to improve SAP to ensure they are fully utilized external hackers. performance on Intel architecture. Intel throughout the SAP HANA solution stack. also integrates a combination of open

Project Rhino Establishing comprehensive security for Apache Hadoop*

Intel® Distribution for Apache Hadoop Intel® Manager

Connectors Recommendation Engine Behavior Model Vertical Accelerators Heat Map Netezza, Oracle, SAP, SQLServer, Analytics Workbench HBase* Explorer Security Controls Teradata, DB2 Kafka* Lucene*, Solr* Gryphon* Job Proﬁler Graph Mining Event Bus Search Low-latency SQL-92 Resource Monitor Oozie*

Sqoop* Pig* Mahout* R* Hcatalo* Hive

Workﬂow Upgrade Scripting Machine Learning Stats Metadata Query Data Transfer Alerts YARN* (+MapReduce*) SLURM* HBase Distributed Processing Framework Scheduler Uniﬁed Logging

Flume* HDFS | Lustre* | GlusterFS Zookeeper* Coordination Log Collector Hadoop Compatible File Systems Tuning

High Availability and Disaster Recovery Conﬁguration

Rhino (Security) [Encryption, Authentication, Authorization, Auditing] Deployment

Intel proprietary Intel-optimized open Includes Intel security components source components enhancements

Figure 2. The Intel® Distribution for Apache Hadoop* includes extensive enhancements for enterprise-class security and compliance and Intel is working on Project Rhino to establish a comprehensive security framework across the Hadoop* ecosystem. The goal is to provide a common authentication and authorization framework with integrated support for regulatory requirements in financial, healthcare, government, and e-commerce environments. 6 Real-Time Big Data Analytics for the Enterprise

• Fast, transparent data encryption. could fail without impacting service or • Flexible extensibility, with an IDH uses Intel® Data Protection data availability. However, the cluster application programming interface (API) Technology with Advanced Encryption NameNode and Job Tracker servers, which that allows third-party and custom Standard New Instructions5 (AES- are required in all Hadoop deployments, applications to access the functions in NI), which accelerates encryption are potential single points of failure. Intel Manager for Apache Hadoop. and decryption performance by up IDH provides integrated support for to 19 times6, to enable strong data high availability for both these critical SAP and Intel: A Shared Vision for protection without compromising query servers. Intel is also working on the open Big Data Integration performance. Data can be encrypted source Project Ladon, which is designed Intel and SAP continue to jointly engineer, selectively and transparently, both in to support disaster recovery of Apache optimize, and enhance the integration motion and at rest, to meet security and Hadoop through multisite data replication. of SAP HANA and IDH. The companies compliance requirements. Within IDH, are working together to integrate new transparent encryption is supported Enterprise-Class Manageability functionality and to optimize software to in Hive, Pig*, MapReduce, HBase*, SAP HANA is typically delivered as derive maximum benefit from advances and the Hadoop Distributed File an appliance for onsite deployments. in hardware. Some objectives of this System* (HDFS*). All hardware and software is tightly collaboration include: • Governance. All database operations integrated and optimized to simplify are logged across both SAP HANA and deployment and management. Apache • Simplified troubleshooting, so query IDH and can be audited to verify that Hadoop, on the other hand, is based on failures can be identified, diagnosed, users only access authorized data sets open source software that is designed and fixed more quickly and efficiently. and services. Reports and automated to run on large numbers of off-the-shelf Future solutions will include built-in alerts help IT protect data and servers. Management can be complex analytics for root-cause analysis. document compliance. in this more distributed computing • Enhanced data relocation, so data environment, and the challenges increase Intel is working to extend these and other can be moved more quickly, flexibly, as a cluster grows. security capabilities across the Hadoop and transparently between the two ecosystem through an open source IDH includes Intel® Manager for Apache platforms. project called Project Rhino (Figure 2). Hadoop software, which combines open • Stronger security, by further The goal is to establish a comprehensive source and proprietary tools to provide improving integration and by providing security framework for Hadoop that enterprise-level manageability, including: more comprehensive, multilayered will help businesses address security protections in both hardware and • A user friendly interface for managing issues and compliance protocols across software. access controls and for updating a wide range of use cases in financial, the system. Built-in wizards provide Intel is also deeply involved in hundreds of healthcare, government, and e-commerce workflows and guidance to speed open source projects to increase Hadoop environments. Project Rhino will deployment, simplify upgrades, and performance and functionality, and the contribute code to the Apache improve results. results of these efforts will continue Foundation so these capabilities will be to increase the capability and value of freely available. • Automatic cluster configuration IDH. Many of these developments are and tuning, using the Intel® Active also offered back to the open source High Availability Tuner. Advanced machine-learning community to help drive innovation and algorithms select the best setup based Big data analytics are often used to interoperability across the broader big on workload characteristics to deliver improve outcomes in revenue-producing data ecosystem. business processes, so high availability is optimized query performance quickly important. SAP HANA provides integrated and with no need for complex manual support for data replication and system tuning. failover to prevent downtime. Hadoop • Built-in monitoring, with a dashboard implements 3-way data replication by that provides a comprehensive view of default, so that any data node in a cluster the cluster and system health.

7 Real-Time Big Data Analytics for the Enterprise

SAP: Single Point of Contact for Service and Support Intel Distribution for Apache Hadoop: http://hadoop .intel .com SAP HANA and IDH are available from SAP sales teams worldwide. SAP offers full support for the joint solution. SAP also offers comprehensive consulting services, from SAP Big Data: www .sap .com/bigdata initial planning and assessment through implementation and ongoing optimization. The speed, scale, and flexibility of the platform go far beyond what has been possible in the past, and IT organizations can accelerate deployment by working with experts who have extensive experience with SAP HANA and Apache Hadoop.

Conclusion SAP and Intel provide an optimized solution for real-time big data analytics based on SAP HANA and the Intel Distribution for Apache Hadoop. Using this joint solution, data and business analysts can combine the performance of in-memory analytics with the massive scalability of Apache Hadoop. As a result, they can store and analyze petabytes of poly-structured data cost effectively at the speeds needed to support real-time business processes.

Intel and SAP have worked closely together to optimize the combined platform to support fast, federated queries that tighten the seams between the two platforms and make it easier for BI users to get the results they want without worrying about the infrastructure. The solution is designed to support enterprise requirements for security, availability, and manageability, so IT organizations can integrate the platform into their datacenter while minimizing cost and risk.

1. Source: Sikka, Vishal, SAP. “The Business Value of Speed! Lessons from 10,000X SAP HANA Performance Club.” August 2012. http://www.saphana.com/community/blogs/ blog/2012/08/05/the-business-value-of-speed. 2. Source: Intel internal measurements November 2013. Configurations: Baseline 1.0x: Intel® E7505 Chipset using four Intel® Xeon® processors E7-4870 (4P/10C/20T, 2.4GHz) with 256GB DDR3-1066 memory scoring 110,061 queries per hour. Source: Intel Technical Report #1347. New Generation 2x: Intel® C606J Chipset using four Intel® Xeon® processors E7-4890 v2 (4P/15C/30T, 2.8GHz) with 512GB DDR3-1333 (running 2:1 VMSE) memory scoring 218,406 queries per hour. Source: Intel Technical Report #1347. 3. Source: TeraSort Benchmarks conducted by Intel in December 2012. Custom settings: mapred.reduce.tasks=100 and mapred.job.reuse.jvm.num.tasks=-1. Cluster configuration: One head node (name node, job tracker), 10 workers (data nodes, task trackers), Cisco Nexus* 5020 10 Gigabit switch. Performance measured using Iometer* with Queue Depth 32. Baseline worker node: SuperMicro SYS-1026T-URF 1U servers with two Intel® Xeon® processors X5690 @ 3.47 GHz, 48 GB RAM, 700 GB 7200 RPM SATA hard drives, Intel® Ethernet Server Adapter I350-T2, Apache Hadoop* 1.0.3, Red Hat Enterprise Linux* 6.3, Oracle Java* 1.7.0_05. Baseline storage: 700 GB 7200 RPM SATA hard drives, upgraded storage: Intel® Solid-State Drive 520 Series (the Intel® Solid-State Drive 520 Series is currently not validated for data center usage). Baseline network adapter: Intel® Ethernet Server Adapter I350-T2, upgraded network adapter: Intel® Ethernet Converged Network Adapter X520-DA2.Upgraded software in worker node: Intel® Distribution for Apache Hadoop* software 2.1.1. Note: Solid-state drive performance varies by capacity. More information: http://hadoop.apache.org/docs/ current/api/org/apache/hadoop/examples/terasort/package-summary.html. 4. Source: Terasort Benchmarks conducted by Intel. Configuration details: One head node (name node, job tracker), 10 workers (data nodes, task trackers), Dual Intel® Xeon® processor [email protected] GHz, 32 cores per node, 7 x 1 TB dedicated data disks per node, 10 GbE network. System Swap turned off, Kernel Buffer Cache cleared before each performance test. 5. No computer system can provide absolute security. Requires an enabled Intel® processor and software optimized for use of the technology. Consult your system manufacturer and/or software vendor for more information. 6. Source: Intel Internal tests using OpenSSL 1.0.1c* encryption software to encrypt and decrypt a 1 GB text file, with and without AES-NI enabled. Server configuration: 4-socket server with 4 x Intel® Xeon® processor E5-2690 (32 core system, 1 core used in testing), 32 GB memory, CentOS 6.3* operating system, Apache Hadoop Distributed File System* (HDFS*) with namenode, datanode, and the test program all run on the same server, 240 GB Intel® Solid State Drive 320 Series storage. For details, see the Intel Solution Brief, “Fast, Low-Overhead Encryption for Apache Hadoop*.” http://hadoop.intel.com/pdfs/IntelEncryptionforHadoopSolutionBrief.pdf Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance. INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. A “Mission Critical Application” is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL’S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS ,COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS’ FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked “reserved” or “undefined.” Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm

© 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Core, Xeon, Intel Inside, the Intel Inside logo, the Look Inside. logo, and Look Inside. are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Printed in USA 0214/MR/CMD/PDF Please Recycle 329774-001US