Hitachi Solution for Databases in an Enterprise Data Warehouse Offload Package for Oracle Database with MapR Distribution of

Reference Architecture Guide

By Shashikant Gaikwad, Subhash Shinde

December 2018 Feedback Hitachi Data Systems welcomes your feedback. Please share your thoughts by sending an email message to [email protected]. To assist the routing of this message, use the paper number in the subject and the title of this white paper in the text.

Revision History

Revision Changes Date

MK-SL-131-00 Initial release December 27, 2018 Table of Contents

Solution Overview 2 Business Benefits 2 High Level Infrastructure 3

Key Solution Components 4 Pentaho 6 Hitachi Advanced Server DS120 7 Hitachi Virtual Storage Platform Gx00 Models 7 Hitachi Virtual Storage Platform Fx00 Models 7 Brocade Switches 7 Cisco Nexus Data Center Switches 7 MapR Converged Data Platform 8 Red Hat Enterprise Linux 10

Solution Design 10 Server Architecture 11 Storage Architecture 13 Network Architecture 14 Data Analytics and Performance Monitoring Using Hitachi Storage Advisor 17 Oracle Enterprise Data Workflow Offload 17

Engineering Validation 29 Test Methodology 29 Test Results 30 1 Hitachi Solution for Databases in an Enterprise Data Warehouse Offload Package for Oracle Database with MapR Distribution of Apache Hadoop Reference Architecture Guide

Use this reference architecture guide to implement Hitachi Solution for Databases in an enterprise data warehouse offload package for Oracle Database. This Oracle converged infrastructure provides a high performance, integrated, solution for advanced analytics using the following applications:

. Hitachi Advanced Server DS120 with Intel Xeon Silver 4110 processors

. Pentaho Data Integration

. MapR distribution for Apache Hadoop This converged infrastructure establishes best practices for environments where you can copy data in an enterprise data warehouse to an database on top of Hadoop Distributed File System (HDFS). Then, you obtain your data for analysis from the Hive database instead of from the busy Oracle server.

This reference architecture guide is for you if you are in one of the following roles and need to create a big data management and advanced business analytics platform:

. Data scientist

. Database administrators

. System administrator

. Storage administrator

. Database performance analyzer

. IT professional with the responsibility of planning and deploying an EDW offload solution To use this reference architecture guide to create your big data infrastructure, you should have familiarity with the following:

. Hitachi Advanced Server DS120

. Pentaho

. Big data and the MapR distribution for Apache Hadoop

. Apache Hive

. Oracle Real Application Cluster Database 12c Release 1

. IP networks

. Red Hat Enterprise Linux

1 2

Note — Testing of this configuration was in a lab environment. Many things affect production environments beyond prediction or duplication in a lab environment. Follow the recommended practice of conducting proof-of-concept testing for acceptable results in a non-production, isolated test environment that otherwise matches your production environment before your production implementation of this solution.

Solution Overview Use this reference architecture to implement Hitachi Solution for Databases in an enterprise data warehouse offload package for Oracle Database with the MapR distribution of Apache Hadoop.

Business Benefits This solution provides the following benefits:

. Improve database manageability You can take a “divide and conquer” approach to data management by moving data onto a lower cost storage tier without disrupting access to data.

. Extreme scalability Leveraging the extreme scalability of Apache Hadoop Distributed File System, you can offload data from the Oracle servers into commodity servers running big data solutions.

. Lower total cost of ownership (TCO) Reduce your capital expenditure by reducing the resources needed to run applications. Using Hadoop Distributed File System and low-cost storage makes it possible to keep information that is not deemed currently critical, but that you still might want to access, off the Oracle servers.

This approach reduces the number of CPUs needed to run the Oracle database, optimizing your infrastructure. This potentially delivers hardware and software savings, including maintenance and support costs.

Reduce the costs of running your workloads by leveraging less expensive, general purpose servers running Hadoop cluster.

. Improve availability Reduce scheduled downtime by allowing database administrators to perform backup operations on a smaller subset of your data. With offloading, perform daily backups on hot data, and less frequent backups on warm data.

For an extremely large database, this can make the difference between having enough time to complete a backup in off-business hours or not.

This also reduces the amount of time required to recover your data.

. Analyze data without affecting the production environment When processing and offloading data through Pentaho Data Integration, dashboards in Pentaho can analyze your data without affecting the performance of the Oracle production environment.

2 3

High Level Infrastructure Figure 1 shows the high-level infrastructure for this solution. The configuration of Hitachi Advanced Server DS120 provides the following characteristics:

. Fully redundant hardware

. High compute and storage density

. Flexible and scalable I/O options

. Sophisticated power and thermal design to avoid unnecessary operation expenditures

. Quick deployment and maintenance

Figure 1

3 4

To avoid any performance impact to the production database, Hitachi Vantara recommends using a configuration with a dedicated IP network for the following:

. Production Oracle database

. Pentaho server

. Hadoop servers Uplink speed to the corporate network depends on your environment and requirements. The Cisco Nexus 93180YC-EX switches can support uplink speeds of 40 GbE.

This solution uses Hitachi Unified Compute Platform CI in a solution for an Oracle Database architecture with Hitachi Advanced Server DS220, Hitachi Virtual Storage Platform G600, and two Brocade G620 SAN switches to host Oracle Enterprise Data Warehouse. You can use your existing Oracle database environment or purchase Hitachi Unified Compute Platform CI to host Oracle RAC or a standalone solution to host Enterprise Data Warehouse.

Key Solution Components The key solution components for this solution are listed in Table 1, “Hardware Components,” on page 4 and Table 2, “Software Components,” on page 6.

TABLE 1. HARDWARE COMPONENTS

Hardware Detailed Description Firmware or Driver Quantity Version

Hitachi Virtual Storage . One controller 83-04-47-40/00 1 Platform G600 (VSP G600) . 8 × 16 Gb/s Fibre Channel ports

. 8 × 12 Gb/s backend SAS ports

. 256 GB cache memory

. 40 × 960 GB SSDs, plus 2 spares

. 16 Gb/s × 2 ports CHB

Hitachi Advanced . 2 Intel Xeon Gold 6140 CPU @ 2.30 GHz 3A10.H3 1 Server DS220 (Oracle host) . 768 GB (64GB × 12) DIMM DDR4 synchronous registered (buffered) 2666 MHz

. Intel XXV710 Dual Port 25 GbE NIC cards i40e-2.3.6 2

. Emulex LightPulse LPe31002-M6 2-Port 16 Gb/s 11.2.156.27 2 Fibre Channel adapter

4 5

TABLE 1. HARDWARE COMPONENTS (CONTINUED)

Hardware Detailed Description Firmware or Driver Quantity Version

Hitachi Advanced . 2 Intel Xeon Silver 4110 CPU @ 2.10GHz 3A10.H8 3 Server DS120 (Apache Hadoop . 2 × 128 GB MLC SATADOM for boot host) . 384 GB (32 GB × 12) DIMM DDR4 synchronous registered (buffered) 2666 MHz

. Intel XXV710 dual port 25 GbE NIC cards i40e-2.3.6 6

. 1.8 TB SAS drives 4

Hitachi Advanced . 2 Intel Xeon Silver 4110 CPU @ 2.10GHz 3A10.H8 1 Server DS120 (Pentaho host) . 2 × 128 GB MLC SATADOM for boot

. 128 GB (32 GB × 4) DIMM DDR4 synchronous registered (buffered) 2666 MHz

. Intel XXV710 Dual Port 25 GbE i40e-2.3.6 2

Brocade G620 . 48 port Fibre Channel switch V8.0.1 2 switches . 16 Gb/s SFPs

. Brocade hot-pluggable SFP+, LC connector

Cisco Nexus . 48 × 10/25 GbE Fiber ports . 7.0(3)I5(1) 2 93180YC-EX switches . 6 × 40/100 Gb/s quad SFP (QSFP28) ports

Cisco Nexus 3048TP . 1 GbE 48-Port Ethernet switch . 7.0(3)I4(2) 1 switch

5 6

TABLE 2. SOFTWARE COMPONENTS

Software Version Function

Red Hat Enterprise Linux Version 7.4 Operating system for the MapR distribution of the Hadoop environment. Kernel Version: kernel-3.10.0- 514.36.5.el7.x86_64

Red Hat Enterprise Linux Version 7.4 Operating system for the Oracle environment 3.10.0-693.11.6.el7.x86_64

Microsoft® Windows Server® 2012 R2 Standard Operating system for the Pentaho Data Integration environment.

Oracle 12c Release 1 (12.1.0.2.0) Database software

Oracle Grid Infrastructure 12c Release 1 (12.1.0.2.0) Volume management, file system software, and Oracle Automatic Storage Management

Pentaho Data Integration 8.1 Extract-transfer-load software

MapR distribution for Apache Hadoop 6.0 Hadoop distribution from MapR

Apache Hive 1.1.0 Hive database

Red Hat Enterprise Linux Device Mapper 1.02.140 Multipath software Multipath

Hitachi Storage Navigator [Note 1] Microcode dependent Storage management software

Hitachi Storage Advisor [Note 1] 2.1.0 Storage orchestration software

[Note 1] These software programs were used for this Oracle Database architecture built on Hitachi Unified Compute Platform CI. They may not be required for your implementation.

Pentaho A unified data integration and analytics program, Pentaho addresses the barriers that block your organization's ability to get value from all your data. Simplify preparing and blending any data with a spectrum of tools to analyze, visualize, explore, report, and predict. Open, embeddable, and extensible, Pentaho ensures that each member of your team — from developers to business users — can translate data into value.

. Internet of things — Integrate machine data with other data for better outcomes.

. Big data — Accelerate value with Apache Hadoop, NoSQL, and other big data.

. Data integration — Access, manage, and blend any data from any source. This solution uses Pentaho Data Integration to drive the extract, transform, and load (ETL) process. The end target of this process is an Apache Hive database on top of Hadoop Distributed File System.

. Business analytics — Turn data into insights with embeddable analytics.

6 7

Hitachi Advanced Server DS120 Optimized for performance, high density, and power efficiency in a dual-processor server, Hitachi Advanced Server DS120 delivers a balance of compute and storage capacity. This rack mounted server has the flexibility to power a wide range of solutions and applications.

The highly-scalable memory supports up to 3 TB using 24 slots of 2666 MHz DDR4 RDMM. DS120 is powered by the Intel Xeon scalable processor family for complex and demanding workloads. There are flexible OCP and PCIe I/O expansion card options available. This server supports up to 12 storage devices with up to 4 NVMe.

Hitachi Virtual Storage Platform Gx00 Models Hitachi Virtual Storage Platform Gx00 models are based on industry-leading enterprise storage technology. With flash- optimized performance, these systems provide advanced capabilities previously available only in high-end storage arrays. With the Virtual Storage Platform Gx00 models, you can build a high performance, software-defined infrastructure to transform data into valuable information.

Hitachi Storage Virtualization Operating System provides storage virtualization, high availability, superior performance, and advanced data protection for all Virtual Storage Platform Gx00 models. This proven, mature software provides common features to consolidate assets, reclaim space, extend life, and reduce migration effort.

This solution uses Virtual Storage Platform G600, which supports Oracle Real Application Clusters.

Hitachi Virtual Storage Platform Fx00 Models Hitachi Virtual Storage Platform Fx00 models deliver superior all-flash performance for business-critical applications, with continuous data availability. High-performance network attached storage with non-disruptive deduplication reduces the required storage capacity by up to 90% with the power to handle large, mixed-workload environments.

Hitachi Storage Virtualization Operating System provides storage virtualization, high availability, superior performance, and advanced data protection for all Virtual Storage Platform Fx00 models. This proven, mature software provides common features to consolidate assets, reclaim space, extend life, and reduce migration effort.

Brocade Switches Brocade and Hitachi Vantara partner to deliver storage networking and data center solutions. These solutions reduce complexity and cost, as well as enable virtualization and to increase business agility.

Optionally, this solution uses the following Brocade product:

. Brocade G620 switch, 48-port Fibre Channel In this solution, SAN switches are optional. Direct connect is possible under certain circumstances. Check the support matrix to ensure support for your choice.

Cisco Nexus Data Center Switches Cisco Nexus data center switches are built for scale, industry-leading automation, programmability, and real-time visibility.

This solution uses the following Cisco switches to provide Ethernet connectivity:

. Nexus 93180YC-EX, 48-port 10/25 GbE switch

. Nexus 3048TP, 48-port 1GbE Switch

7 8

MapR Converged Data Platform The industry’s leading unified data platform, MapR Converged Data Platform, can simultaneously do analytics and applications with speed, scale, and reliability. It converges all data into a data fabric that can store, manage, process, apply, and analyze as data happens.

Integrating Apache Hadoop, , and with real-time database capabilities, global event streaming, and scalable enterprise storage, MapR Converged Data Platform powers a new generation of big data applications.

MapR delivers enterprise grade security, reliability, and real-time performance while dramatically lowering hardware and operational costs of your most important applications and data. Supporting a variety of open source projects, MapR is committed to using industry-standard APIs.

Figure 2 shows MapR Converged Data Platform integrating with workloads and deployment models.

Figure 2

MapR supports dozens of open source projects. It extends Hadoop by providing its own custom features. MapR-XD Cloud- Scale Data Store (MapR-XD) is the industry’s only exabyte scale data store for building intelligent applications with MapR Converged Data Platform.

. Map-ES — As a publish and scribe messaging system built into the MapR platform, MapR-ES is compatible with Kafka APIs. . MapR-DB — This is a high performance No-SQL database that lets you run analytics on live data. With MapR-DB, you can handle multiple use cases and workloads on a single cluster. To provide a seamless interface with your current solutions, it can be accessed with HBASE APIs and JSON APIs. . MapR Data Fabric for Kubernetes — The MapR Data Fabric includes a natively-integrated Kubernetes volume driver to provide persistent storage volumes for access to any data located on-premises, across clouds, and to the edge. . Application deployment in containers — Stateful applications can now be deployed in containers for production use cases, machine learning pipelines, and multi-tenant use cases. . MapR Control System — Use MapR Control System to manage, administer, and monitor the Hadoop cluster.

8 9

. Multi-Tenancy — MapR supports multi-tenancy surpassing YARNs capabilities.

. High performance — With faster file access and an optimized MapReduce, as a MapR customer you can deploy one- third fewer nodes than other Hadoop distributions. . High Availability — MapR automatically eliminates single points of failures.

. Disaster Recovery — MapR provides disaster recovery with mirror copies of data replicated to other sites and consistent point in time snapshots. Figure 3 shows the components of the MapR Data Platform and other systems that it integrates with.

Figure 3

MapR provides open source ecosystem projects to handle a variety of big data management tasks. Projects include Apache Storm, , Apache Hive, Apache Mahout, Apache Hadoop YARN, Apache Sqoop, Apache Flume, and more.

MapR's open interface allows it to inter-operate with many commercial tools, such as the following:

. Pentaho

. Oracle Oracle Database has a multi-tenant architecture so you can consolidate many databases quickly and manage them as a cloud service. Oracle Database also includes in-memory data processing capabilities for analytical performance. Additional database innovations deliver efficiency, performance, security, and availability. Oracle Database comes in two editions: Enterprise Edition and Standard Edition 2.

Oracle Automatic Storage Management (Oracle ASM) is a volume manager and a file system for Oracle database files. This supports single-instance Oracle Database and Oracle Real Application Clusters configurations. Oracle ASM is the recommended storage management solution that provides an alternative to conventional volume managers, file systems, and raw devices.

9 10

Red Hat Enterprise Linux Red Hat Enterprise Linux delivers military-grade security, 99.999% uptime, support for business-critical workloads, and so much more. Ultimately, the platform helps you reallocate resources from maintaining the status quo to tackling new challenges.

Device mapper multipathing (DM-Multipath) allows you to configure multiple I/O paths between server nodes and storage arrays into a single device.

These I/O paths are physical SAN connections that can include separate cables, switches, and controllers. Multipathing aggregates the I/O paths, creating a new device that consists of the aggregated paths.

Solution Design This describes the reference architecture environment to implement Hitachi Solution for the Databases in an enterprise data warehouse offload package for Oracle Database. The environment uses Hitachi Advanced Server D120 and Hitachi Advanced Server DS220.

The infrastructure configuration includes the following:

. Pentaho server —Configure one server to run Pentaho.

. Hadoop Cluster Servers — Configure at least three servers to run Apache Hadoop Cluster with Hive2 database. The number of Hadoop servers can be expended, based on size of working set, sharding, and other factors. This solution uses local hard disk drives as RAID-1 storage for the Hadoop cluster hosts. . IP network connection — Use IP connections through Cisco Nexus switches to connect to the Pentaho server, the Apache Hive servers, and the Oracle server. . Oracle Database Architecture built on Hitachi Unified Compute Platform CI — Use this infrastructure to host Oracle Enterprise Data Warehouse. The infrastructure includes one Hitachi Advanced Server DS220, two Brocade G620 SAN switches, and any Hitachi Virtual Storage Platform Gx00 model or Virtual Storage Platform Fx00 model. (This example uses Virtual Storage Platform G600.) Configure the Oracle database for a single instance or as an Oracle Real Application Cluster environment. The following are the components in a minimal deployment, single rack configuration of this solution. It uses a single Hitachi Advanced Server DS120.

Every deployment has its own requirements, using different software combinations. These rack deployment examples show some basic configurations. There are many services that can be part of a deployment that are not listed in these examples.

Use MapR Installer to deploy MapR Converged Data Platform.

To design a configuration to meet your needs, contact MapR Sales or Hitachi Vantara Global Sales.

The following recommendation is for a small single rack deployment of a total of 10 nodes.

. Service Node (only) — The 2 nodes labeled Service Node run the following: . Container location databases (CLDB) . Web server . Resource Manager . ZooKeeper Node (only) — The 3 nodes labeled ZooKeeper Node run Apache ZooKeeper.

10 11

. All nodes — All nodes run the following: . Node Manager . NFS Gateway . File server . Drill This deployment uses top-of-rack switches.

For complete configuration deployment and information see Optimized infrastructure for big data analytics on MapR.

Server Architecture Using the same environment that was used in testing is not a requirement to deploy this solution. However, this architecture provides the compute power for the Apache Hive database to handle complex database queries and a large volume of transaction processing in parallel.

To validate the environment, the following Hitachi Advanced Server DS120 nodes were used:

. Three nodes for the Apache Hadoop host

. One node for the Pentaho Data Integration host To validate the environment, the following Hitachi Advanced Server DS220 node was used:

. One node for the Oracle host Table 3, “Hitachi Advanced Server DS120 Specifications,” and Table 4, “Hitachi Advanced Server DS220 Specifications,” describe the details of the server configuration used to validate for this solution.

TABLE 3. HITACHI ADVANCED SERVER DS120 SPECIFICATIONS

Server Server Name Role CPU Cores RAM

Hadoop Server 1 hadoopnode1 Hadoop Cluster Node 1 16 384 GB (32 GB × 12)

Hadoop Server 2 hadoopnode2 Hadoop Cluster Node 2 16 384 GB (32 GB × 12)

Hadoop Server 3 hadoopnode3 Hadoop Cluster Node 3 16 384 GB (32 GB × 12)

Pentaho Server edwpdi Pentaho PDI Server 16 128 GB (32 GB × 4)

TABLE 4. HITACHI ADVANCED SERVER DS220 SPECIFICATIONS

Server Server Name Role CPU Cores RAM

Oracle Oracle host Oracle Server 36 768 GB (64 GB × 12)

11 12

Use the configuration settings in the Apache Hadoop cluster servers in Table 5 for the BIOS and Red Hat Enterprise Linux 7.4 kernel parameters.

TABLE 5. APACHE HADOOP SERVER BIOS AND RED HAT ENTERPRISE LINUX 7.4 KERNEL PARAMETERS

Parameter Category Setting Value

BIOS NUMA ENABLE

DISK READ AHEAD DISABLE

Red Hat Enterprise Linux 7.4 kernel ulimit unlimited

vm.dirty_ratio 20

vm.dirty_background_ratio 10

vm.swappiness 1

tansparent_hugepage never

IO scheduler Noop

Red Hat Enterprise Linux 7.3 operating system services tuned disable

Use the configuration settings in the Pentaho server in Table 6 for the BIOS and Red Hat Enterprise Linux 7.4 kernel parameters.

TABLE 6. PENTAHO SERVER PARAMETERS

Parameter Category Setting Value

BIOS NUMA ENABLE

DISK READ AHEAD DISABLE

Use the parameters in Table 7 for the Apache Hadoop environment.

TABLE 7. PARAMETERS FOR THE APACHE HADOOP ENVIRONMENT

Setting Property Value

Memory Tuning mapred.child.java.opts 2048 MB

Replication factor dfs.replication 3

Map CPU vcores .map.cpu.vcores 2

Reduce CPU vcores mapreduce.reduce.cpu.vcores 2

Buffer size io.file.buffer.size 128

12 13

TABLE 7. PARAMETERS FOR THE APACHE HADOOP ENVIRONMENT (CONTINUED)

Setting Property Value

DFS Block size dfs.block.size 128

Speculative Execution mapreduce.map.tasks.speculative.excution TRUE

Hive vectorized execution set hive.vectorized.execution.enabled TRUE

Hive Server Heap Size hive-env.sh hadoop_opts 16GB

Spark driver Memory spark.driver.memory 12GB

Mapreduce memory Mapreduce.reduce.memory.mb 3072

Mapreduce IO Sort Mapriduce.task.io.sort.mb 512

Mapreduce Framework Mapreduce.framework.name yarn

Use the connection parameters in Table 8 for the Oracle environment, if appropriate for your environment.

TABLE 8. CONNECTION PARAMETERS FOR ORACLE ENVIRONMENT

Setting Property Value

Database processes processes 5000

Database sessions sessions 7000

Database transactions transaction 6000

Note —The connection information in Table 8, “Connection Parameters for Oracle Environment,” may differ for your production environment, based on your offload configuration. To execute multiple transformation steps, Pentaho Data Integration opens multiple database connections. Make sure that Oracle Database allows enough connections to offload using Pentaho.

Storage Architecture This describes the storage architecture for this solution.

Storage Configuration for Hadoop Cluster This configuration uses recommended practices with Hitachi Advanced Server D120 and MapR for the design and deployment of storage for Apache Hadoop.

Configure a total of four hard disk drives on each Hadoop node.

For the best performance and to have the size to accommodate the Oracle Enterprise Data Warehouse offloading space, adjust the size of the hard drives to meet your business requirements.

For more information on how Hitachi provides a high performance, integrated and converged solution for Oracle database, see Hitachi Solution for Databases Reference Architecture for Oracle Real Application Clusters Database 12c

13 14

Table 9 shows the storage configuration used for the Hadoop cluster in this solution.

TABLE 9. STORAGE CONFIGURATION FOR APACHE HIVE DATABASE

RAID-1 RAID Level

1.2 TB HDD Drive Type

4 Number of Drives

4.8 TB Total Useable Capacity

MapR NFS File System Type

4 KB File System Block Size

Disable Disk Readahead (BIOS Setting)

Note — On the node where you install the MapR distribution of Apache Hadoop services, you need 100 GB space allocated exclusively for Hadoop services data. If you want to use the root partition as data directories for Hadoop services, create a root partition with 100 GB. If you want to use any other disk, make sure available space in it is more than 100 GB.

File System Recommendation for MapR Distribution for Hadoop The MapR Data Platform provides a unified data solution for structured data (tables) and unstructured data (files). MapR File System (MapR-FS) is a random read-write distributed file system that allows applications to concurrently read and write directly to disk.

The storage system architecture used by MapR-FS is written in C/C++. It prevents locking contention, eliminating performance impact from Java garbage collection.

For more information on the file system, see MapR File system overview.

Network Architecture This architecture requires the following separate networks for the Pentaho server and the MapR distribution of Apache Hadoop servers:

. Public Network — This network must be scalable. In addition, it must meet the low latency needs of the network traffic generated by the servers running applications in the environment. . Management Network — This network provides BMC connections to the physical servers.

. Data Network — This network provides communication between nodes. Hitachi Vantara recommends using pairs of 10/25 Gb/s NICs for the public network and 1 Gb/s LOM for the management network.

14 15

Observe these points when configuring public network in your environment:

. For each server in the configuration, use at least two identical, high-bandwidth, low-latency NICs for the public network.

. Use NIC bonding to provide failover and load balancing within a server. If using two dual-port NICs, NIC bonding can be configured across two cards. . Ensure all NICs are set to full duplex mode For the complete network architecture, see Optimized infrastructure for big data analytics on MapR Reference Architecture Guide (PDF).

Figure 4 shows the network configuration in this solution.

Figure 4

15 16

Table 10, “Network Configuration and IP Addresses,” shows the network configuration, IP addresses, and name configuration that was used when testing the environment with Hitachi Advanced Server DS120. Your implementation of this solution can differ.

Configure pairs ports from different physical NIC cards to avoid a single point of failure (SPoF) when installing two NICs on each server. However, if high availability is not a concern for you, this environment supports using one NIC on the MapR Distribution Hadoop servers and the Pentaho server for lower cost.

The IP configuration for the Oracle environment can be adjusted for Oracle real application cluster, as applicable.

TABLE 10. NETWORK CONFIGURATION AND IP ADDRESSES

Server NIC Ports Subnet NIC IP Address Network Bandwidth Cisco Nexus Switch BOND (Gb/s)

Switch Port Number

DS120 NIC - 0 167 Bond0 172.17.167.71 Public 10/25 1 11 Server 1 (Hadoop) NIC - 3 10/25 2 BMC- 242 - 172.17.242.165 Management 1 3 11 Dedicated NIC

DS120 NIC - 0 167 Bond0 172.17.167.72 Public 10/25 1 12 Server 2 (Hadoop) NIC - 3 10/25 2 BMC- 242 - 172.17.242.166 Management 1 3 12 Dedicated NIC

DS120 NIC - 0 167 Bond0 172.17.167.73 Public 10/25 1 13 Server 3 (Hadoop) NIC - 3 10/25 2 BMC- 242 - 172.17.242.167 Management 1 3 13 Dedicated NIC

DS120 NIC - 0 167 Bond0 172.17.167.74 Public 10/25 1 14 Server 4 (Pentaho) NIC - 3 10/25 2 BMC- 242 - 172.17.242.74 Management 1 3 14 Dedicated NIC

DS220 NIC-0 167 Bond0 172.17.167.75 Public 10/25 1 15 Server (Oracle) NIC-3 10/25 2 BMC- 242 - 172.17.242.75 Management 1 3 15 Dedicated NIC

16 17

Data Analytics and Performance Monitoring Using Hitachi Storage Advisor Use Hitachi Storage Advisor for data analytics and performance monitoring with this solution.

By reducing storage infrastructure management complexities, Hitachi Storage Advisor simplifies management operations. This helps you to rapidly configure storage systems and IT services for new business applications.

Hitachi Storage Advisor can be used for this Oracle Database architecture, built on Hitachi Unified Compute Platform CI. However, this may not be required for your implementation.

Oracle Enterprise Data Workflow Offload Use a Python script to generate the Oracle Enterprise Data Workflow mapping of large Oracle data sets to the MapR distribution of Apache Hadoop, and then offload the data using Pentaho Data Integration. Have this script create a transformation Kettle file with Spoon for the data offload.

This auto-generated transformation transfers row data from Oracle database tables or views in a schema to the Apache Hadoop database. Pentaho Data Integration uses this transformation directly.

Pentaho Data Integration Pentaho Data Integration allows you to ingest, blend, cleanse, and prepare diverse data from any source. With visual tools to eliminate coding and complexity, Pentaho puts all data sources and the best quality data at the fingertips of businesses and IT users.

Using intuitive drag-and-drop data integration, coupled with data agnostic connectivity, your use of Pentaho Data Integration can span from flat files and RDBMS to MapR Distribution Hadoop and beyond. Go beyond a standard extract- transform-load (ETL) designer to scalable and flexible management for end-to-end data flows.

In this reference architecture, the end target of the ETL process is an Apache Hive database.

To set up Pentaho to connect to a MapR Distribution Hadoop Cluster, make sure that MapR client configuration is done properly and correct plugin/shim is configured. Read the MapR client configuration steps on Setting Up the Client. View the plug-in configuration steps on Set Up Pentaho to Connect to a MapR Cluster..

For this solution, configure and set Active Shim to MapR 6.0. Figure 5 shows selecting the MapR distribution for the Pentaho shim in Hadoop.

Figure 5

17 18

Script for Automatic Oracle Enterprise Data Workflow Offload You can use an Oracle Enterprise Data Workflow script that uses existing user accounts with appropriate permissions for Oracle, Hadoop Distributed File System and Apache Hive database access. Have the script test the Oracle, Hadoop Distributed File System, and Apache Hive connections first and then proceed with further options.

The main options to generate a Pentaho Data Integration transformation are the following:

. Transfer all tables in selected Oracle schema to Apache Hive on Hadoop.

. Transfer Oracle tables based on partition to Apache Hive on Hadoop.

. Transfer specific Oracle table rows to Apache Hive on Hadoop based on date range.

. Transfer specific Oracle table rows to Apache Hive on Hadoop based on column key value.

Example Work Flow Offloads Using Pentaho Data Integration These are enterprise data warehouse offload examples that Pentaho Data Integration uses for the enterprise data warehouse offload. These are created using the graphical user interface in Pentaho Data Integration.

Full Table Data Copy from Oracle to Apache Hive It can be a challenge to convert data types between two database systems. With the graphical user interface in Pentaho Data Integration, you can do data type conversion with a few clicks. No coding is needed.

Use the graphical user interface in Pentaho Data Integration to construct a workflow to copy all the data from an Oracle table to Apache Hive.

Define data connections for the Pentaho server so that Pentaho Data Integration can access data from sources like Oracle Database. See Define Data Connections for the Pentaho Server for these procedures.

Figure 6 shows the Pentaho Data Integration work flow for a full table data copy from Oracle to Apache Hive in the user interface.

Figure 6

Your workflow can copy data directly from Oracle to Apache Hive. However, there is a performance penalty if you do that. Read about this performance penalty.

To avoid the performance penalty, create a Pentaho Data Integration work flow with the following steps:

1. Read the Oracle table (table input). 2. Copy the Oracle table to Hadoop Distributed File System (Hadoop file output). 3. Load the Hadoop Distributed File System file into the Apache Hive table (execute the SQL script).

18 19

To create a full table data copy work flow, do the following.

1. On the Pentaho Data Integration work flow, double-click Read Oracle Table (Figure 6 on page 18). The Table Input dialog box opens (Figure 7 on page 19). 2. On the Table Input page, provide the connection information and SQL query for the data, and click OK.

Figure 7

3. Set parameters for the file transfer. (1) To transfer Oracle table input data to Hadoop Distributed File System, double-click Copy Oracle Table to HDFS File' (Figure 6 on page 18). The Hadoop File Output dialog box opens to the File tab (Figure 8 on page 20).

19 20

Figure 8

(2) Click the Fields tab (Figure 9 on page 21). (3) Change settings as needed, and then click OK. Figure 9 shows setting value for HDFS output file to convert a date field (TIME_ID) to a sting value during offload.

20 21

Figure 9

4. Load the Hadoop Distributed File System file into the Apache Hive database. (1) Double-click Load HDFS File into Hive Table (Figure 6 on page 18). The Execute SQL statements dialog box opens (Figure 10 on page 22). (2) Change settings as necessary, and then click OK. You can form a Hive SQL (HQL) query to match your requirements.

21 22

Figure 10

22 23

Figure 11 shows the execution of the work flow.

Figure 11

23 24

Figure 12 shows the copied table information in Apache Hive Database.

Figure 12

Merge and Join Two Tables Data Copy from Oracle to Apache Hive Often you need to join two or more enterprise data warehouse tables. This example demonstrates how to use the graphical user interface in Pentaho Data Integration to join two Oracle tables and offload the data to HDFS-Hive.

This example helps you understand how to query two different data sources simultaneously while saving the operational result (join, in this case) to Hadoop. This example is applicable to read data from Oracle and Hive databases, as well. You can perform the join on one table from the Oracle database and the other table from Hive database.

Figure 13 on page 25 shows the transformation work flow and the execution results for the merged work flow in the graphical user interface of Pentaho Data Integration.

24 25

Figure 13

25 26

Figure 14 shows transformation information to load the Hadoop Distributed File System file into the Hive table.

Figure 14

26 27

Figure 15 shows simple verification of join data on the Hive database.

Figure 15

If done as part of the Pentaho work flow, sorting rows in large tables could be time consuming. Hitachi Vantara recommends sorting all the rows in server memory instead of using a memory-plus-disk approach.

On the Sort row dialog box (Figure 16 on page 28), make the following settings:

. Use the Sort size (rows in memory) text box to control how many rows are sorted in server memory.

. Use the Free memory threshold (in %) text box to help avoid filling all available memory in server memory. Make sure to allocate enough RAM to Pentaho Data Integration on the server when you need to do large sorting tasks. Figure 16 shows controlling the cache size in the Sort rows dialog box from the graphical user interface in Pentaho Data Integration.

27 28

Figure 16

Sorting on the database is faster often than sorting externally, especially if there is an index on the sort field or fields. You can use this as another option to create better performance.

More Oracle tables can be joined one at time with same Pentaho Data Integration sorting and join steps. You can also use Execute SQL Script in Pentaho for another option to join multiple tables. Multiple table inputs with merge joins in one transformation in the Pentaho Community Forums shows how to do this from Pentaho Data Integration.

Pentaho Kettle Performance By default, all steps used in the Kettle transformation for offloading the data runs in parallel and consumes CPU and memory resources. As the resources are limited, plan resource utilization in how many transformations (steps) should be run in parallel.

Form multiple transformations with a limited number of steps and then execute them in sequence. This ensures copying or offloading all data without exhausting the resources in the server.

Decide on the number of steps in a transformation based on the available CPU and memory resources on the Kettle application host. Refer to the Pentaho server hardware information in Table 1, “Hardware Components,” on page 4.

. Environment work flow Information . Multiple Kettle transformations were created to run the copy or offload sequentially, rather than creating multiple parallel transformations through a Kettle job.

28 29

. Transformation steps information . The number of tables in each Kettle transformation were set to 80. . The number of columns in source Oracle database tables were set to 22. . Observations . The CPU resource utilization was between 55% and 60% throughout the transformation tests. This means that there was no resource depletion within the server. Possibly the number of tables in your work flow could be increased. . Work flow execution with optimum performance depends on the following: . How much CPU resources are available? . How much data is being copied or offloaded in a table row? . How many tables were included in the transformation?

Engineering Validation This summarizes the key observations from the test results for the Hitachi Solution for Databases in an enterprise data warehouse offload package for Oracle Database. This environment uses Hitachi Virtual Hitachi Advanced Server DS120, Pentaho 8.0 and the MapR distribution of Hadoop Ecosystem 6.0.0.

When evaluating this Oracle Enterprise Data Warehouse solution, the laboratory environment used the following:

. One Hitachi Unified Compute Platform CI for the Oracle Database environment

. Four Hitachi Advanced Server DS120

. One Hitachi Virtual Storage Platform G600

. Two Brocade G620 SAN switches Using this same test environment is not a requirement to deploy this solution.

Test Methodology The source data was preloaded into a sample database scheme into Oracle database.

After preloading the data, a few Pentaho Data Integration work flow examples were developed for offloading data to the Apache Hive database.

Once data was loaded into the Apache Hive database, certain verifications were done to make sure the data was offloaded correctly.

The example work flows found in Oracle Enterprise Data Workflow Offload were used to validate this environment.

Testing involved following this procedure for each example work flow:

1. Verify the following for each Oracle table before running the enterprise data offload work flow: . Number of rows . Number of columns . Data types 2. Run the enterprise data offload work flow.

29 30

3. Verify the following for the Apache Hive table after running the enterprise data offload work flow to see if the numbers matched those in the Oracle table: . Number of documents (same as number of rows) . Number fields (same as number of columns) . Data types

Test Results After running each enterprise data offload work flow example, the test results showed the same number of documents (rows), fields (columns) and data types in the Oracle database as the Apache Hive database.

These results show that you can use Pentaho Data Integration to move data from an Oracle host to Apache Hive on top of Hadoop Distributed File System to relieve the workload on your Oracle host. This provides a cost-effective solution to expanding capacity to relieve server utilization pressures.

30 For More Information Hitachi Vantara Global Services offers experienced storage consultants, proven methodologies and a comprehensive services portfolio to assist you in implementing Hitachi products and solutions in your environment. For more information, see the Services website.

Demonstrations and other resources are available for many Hitachi products. To schedule a live demonstration, contact a sales representative or partner. To view on-line informational resources, see the Resources website.

Hitachi Academy is your education destination to acquire valuable knowledge and skills on Hitachi products and solutions. Our Hitachi Certified Professional program establishes your credibility and increases your value in the IT marketplace. For more information, see the Hitachi Vantana Training and Certification website.

For more information about Hitachi products and services, contact your sales representative, partner, or visit the Hitachi Vantara website. 1

Hitachi Vantara

Corporate Headquarters Regional Contact Information 2845 Lafayette Street Americas: +1 408 970 1000 or [email protected] Santa Clara, CA 96050-2639 USA Europe, Middle East and Africa: +44 (0) 1753 618000 or [email protected] www.HitachiVantara.com | community.HitachiVantara.com Asia Pacific: +852 3189 7900 or [email protected]

© Hitachi Vantara Corporation 2018. All rights reserved. HITACHI is a trademark or registered trademark of Hitachi, Ltd. VSP is a registered trademark of Hitachi Vantara Corporation. Microsoft and Windows Server are trademarks or registered trademarks of Microsoft Corporation. All other trademarks, service marks, and company names are properties of their respective owners. Notice: This document is for informational purposes only, and does not set forth any warranty, expressed or implied, concerning any equipment or service offered or to be offered by Hitachi Data Systems Corporation. MK-SL-131-00. December 2018.