<<

Hitachi Solution for Databases in an Enterprise Data Warehouse Offload Package for Oracle Database to Cloudera Distribution of

Reference Architecture Guide

By Shashikant Gaikwad, Subhash Shinde

September 2018 Feedback Hitachi Data Systems welcomes your feedback. Please share your thoughts by sending an email message to [email protected]. To assist the routing of this message, use the paper number in the subject and the title of this white paper in the text.

Revision History

Revision Changes Date

MK-SL-098-00 Initial release August 20, 2018 MK-SL-098-01 Updated text in Merge and Join Two Tables Data Copy September 24, 2018 Table of Contents

Solution Overview 2 Business Benefits 2 High Level Infrastructure 3

Key Solution Components 4 6 Hitachi Advanced Server DS120 7 Hitachi Virtual Storage Platform Gx00 Models 7 Hitachi Virtual Storage Platform Fx00 Models 7 Brocade Switches 7 Cisco Nexus Data Center Switches 7 Cloudera 8 Oracle Database 9 Red Hat Enterprise 9

Solution Design 9

Solution Validation 11 Storage Architecture 13 Network Architecture 15 Data Analytics and Performance Monitoring Using Hitachi Storage Advisor 17 Oracle Enterprise Data Offload Workflow 18

Engineering Validation 30 Test Methodology 30 Test Results 31 1 Hitachi Solution for Databases in an Enterprise Data Warehouse Offload Package for Oracle Database to Cloudera Distribution of Apache Hadoop Reference Architecture Guide

Use this reference architecture guide to implement Hitachi Solution for Databases in an enterprise data warehouse offload package for Oracle Database. This Oracle converged infrastructure provides a high performance, integrated, solution for advanced analytics using the following applications:

. Hitachi Advanced Server DS120 with Intel Xeon Silver 4110 processors

. Pentaho

. Cloudera Distribution Hadoop This architecture establishes best practices for environments where you can copy data in an enterprise data warehouse to Apache Hive database on top of Hadoop Distributed File System (HDFS). Then, you obtain your data for analysis from the Hive database instead of from the busy Oracle server.

This reference architecture guide is for you if you are in one of the following roles and need to create a big data management and advanced analytics solution:

. Data scientist

. Database administrator

. System administrator

. Storage administrator

. Database performance analyzer

. IT professional with the responsibility of planning and deploying an EDW offload solution To use this reference architecture guide to create your big data infrastructure, you should have familiarity with the following:

. Hitachi Advanced Server DS120

. Pentaho

. Big data and Cloudera Distribution Hadoop (CDH)

. Apache Hive

. Oracle Real Application Cluster Database 12c Release 1

. IP networks

. Red Hat Enterprise Linux

1 2

Note — Testing of this configuration was in a lab environment. Many things affect production environments beyond prediction or duplication in a lab environment. Follow the recommended practice of conducting proof-of-concept testing for acceptable results in a non-production, isolated test environment that otherwise matches your production environment before your production implementation of this solution.

Solution Overview Use this reference architecture to implement Hitachi Solution for Databases in an enterprise data warehouse offload package for Oracle Database to Cloudera Distribution Hadoop.

Business Benefits This solution provides the following benefits:

. Improve database manageability You can take a "divide and conquer" approach to data management by moving data onto a lower cost storage tier without disrupting access to data.

. Extreme scalability Leveraging the extreme scalability of Hadoop Distributed File System, you can offload data from the Oracle servers into commodity servers running big data solutions.

. Lower total cost of ownership (TCO) Reduce your capital expenditure by reducing the resources needed to run applications. Using Hadoop Distributed File System and low-cost storage makes it possible to keep information that is not deemed currently critical, but that you still might want to access, off the Oracle servers.

This approach reduces the number of CPUs needed to run the Oracle database, optimizing your infrastructure. This potentially delivers hardware and software savings, including maintenance and support costs.

Reduce the costs of running your workloads by leveraging less expensive, general purpose servers running Hadoop cluster.

. Improve availability Reduce scheduled downtime by allowing database administrators to perform backup operations on a smaller subset of your data. With offloading, perform daily backups on hot data, and less frequent backups on warm data.

For an extremely large database, this can make the difference between having enough time to complete a backup in off-business hours or not.

This also reduces the amount of time required to recover your data.

. Analyze data without affecting the production environment When processing and offloading data through Pentaho , dashboards in Pentaho can analyze your data without affecting the performance of the Oracle production environment.

2 3

High Level Infrastructure Figure 1 shows the high-level infrastructure for this solution. The configuration of Hitachi Advanced Server DS120 provides the following characteristics:

. Fully redundant hardware

. High compute and storage density

. Flexible and scalable I/O options

. Sophisticated power and thermal design to avoid unnecessary operation expenditures

. Quick deployment and maintenance

Figure 1

3 4

To avoid any performance impact to the production database, Hitachi Vantara recommends using a configuration with a dedicated IP network for the following:

. Production Oracle database

. Pentaho server

. Hadoop servers Uplink speed to the corporate network depends on your environment and requirements. The Cisco Nexus 93180YC-EX switches can support uplink speeds of 40 GbE.

This solution uses Hitachi Unified Compute Platform CI in a solution for an Oracle Database architecture with Hitachi Advanced Server DS220, Hitachi Virtual Storage Platform G600, and two Brocade G620 SAN switches hosted Oracle EDW. You can use your existing Oracle database environment or purchase Hitachi Unified Compute Platform CI to host Oracle RAC or a standalone solution to host Enterprise Data Warehouse.

Key Solution Components The key solution components for this solution are listed in Table 1, “Hardware Components,” on page 4 and Table 2, “Software Components,” on page 6.

TABLE 1. HARDWARE COMPONENTS

Hardware Detailed Description Firmware or Driver Version Quantity

Hitachi Virtual . One controller 83-04-47-40/00 1 Storage Platform G600 (VSP G600) . 8 × 16 Gb/s Fibre Channel ports

. 8 × 12 Gb/s backend SAS ports

. 256 GB cache memory

. 40 × 960 GB SSDs, plus 2 spares

. 16 Gb/s × 2 ports CHB

Hitachi Advanced . 2 Intel Xeon Gold 6140 CPU @ 2.30 GHz 3A10.H3 1 Server DS220 (Oracle host) . 768 GB (64GB × 12) DIMM DDR4 synchronous registered (buffered) 2666 MHz

. Intel XXV710 Dual Port 25 GbE NIC cards i40e-2.3.6 2

. Emulex LightPulse LPe31002-M6 2-Port 16 11.2.156.27 2 Gb/s Fibre Channel adapter

4 5

TABLE 1. HARDWARE COMPONENTS (CONTINUED)

Hardware Detailed Description Firmware or Driver Version Quantity

Hitachi Advanced . 2 Intel Xeon Silver 4110 CPU @ 2.10GHz 3A10.H3 3 Server DS120 (Hadoop host) . 2 × 64 GB MLC SATADOM for boot

. 384 GB (32 GB × 12) DIMM DDR4 synchronous registered (buffered) 2666 MHz

. Intel XXV710 dual port 25 GbE NIC cards i40e-2.3.6 6

. 1.8 TB SAS drives 4

Hitachi Advanced . 2 Intel Xeon Silver 4110 CPU @ 2.10GHz 3A10.H3 1 Server DS120 (Pentaho host) . 2 × 64 GB MLC SATADOM for boot

. 128 GB (32 GB × 4) DIMM DDR4 synchronous registered (buffered) 2666 MHz

. Intel XXV710 Dual Port 25 GbE i40e-2.3.6 2

Brocade G620 . 48 port Fibre Channel switch V8.0.1 2 switches . 16 Gb/s SFPs

. Brocade hot-pluggable SFP+, LC connector

Cisco Nexus . 48 × 10/25 GbE Fiber ports 7.0(3)I5(1) 2 93180YC-EX switches . 6 × 40/100 Gb/s quad SFP (QSFP28) ports

Cisco Nexus 3048TP . 1 GbE 48-Port Ethernet switch 7.0(3)I4(2) 1 switch

5 6

TABLE 2. SOFTWARE COMPONENTS

Software Version Function

Red Hat Enterprise Linux Version 7.3 for Cloudera Distribution Hadoop environment. Kernel Version: kernel- 3.10.0-514.36.5.el7.x86_64

Red Hat Enterprise Linux Version 7.4 Operating system for Oracle Environment

3.10.0-693.11.6.el7.x86_64

Microsoft® Windows Server® 2012 R2 Standard Operating system for Pentaho Data Integration (PDI) environment.

Oracle 12c Release 1 (12.1.0.2.0) Database Software

Oracle Grid Infrastructure 12c Release 1 (12.1.0.2.0) Volume Management, File System Software, and Oracle Automatic Storage Management

Pentaho Data Integration 8.0 Extract-transfer-load software

Cloudera Distribution Hadoop 5.13 Hadoop Distribution

Apache Hive 1.1.0 Hive Database

Red Hat Enterprise Linux Device Mapper 1.02.140 Multipath Software Multipath

Hitachi Storage Navigator [Note 1] Microcode dependent Storage management Software

Hitachi Storage Advisor (HSA) [Note 1] 2.1.0 Storage orchestration software

[Note 1] These software programs were used for this Oracle Database architecture built on Hitachi Unified Compute Platform CI. They may not be required for your implementation.

Pentaho A unified data integration and analytics program, Pentaho addresses the barriers that block your organization's ability to get value from all your data. Simplify preparing and blending any data with a spectrum of tools to analyze, visualize, explore, report, and predict. Open, embeddable, and extensible, Pentaho ensures that each member of your team — from developers to business users — can translate data into value.

. Internet of things — Integrate machine data with other data for better outcomes.

. Big data — Accelerate value with Apache Hadoop, NoSQL, and other big data.

. Data integration — Access, manage, and blend any data from any source. This solution uses Pentaho Data Integration to drive the extract, transform, and load (ETL) process. The end target of this process is an Apache Hive database on top of Hadoop Distributed File System.

. Business analytics — Turn data into insights with embeddable analytics.

6 7

Hitachi Advanced Server DS120 Optimized for performance, high density, and power efficiency in a dual-processor server, Hitachi Advanced Server DS120 delivers a balance of compute and storage capacity. This rack mounted server has the flexibility to power a wide range of solutions and applications.

The highly-scalable memory supports up to 3 TB using 24 slots of 2666 MHz DDR4 RDMM. DS120 is powered by the Intel Xeon scalable processor family for complex and demanding workloads. There are flexible OCP and PCIe I/O expansion card options available. This server supports up to 12 storage devices with up to 4 NVMe.

Hitachi Virtual Storage Platform Gx00 Models Hitachi Virtual Storage Platform Gx00 models are based on industry-leading enterprise storage technology. With flash- optimized performance, these systems provide advanced capabilities previously available only in high-end storage arrays. With the Virtual Storage Platform Gx00 models, you can build a high performance, software-defined infrastructure to transform data into valuable information.

Hitachi Storage Virtualization Operating System provides storage virtualization, high availability, superior performance, and advanced data protection for all Virtual Storage Platform Gx00 models. This proven, mature software provides common features to consolidate assets, reclaim space, extend life, and reduce migration effort.

This solution uses Virtual Storage Platform G600, which supports Oracle Real Application Clusters.

Hitachi Virtual Storage Platform Fx00 Models Hitachi Virtual Storage Platform Fx00 models deliver superior all-flash performance for business-critical applications, with continuous data availability. High-performance network attached storage with non-disruptive deduplication reduces the required storage capacity by up to 90% with the power to handle large, mixed-workload environments.

Hitachi Storage Virtualization Operating System provides storage virtualization, high availability, superior performance, and advanced data protection for all Virtual Storage Platform Fx00 models. This proven, mature software provides common features to consolidate assets, reclaim space, extend life, and reduce migration effort.

Brocade Switches Brocade and Hitachi Vantara partner to deliver storage networking and data center solutions. These solutions reduce complexity and cost, as well as enable virtualization and cloud computing to increase business agility.

Optionally, this solution uses the following Brocade product:

. Brocade G620 switch, 48-port Fibre Channel In this solution, SAN switches are optional. Direct connect is possible under certain circumstances. Check the support matrix to ensure support for your choice.

Cisco Nexus Data Center Switches Cisco Nexus data center switches are built for scale, industry-leading automation, programmability, and real-time visibility.

This solution uses the following Cisco switches to provide Ethernet connectivity:

. Nexus 93180YC-EX, 48-port 10/25 GbE switch

. Nexus 3048TP, 48-port 1GbE Switch

7 8

Cloudera Cloudera is the leading provider of enterprise-ready, big data software and services. Cloudera Enterprise Data Hub is the market-leading Hadoop distribution. It includes Apache Hadoop, Cloudera Manager, related open source projects, and technical support.

Big Data is a generic term to cover a set of components that are used with very large data sets to provide advanced data analytics. Big data usually refers to large volumes of unstructured or semi-structured data.

Usually, a big data solution is part of the Apache Hadoop project. However, big data can include components from many different software companies.

This reference architecture uses Red Hat Enterprise Linux and Cloudera Enterprise Data Hub.

The following is a partial list of the software components and modules that can be used in a Cloudera Enterprise Data Hub deployment:

. Cloudera Manager Cloudera Manager provides management and monitoring of the Cloudera Hadoop distribution cluster

. Apache Hadoop Distributed File System Hadoop Distributed File System (HDFS) is a distributed high-performance file system designed to run on commodity hardware.

. Apache Hadoop Common These common utilities support the other Hadoop modules. This programming framework supports the distributive processing of large data sets.

. Apache Hadoop YARN Apache Hadoop YARN is a framework for job scheduling and cluster resource management. This splits the functionalities of the following into separate daemons:

. ResourceManager interfaces with the client to track tasks and assign tasks to NodeMangers management . NodeManager launches and tracks execution on the worker nodes . Apache Hive Apache Hive is data warehouse software that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL.

. Apache ZooKeeper Apache ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.

. ZooKeeper Master Node ZooKeeper is a high-availability system, whereby two or more nodes can connect to a ZooKeeper master node. The ZooKeeper master node controls the nodes to provide high availability.

8 9

. ZooKeeper Standby Master Node When ZooKeeper runs in a highly-available setup, there can be several nodes configured as ZooKeeper master nodes. Only one of these configured nodes is active as a master node at any time. The others are standby active nodes.

If the currently-active master node fails, then the ZooKeeper cluster itself promotes one of the standby master nodes to the active master node.

Oracle Database Oracle Database has a multi-tenant architecture so you can consolidate many databases quickly and manage them as a cloud service. Oracle Database also includes in-memory data processing capabilities for analytical performance. Additional database innovations deliver efficiency, performance, security, and availability. Oracle Database comes in two editions: Enterprise Edition and Standard Edition 2.

Oracle Automatic Storage Management (Oracle ASM) is a volume manager and a file system for Oracle database files. This supports single-instance Oracle Database and Oracle Real Application Clusters configurations. Oracle ASM is the recommended storage management solution that provides an alternative to conventional volume managers, file systems, and raw devices.

Red Hat Enterprise Linux Red Hat Enterprise Linux delivers military-grade security, 99.999% uptime, support for business-critical workloads, and so much more. Ultimately, the platform helps you reallocate resources from maintaining the status quo to tackling new challenges.

Device mapper multipathing (DM-Multipath) allows you to configure multiple I/O paths between server nodes and storage arrays into a single device.

These I/O paths are physical SAN connections that can include separate cables, switches, and controllers. Multipathing aggregates the I/O paths, creating a new device that consists of the aggregated paths.

Solution Design This describes the reference architecture environment to implement Hitachi Solution for the Databases in an enterprise data warehouse offload package for Oracle Database. The environment uses Hitachi Advanced Server D120 and Hitachi Advanced Server DS220.

The infrastructure configuration includes the following:

. Pentaho server — There is one server configured to run Pentaho.

. Hadoop Cluster Servers — There are at least three servers configured to run Hadoop Cluster with Hive2 database. The number of Hadoop servers can be expended, based on size of working set, sharding, and other factors. This solution uses local HDDs as JBOD for the Hadoop cluster hosts. . IP network connection — There are IP connections to connect the Pentaho server, the Apache Hive servers, and the Oracle server through Cisco Nexus switches. . Oracle Database Architecture built on Hitachi Unified Compute Platform CI — This is the Oracle infrastructure used for hosting Oracle Enterprise Data Warehouse. The infrastructure includes one Hitachi Advanced Server DS 220, two Brocade G620 SAN switches, and Hitachi Virtual Storage Platform G600. In your implementation, any Hitachi Virtual Storage Platform Gx00 model or Virtual Storage Platform Fx00 model can be used. Oracle database can be configured for a single instance or an Oracle Real Application Cluster environment.

9 10

The following are the components in a minimal deployment, single rack configuration of this solution. It uses a single Hitachi Advanced Server DS120.

. Switches . Top-of-rack data switches . Management switches . 3 master nodes . 2 Intel 4110 processors . 256 GB of RAM . 10 × 1.8 TB SAS drives . 2 × 128 GB SATADOMS . 1 edge node . 2 Intel 4110 processors . 128 GB of RAM . 2 × 1.8 TB SAS drives . 2 × 128GB SATADOMS . 1 Pentaho node . 2 Intel 4110 processors . 368 GB of RAM . 4 × 1.8 TB SAS drives . 2 × 128 GB SATADOMS . 9 worker nodes . 2 Intel 4110 processors . 368 GB of RAM . 12 x 1.8 TB SAS drives . 2 × 128 GB SATADOMS The 9 workers nodes provide 1.8 × 12 × 6 TB raw storage, allowing for 129.6 TB of raw storage. After a replication factor of 3 and reserving 20% for overhead, this gives 34.56 TB of storage. There are multiple reasons to add more worker nodes:

. Provide more storage . After monitoring system performance to see if there is a need, improve performance The queries require more than the standard 20% overhead.

For complete configuration deployment and information see Hitachi Solution for Enterprise Data Intelligence with Cloudera.

10 11

Solution Validation For validation purposes, this reference architecture uses three Hitachi Advanced Server DS120 for a three-node Hadoop host configuration, and one Advanced Server DS120 server to host Pentaho Data Integration (PDI) tool. The architecture provides the compute power for the Apache Hive database to handle complex database queries and a large volume of transaction processing in parallel.

Table 3, “Hitachi Advanced Server DS120 Specifications,” and Table 4, “Hitachi Advanced Server DS220 Specifications,” describe the details of the server configuration for this solution.

TABLE 3. HITACHI ADVANCED SERVER DS120 SPECIFICATIONS

Server Server Name Role CPU Cores RAM

Hadoop Server 1 hadoopnode1 Hadoop Cluster 16 384 GB (32 GB × 12) Node 1

Hadoop Server 2 hadoopnode2 Hadoop Cluster 16 384 GB (32 GB × 12) Node 2

Hadoop Server 3 hadoopnode3 Hadoop Cluster 16 384 GB (32 GB × 12) Node 3

Pentaho Server edwpdi Pentaho PDI 16 128 GB (32 GB × 4) Server

TABLE 4. HITACHI ADVANCED SERVER DS220 SPECIFICATIONS

Server Server Name Role CPU Cores RAM

Oracle Oracle host Oracle Server 36 768 GB (64 GB × 12)

Table 5 shows the server BIOS and Red Hat Enterprise Linux 7.4 kernel parameters for the Apache Hadoop cluster servers.

TABLE 5. BIOS AND RED HAT ENTERPRISE LINUX 7.4 KERNEL PARAMETERS FOR THE APACHE HADOOP CLUSTER SERVERS

Parameter Category Setting Value

BIOS NUMA ENABLE

DISK READ AHEAD DISABLE

11 12

TABLE 5. BIOS AND RED HAT ENTERPRISE LINUX 7.4 KERNEL PARAMETERS FOR THE APACHE HADOOP CLUSTER SERVERS

Parameter Category Setting Value

RHEL 7.4 Kernel ulimit unlimited

vm.dirty_ratio 20

vm.dirty_background_ratio 10

vm.swappiness 1

tansparent_hugepage never

IO scheduler Noop

RHEL 7.3OS services tuned disable

Table 6 shows the server BIOS and Red Hat Enterprise Linux 7.4 kernel parameters for the Pentaho server.

TABLE 6. PARAMETERS FOR THE PENTAHO SERVER

Parameter Category Setting Value

BIOS NUMA ENABLE

DISK READ AHEAD DISABLE

Table 7 shows the parameters for the Apache Hadoop Environment

TABLE 7. PARAMETERS FOR THE APACHE HADOOP ENVIRONMENT

Setting Property Value

Memory Tuning mapred.child..opts 2048 MB

Replication factor dfs.replication 3

Map CPU vcores .map.cpu.vcores 2

Reduce CPU vcores mapreduce.reduce.cpu.vcores 2

Buffer size io.file.buffer.size 128

DFS Block size dfs.block.size 128

Speculative Execution mapreduce.map.tasks.speculative.excution TRUE

Hive vectorized execution set hive.vectorized.execution.enabled TRUE

12 13

TABLE 7. PARAMETERS FOR THE APACHE HADOOP ENVIRONMENT (CONTINUED)

Setting Property Value

Hive Server Heap Size hive-env.sh hadoop_opts 16GB

Spark driver Memory spark.driver.memory 12GB

Table 8 has the connection parameters for the Oracle environment.

TABLE 8. CONNECTION PARAMETERS FOR ORACLE ENVIRONMENT

Setting Property Value

Database processes processes 5000

Database sessions sessions 7000

Database transactions transaction 6000

Note — The connection information as given in Table 8 may differ for your production environment, based on your offload configuration. To execute multiple transformation steps, Pentaho Data Integration (PDI) opens multiple database connections. Make sure that Oracle Database allows sufficient number of connections to offload using PDI.

Storage Architecture This describes the storage architecture for this solution.

Storage Configuration for Hadoop Cluster This configuration uses recommended practices with Hitachi Advanced Server D120 and CDH for the design and deployment of storage for Hadoop.

Configure a total of four HDDs on each Hadoop Node.

For the best performance and size to accommodate the Oracle Enterprise Data Warehouse offloading space, adjust the size of HDDs to meet your business requirements.

For more information on how Hitachi provides high performance, integrated and converged solution for Oracle database, see Hitachi Solution for Databases Reference Architecture for Oracle Real Application Clusters Database 12c

13 14

Table 9 shows the storage configuration used for the Hadoop cluster in this solution.

TABLE 9. STORAGE CONFIGURATION FOR APACHE HIVE DATABASE

JBOD RAID Level

1.2 TB HDD Drive Type

4 Number of Drives

4.8 TB Total Useable Capacity

XFS File System Type

4 KB File System Block Size

Disable Disk Readahead (BIOS Setting)

Note — On the node where you install Cloudera Management services, you need 100 GB space allocated exclusively for Cloudera Management services data. If you want to use the root partition as data directories for Cloudera Management services, have a root partition with 100 GB. If you want to use any other disk, make sure available space in it is more than 100 GB.

File System Recommendation for Cloudera Distribution for Hadoop Hadoop Distributed File System (HDFS) is designed to run on top of an underlying file system in an operating system. Cloudera recommends that you use either of the following file systems tested on the supported operating systems.

. ext3 — This is the most tested underlying filesystem for HDFS.

. ext4 — This scalable extension of ext3 is supported in more recent Linux releases.

. XFS — This is the default file system in Red Hat Enterprise Linux 7.

Note — Cloudera does not support in-place upgrades from ext3 to ext4. Cloudera recommends that you format disks as ext4 before using them as data directories.

14 15

Network Architecture This architecture requires the following separate networks for the Pentaho server and the Cloudera Distribution Hadoop servers:

. Public Network — This network must be scalable. In addition, it must meet the low latency needs of the network traffic generated by the servers running applications in the environment. . Management Network — This network provides BMC connections to the physical servers.

. Data Network — This network provides communication between nodes. Hitachi Vantara recommends using pairs of 10/25 Gb/s NICs for the public network and 1 Gb/s LOM for the management network.

Observe these points when configuring public network in your environment:

. For each server in the configuration, use at least two identical, high-bandwidth, low-latency NICs for the public network.

. Use NIC bonding to provide failover and load balancing of within a server. If using two dual-port NICs, NIC bonding can be configured across two cards. . Ensure all NICs are set to full duplex mode For the complete network architecture, see Hitachi Solution for Enterprise Data Intelligence with Cloudera.

Figure 2 on page 16 shows the network configuration in this solution.

15 16

Figure 2

Table 10, “Network Configuration and IP Addresses,” on page 17 shows the network configuration, IP addresses, and name configuration that was used when testing the environment with Hitachi Advanced Server DS120 Your implementation of this solution can differ.

Configure pairs ports from different physical NIC cards to avoid a single point of failure (SPoF) when installing two NICs on each server. However, if high availability is not a concern for you, this environment supports using one NIC on the Cloudera Distribution Hadoop servers and the Pentaho server for lower cost.

The IP configuration for the Oracle environment can be adjusted for Oracle real application cluster, as applicable.

16 17

TABLE 10. NETWORK CONFIGURATION AND IP ADDRESSES

Server NIC Ports Subnet NIC IP Address Network Bandwidt Cisco Nexus Switch BOND h (Gb/s)

Switch Port Number

DS120 NIC - 0 167 Bond0 172.17.167.71 Public 10/25 1 11 Server 1 (Hadoop) NIC - 3 10/25 2 BMC- 242 - 172.17.242.165 Management 1 3 11 Dedicated NIC

DS120 NIC - 0 167 Bond0 172.17.167.72 Public 10/25 1 12 Server 2 (Hadoop) NIC - 3 10/25 2 BMC- 242 - 172.17.242.166 Management 1 3 12 Dedicated NIC

DS120 NIC - 0 167 Bond0 172.17.167.73 Public 10/25 1 13 Server 3 (Hadoop) NIC - 3 10/25 2 BMC- 242 - 172.17.242.167 Management 1 3 13 Dedicated NIC

DS120 NIC - 0 167 Bond0 172.17.167.74 Public 10/25 1 14 Server 4 (Pentaho) NIC - 3 10/25 2 BMC- 242 - 172.17.242.74 Management 1 3 14 Dedicated NIC

DS220 NIC-0 167 Bond0 172.17.167.75 Public 10/25 1 15 Server (Oracle) NIC-3 10/25 2 BMC- 242 - 172.17.242.75 Management 1 3 15 Dedicated NIC

Data Analytics and Performance Monitoring Using Hitachi Storage Advisor Use Hitachi Storage Advisor for data analytics and performance monitoring with this solution.

By reducing storage infrastructure management complexities, Hitachi Storage Advisor simplifies management operations. This helps you to rapidly configure storage systems and IT services for new business applications.

Hitachi Storage Advisor can be used for this Oracle Database architecture, built on Hitachi Unified Compute Platform CI. However, this may not be required for your implementation.

17 18

Oracle Enterprise Data Offload Workflow Use a Python script to generate the Oracle Enterprise Data Workflow mapping of large Oracle data sets to Cloudera Distribution Hadoop, and then offload the data using Pentaho Data Integration. Have this script create a transformation Kettle file with Spoon for the data offload.

This auto-generated transformation transfers row data from Oracle database tables or views in a schema to the Apache Hadoop database. Pentaho Data Integration uses this transformation directly.

Pentaho Data Integration Pentaho Data Integration (PDI) allows you to ingest, blend, cleanse, and prepare diverse data from any source. With visual tools to eliminate coding and complexity, Pentaho puts all data sources and the best quality data at the fingertips of businesses and IT users.

Using intuitive drag-and-drop data integration coupled with data agnostic connectivity, your use of Pentaho Data Integration can span from flat files and RDBMS to Cloudera Distribution Hadoop and beyond. Go beyond a standard extract-transform- load (ETL) designer to scalable and flexible management for end-to-end data flows.

In this reference architecture, the end target of the ETL process is an Apache Hive database.

To set up Pentaho to connect to a Cloudera Distribution Hadoop Cluster, make sure that correct plugin/shim is configured. View the plugin configuration steps at Set Up Pentaho to Connect to a Cloudera Cluster.

For this solution, configure and set Active Shim to Cloudera CDH 5.13. Figure 3 shows setting this in Cloudera Hadoop Distribution for PDI.

Figure 3

18 19

Script for Automatic Enterprise Data Workflow Offload You can use an Oracle Enterprise Data Workflow script that uses existing user accounts with appropriate permissions for Oracle, Hadoop Distributed File System and Apache Hive database access. Have the script test the Oracle, HDFS, and Apache Hive connections first and then proceed with further options.

The main options to generate a Pentaho Data Integration transformation are the following:

. Transfer all tables in selected Oracle schema to Apache Hive on Hadoop.

. Transfer Oracle tables based on partition to Apache Hive on Hadoop.

. Transfer specific Oracle table rows to Apache Hive on Hadoop based on date range.

. Transfer specific Oracle table rows to Apache Hive on Hadoop based on column key value.

Example Workflows Offloads Using Pentaho Data Integration These are enterprise data warehouse offload examples that Pentaho Data Integration uses for the enterprise data warehouse offload. These are created using the graphical user interface in Pentaho Data Integration.

Full Table Data Copy from Oracle to Apache Hive It can be a challenge to convert data types between two database systems. With the graphical user interface in Pentaho Data Integration, you can do data type conversion with a few clicks. No coding is needed.

Use the graphical user interface in Pentaho Data Integration to construct a workflow to copy all the data from an Oracle table to Apache Hive.

You can define data connections for the Pentaho server so that Pentaho Data Integration can access data from sources like Oracle Database. See Define Data Connections for the Pentaho Server for these procedures.

Figure 4 shows the Pentaho Data Integration workflow for a full table data copy from Oracle to Apache Hive in the user interface.

Figure 4

Your workflow can copy data directly from Oracle to Apache Hive. However, there is a performance penalty if you do that. Read about this performance penalty.

To avoid the performance penalty, create the PDI workflow with the following three steps:

1. Read the Oracle table (table input). 2. Copy the Oracle table to Hadoop Distributed File System (Hadoop file output). 3. Load the HDFS file into the Apache Hive table (execute SQL script).

19 20

To create a full table data copy workflow, do the following.

1. On the Pentaho Data Integration workflow, double-click Read Oracle Table (Figure 4 on page 19). The Table Input dialog box opens (Figure 5). 2. On the Table Input page, provide the connection information and SQL query for the data, and click OK.

Figure 5

3. Set parameters for the file transfer. (1) To transfer Oracle table input data to Hadoop Distributed File System, double-click Copy Oracle Table to HDFS File' (Figure 4 on page 19). The Hadoop File Output dialog box opens to the File tab (Figure 6 on page 21).

20 21

Figure 6

(2) Click the Fields tab (Figure 7 on page 22). (3) Change settings as needed, and then click OK. Figure 7 shows setting value for HDFS output file to convert a date field (TIME_ID) to a sting value during offload.

21 22

Figure 7

4. Load the HDFS file into the Apache Hive database. (1) Double-click Load HDFS File into Hive Table (Figure 4 on page 19). The Execute SQL statements dialog box opens (Figure 8 on page 23). (2) Change settings as necessary, and then click OK. You can form a Hive SQL (HQL) query to match your requirements.

22 23

Figure 8

23 24

Figure 9 shows the execution of the workflow.

Figure 9

24 25

Figure 10 shows the copied table information in Apache Hive Database.

Figure 10

Merge and Join Two Tables Data Copy from Oracle to Apache Hive Often you need to join two or more enterprise data warehouse tables.

This example helps you understand how to query two different data sources simultaneously while saving the operational result (join, in this case) to Hadoop. This example is applicable to read data from Oracle and Hive databases, as well. You can perform the join on one table from the Oracle database and the other table from Hive database.

Figure 11 on page 26 shows the transformation workflow and the execution results for the merged workflow in the graphical user interface of Pentaho Data Integration.

25 26

Figure 11

26 27

Figure 12 shows transformation information to load the HDFS file into the Hive table.

Figure 12

27 28

Figure 13 shows simple verification of join data on hive database.

Figure 13

If done as part of the Pentaho workflow, sorting rows in large tables could be time consuming. Hitachi Vantara recommends sorting all the rows in server memory instead of using a memory-plus-disk approach.

On the Sort row dialog box (Figure 14 on page 29), make the following settings:

. Use the Sort size (rows in memory) text box to control how many rows are sorted in server memory.

. Use the Free memory threshold (in %) text box to help avoid filling all available memory in server memory. Make sure to allocate enough RAM to Pentaho Data Integration on the server when you need to do large sorting tasks. Figure 14 shows controlling the cache size in the Sort rows dialog box from the graphical user interface in Pentaho Data Integration.

28 29

Figure 14

Sorting on the database is faster often than sorting externally, especially if there is an index on the sort field or fields. You can use this as another option to create better performance.

More Oracle tables can be joined one at time with same Pentaho Data Integration sorting and join steps. You can also use Execute SQL Script in Pentaho for another option to join multiple tables. This example in the Pentaho Community Forums shows how to do this from Pentaho Data Integration.

Pentaho Kettle Performance By default, all steps used in the Kettle transformation for offloading the data runs in parallel and consumes CPU and memory resources. As the resources are limited, plan resource utilization in how many transformations (steps) should be run in parallel.

Form multiple transformations with a limited number of steps and then execute them in sequence. This ensures copying or offloading all data without exhausting the resources in the server.

Decide on the number of steps in a transformation based on the available CPU and memory resources on the Kettle application host. Refer to the Pentaho server hardware information in Table 1, “Hardware Components,” on page 4.

. Environment workflow Information . Multiple Kettle transformations were created to run the copy or offload sequentially, rather than creating multiple parallel transformations through a Kettle job.

29 30

. Transformation steps information . The number of tables in each Kettle transformation were set to 80. . The number of columns in source Oracle database tables were set to 22. . Observations . The CPU resource utilization was between 55% and 60% throughout the transformation tests. This means that there was no resource depletion within the server. Possibly the number of tables in your workflow could be increased. . Workflow execution with optimum performance depends on the following: . How much CPU resources are available? . How much data is being copied or offloaded in a table row? . How many tables were included in the transformation?

Engineering Validation This summarizes the key observations from the test results for the Hitachi Solution for Databases in an enterprise data warehouse offload package for Oracle Database. This environment uses Hitachi Virtual Hitachi Advanced Server DS120, Pentaho 8.0 and the Cloudera Distribution of the Hadoop Ecosystem 5.13.

When evaluating this Oracle Enterprise Data Warehouse solution, the laboratory environment used the following:

. One Hitachi Unified Compute Platform CI for the Oracle Database environment

. Four Hitachi Advanced Server DS120

. One Hitachi Virtual Storage Platform G600

. Two Brocade G620 SAN switches Using this same test environment is not a requirement to deploy this solution.

Test Methodology The source data was preloaded into a sample database scheme into Oracle database.

After preloading the data, a few examples Pentaho Data Integration workflows were developed for offloading data to - Apache Hive database.

Once data was loaded into the Apache Hive database, certain verification was done and make sure the data was offloaded correctly.

The examples workflows found in Oracle Enterprise Data Offload Workflow were used to validate this environment.

Testing involved following this procedure for each example workflow:

1. Verify the following for each Oracle table before running the enterprise data offload workflow: . Number of rows . Number of columns . Data types 2. Run the enterprise data offload workflow.

30 31

3. Verify the following for the Apache Hive table after running the enterprise data offload workflow to see if the numbers matched those in the Oracle table: . Number of documents (same as number of rows) . Number fields (same as number of columns) . Data types

Test Results After running each enterprise data offload workflow example, the test results showed the same number of documents (rows), fields (columns) and data types in the Oracle database as -Apache Hive database.

These results show that you can use Pentaho Data Integration to move data from an Oracle host to Apache Hive on top of Hadoop Distributed File System to relieve the workload on your Oracle host. This provides a cost-effective solution to expanding capacity to relieve server utilization pressures.

31 For More Information Hitachi Vantara Global Services offers experienced storage consultants, proven methodologies and a comprehensive services portfolio to assist you in implementing Hitachi products and solutions in your environment. For more information, see the Services website.

Demonstrations and other resources are available for many Hitachi products. To schedule a live demonstration, contact a sales representative or partner. To view on-line informational resources, see the Resources website.

Hitachi Academy is your education destination to acquire valuable knowledge and skills on Hitachi products and solutions. Our Hitachi Certified Professional program establishes your credibility and increases your value in the IT marketplace. For more information, see the Hitachi Vantana Training and Certification website.

For more information about Hitachi products and services, contact your sales representative, partner, or visit the Hitachi Vantara website. 1

Hitachi Vantara

Corporate Headquarters Regional Contact Information 2845 Lafayette Street Americas: +1 408 970 1000 or [email protected] Santa Clara, CA 96050-2639 USA Europe, Middle East and Africa: +44 (0) 1753 618000 or [email protected] www.HitachiVantara.com | community.HitachiVantara.com Asia Pacific: +852 3189 7900 or [email protected]

© Hitachi Vantara Corporation 2018. All rights reserved. HITACHI is a trademark or registered trademark of Hitachi, Ltd. VSP is a registered trademark of Hitachi Vantara Corporation. Microsoft and Windows Server are trademarks or registered trademarks of Microsoft Corporation. All other trademarks, service marks, and company names are properties of their respective owners. Notice: This document is for informational purposes only, and does not set forth any warranty, expressed or implied, concerning any equipment or service offered or to be offered by Hitachi Data Systems Corporation. MK-SL-098-01. September 2018.