DMX

Install Guide

Version 9.10

DMX Install Guide

Copyright 1990, 2020 Syncsort Incorporated. All rights reserved.

This document contains unpublished, confidential, and proprietary information of Syncsort Incorporated. No disclosure or use of any portion of the contents of this document may be made without the express written consent of Syncsort Incorporated.

Getting technical support: Customers with a valid maintenance contact can get technical assistance via MySupport. There you will find product downloads and documentation for the products to which you are entitled, as well as an extensive knowledge base.

Version 9.10

Last Update: 15 May 2020

Contents DMX Overview ...... 4 Installing DMX/DMX-h ...... 4 DMX-h Overview ...... 4 Prerequisites ...... 5 Step-by-Step Installation ...... 8 Configuring the DMX Run-time Service ...... 22 Applying a New License Key to an Existing Installation ...... 26 Running DMX ...... 28 Graphical User Interfaces ...... 28 DMX Help ...... 28 Connecting to Databases from DMX ...... 29 Amazon Redshift ...... 29 Azure Synapse Analytics (formerly SQL Data Warehouse) ...... 31 Databricks ...... 33 DB2 ...... 35 Greenplum ...... 36 Hive data warehouses ...... 38 ...... 45 Microsoft SQL Server ...... 48 Netezza ...... 48 NoSQL Databases ...... 50 Oracle ...... 51 Snowflake ...... 51 Sybase ...... 54 Teradata ...... 54 Vertica ...... 55 Other DBMSs ...... 56 Defining ODBC Data Sources ...... 58 Connecting to Message Queues from DMX ...... 60 IBM WebSphere MQ ...... 60 Connecting to Salesforce from DMX ...... 61 Connecting to SAP from DMX ...... 62

DMX Install Guide i

Registering DMX in SAP SLD ...... 63 Connecting to HDFS from DMX ...... 63 Connecting to Connect:Direct nodes from DMX ...... 63 Security ...... 63 Installation and Configuration ...... 63 Connecting to CyberArk Enterprise Password Vault ...... 64 CyberArk Licenses ...... 64 Connecting to Protegrity Data Security Gateway ...... 65 Connecting to QlikView data eXchange files from QlikView or Qlik Sense...... 65 QlikView desktop installation overview ...... 65 Qlik Sense desktop installation overview ...... 65 Connecting to Tableau Data Extract files from Tableau ...... 66 Tableau desktop installation overview ...... 66 Removing DMX/DMX-h from Your System ...... 66 DMX installation component options ...... 69 DMX Management Service installation and configuration ...... 70 DMX DataFunnel run-time service install and configuration ...... 76 Technical Support ...... 80

ii DMX Install Guide

Documentation Conventions The following conventions are used in the format sections of the command options in this manual.

Convention Explanation Example

Regular type Items in regular type must be entered literally using ASCII either lowercase or uppercase letters. Items may be ascending abbreviated.

Italics (non-bold) Items in italics (non-bold) represent variables. You must file_name substitute an appropriate numerical or text value for the variable.

Braces { } Braces indicate that a choice must be made among items {"a" } contained in the braces. The choices may be presented {X"xx" } in an aligned column, or on one line separated by a vertical bar ( | ). OR

{AND | OR}

Brackets [ ] Brackets indicate that an item is optional. A choice may [alias] be made among multiple items contained in brackets. OR [+ | -]

Slash / A slash identifies a DMX option keyword. The slash /INFILE must be included when an option keyword is specified. /infile

Double quotes " " Double quotation marks that appear in a format "b"-"e" statement must be specified literally.

Ellipsis … An ellipsis indicates that the preceding argument or [expression…] group of arguments may be repeated.

Sequence A sequence number indicates that a series of arguments field2 number or values may be specified. The sequence number itself must never be specified.

DMX Install Guide 3

DMX Overview DMX™ is a high-performance data transformation product. With DMX you can design, schedule, and control all your data transformations from a simple graphical interface on your Windows desktop.

Data records can be input from many types of sources such as database tables, SAP systems, Salesforce.com objects, flat files, XML files, pipes, etc. The records can be aggregated, joined, sorted, merged, or just copied to the appropriate target(s). Before output, records can be filtered, reformatted, or otherwise transformed.

Metadata, including record layouts, business rules, transformation definitions, run history and data statistics, can be maintained either within a specific task or in a central repository. The effects of making a change to your application can be analyzed through impact and lineage analysis.

You can run your data transformations directly from your desktop, on any UNIX or Windows server, or schedule them for later execution, embed them in batch scripts, or invoke them from your own programs. Installing DMX/DMX-h Installed DMX components are dependent on your license key:

• DMX server license key installs components based on whether you select a Standard, Full, Classic, or Custom installation. See DMX installation component options. • DMX workstation license key installs the development client, Job and Task Editors; the DMX engine, dmxjob/dmexpress;; and the service for development client, which is the DMX Run-time Service, dmxd. The version of DMX server software must be at least as high as the version of the DMX client software that is used to develop jobs and connect to the server. Thus, when installing a new version of DMX, ensure that you install the same release of DMX on your client and server machines. If you are upgrading and unable to install both the client and the server at the same time, you need to upgrade the server prior to upgrading the client. DMX-h Overview DMX-h is the Hadoop-enabled edition of DMX, providing the following Hadoop functionality:

• ETL Processing in Hadoop – Develop a DMX-h ETL application entirely in the DMX GUI to run seamlessly in the Hadoop MapReduce framework, with no Pig, Hive, or Java programming required. Currently, jobs can be run in either MapReduce or Spark. See the online DMX Help topic "DMX-h”. • Hadoop Sort Acceleration – Seamlessly replace the native sort within Hadoop MapReduce processing with the high-speed DMX engine sort, providing performance benefits without programming changes to existing MapReduce jobs. See the DMX-h Sort User Guide, which is included in the Documentation folder under your DMX software installation directory. • Apache Spark Integration – Use the Spark mainframe connector to transfer mainframe data to HDFS. See the online DMX Help topic “Spark Mainframe Connector”. • Apache Sqoop Integration – Use the Sqoop mainframe import connector to transfer mainframe data into HDFS. See the online DMX Help topic "Sqoop Mainframe Import Connector”. DMX-h Requirements DMX-h requires the following:

4 DMX Install Guide

• DMX-h Edition • A supported Hadoop MapReduce and/or Spark distribution: o MapReduce ▪ Cloudera CDH 5.x (5.2 and higher) – YARN (MRv2) ▪ Data Platform (HDP) 2.x (2.3 and higher) – YARN ▪ 2.x (2.2 and higher) – YARN ▪ MapR, Community Edition and Enterprise Edition only (previously termed M5 and M7, respectively), 6.x – YARN ▪ Pivotal HD 3.0 – YARN

DMX-h is certified as ODPi (1.0 and higher) interoperable.

o Spark ▪ Spark on YARN on the following Hadoop distributions: • Cloudera CDH 5.x (5.5 and higher) • Hortonworks Data Platform (HDP) 2.3.4, 2.x (2.4 and higher) • MapR 5.x (5.1 and higher), Community Edition and Enterprise Edition only (previously named M5 and M7, respectively) ▪ Spark on Mesos 0.21.0 ▪ Spark Standalone 1.5.2 and higher DMX-h Component Setup and Operation A DMX-h setup consists of the following:

• Windows workstation o DMX must be installed as described in Step-by-Step Installation, Windows Systems. o DMX Job and Task Editors are used for MapReduce job development. o MapReduce jobs are submitted to Hadoop via the ETL server from the Job Editor.

• Linux ETL server (edge node) o DMX must be installed as described in Step-by-Step Installation, UNIX Systems. o The Hadoop client must be installed and configured to connect to the Hadoop cluster. o The DMX Run-time Service, dmxd, must be running to respond to jobs run via the Windows workstation; it calls dmxjob with the /HADOOP option, which ultimately calls hadoop to submit jobs to the cluster.

• Hadoop cluster o DMX must be installed without dmxd on all nodes in the Hadoop cluster as described in Step-by-Step Installation, Hadoop Cluster. o Each mapper and reducer runs the map side or reduce side task(s), respectively. o All file descriptors for sources, targets, and intermediate files are carefully connected so they fit into the Hadoop MapReduce flow. Prerequisites Before you install DMX on your system, ensure that the following are available:

• DMX software: This is generally shipped downloaded from Syncsort’s web site as a self- extracting executable file (Windows) or a tar file (UNIX).

DMX Install Guide 5

• DMX license key: License keys are sent via e-mail as an attachment file called DMExpressLicense.txt. If you need specific system information to obtain a license key, refer to the section below on Getting DMX License Information. If you have a DMX server license key and plan to install DMX installation components, the type of user that you setup depends on whether impersonation privileges are extended. See DMX installation user setup considerations.

: DMX runs on the following operating systems, with the listed release being the minimum supported. Both 32 bit and 64 bit versions are supported, unless otherwise stated: AIX release 6.1 64-bit; HP-UX release 11.31 IA64 64-bit; Linux kernel version 2.6.18 to 2.6.31 with library version 2.5 to 2.11 on Pentium-class x86_64 64-bit machines; Linux kernel version 2.6.16 with C library version 2.4 on IBM System z 64-bit mainframes; SunOS 5.10 SPARC 64-bit; Windows Vista; Windows 7; Windows 8.x; Windows 10; and Windows Server 2008, 2012; and 2012 R2. • Java version requirements: On Windows and UNIX/Linux systems, DMX requires Java runtime version 1.7 or higher unless you are only running DMX Sort, which does not use Java. DMX requires JDK 7. • Communication security protocol: On Windows and UNIX/Linux systems, DMX supports Transport Layer Security (TLS) up to and including TLS version 1.2. • User rights: Sufficient privileges to install and start Windows Services for Windows platforms and root privileges to install and start UNIX daemons on UNIX platforms. An umask setting of 022 is required so that other users can run the installed executables. The installation procedure sets and resets umask if required. • Pluggable Authentication Modules (PAM): If you want to use PAM for authentication on UNIX or Linux platforms, PAM must be installed and configured on the system. • Database client software: If you want DMX to access data in database tables (either as data source or target), then the appropriate database client software must be on the system and accessible via the appropriate shared library or dynamic link library (dll) paths. For example, to access an Oracle database, Oracle Client must be installed on the system where you run DMX; to access a database via ODBC, an ODBC data source must be defined on the system where you run DMX. For details on how to connect to a specific Database Management System (DBMS), refer to the section Connecting to Databases from DMX. • Message queue client software: If you want DMX to access data in a message queue, then the appropriate message queue client software must be on the system and accessible via the appropriate shared library or dynamic link library (dll) paths. For example, to access an IBM WebSphere MQ queue, IBM WebSphere MQ client must be installed on the system where you run DMX. For details on how to connect to a specific message queue type, refer to the section Connecting to Message Queues from DMX. • SAP client software: If you want DMX to access data in an SAP system, then the appropriate SAP client software must be installed on the system where you run DMX and accessible via the appropriate shared library or dynamic link library (dll) paths. For details on how to connect to an SAP system, refer to the section Connecting to SAP from DMX. • Hadoop software – If you want DMX to access data in a Hadoop Distributed File System (HDFS), or you want to run DMX-h ETL MapReduce jobs, then a Hadoop distribution configured to access the cluster must be installed on the edge/ETL node from which you run DMX. For details on how to connect to HDFS, refer to the section Connecting to HDFS from DMX. • Connect:Direct software – If you want DMX to access data using a Connect:Direct connection, a Connect:Direct server and client (CLI/API) must be installed on the system where you run DMX and must be configured to access the required Connect:Direct nodes. For details on how to connect to a Connect:Direct node, refer to Connecting to Connect:Direct nodes from DMX. • QlikView software – DMX supports QlikView data eXchange (QVX) files as targets. To access QVX files as sources from QlikView or Qlik Sense, refer to Connecting to QlikView data eXchange files from QlikView or Qlik Sense.

6 DMX Install Guide

• Tableau software – DMX supports Tableau Data Extract (TDE) files as targets. To access TDE files as sources from Tableau, refer to Connecting to Tableau Data Extract files from Tableau. DMX installation user setup considerations The type of user that you setup to install DMX installation components is dependent on whether impersonation privileges are extended:

• If you plan to use impersonation when running the DMX Run-time Service, dmxd, you must install as root. • When running the DataFunnel Run-time Service, dmxrund, considerations exist for the type of user that installs components. User setup when running dmxrund If you do not plan to use impersonation when running dmxrund, setup a non-administrative user to install and run on Windows or setup a service user to install and run on Linux. Setup a non-administrative/service user Windows As the administrative user has impersonation privileges by default, setup a new user who does not have administrative rights. Linux To install and run job requests without impersonation, create a service user, dmxuser, and run the installation as dmxuser. Setup impersonation If you plan to use impersonation when running dmxrund, no user setup is required to install and run on Windows; setup an impersonated user to install and run on Linux. Windows As the administrative user has impersonation privileges by default, no setup is required. Linux DMX installation impersonation considerations on Linux follow:

• No impersonation – Running jobs without impersonation does not require root access. Upon receipt of a job submission request from the DMX management service, dmxmgr, dmxrund calls the DMX engine, dmxdfnl, to run the submitted job as the service user, dmxuser. • Impersonation – Running jobs with impersonation requires root access to impersonate the specified user. While dmxrund never is granted root access, another installed component, dmxexecutor, can enable impersonation. When dmxrund detects that dmxexecutor is installed in the required directory with the correct permissions, dmxrund calls dmxexecutor to impersonate the specific user that calls the DMX engine, dmxdfnl, which runs the submitted jobs. To install and run job requests with impersonation, do the following:

• Create a service user, dmxuser. • Create a service group, dmexpress. Note: If you choose to change the name of the service group, you must update the SERVICE_GROUP property of the DMX custom impersonation configuration properties file.

DMX Install Guide 7

• Add dmxuser to the service group. • Run the installation as dmxuser. • Ensure that the following files are in the specified directories with the specified permissions:

Directory and file Permissions Notes

/bin/ -rwsr-x--- The ‘s’ represents the set-user identification dmxexecutor (setuid) bit and indicates that dmxexecutor is extended impersonation privileges to run submitted jobs as a specific user.

/conf/ -rwx------Updates to dmxexecutor.conf are required only dmxexecutor.conf if you choose to customize the impersonation. Getting DMX License Information To obtain a license key, you need the computer name, the hardware model, the number of processors, and the operating system of each system on which DMX is to run. You can gather the system information by running the DMX License Information program. Windows Systems You can run the DMX License Information program in the following ways:

• From Syncsort’s web site at: http://www.syncsort.com/software/licenseinfo.exe

• If DMX is installed, go to Programs, DMExpress from the Start menu and select License Information. The program prepares a license information document with the system information and then displays it in a Notepad window. You can save the form (File, Save As) and e-mail it to Syncsort or to your local DMX sales agent. The information is then used to create your license key(s). UNIX Systems You can run the DMX License Information program in the following ways:

• From Syncsort’s web site at: http://www.syncsort.com/software/licenseinfo.sh

• If DMX is installed, go to the /bin directory, where denotes the directory where DMX is installed, and type: ./licenseinfo The program generates and displays a text file named SyncsortLicenseInfo.txt in the current directory. You can e-mail the file to Syncsort or to your local DMX sales agent. The information is then used to create your license key(s). Step-by-Step Installation Windows Systems Interactive Installation

1. Make sure that any previous version of DMX has been removed (see Removing DMX from Your System later in this guide if necessary).

8 DMX Install Guide

2. You can also install DMX by running \Windows\x86\setup.exe for 32-bit Windows \Windows\x64\setup.exe for 64-bit Windows x64

extracted from the downloaded executable directly or via Control Panel, Add/Remove Programs.

3. You are prompted to either enter a license key or start a free trial. If you've selected to enter a license key, you can type in the location of the DMExpressLicense.txt, or browse to it, when prompted. You can also enter the license key manually. 4. Read the terms of the Syncsort License Agreement and confirm your acceptance of them. 5. Review the product options, components and features that are enabled by your license key. 6. If your license key is a • DMX server license key, a menu displays from which you select from among the component options: o Standard o Full o Classic o Custom For information on these options, see DMX installation component options. Select an option and make the appropriate selections. • DMX workstation license key, no component options display for selection.

You are eligible for the classic DMX/DMX-h installation, which installs the development client, Job and Task Editors; the DMX engine, dmxjob/dmexpress; and the service for development client, which is the DMX Run-time Service, dmxd.

7. Confirm the file folder into which you want to install DMX. The file folder is subsequently referred to as .

8. Select the program folder in which you want the DMX icons to appear. 9. Review the Setup Information, choose back to change these options or install to complete installation. 10. If your license key enables the DMX Run-time Service, select the configuration options for the Service. You can also configure the DMX Run-time Service later via Control Panel, Administrative Tools, Services. 11. You may be prompted to choose whether to automatically run SyncSort jobs in DMX, either immediately or after subsequent un-installation of SyncSort, depending on the presence of the SyncSort Conversion license option and an existing installation of SyncSort. 12. Upon setup completion, a list of menu shortcuts display in the DMX program folder, which is available through the Windows Start menu. 13. To run the Connect Portal web UI, you must configure the DMX management service, dmxmgr, including authentication. Then, start the DMX Management Service via Control Panel, Administrative Tools, Services.

DMX Install Guide 9

14. To run copy projects in Connect Portal, start the DataFunnel Run-time Service via Control Panel, Administrative Tools, Services. See DMX DataFunnel run-time service installation and configuration for more details. 15. To run CDC replication projects in Connect Portal, separately install the latest version of MIMIX Share. The MIMIX Share Listener service starts automatically when you install MIMIX Share, but the listener service must be running to run CDC replication projects in Connect Portal. If you performed a full install, including the development client, the following menu shortcuts display:

• DMExpress • Apply a New License Key • DMExpress Application Upgrade • DMExpress Global Find • DMExpress Help • DMExpress Job Editor • DMExpress Server • DMExpress Task Editor • DataFunnel • License Information • Reference Guides • Release Notes If you performed a standard or classic install, with the development client but not the Management Service, the following menu shortcuts display:

• DMExpress • Apply a New License Key • DMExpress Application Upgrade • DMExpress Global Find • DMExpress Help • DMExpress Job Editor • DMExpress Server • DMExpress Task Editor • License Information • Reference Guides • Release Notes If you installed the Management Service only (custom install), the following menu shortcuts display:

• DMExpress • Apply a New License Key • DMExpress Help • DataFunnel • License Information • Reference Guides • Release Notes If you did not install the development client nor the Management Service, the following menu shortcuts display:

• DMExpress • Apply a New License Key

10 DMX Install Guide

• DMExpress Help • Documentation • License Information • Reference Guides • Release Notes If you have ActiveX based SyncSort applications which you choose to run with DMX, and you subsequently uninstall SyncSort, you may need to re-register the SyncSortX ActiveX control. To register the ActiveX control, open a command prompt and type the following command: regsvr32.exe /Programs/SyncSortX.dll Silent Installation Silent installation requires a silent setup file that can be recorded during an interactive installation. Installation steps may differ depending on product licensing, so changing the version of DMX or adding or removing packages may require re-recording the silent setup file. To record the installation options Open a command prompt; type the full path for the installation program followed by the options: –r –f1 where is the full path for the file to record the installation options. If you are installing from a downloaded image which is located in c:\downloads, you would type a command like: C:\downloads\DMExpress_1-4_windows.exe–r –f1c:\temp\setup.iss

An interactive installation starts and all the selected installation options are saved in the specified file. To run the installation in silent mode Open a command prompt; type the pathname of the install executable followed by the options: –s –f1 -slog where is the full path for the file that was previously used to record the installation options, and is the full path for the installation log file generated by silent installation. If you are installing from a downloaded image which is located in c:\downloads, you would type a command like: C:\downloads\DMExpress_1-4_windows.exe –s–f1c:\temp\setup.iss

If you do not specify the –slog option, then setup generates a log of the silent installation, setup.log, in the folder from which the setup is run or in the folder where the specified silent setup file is located. Multiple command line options are separated with a space, but there should be no spaces inside a command line option (for example, –slogc:\setup.log is valid, but –slog c:\setup.log is not).

Note: When running silent installation on a machine with User Account Control enabled, an administrator command prompt or batch file can be used to avoid the initial prompt by the operating system requesting elevated privileges. To start a Command Prompt with administrative privileges, right-click the Command Prompt shortcut and select "Run as administrator".

DMX Install Guide 11

UNIX Systems Prerequisites for COBOL Support DMX can be used to accelerate COBOL SORT and MERGE verbs or to process COBOL data files as source or target. In order to use these features, you must have a license to use the COBOL compiler on the system where the DMX task runs. Micro Focus COBOL or Server Express The following variables must be set prior to installation: the COBDIR and PATH variables must be set and exported to include the COBOL compiler, and the following environment variable for shared libraries must be set to include all the shareable libraries used by the compiler and exported on the corresponding platform:

AIX LIBPATH HP-UX SHLIB_PATH Linux LD_LIBRARY_PATH Solaris LD_LIBRARY_PATH AcuCorp’s ACUCOBOL-GT™ COBOL Development System Support for ACUCOBOL-GT™ is available on the following UNIX platforms:

Operating System Architecture

HP-UX 64-bit for Itanium

AIX 64-bit on PowerPC

SunOS 64-bit on SPARC processors

The bit level of DMX must match that of the ACUCOBOL-GT™ installation.

Before running the DMX install script, set the environment variable ACUCOBOL: export ACUCOBOL= where is the location of your ACUCOBOL-GT™ installation. If the environment variable COBDIR is set, unset it: unset COBDIR

Once DMX has been installed, additional steps need to be performed to enable support for AcuCOBOL. Please refer to the DMX online help topic “Installing support for AcuCOBOL.”

COBOL-IT Support for COBOL-IT line sequential files is available on the following UNIX platforms:

Operating System Architecture

AIX 64-bit on PowerPC

Linux 64-bit for Intel-compatible processors

12 DMX Install Guide

The bit level of DMX must match that of the COBOL-IT installation. The minimum supported COBOL-IT version is 3.7.

Before running the DMX install script, do the following:

• Unset the environment variable COBDIR, if set: unset COBDIR

• Set the environment variable COBOLITDIR: export COBOLITDIR= where is the location of your COBOL-IT installation.

To configure COBOL-IT runtime environment variables, refer to the DMX online help topic, “Installing support for COBOL-IT.” Informix C-ISAM Support If you plan to use DMX to process Informix C-ISAM files, the environment variable INFORMIXDIR must be set and exported prior to running the install script. The directory $INFORMIXDIR/lib must contain the library libisam.a. Unikix VSAM Support If you plan to use DMX to process Unikix VSAM files, the environment variable UNIKIX must be set and exported prior to running the install script. The directory $UNIKIX/lib must contain the library libbcisam.a. Interactive Installation

1. If you are installing from a tar file that you downloaded from Syncsort’s web site, extract the contents of the tar file on your UNIX system using a command similar to: tar xvof DMEXPRESS.TAR This creates a directory dmexpress under the current directory.

2. Log in as user root if you wish to install or configure the DMX Run-time Service. The DMX Run-time Service allows you to submit tasks or jobs from the DMX Task Editor or Job Editor components, running on remote desktops, to execute on this DMX server. To install using downloaded software, navigate to the dmexpress directory created when you extracted the contents of the tar file and then run the install program. For example, cd /usr/tmp/dmexpress ./install

3. Depending on your system and the licensed options, you may be asked several questions. For example, on platforms where both a 32-bit and a 64-bit version of DMX are available, you are asked to choose which one you would like to install. You are prompted to either enter a license key or start a free trial. If you've selected to enter a license key, specify the location of the license key file, DMExpressLicense.txt, when prompted.

4. Read the terms of the Syncsort License Agreement and confirm your acceptance of them.

DMX Install Guide 13

5. Review the product options, components and features that are enabled by your license key. If your license key is a

• DMX server license key, a menu displays from which you select from among the component options: DMExpress Components DMExpress Engine Service for Development Client DataFunnel Run-time Service Management Service System Computer name: ... License Expiry Date ...

or information on these options, see DMX installation component options.

• DMX workstation license key, no component options display. You are eligible for the classic DMX/DMX-h installation, which installs the development client, Job and Task Editors; the DMX engine, dmxjob/dmexpress; and the service for development client, which is the DMX Run-time Service, dmxd.

6. Specify the directory into which you want to install DMX. This directory is subsequently referred to as . 7. If you logged on as root, you are prompted to indicate your choice for configuring the DMX Run-time Service. This allows you to start the service immediately, and choose to start it with system restart. This also allows you select PAM authentication if it is available on the system. To configure the DMX Run-time Service at a later time, run the installation procedure as root from the DMX installation directory. See run-time service install and configuration for additional information. 8. You may be prompted to choose whether to automatically run SyncSort jobs in DMX, either immediately or after subsequent un-installation of SyncSort, depending on the presence of the SyncSort Conversion license option and an existing installation of SyncSort. 9. If you have a DMX server license, you are given the option to install the DataFunnel Run- time Service and the option to install Management Service. 10. When the installation procedure completes, update your environment variables. Add /bin to your PATH, and add /lib to the shared library path, for example, by updating your profile. The environment variable that must be set for specific platforms is as follows: AIX LIBPATH HP-UX SHLIB_PATH Linux LD_LIBRARY_PATH Solaris LD_LIBRARY_PATH

11. To run the Connect Portal web UI, you must configure the DMX management service, dmxmgr, including authentication. Then, start the DMX Management Service, dmxmgr. See configure the DMX management service for more details.

14 DMX Install Guide

16. To run copy projects in Connect Portal, start the DataFunnel Runtime Service, dmxrund. See DMX DataFunnel run-time service installation and configuration for more details. 17. To run CDC replication projects in Connect Portal, separately install the latest version of MIMIX Share. The MIMIX Share Listener service starts automatically when you install MIMIX Share, but the listener service must be running to run CDC replication projects in Connect Portal. Silent Installation A silent installation allows you to easily install DMX on multiple machines with identical options. You simply install interactively on the first machine using the record option to save your responses to installation prompts in a file. Then you run the silent installation on the remaining machines, pointing to the recorded response file. Because the silent installation is non-interactive, it can be scripted to effectively automate installation on many machines.

1. To prepare to run the silent installation, initiate the interactive installation on the first machine as described in the section above, but in step 3, run the install command with the record option, –r, specifying the file in which to store your responses to installation prompts as follows: ./install –r

2. Upon successful completion of the interactive installation, run the install program with the silent option, –s, and the silent log option, –slog, on the remaining machines that require installation as follows: ./install –s -slog

where:

o is the full path to the response file generated by the interactive installation. o is the full path to the log file generated by the silent installation. Hadoop Cluster DMX-h must be installed on all the nodes in the Hadoop cluster using one of the following methods:

• Managed Methods - recommended for large clusters o Cloudera Manager Parcel Installation – Store the parcel in the Cloudera Manager local or remote parcel repository (requires root/sudo privileges), then distribute and activate the parcel on the cluster nodes via Cloudera Manager (requires Administrator access to Cloudera Manager). Available as of Cloudera Manager 4.5.

o Apache Ambari Service Installation – Deploy the DMX-h Service Definition Package to the Ambari repository, then install DMX-h on the nodes in the cluster using the Ambari web interface (requires root/sudo privileges). Available as of Ambari 1.7.

o RPM Installation – Deploy the RPM (Red Hat Package Manager) on all nodes in the cluster, then use the RPM to install DMX-h on all nodes in the cluster (requires root/sudo privileges).

• Manual/Silent Installation – Install DMX-h on one node and replicate on all remaining nodes

DMX Install Guide 15

The DMX Run-time Service (dmxd) only needs to be running on the node(s) to which you want to submit jobs from the DMX GUI; typically, this is the machine designated as the edge node. When installing DMX-h using any of the managed methods, the DMX Run-time Service is not installed. See Installing/Upgrading the DMX Run-time Service for instructions on how to do this on the edge node. Installation Packages for Managed Methods There are two separate installation packages for DMX-h, one for the software and another for the license. If you do not already have a license installed, install a license package along with the software package. If the license isn't installed, DMX-h runs in trial mode, which eventually expires and stops working.

If you want to upgrade from a release before the introduction of the second license package, you must install both the software and license packages. Cloudera Manager Parcel Installation Note: Cloudera Manager does not support the mixing of parcels with any other managed install method, and doing so could result in your Hadoop cluster not restarting.

Pre-Installation Execute the following steps on the machine where Cloudera Manager is installed:

1. Run the self-extracting shrink-wrap executables for the software package and license packages from the directory where it is located. For the software executable this is: ./dmexpress--.parcel.bin

For the license executable, this is: ./dmexpresslicense_--.parcel.bin

For example, dmexpresslicense_12345-20190928-el6.parcel.bin

2. Read and accept the Software License Agreement. 3. Enter a target directory in which to put the extracted .parcel, .sha, and manifest.json files. The manifest.json file is required to use DMX via a remote parcel repository. The default is the current folder. Installation Install the DMX-h (dmexpress) parcel and the DMX-h license (dmexpresslicense-XXXXX) parcel on all nodes in the cluster as follows:

1. Depending on whether you are using a local parcel repository or a remote parcel repository, do one of the following:

• Local parcel repository – With root/sudo privileges, copy the extracted .parcel and .sha files for software and license to the Cloudera Manager local parcel repository. The default location is /opt/cloudera/parcel-repo/. • Remote parcel repository – With root/sudo privileges, copy the extracted .parcel and manifest.json files for software and license to your remote parcel repository. Ensure that the files have read and execute permissions for all users. As outlined on Cloudera’s Creating and Using a Parcel Repository page, follow the steps to Configure the Cloudera Manager Server to Use the Parcel URL.

16 DMX Install Guide

2. Logged in to Cloudera Manager as an Administrator user, click on the parcel indicator button in the Cloudera Manager Admin console navigation bar to bring up the Parcels tab of the Hosts page. 3. If not already detected, click on the Check for New Parcels button. Consider the following:

• If you are using a local parcel repository, you can see the “downloaded” parcels on this page, for example, dmexpress 9.8.1 and/or dmexpresslicense_12345 20180928. • If you are using a remote parcel repository, click on the Download button to download the dmexpress and/or dmexpresslicense-XXXXX parcel from the remote repository. Click on the Distribute button to distribute the dmexpress and/or dmexpresslicense-XXXXX parcel to the nodes in the cluster. By default, the files are written to /opt/cloudera/parcels/parcel_name/ on each node. 4. Upon completion of the distribution, either or both parcels can be activated by clicking on its Activate button. If there was a previously activated distribution of DMX-h, be sure that no DMX-h jobs are running, because Cloudera Manager automatically deactivates the old parcel upon activation of the new parcel, and any running jobs fail. 5. Upon activation, the symbolic link /usr/dmexpress is created/updated to point to the activated DMX installation.

See the Cloudera Manager Enterprise Edition User Guide for details on Managing Parcels.

Apache Ambari Service Installation Pre-Installation Execute the following steps on the machine where the Ambari server resides:

1. Run the self-extracting shrink-wrap executable for the software package from the directory where it is located. For the software executable this is: ./dmexpress--.parcel.bin

For the license executable, this is: ./dmexpresslicense---.ambari-service.bin

e.g. dmexpresslicense-12345-20180928-any.ambari-service.bin 2. Read and accept the Software License Agreement. 3. Enter a target directory in which to extract the DMX-h or DMX-h license service folder, or press Enter to accept the default, which is the current directory. If a folder with the same name already exists, you are prompted to overwrite; enter yes to overwrite, or no to exit the extracting process. 4. Enter a target directory in which to copy the DMX-h or DMX-h license service package where it can be found by the Ambari server, or press Enter to accept the default, which is the root path of the latest stack. 5. Enter yes to restart the Ambari server for the new package to be picked up, or no to restart later. 6. If the DMX-h or DMX-h license, respectively, service definition already exists in the repository, you are prompted to upgrade; enter yes to upgrade, or no to exit the process without updating the existing service definition package. 7. Enter the Ambari server's hostname, username, and password, and the cluster name, as prompted, to complete the upgrade.

DMX Install Guide 17

a. If the credentials entered fail, you can re-run this step manually by executing the following script, where is the directory you specified in step 3: /services/DMXh/package/scripts/prepare_dmxh_upgrade.sh

b. If the credentials entered fail for the license package, execute this script: /services/ DMXhLicense/package/scripts/prepare_dmxh_license_upgrade.sh

8. If there is no license installed, repeat steps 1-7 for the license .bin file. Installation Install the DMX-h and/or DMX-h License service on all nodes in the cluster as follows:

1. Log in to the Ambari dashboard and select Actions->Add Service. 2. On the Add Service Wizard page, select DMX-h and/or DMX-h License and click Next. 3. On the Assign Slaves and Clients page, check Client for all nodes, and click Next. 4. On the Configure Services page, click Next to continue with the default options (recommended). Alternatively, if you wish to change the default installation directory, expand the “Advanced” section and make changes to the DMX-h Base Directory setting, ensuring that the same directory is specified for both the DMX-h and DMX-h License tabs, and then click Next. 5. On the Review page, verify the configuration and click Deploy to deploy DMX-h and/or DMX- h License, or click Back to make modifications. 6. On the Install, Start and Test page, wait for the DMX-h and/or DMX-h License service to be successfully installed on each node. If an error occurs, select the "Failures encountered" text to display an error log and identify the problem.

See http://docs.hortonworks.com/ for details on Apache Ambari.

RPM Installation Pre-Installation Execute the following steps starting with one node in or with access to the Hadoop cluster:

1. Run the self-extracting shrink-wrap executable for the software and license packages from the directories where they are located. For the software RPM, this is: ./dmexpress--1.x86_64.bin

For the license RPM, this is: ./dmexpresslicense---..bin

e.g. dmexpresslicense-12345-20180927-1.x86_64.bin 2. Read and accept the Software License Agreement. 3. Enter a target directory in which to put the extracted RPM file (the default is the current folder). Installation You can deploy the RPM on all nodes in the cluster using configuration management software or install the DMX-h RPM package on all nodes in the cluster directly:

18 DMX Install Guide

1. Execute the following command with sudo or root privileges: rpm -i dmexpress--1.x86_64.rpm

The license RPM equivalent command is: rpm -i dmexpresslicense--- ..rpm

This creates a dmexpress folder under the default install location of /usr. To install to a different location (not recommended), use the --prefix option for both license and software install, such as: rpm -i --prefix /some/other/directory dmexpress-- 1.x86_64.rpm

Alternatively, the RPM can be installed with your Linux distribution’s high-level package manager if it supports RPM. For example, on RHEL and CentOS, the yum command can be used: yum install dmexpress--1.x86_64.rpm

or yum install dmexpresslicense--- ..rpm

If there is an existing package, you can upgrade the software or license RPM instead: rpm -U .rpm

or yum upgrade .rpm Manual/Silent Installation Pre-Installation The following steps are required prior to running the manual installation:

1. Create a shared directory, hereafter referred to as , that can be accessed by all nodes in the cluster for sharing the following files/folders (otherwise, they would need to be copied to the same location on each node in the cluster): • The DMExpressLicense.txt file obtained from the download. • The dmexpress sub-directory created upon the dmexpress tar file extraction. • The response file for the DMX silent installations (generated upon install on the first node). 2. Extract the DMX Software. a) Copy DMExpressLicense.txt and the dmexpress tar file to the . b) Extract the contents of the dmexpress tar file in the on your UNIX system: tar xvof dmexpress_-1__linux_2-6_x86-64_64bit.tar This creates a dmexpress/ directory under the current directory, hereafter referred to as the . Installation To install DMX-h on each node in the cluster, follow the instructions under UNIX Systems, Silent Installation. You must manually install DMX-h on the first node, specifying a file to record your

DMX Install Guide 19

responses to the install prompts, and can then silently install DMX-h on the remaining nodes using the recorded response file, ensuring that all nodes are configured consistently.

When running the manual installation on the first machine, respond no to the prompt about installing the DMX Run-time Service unless you want all the nodes in the cluster to install/run it. See Installing/Upgrading the DMX Run-time Service for instructions on installing it on at least one machine to which DMX-h jobs are submitted from the GUI. Installing/Upgrading the DMX Run-time Service The DMX Run-time Service (dmxd) must be installed and running on any machine to which DMX-h jobs are submitted from the GUI; typically this is the machine designated as the edge node. If you install/upgrade DMX-h on the edge node using any of the managed installation methods, or using the Manual/Silent installation method where you answer no to the prompt about installing the service, the DMX Run-time Service is not installed/upgraded.

To install/upgrade the DMX Run-time Service on any machine where DMX-h is installed, follow the instructions for UNIX systems in Configuring the DMX Run-time Service. Cluster in the cloud using Cloudera Director Using Cloudera Director, you can install DMX-h on all of the nodes of a cluster in Google Cloud Platform (GCP) or in Amazon web services (AWS).

Provided that you update the Cloudera Director configuration file, Cloudera Director can install DMX-h as part of a cluster creation process that is initiated from the Cloudera Director command- line interface (CLI).

Note: As Cloudera works toward supporting third-party parcels in Cloudera Director, Syncsort is committed to updating the DMX-h installation procedures in alignment with Cloudera Director enhanced functionality. Pre-Installation To enable Cloudera Director to install DMX-h on a cluster in the cloud, update the instancePostCreateScripts section of the Cloudera Director configuration file to invoke a DMX installation script, which you create. At a minimum, the DMX installation script must install the DMX RPM. Example: instancePostCreateScripts section of a Cloudera Director configuration file In the following instancePostCreateScripts example, the DMX installation script is copied from a Google Cloud Storage bucket and executed. instancePostCreateScripts: ["""#!/bin/sh echo "Installing DMExpress..." /usr/local/bin/gsutil cp gs:///installdmx.sh installdmx.sh chmod a+x installdmx.sh sudo ./installdmx.sh if test $? -ne 0 then echo Failed to install DMX on cluster nodes. exit 1 fi echo "Done installing DMX ..." exit 0

"""]

20 DMX Install Guide

Example: DMX installation script #!/bin/bash version=9.2 shrinkWrapFile=dmexpress-${version}-1.x86_64.bin shrinkWrapResponse=shrinkWrapResponse.txt # create the shrink-wrap response file cat < $shrinkWrapResponse a EOF /usr/local/bin/gsutil cp gs:///$shrinkWrapFile $shrinkWrapFile if test $? -ne 0 then echo Failed to copy DMX shrinkwrap file from the bucket echo "" exit 1 fi chmod a+x $shrinkWrapFile #extract the rpm ./$shrinkWrapFile < $shrinkWrapResponse > shrinkWrap.out 2>&1 #install the rpm rpm -i dmexpress-${version}-1.x86_64.rpm if test $? -ne 0 then echo Failed to install DMX RPM package echo "" exit 1 fi rm -f $shrinkWrapResponse rm -f $shrinkWrapFile rm -f dmexpress-${version}-1.x86_64.rpm Installation From the Cloudera Director CLI, create the cluster. When the Cloudera Director cluster deployment completes successfully, DMX-h is installed on all of the nodes in the cluster. Post-installation To enable the submission of DMX-h jobs from the DMX Job Editor on a Windows instance, do the following:

1. SSH to the ETL server/edge node and run a preparation script, which you create, to do the following: start the DMX Run-time Service, dmxd; create a UNIX account, dmxuser/dmxuser; enable password authentication for SSH. Example: ETL server/edge node preparation script #!/bin/bash # (1) start dmxd on master-node DMEXPRESS_HOME_DIRECTORY=/usr/dmexpress export DMEXPRESS_HOME_DIRECTORY if [ "" != "022" -a "" != "0022" -a "" != "000" -a "" != "00" -a "" != "0000" -a "" != "002" -a "" != "02" -a "" != "0002" -a "" != "020" -a "" != "0020" ] then umask 022 2>/dev/null

DMX Install Guide 21

fi if [ ! -f $DMEXPRESS_HOME_DIRECTORY/bin/dmxd ] then echo Failed to locate the DMX Run-time Service 'dmxd'. exit 1 fi mkdir -p $DMEXPRESS_HOME_DIRECTORY/logs echo "JOBS_DETAILS_DIR=$DMEXPRESS_HOME_DIRECTORY/logs" > $DMEXPRESS_HOME_DIRECTORY/bin/dmxd.conf echo "DMEXPRESS_EXE=$DMEXPRESS_HOME_DIRECTORY/bin/dmexpress" >> $DMEXPRESS_HOME_DIRECTORY/bin/dmxd.conf echo "DMEXPRESS_AUTHENTICATION_METHOD=DEFAULT" >> $DMEXPRESS_HOME_DIRECTORY/bin/dmxd.conf PATH=$DMEXPRESS_HOME_DIRECTORY/bin:$PATH:/usr/bin; export PATH LD_LIBRARY_PATH=$DMEXPRESS_HOME_DIRECTORY/lib:$LD_LIBRARY_PATH; export LD_LIBRARY_PATH cd $DMEXPRESS_HOME_DIRECTORY/bin echo Starting the DMX Run-time Service at `date`... nohup ./dmxd ./dmxd.conf 1>dmxd.stdout 2>dmxd.stderr & # (2) create dmxuser useradd -d /home/dmxuser -m -s /bin/bash "dmxuser" echo "dmxuser:dmxuser"| chpasswd

if test $? -ne 0 then echo Failed to set password for user dmxuser. exit 1 fi # (3) enable password authentication for sftp cat /etc/ssh/sshd_config | sed -e "s/PasswordAuthentication.*no/PasswordAuthentication yes/" > sshd_config_temp mv sshd_config_temp /etc/ssh/sshd_config /etc/init.d/sshd restart if test $? -ne 0 then echo Failed to enable ssh password login. exit 1 fi exit 0

2. As dmxd runs on port 32636 and the SSH service runs on port 22, modify the edge node network rules to allow TCP connections to these ports from the Windows instance. Configuring the DMX Run-time Service The DMX Run-time Service needs to be running on any system to which you want to submit tasks or jobs for execution. It is also required for certain other functions such as file browsing in a multi- locale environment, or viewing server statistics from the client.

The DMX Run-time Service is usually configured during installation to determine the following options:

• automatic restart on system startup • PAM authentication on supported UNIX and Linux platforms To change these options, you can reconfigure the DMX Run-time Service as described below.

22 DMX Install Guide

Stopping and starting the DMX Run-time Service The DMX Run-time Service can be stopped and restarted at any time. Before stopping the DMX Run-time Service, please verify that no job or task submitted from the graphical interface is running.

If a DMX client running a version prior to 5.2.5 connects to this DMX Run-time Service, then the Remote Procedure Call (RPC) service must be running on the system when the DMX Run-time Service starts. The RPC service is used to obtain additional ports required to connect to older DMX clients. Refer to the section below on RPC ports used by older DMX clients. Otherwise, the RPC service is not required and all ports associated with it may be blocked. Windows systems This procedure requires Administrator level access. Select the DMExpress Service from Control Panel, Administrative Tools, Services, then select Properties from the pop-up menu. This opens the DMExpress Service Properties dialog. Use the Start (or Stop) button. A progress bar may appear and the Service status in the properties window changes to Started (or Stopped).

NOTE: In order to submit a job or task, a user must have local login privileges to the machine where the service is running. UNIX systems Root level access is required. Run the install script in the DMX installation directory. For example: >cd /usr/software/DMExpress >./install

This gives you the option of configuring the DMX Run-time Service, where you can choose to stop and/or start the service. Automatic restart on system startup Windows systems Select Automatic in Startup type in the DMX Run-time Service Properties dialog to have the DMX Run-time Service started automatically when the system starts; select Manual otherwise. UNIX systems Use the install procedure as described under Stopping and Starting the DMX Run-time Service above. You are asked the appropriate questions. PAM authentication UNIX systems To configure DMX to use PAM authentication, do the following:

1. Use the install procedure to stop and restart the service as described above. If you have Pluggable Authentication Modules (PAM) installed and configured on the system, you are asked whether DMExpress should use PAM to authenticate users.

2. Include the PAM library in the system library path. DMExpress specifically looks for the library name libpam.so. If your library has a different name, such as libpam.so.0.81.5, create a symbolic link to it in any directory that is included in the shared library path environment variable. For example, this can be done in the DMX lib directory, specifying the full path to your library:

DMX Install Guide 23

cd //lib ln -s /lib64/libpam.so.0.81.5 libpam.so

3. Modify the configuration of the Active Directory software that handles all network connections to the server running the DMX Run-time Service, dmxd. • On Linux systems, have your system administrator create a file named dmxd in the /etc/pam.d/ directory and grant authentication and account management privileges to the dmxd service. Alternatively, you can do the following:

1. Create a file named dmxd in the directory /etc/pam.d 2. Copy the contents of sshd to dmxd

• On UNIX systems, have your system administrator create a dmxd entry in the pam.conf file, which is located in the /etc/ directory, and grant authentication and account management privileges to the dmxd service. Alternatively, you can do the following:

1. Create an entry named dmxd in the file /etc/pam.conf 2. Copy the contents of telnet to the entry created for dmxd 4. Ensure that DMX is configured to use PAM authentication: • Check the installation log file, install.log, which was created in the directory where you installed DMX. If PAM is installed on your system, the DMX installation log includes a question asking if DMX should use PAM authentication. Verify that the recorded response is yes [y]. • Alternatively, if you have root access to the DMX remote server, login as root and verify that the following appears in the service configuration file, dmxd.conf, which is located in the /bin directory: DMEXPRESS_AUTHENTICATION_METHOD=PAM Communication ports required by the DMX server The following TCP and UDP ports are used for communication with the DMX server. When configuring your firewall, make sure the required ports are not blocked anywhere between the system running DMX server and the systems running the DMX client.

If any DMX client that connects to this server is running a version of DMX older than 5.2.5, additional RPC ports are used, and may be configured. Refer to the section below on RPC ports used by older DMX clients.

Port number/ Description transport

32636/TCP DMX server port, used for communication with DMX client, or with other DMX servers when using Grid Computing. It is not recommended to override this port number; please contact Syncsort technical support if you need to do so. Refer to the section below on Technical Support.

In addition, if a DMX task or job uses a remote UNIX server connection, or a Windows network path with a UNC name, to access data (including source and target files) or metadata (including tasks, jobs and external metadata), the following ports need to be open on the system hosting the files.

24 DMX Install Guide

Port number/ Description transport

20/TCP,UDP FTP data port, if Secure FTP is not used

21/TCP,UDP FTP control port, if Secure FTP is not used

22/TCP,UDP Secure FTP port, if Secure FTP is used

445/TCP,UDP Windows shares

50070/TCP,UDP Hadoop Distributed File System name node

RPC ports used by older DMX clients DMX clients older than 5.2.5 require additional ports to communicate with the DMX Server. These ports are assigned by the RPC service at the time the DMX Run-time Service starts.

The following ports are used in addition to the standard ports used by the DMX Run-time Service.

Port number/ Description transport

Arbitrary DMX server port used for communication with DMX clients. An arbitrary port port/TCP is assigned when the DMX Run-time Service is started. The port number can be configured as mentioned below, for example if your security policy does not allow a wide range of ports to be open, or due to the presence of a firewall.

111/TCP,UDP UNIX RPC port mapper

135/TCP,UDP Windows RPC endpoint mapper

Configuring the Server port Windows systems On the machine where the DMX Run-time Service is installed, open the DMX Run-time Service Properties dialog and stop the DMX Run-time Service as described above.

In the Start parameters edit box of the properties window, type: /tcpport where is the port you want the service to use. For example: /tcpport 7771

Start the DMX Run-time Service as described above. The DMX Run-time Service now uses the port you provided. UNIX systems Stop the DMX Run-time Service as described above.

DMX Install Guide 25

Edit the service configuration file, dmxd.conf, which is located in the /bin directory, to insert the following line: DMEXPRESS_TCP_PORT= where is the port you want the service to use. For example: DMEXPRESS_TCP_PORT=7771

Stop and start the DMX Run-time Service as described above. The DMX Run-time Service now uses the port you provided. Applying a New License Key to an Existing Installation Applying a new license key updates your product license to a new licensed version. If your new license enables features or products not installed in your original installation, applying a new license key does not install them automatically. Windows Systems Applying a new key interactively Perform the following steps to apply a new license key to an existing DMX installation:

1. Go to Programs, DMExpress from the Start menu and select Apply a New License Key. 2. Browse to the location of the license key file, DMExpressLicense.txt, or type in the license key manually, when prompted. 3. Read the terms of the Syncsort License Agreement and confirm your acceptance of them. 4. Review the product options, components and features that are enabled by your license key. 5. Confirm the location of the existing DMX installation. Applying a new key silently Applying a new license key silently requires a setup file which can be recorded when applying a new license key interactively. To record the setup file Open a command prompt; type the full path to the program applykey.exe, followed by the options: –r –f1 where is the full path to the setup file that is created. For example, if DMX is installed in “C:\Program Files\DMExpress\”, type: "C:\Program Files\DMExpress\Programs\applykey.exe" –r – f1c:\temp\setup.iss

An interactive session begins and the options that are selected during the interactive session are recorded in the specified setup file. To run the applykey.exe program in silent mode Open a command prompt; type the full path to the program applykey.exe, followed by the options: –s –f1 -slog

26 DMX Install Guide

where is the full path to a setup file that was created using the steps above, and is the full path to the log file which contains any output produced by the silent install run. For example, if DMX is installed in “C:\Program Files\DMExpress\”, type: "C:\Program Files\DMExpress\Programs\applykey.exe" –s – f1c:\temp\setup.iss –slogc:\temp\setup.log

If you do not specify the –slog option, then apply key generates a log, setup.log, in the folder where the silent setup file is located.

Multiple command line options are separated with a space, but there should be no spaces inside a command line option (for example, –slogc:\setup.log is valid, but –slog c:\setup.log is not). UNIX Systems Apply a new key interactively Perform the following steps to apply a new license key to an existing DMX installation:

1. Change to the directory and run the apply key program: cd ./applykey

2. Specify the location of the license key file when prompted. 3. Read the terms of the Syncsort License Agreement and confirm your acceptance of them. 4. Review the product options, components and features that are enabled by your license key. Applying a new key silently A silent applykey process allows you to easily apply a new DMX license key on multiple machines with identical options. You simply apply the key interactively on the first machine using the record option to save your responses to applykey prompts in a file. Then you run the silent applykey process on the remaining machines, pointing to the recorded response file. Because the silent applykey process is non-interactive, it can be scripted to effectively automate applying the license key on many machines.

1. To prepare to run the silent applykey process, initiate the interactive applykey process on the first machine as described in the section above, but in step 1, run the applykey command with the record option, –r, specifying the file in which to store your responses to applykey prompts as follows: ./applykey –r

Note: Before initiating the silent applykey process, ensure that all actively running jobs complete successfully. 2. Upon successful completion of the interactive applykey process, run the applykey program with the silent option, –s, and the silent log option, –slog, on the remaining machines that require the new key as follows: ./applykey –s -slog

where:

DMX Install Guide 27

is the full path to the response file generated by the interactive applykey process. • is the full path to the log file generated by the silent applykey process. DMX-h in a Hadoop Cluster The method for applying a new license key to DMX-h on the nodes of a Hadoop cluster depends on how you originally installed DMX-h in the cluster. Follow the instructions in the appropriate section below. Cloudera Manager Parcel Apply Key

1. Install and activate the new DMX-h license Cloudera parcel, as described in Cloudera Manager Parcel Installation. The software parcel does not need to be modified. 2. (optional) Uninstall the old DMX-h license Cloudera parcel, as described in Cloudera Manager Parcel Uninstall. Apache Ambari Server Apply Key

1. Install the new DMX-h Ambari license service definition package, as described in Apache Ambari Service Installation. The software package does not need to be modified. This effectively updates the existing service definition package. RPM Apply Key

1. Install the new DMX-h license RPM package, as described in RPM Installation. This effectively updates the license key. Manual/Silent Apply Key See UNIX systems, Applying a new key silently. Running DMX Once you have installed DMX, you can create tasks corresponding to different stages of your process via the DMX Task Editor, and group tasks as jobs and run jobs via the DMX Job Editor. You can schedule to run the jobs later or from within a batch script. You can obtain more information on both the graphical user interfaces and on running tasks and jobs from the command line from the DMX Online Help. Graphical User Interfaces On Windows systems, go to Programs, DMExpress from the Start menu and select DMExpress Task Editor to run the DMX Task Editor. To run the DMX Job Editor, either select it from the Start, Programs, DMExpress menu, or switch to it from within the Task Editor via the Run, Create Job menu item. DMX Help To access DMX Help, go to Programs, DMExpress from the Start menu and select DMX Help or select the Help, Topics menu item from within the Task Editor or the Job Editor.

28 DMX Install Guide

Connecting to Databases from DMX In order for DMX to access database tables as sources or targets, the appropriate database client software must be on the system and accessible via the appropriate shared library or dynamic link library (dll) paths. The following environment variable must be set to include the path to the database client libraries and exported on the corresponding platform:

Windows PATH AIX LIBPATH HP-UX SHLIB_PATH Linux LD_LIBRARY_PATH Solaris LD_LIBRARY_PATH

On UNIX systems, the variable needs to be set and exported prior to starting the DMX Run-time Service or running DMX tasks or jobs.

Additional client configuration might be required for a specific DBMS. The configuration steps needed to access a specific DBMS are described in the following sections.

The DMX install program assists you with configuring and/or verifying connections to databases.

On UNIX systems, if you wish to configure and/or verify database connections any time after the installation procedure, run the databaseSetup program as follows: cd ./databaseSetup Amazon Redshift Initial requirements Before attempting to connect to Amazon Redshift, do the following:

• Configure the DMX server, which can be either an Amazon Elastic Compute Cloud (EC2) instance or your local machine, to accept SSH connections. • Depending on the DMX server, consider the following: • EC2 instance – Set the size of the maximum transmission unit (MTU). • Local machine - Due to throughput on the wide area network (WAN), you may notice a performance lag at design time and at runtime. If the local machine is behind a firewall, you may need to configure a Virtual Private Network (VPN) to connect to the local machine from Amazon Redshift.

• Configure the DMX server to include the Amazon Redshift cluster public key and cluster node IP addresses: 1. Retrieve the Amazon Redshift cluster public key and cluster node IP addresses. 2. Add the Amazon Redshift cluster public key to the DMX host's authorized keys file. 3. Configure the DMX host to accept all of the Amazon Redshift cluster node IP addresses. 4. Get the public key for the DMX host. • Specify Amazon Redshift parameters in the DMX Redshift configuration file.

DMX Install Guide 29

The parameters outlined in the DMX Redshift configuration file, as defined by the DMX_REDSHIFT_INI_FILE environment variable, provide DMX with the values required to access an Amazon S3 bucket and to invoke the Amazon Redshift COPY command. Note: If DMX_REDSHIFT_INI_FILE is not set, DMX issues an error message upon task initiation and the DMX task aborts.

A sample DMX Redshift configuration file is provided in the DMX installation directory as follows: Windows C:\Program Files\DMExpress\Examples\Databases\Redshift\DMXRedshift.ini UNIX /etc/DMXRedshift.ini Installation and configuration Connectivity between DMX and Amazon Redshift databases is established through the Amazon Redshift ODBC driver and, when loading, through multiple SSH connections.

DMX optimizes load performance to Amazon Redshift databases through the invocation of the Amazon Redshift COPY command. Amazon Redshift ODBC driver installation Windows systems For Windows systems, ODBC driver installation includes the following:

1. Install and configure the Amazon Redshift ODBC 32-bit driver on operating systems. 2. When creating a system DSN entry for the ODBC connection, ensure the following settings on the given dialogs: • Amazon Redshift ODBC Driver DSN Setup dialog: Use Declare/Fetch is selected. • Amazon Redshift Data Type Configuration dialog: o Use Unicode is unselected. o Show Boolean Column As String is unselected. o Max Varchar (Default 255) is populated with the value 65530. UNIX systems For UNIX systems, ODBC driver installation includes the following:

1. Install the Amazon Redshift ODBC 64-bit driver on Linux operating systems. 2. Configure the ODBC Driver on Linux operating systems. When using the unixODBC driver manager, override the standard threading settings in the ODBC section of odbcinst.ini as follows: [ODBC] Threading = 1

3. Update odbc.ini with the following name-value pairs: UseDeclareFetch=1 UseUnicode=0 BoolsAsChar=0 MaxVarchar=65530

30 DMX Install Guide

Azure Synapse Analytics (formerly SQL Data Warehouse)

Azure Synapse Analytics (formerly SQL Data Warehouse) is a cloud-based Enterprise Data Warehouse (EDW) developed by Microsoft. Through JDBC connectivity, DMX-h supports Azure Synapse Analytics as sources and targets.

Azure Synapse Analytics connection requirements Azure Synapse Analytics requires a JDBC connection configuration with the driver name and location for all connections. The parameters outlined in a DMX Azure Synapse Analytics configuration file include the following:

• DriverName - Required JDBC driver name. • DriverClassPath - Required JDBC class path. • MAXPARALLELSTREAMS - Optional maximum number of parallel streams created to load data for performance and according to demand. • STORAGEACCESSKEY - Required. Azure Blob Storage access key for an active account. If the storage access key is missing or invalid, DMX issues an AZSQDWTERR error message and aborts the job. • WORKTABLECODEC - Optional compression codec to use to compress files in the staging table. DMX currently supports gzip compression codec only. • WORKTABLEDIRECTORY - Required. A URL that includes the Blob Storage account name with the endpoint, including the container name. See https://docs.microsoft.com/en- us/azure/storage/blobs/storage-blobs-introduction#blob-storage-resources. For example:

WorkTableDirectory=https://.blob.core.windows.net/dmx- azstorage-container

Where is the blob storage account name and is the container name. If the work table directory is missing or invalid, DMX issues an AZSQDWTERR error message and aborts the job.

• WORKTABLESCHEMA - Optional schema name to create the staging data. If this parameter is not set, DMX creates tables in the same schema as the target table.

Defining Azure Synapse Analytics database connections In the Database Connection dialog, define a connection to an Azure Synapse Analytics database as follows:

• At DBMS, select Azure Synapse Analytics. • At Access Method, select JDBC. • At Database, select a previously defined Azure Synapse Analytics JDBC database connection URL. • At Authentication, select Auto-detect.

DMX requirements to load data into an Azure Synapse Analytics target Before using DMX to load data into an Azure Synapse Analytics target, do the following:

DMX Install Guide 31

1. Create or verify that the master database contains a database master key 2. Enable the db_owner privilege for the user connecting to Azure Synapse Analytics. Alternately, set or verify the following more granular privileges for the connecting user:

EXEC sp_addrolemember 'db_datawriter', ''; GRANT CONTROL TO ;

Azure Synapse Analytics target connections Using an Azure Synapse Analytics JDBC connection, DMX-h can write supported Azure Synapse Analytics data types to Azure Synapse Analytics targets directly for optimal performance.

Defining Azure Synapse Analytics targets At the Target Database Table dialog, define an Azure Synapse Analytics database table target:

1. At Connection, select a previously defined Azure Synapse Analytics target connection or select Add new... to add a new one.

2. Select a table from the list of Tables, or select Create new... to create a new one. o User defined SQL statement is not supported. o All target disposition methods are supported. 3. On the Parameters tab, the following optional parameters are available for Azure Synapse Analytics target database tables. Values specified here take precedence over their corresponding property in the JDBC configuration file, if any. o Maximum parallel streams - the maximum number of parallel streams that can be established to load data for performance and that are created according to demand. o Work table directory - Required. A URL that includes the Blob Storage account name with the endpoint, including the container name. See https://docs.microsoft.com/en- us/azure/storage/blobs/storage-blobs-introduction#blob-storage-resources. For example:

WorkTableDirectory=https://.blob.core.windows.ne t/dmx-azstorage-container

Where is the Blob storage account name and is the container name. If the work table directory is missing or invalid, DMX issues an AZSQDWTERR error message and aborts the job.

o Work table codec - specifies the compression algorithm used to compress data staged in Blob storage. o Work table schema - the schema used to create the staging table. 4. Set commit interval and Abort task if any record is rejected are not supported.

Azure Synapse Analytics source connections Using an Azure Synapse Analytics JDBC connection, DMX can read supported Azure Synapse Analytics data types from any Azure Synapse Analytics table.

Defining Azure Synapse Analytics sources For all DMX-h ETL jobs, DMX-h supports Azure Synapse Analytics database tables as sources and as lookup sources. At the Source Database Table dialog or at the Lookup Source Database Table dialog define either an Azure Synapse Analytics database table source or lookup source respectively:

32 DMX Install Guide

• At Connection, select a previously defined Azure SQL Data Warehouse source connection or select Add new... to add a new connection.

Databricks Databricks is a cloud database Platform-as-a-Service for Spark supported on Azure and AWS Cloud Services. Through JDBC connectivity, DMX-h supports Databricks databases as sources and targets.

Databricks connection requirements Databricks requires a JDBC connection configuration with the driver name and location for all connections.

Before attempting to connect to Databricks, do the following:

• Install DMX server on an Amazon Elastic Compute Cloud (EC2) instance, Azure Virtual Machine (VM), or your local machine.. • Specify JDBC and Spark parallelization parameters in the DMX JDBC configuration file.

The parameters outlined in the DMX JDBC configuration file, as defined by the DMX_JDBC_INI_FILE environment variable, provide DMX with the mandatory and optional values required to access an Amazon S3 bucket or Microsoft Azure blob to invoke a Databricks query.

• DMX accesses Databricks using keys-based authentication. If no access keys are provided, DMX issues a UNIAMCRE error message aborts the job.

The parameters outlined in a DMX Databricks configuration file include the following:

• DriverName - Required JDBC driver name. • DriverClassPath - Required JDBC class path. • ANALYZETABLESTATISTICS - When set to y, DMX can run analyze queries that collect table statistics. Default value in n. • ANALYZECOLUMNSTATISTICS - When set to y, DMX can run analyze queries that collect column statistics. Default value is n. • MAXPARALLELSTREAMS - Optional integer representing Maximum number of parallel streams that can be established for loading data to the staging data file. By default, MAXPARALLELSTREAMS is set to the number of CPUs available in the client machine. • WORKTABLEDIRECTORY - Required path to an s3 bucket, Azure blob container, or Databricks File System (DBFS) store in which to stage data. You must mount an s3 bucket or Azure blob container using the Databricks File System (DBFS). Example URLs could include: o s3a://dev for an S3 bucket o wasbs://[email protected]/dev for an Azure Blob o dbfs://dev for a DBFS store • DBFSMOUNTPOINT – DBFS mount point (DBFS path) required by WORKTABLEDIRECTORY. DBFSMOUNTPOINT is mandatory if the work table directory maps to an S3/Azure URL.

DMX Install Guide 33

• MAXWORKFILESIZE – Optional integer. The maximum size of a file in bytes of the staging file written by task. The default value is 134217728, which is equivalent to 128 MB. • WORKTABLESCHEMA - Optional schema name to use for staging data . The default schema for staging data is the same as the target data schema. • WORKTABLECODEC - A compression codec to compress the files in the stagiung directory. Valid values are gzip (default), bzip2, and uncompressed. • AWSACCESSKEYID - A 20-character, alphanumeric string that Amazon provides upon establishing an AWS account. DMX ignores this parameter unless WORKTABLEDIRECTORY is an S3 bucket. If DMX runs in EC2, AWSACCESSKEYID is optional. • AWSACCESSKEY - The 40-character string, also known as the secret access key, which Amazon provides upon establishing an AWS account. DMX ignores this parameter unless WORKTABLEDIRECTORY is an S3 bucket. If DMX runs in EC2, AWSACCESSKEY is optional.

DMX requires the access key id and the secret access key to send requests to an Amazon S3 bucket unless an AWS temporary session token is required, in which case DMX requires the access key id and AWS temporary session token. See the AWSTOKEN parameter below.

• AWSTOKEN - An AWS temporary session token, granting temporary security credentials (temporary access keys and a security token) to any IAM user enabling them to access AWS services. This alternative authentication method replaces a full-access AWS storage access key. DMX ignores this parameter unless WORKTABLEDIRECTORY is an S3 bucket. • AzureStorageAccessKey - A 512-bit Azure Blob Storage access key for an active account of which Microsoft issues two upon establishing an Azure Portal account. If DMX runs in the Azure Blob container, AzureStorageAccessKey is optional. If the storage access is required and the key is missing or invalid, DMX issues an AZSQDWTERR error message and aborts the job. DMX ignores this parameter unless WORKTABLEDIRECTORY is an Azure blob container. • AzureStorageSAS - A shared access signature (SAS) URI that grants restricted access rights to Azure Storage resources. This alternative authentication method replaces a full-access Azure Storage access key. DMX ignores this parameter unless WORKTABLEDIRECTORY is an Azure blob container.

Defining Databricks database connections In the Database Connection dialog, define a connection to a Databricks database as follows:

• At DBMS, select Databricks. • At Access Method, select JDBC. • At Database, select a previously defined Databricks JDBC database connection URL. • At Authentication, select Auto-detect.

Databricks target connections Using a Databricks JDBC connection, DMX-h can write supported Databricks data types to Databricks targets directly for optimal performance.

Defining Databricks targets At the Target Database Table dialog, define a Databricks database table target:

34 DMX Install Guide

1. At Connection, select a previously defined Databricks target connection or select Add new... to add a new one. 2. Select a table from the list of Tables, or select Create new... to create a new one. o User defined SQL statement is not supported. o All target disposition methods are supported. 3. On the Parameters tab, the following optional parameters are available for Databricks target database tables. Values specified here take precedence over their corresponding property in the jdbc configuration file, if any. o Analyze table statistics - enables analyze queries that collect table statistics. o Analyze column statistics - enables analyze queries that collect column statistics. o Maximum parallel streams - Optional integer representing the maximum number of parallel streams that Connect can establish for loading data into the staging data file. By default, this is set to the number of CPUs available in the client machine. o Work table directory - the parent-level directory in s3, blob, and/or dbfs in which Connect creates job-specific subdirectories.

When the work table directory is an s3 bucket, you must mount the s3 bucket through DBFS. For more details, see the Databricks documentation concerning Amazon S3.

When the work table directory is an azure blob container, you must mount the blob container through DBFS. For more details, see the Databricks documentation concerning Azure storage.

o Work table schema - the schema used to create the staging table. By default, Connect creates the staging table in the same schema as the target table. o Work table codec - specifies the compression algorithm used to compress Databricks data. Valid values are gzip (default), bzip2, and uncompressed.Set commit interval and Abort task if any record is rejected are not supported. Databricks source connections Using a Spark JDBC connection, DMX can read supported Databricks data types from any Databricks table.

Defining Databricks sources For all DMX-h ETL jobs, DMX-h supports Databricks database tables as sources and as lookup sources.

At the Source Database Table dialog or at the Lookup Source Database Table dialog define either a Databricks database table source or lookup source respectively:

• At Connection, select a previously defined Databricks source connection or select Add new... to add a new connection.

DB2 Your DB2 client must be installed on the system and configured so that it can connect to databases that you want to access from DMX. For example, you can configure the client by cataloging databases, or by defining database aliases in the db2cli.ini file. Please refer to specific DB2 documentation for details on configuring the client.

DMX Install Guide 35

Windows Systems To access DB2 databases, DB2 client software must be accessible via the dynamic link libraries (dll) located under the /sqllib/bin folder, where denotes the directory where DB2 is installed. UNIX Systems To access DB2 databases, DB2 client software must be accessible via the shared libraries located under the /sqllib/lib directory, where denotes the home directory of the DB2 instance that you want to use to connect to the database. Greenplum Installation and configuration DMX connects to Greenplum databases through the Greenplum ODBC driver and the Greenplum psql client utility, which is a component of the Greenplum client software.

Install and configure the Greenplum client software on the system on which the DMX client is installed.

To establish a connection to the Greenplum database, install the Greenplum ODBC driver and create, configure, and test the ODBC data source name (DSN). Greenplum client software installation Windows systems For Windows systems, client software installation includes the following:

1. Install the Greenplum client software. a) Register as a user on the Pivotal Network site. b) From the Greenplum Clients section of the Pivotal Greenplum Database download site, download the Clients for Windows file, for example: greenplum-clients--build- -WinXP-x86_32.msi

For information on installing and configuring the Greenplum Windows client software, refer to Greenplum Database Client Tools for Windows. The default Greenplum client installation directory is as follows: C:\Program Files (x86)\Greenplum\greenplum-clients- -build-

2. Verify that the Greenplum psql client utility is in a directory specified in the PATH. Note: If the Greenplum psql client utility is not in the PATH when DMX initiates a load to the Greenplum database, DMX issues an error message and the task aborts. UNIX systems For UNIX systems, client software installation includes the following:

1. Install the Greenplum client software. a) Register as a user on the Pivotal Network site.

36 DMX Install Guide

b) From the Greenplum Clients section of the Pivotal Greenplum Database download site, download the applicable Greenplum UNIX client software, for example: greenplum-clients--build- -

For information on installing and configuring the Greenplum UNIX client software, refer to Greenplum Database Client Tools for UNIX. The default Greenplum client installation directory is as follows: /usr/local/greenplum-clients--build-

To setup the system environment variables, run greenplum_clients_path.sh: /greenplum_clients_path.sh

where is the Greenplum client software installation directory

2. Verify that the Greenplum psql client utility is in a directory specified in the PATH. Note: If the Greenplum psql client utility is not in the PATH when DMX initiates a load to the Greenplum database, DMX issues an error message and the task aborts. Greenplum ODBC driver installation and configuration Windows systems For Windows systems, driver installation and configuration includes the following:

1. Install the Greenplum ODBC driver. From the Greenplum Connectivity section of the Pivotal Greenplum Database download site, download the Connectivity for Windows driver file, for example: greenplum-connectivity--build- -WinXP-x86_32.msi

The default Greenplum ODBC driver installation directory is as follows: C:\Program Files (x86)\Greenplum\greenplum-connectivity- -build- \drivers\odbc

2. Verify that the ODBC driver libraries, which are dynamic linked libraries with the extension, are installed successfully. 3. Create and configure the ODBC DSN. 4. When creating a system DSN entry for the ODBC connection, ensure the following on the Greenplum Advanced Options dialog: o Use Declare/Fetch is selected. o Show Boolean Column As String is unselected. o Max Varchar (Default 255) is populated with the value 65530. UNIX systems For UNIX systems, the Greenplum driver is provided by DMX and is part of the DMX installation. The Greenplum driver, _Sgplm.so, is installed in the following directory: /ThirdParty/DataDirect/lib

DMX Install Guide 37

Note: is the DMX installation directory.

Greenplum driver configuration includes the following:

1. Add an entry for the Greenplum driver in the odbc.ini file. A sample odbc.ini file is shipped with DMX and is located in the following directory: /etc.

In the Greenplum Data Source section of the odbc.ini file, add the Greenplum driver entry: Driver=/ThirdParty/DataDirect/lib/_Sgplm.so

2. Define the embedded DataDirect ODBC Driver Manager as the ODBC driver manager. The DataDirect ODBC Driver Manager is shipped with DMX and is installed in the following directory: /ThirdParty/DataDirect Hive data warehouses is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis of large datasets stored in Hadoop's Distributed File System (HDFS) and other compatible file systems. Hive includes HiveQL, a query language useful for real- time analytics in Hadoop.

DMX-h can connect to Hive data warehouses as:

• sources when running on the ETL server/edge node or in the cluster • targets when running on the ETL server/edge node or in the cluster Hive tables can also be accessed as HCatalog sources and targets. Hive source connections Using a Hive ODBC or JDBC connection, DMX-h can read supported Hive data types from all of the supported Hive file types, including Apache Avro, , Optimized Row Columnar (ORC), Record Columnar (RCFile), and text. JDBC is recommended over ODBC. For jobs run in the cluster, DMX-h supports reading from Hive sources using JDBC only.

Note: On an ETL server/edge node, reading from Hive sources via Hive ODBC/JDBC drivers yields low throughput and is best reserved for no more than a few gigabytes of data, such as pre-aggregated data for analytics. JDBC connectivity When DMX-h reads from a Hive table in the cluster via JDBC, the data is temporarily staged in compressed or uncompressed format to a text-backed Hive table. Hive target connections DMX-h writes supported Hive data types to Hive targets using different methods depending on whether the connection is via JDBC or ODBC. JDBC is recommended over ODBC. For jobs run in the cluster, DMX-h supports reading from Hive sources using JDBC only. Consider the following:

• JDBC - When DMX-h writes to a Hive table via JDBC, the data is generally loaded directly into target tables. Writes are temporarily staged in compressed or non-compressed format to a text-backed Hive table only when one of the following conditions limits direct access:

38 DMX Install Guide

o A target table is an ACID table o A target table has one or more partitions o A target table has any complex type column(s) o The target table performs Truncate, Upsert, or Upsert and Apply change (CDC) dispositions o The job runs on localnode or singleclusternode o A user forces DMX-h to stage data by setting the environment variable DMX_HIVE_TARGET_FORCE_STAGING to 1, which uses the two-step process implemented in previous versions of DMX • ODBC - based on the file format and whether the Hive table is partitioned, one of the following methods is used to write to Hive: o Method 1 - When DMX-h writes to a Hive table via ODBC, the data is temporarily staged in parallel streams in compressed or non-compressed format to a text-backed Hive table. o Method 2 - DMX-h writes to Hive by loading the data in parallel streams directly to the Hadoop file system for optimal performance.

File Format Partitioned Non-partitioned

Apache Avro, Apache Parquet, Method 1 Method 2 or delimited text files

Other file formats Method 1 Method 1

Text-backed Hive table The type of Hive table to which data is staged temporarily is determined as follows:

• When Work table directory is specified, DMX-h stages the data to a text-backed Hive external table. • When a work table directory is not specified, DMX-h stages the data to a text-backed Hive managed table.

The temporary text-backed Hive table is deleted at the end of the DMX-h job. Hive configuration Connecting to Hive from DMX-h requires the following configuration components:

• Hive JDBC connection and/or Hive ODBC connection. • Windows JAR files for JDBC connections • Hive table staging • Hive table creation security (for Hive targets) • Sentry/Ranger authorization, if being used Hive ODBC connection Hive ODBC connections can be used for Hive sources and targets. Configuring Hive ODBC requires the following steps, described in detail in the sections below:

1. Install and configure the Hive ODBC driver

DMX Install Guide 39

2. Define a Hive ODBC data source Installing and configuring the Hive ODBC driver Ensure you have administrator/root privileges on the computer before you install the driver.

Installing and configuring on Windows

1. Go to one of the following Hadoop vendor websites and download the Windows 32-bit Hive ODBC driver and associated documentation. For example: • Cloudera: http://www.cloudera.com/downloads/connectors/hive/odbc/2-5-21.html • Hortonworks: http://hortonworks.com/hdp/addons/ • MapR: http://package.mapr.com/tools/MapR-ODBC/. Select the latest version of the file MapR_odbc__x86.exe. 2. After downloading the file, double-click the file to run the installer. 3. Follow the installer's instructions and use the default settings. For additional information about the installation and configuration settings, see the vendor's documentation.

Installing on Linux

1. Go to one of the following Hadoop vendor websites and download the Linux 64-bit Hive ODBC driver and associated documentation. For example: • Cloudera: http://www.cloudera.com/downloads/connectors/hive/odbc/2-5-12.html . Download the appropriate RPM for your Linux distribution. • Hortonworks: http://hortonworks.com/hdp/addons/. Download the appropriate tar file for your Linux distribution and extract the RPMs from it. • MapR: http://doc.mapr.com/display/MapR/Hive+ODBC+Connector and http://package.mapr.com/tools/MapR-ODBC/. Navigate the directories for your Linux distribution and download the appropriate RPM. 2. Unpack the RPM package to install the driver files in the vendor's default location: rpm -i .rpm

3. Note the location of the installed files for later configuration steps. The default installation location depends on the vendor and may be one of the following: • /opt/Cloudera/hiveodbc/ • /usr/lib/hive/lib/native/hiveodbc/ • /opt/mapr/hiveodbc/

4. Continue with Configuring the Hive ODBC driver on Linux. Configuring the Hive ODBC driver on Linux The Hive ODBC driver installation includes the file, /..hiveodbc.ini, which you use to configure the specific vendor's Hive ODBC driver. By default, this file begins with a leading period and gets installed in the user's home directory.

40 DMX Install Guide

1. If you decide not to use the default location and file name for ..hiveodbc.ini, set an environment variable to locate the file. These examples assume you put the file (without a leading period) in the /etc/ directory: • Cloudera: • For Cloudera Hive ODBC driver version 2.5.12 and higher: Cloudera: export CLOUDERAHIVEINI=/etc/cloudera.hiveodbc.ini • For Cloudera Hive ODBC driver versions prior to 2.5.12: Cloudera: export SIMBAINI=/etc/cloudera.hiveodbc.ini • Hortonworks: • For Hortonworks Hive ODBC driver version 0.11 and higher: export SIMBAINI=/etc/hortonworks.hiveodbc.ini • For Hortonworks Hive ODBC driver versions prior to 0.11: export SIMBAINI=/etc/hortonworks.hiveodbc.ini • MapR: export MAPRINI=/etc/mapr.hive.odbc.ini

2. Set the following driver manager options in your vendor-specific configuration file, .hiveodbc.ini, under the Driver section. The default DMX-h driver manager is unixODBC.

a) Set DriverManagerEncoding to UTF.-. b) Set ODBCInstLib to identify the ODBC installation's shared library for the ODBC driver manager. The DMX-h default location is /lib/libodbcinstSSL.so. [Driver] DriverManagerEncoding=UTF-16 ODBCInstLib=/lib/libodbcinstSSL.so

3. Ensure that the Hive ODBC driver library is included at the beginning of the system library path, LD_LIBRARY_PATH, by running the following command: export LD_LIBRARY_PATH=/lib:$LD_LIBRARY_PATH

4. The default location DMX-h uses for the .odbcinst.ini ODBC configuration file is /etc. If you decide to use a different location, set the ODBCSYSINI environment variable to the directory containing your file. 5. Configuration options set in configuration file odbcinst.ini apply to all Hive connections. Create a section for the Hive ODBC driver and set the following options as follows: [ODBC Drivers] Hive ODBC Driver 64-bit=Installed . . . [ Hive ODBC Driver 64-bit] Description= Hive ODBC Driver (64-bit) Driver=//

For additional information about the installation and configuration settings, see the vendor's documentation.

DMX Install Guide 41

Defining a Hive ODBC data source To identify a Hive ODBC data source, you create a data source name (DSN) and set options required to connect to the data source.

Defining a Hive data source on Windows

1. Start the ODBC Data Source Administrator by following the instructions for Windows systems at Defining ODBC Data Sources. 2. In the Create New Data Source dialog, select your Hive ODBC vendor's driver from the list, and then click Finish. 3. Use the following settings in the ODBC Driver DSN Setup dialog for a Hive ODBC data source: Data Source Name Enter a name to identify the Hive DSN.

Host Enter the IP address or hostname of the Hive server.

Port Enter the listening port for the Hive service. The default is 10000.

Database Enter the name of the database schema you want to use. The default schema is default.

Hive Server Type Select Hive Server 2.

Authentication Most Hive installations use User Name authentication by default. The mechanism authentication mechanism for a Hive data source must match the mechanism in use on the Hive server or the connection fails. Check with your Hadoop system administrator.

Advanced Options Check Use Native Query and then click OK. Some ODBC Hive drivers work with both HiveQL and SQL query languages. This setting enables the use of native HiveQL instead of SQL.

Defining a Hive data source on Linux Find general instructions for UNIX systems at Defining ODBC Data Sources.

1. The default location DMX-h uses for the odbc.ini ODBC configuration file is /etc/odbc.ini. If you decide to use a different location and file, set the ODBCSYSINI environment variable to the full path and file name of your file. 4. In the odbc.ini file, add a new Hive data source entry to the [ODBC Data Sources] section. Use the format, =: [ODBC Data Sources] Sample Hive DSN 64=Hive ODBC Driver 64-bit

5. Configure the new Hive data source by adding a section similar to the following to the odbc.ini file. Note that sample values are shown. Consult your Hadoop system administrator for guidance on settings appropriate for your environment: [Sample Hive DSN 64] Driver=// HiveServerType=2

42 DMX Install Guide

HOST= PORT=10000 UseNativeQuery=1 AuthMech=2

The following table lists valid values:

Odbc.ini options Description

Driver Set the location of the installed Hive ODBC Driver file. Find the driver file , for example, libhortonworkshiveodbc64.so, under your installed files /lib/ directory.

HiveServerType Set the HiveServerType to 2, for HiveServer2. HiveServer2 is a newer version with improvements from that of HiveServer and with additional features. 1 (default) HiveServer 2 HiveServer2

HOST Set the IP address or hostname of the Hive server.

PORT Set the listening port for the service. The default port for DMX-h Hive installation is 10000.

UseNativeQuery Set the UseNativeQuery value to 1. Some Hive ODBC drivers work with both HiveQL and SQL query languages. 0 (default) enables the SQL Connector feature 1 enables the HiveQL query language and disables the SQL Connector feature

AuthMech Set the AuthMech value to the number representing the same authentication mechanism as the Hive server. Most Hive installations use User Name authentication (value 2) by default. 0 no authentication 1 Kerberos 2 (default) User Name 3 User Name and Password 4 User Name and Password (SSL) 5 Windows Azure HDInsight Emulator 6 Windows Azure HDInsight Service 7 HTTP 8 HTTPS

Windows JAR files for JDBC connections For connecting to Hive via JDBC on Windows at both design and run time, you need to download the jars listed here: the set under the comment regarding "using remote hiveserver in non-kerberos mode" and, if using Kerberos authentication, the set under the comment "using kerberos secure mode". The folder(s) containing these jar files must be in the DriverClassPath in the JDBC configuration file.

DMX Install Guide 43

Hive table staging When reading Hive sources and connecting via JDBC, Hive stages the data as a text-backed Hive table.

When writing Hive targets, DMX-h stages the data as a text-backed Hive table when connecting via:

• JDBC • ODBC method 1 Note:

• In all cases, sufficient space as well as CREATE TABLE privileges are needed to stage the tables. • For the ODBC cases, DMX-h writes the tables using the hive command, which must be in the path. Hive table creation security With Hive version 0.13 and higher, the default security does not allow the user who creates the table to read from or write to the table. To enable reading from and writing to the table without having to modify access permissions after creating the table, do the following:

• In hive-site.xml, add the property hive.security.authorization.createtable.owner.grants and set its value to SELECT and UPDATE. • Ensure the user has read/write privileges to Hive data files on the Hadoop file system. Sentry and Ranger authorization DMX-h is compatible with the following authorization schemes:

• Cloudera Sentry • Apache Ranger Cloudera Sentry DMX-h is certified to work with Cloudera's Sentry authorization of Hive databases, which requires the following to be enabled in the Cloudera cluster:

• HDFS Access Control Lists (ACLs) • automatic synchronization of HDFS ACLs with Sentry privileges Note: Note: When using Sentry, Hive impersonation is disabled by default. To ensure access to the Work table directory, the default Hive user must have the correct permissions. Apache Ranger DMX-h is compatible with Apache Ranger, a framework for enabling, monitoring, and managing data security across the Hadoop platform. Ranger works with Apache Hadoop (HDFS), Apache Hive, Apache Kafka, and YARN, among other Apache projects.

Note: Ranger is currently designated as an Apache incubator project, and there are gaps in what it works with in the Hadoop ecosystem, such as Apache HCatalog. Additionally, it does not work with Amazon S3 or other cloud-based distributed filesystems.

44 DMX Install Guide

Apache Impala Apache Impala is a native analytic database for Apache Hadoop. Through JDBC connectivity, DMX-h supports Impala databases as sources and targets when running on the ETL server/edge node, in the cluster, and on a framework-determined node in the cluster.

Connecting to Impala requires configuration steps before the connections can be defined. Connection requirements and behavior differ between Impala sources and targets. Maximum length The maximum post-extraction length that DMX-h supports for an Impala database record is 16,777,216 bytes (16 MB). Impala connections Connecting to Impala requires configuration steps before the connections can be defined. Connection requirements and behavior differ between Impala sources and targets. Impala source connections Using an Impala JDBC connection, DMX-h can read supported Impala data types from all supported Impala file types: Apache Avro, Apache Parquet, Record Columnar (RCFile), Text, and SequenceFile.

Note: As per Impala limitations, DMX-h can read complex data types, which include structures and arrays, only from Parquet-backed tables in Impala. JDBC connectivity When DMX-h reads from an Impala database table on an ETL server/edge node or in the cluster via JDBC, the data is staged temporarily in uncompressed format to a text-backed Impala table. Impala target connections Using an Impala JDBC connection, DMX-h can write supported Impala data types to Impala targets targets directly for optimal performance.

Note: As per Impala limitations, DMX-h can write complex data types to Parquet-backed tables in Impala via a Hive database connection only, not via a JDBC connection. JDBC connectivity When DMX-h writes to an Impala database table via JDBC, data is generally loaded directly into target tables. Writes are staged temporarily in compressed or non-compressed format to a text- backed Impala table only when one or more of the following conditions limits direct access:

• A target table has one or more partitions • A parquet-backed target table has any timestamp columns • A target table performs Truncate or Apply Change (CDC) dispositions • The job runs on localnode or singleclusternode • A user forces DMX-h to stage data by setting the environment variable DMX_IMPALA_TARGET_FORCE_STAGING to 1, which uses the two-step process implemented in previous versions of DMX Update and Upsert dispositions are supported only for kudu tables.

At run-time, DMX-h accesses the kudu jars from /opt/cloudera/parcels/CDH/lib/kudu on the edge/master node for Impala access. You can override this default location by using environment

DMX Install Guide 45

variable KUDU_HOME. For example, export KUDU_HOME=/opt/cloudera/parcels/CDH/lib/kudu sets the location accessed at run-time to /opt/cloudera/parcels/CDH/lib/kudu. Impala configuration Connecting to Impala from DMX-h requires the following configuration components:

• Impala JDBC connection • Impala table staging • Apache Sentry authorization when applicable Impala JDBC connection To connect to Impala via JDBC on Windows at design time, download the JDBC driver and specify the mandatory driver name and driver class path parameters in the JDBC configuration file:

• Download the applicable Cloudera Impala JDBC Simba-based driver. See Configuring Impala to Work with JDBC.

• Set the driver name and driver class path in the JDBC configuration file. Impala table staging When reading Impala sources and writing Impala targets, DMX-h stages the data as a text-backed Impala table.

To stage the tables, sufficient space and CREATE TABLE privileges are required. Defining Impala database connections In the Database Connection dialog, the general pattern to define a connection to an Impala database is as follows:

• At DBMS, select Impala. • At Access Method, select JDBC. • At Database, select a previously defined Impala JDBC database connection URL. • At Authentication, select Auto-detect or Kerberos. Note: When Kerberos authentication is required, ensure that Kerberos is selected.

Defining Impala sources For all DMX-h ETL jobs, DMX-h supports Impala database tables as source and as lookup source.

At the Source Database Table dialog or at the Lookup Source Database Table dialog define either an Impala database table source or lookup source respectively:

• At Connection, select a previously defined Impala source connection or select Add new... to add a new connection. • On the Parameters tab, the following optional parameters are available for Impala database table sources and lookup sources: o Filter - equivalent to the text that follows a WHERE clause in a SQL query, the filter parameter specifies the condition upon which records are extracted from an Impala source table. o For partitioned Impala database table sources and lookup sources, you can specify a partition predicate in the WHERE clause, which serves as a filter that enables partition pruning and limits scanning to those portions of the table relevant to partitions.

46 DMX Install Guide

o Work table directory - serves as the parent-level directory beneath which job-specific subdirectories are created for staging data. o Work table schema - the schema used to create the staging table. o Impala configuration properties - any Impala configuration property can be entered manually in the parameters grid. Defining Impala targets At the Target Database Table dialog, define an Impala database table target:

1. At Connection, select a previously defined Impala target connection or select Add new... to add a new one. 2. Select a table from the list of Tables, or select Create new... to create a new one. By default, DMX-h creates text-backed Impala database tables; to create an Impala table backed by some other file format, follow the instructions in the Create Database Table dialog help topic, with the following modification:

3. Click View SQL. 4. In the SQL textbox, change STORED AS TEXTFILE to STORED AS , where is the keyword for the applicable file format, such as AVRO or PARQUET. • User defined SQL statement is not supported. • All target disposition methods are supported. • All partition columns must be mapped. 5. On the Parameters tab, the following optional parameters are available for Hive target database tables: • Compute table statistics - To optimize subsequent Impala query performance, DMX-h can run Impala analyze queries that collect target table statistics and target column statistics after the load to the Impala target database. o Valid values include true and false (default). If you specify false or if a parameter value is blank, DMX-h does not run the parameter-specific query after the load to the Impala target database. o When Impala auto-analysis is enabled and DMX-h loads via staging table to the Impala target database, Impala automatically computes table statistics, but not column statistics, and stores the table statistics to the metastore. • Maximum parallel streams - the maximum number of parallel streams that can be established to load data for performance and that are created according to demand. This value can also be specified via the environment variable DMX_IMPALA_MAX_WRITE_THREADS. If specified both ways, the parameter value takes precedence. If neither is specified, the default value is either the number of CPUs on the edge node when running on the ETL server/edge node or is 1 for each instance of DMX-h when running on the cluster. • Work table codec - specifies the compression algorithm used to compress Impala data. • Work table directory - serves as the parent-level directory beneath which job-specific subdirectories are created for staging data. • Work table schema - the schema used to create the staging table. • Impala configuration properties - any Impala configuration property can be entered manually in the parameters grid. 6. Set commit interval and Abort task if any record is rejected are not supported.

DMX Install Guide 47

Microsoft SQL Server Your SQL Server client must be installed on the system and configured so that it can connect to databases that you want to access from DMX. On 64-bit Windows, a SQL Server Native Client must also be installed. Please refer to specific SQL Server documentation for details on configuring the client. Windows Systems A SQL Server data source needs to be defined for each database that you want DMX to access. The data source should be named the same as the SQL Server database it points to. Choose a SQL Native Client as the DBMS driver on 64-bit Windows. See Defining ODBC Data Sources for details on defining data sources on Windows systems. Netezza Installation and Configuration DMX connects to Netezza databases through the Netezza nzload client utility, which is a component of the Netezza client software package, and the Netezza Open Database Connectivity (ODBC) driver.

For Windows and UNIX systems, the client software package includes the Netezza client interface software and the Netezza ODBC driver.

To establish a connection to the Netezza database, install the Netezza client software package on the system on which the DMX client is installed. Netezza client software package and driver installation Windows systems For Windows systems, the client software installation includes the following:

1. Install the Netezza client software package. For procedures on installing the Netezza client software and ODBC driver, refer to the installation chapter of the IBM Netezza System Administration Guide. The default Netezza client installation is located in the following directory: C:\Program Files (x86)\IBM Netezza Tools

The default Netezza ODBC driver installation is located in the following directory: C:\Program Files (x86)\IBM Netezza ODBC Driver

2. Verify that the ODBC driver libraries, which are dynamic linked libraries with the .dll extension, are installed successfully. 3. Verify that the Netezza ODBC driver installation directory is specified in the PATH. 4. Create and configure the ODBC DSN. 5. Specify the Netezza client utilities directory in the PATH. For example, set the PATH as follows: set PATH=%PATH%;C:\Program Files (x86)\IBM Netezza Tools\bin

Note: If the Netezza nzds and nzload client utilities are not in the PATH when DMX initiates a load to the Netezza database, DMX does the following:

48 DMX Install Guide

• nzds - DMX issues a performance warning message and establishes only one connection to the Netezza database. • nzload - DMX issues an error message and the DMX task aborts. 6. To run the nzds client utility, ensure that the database user account has the Manage Hardware privilege. For additional information on required privileges, see the IBM Netezza System Administration Guide and the Netezza Data Loading Guide.

7. Verify the port number used to connect to the Netezza database. When the NZ_DBMS_PORT environment variable is defined, DMX connects to the Netezza database using the value specified in NZ_DBMS_PORT; otherwise, DMX connects to the Netezza database using the default port number, 5480. UNIX systems For UNIX systems, the client software installation includes the following:

1. Install the Netezza client software package. For procedures on installing the Netezza ODBC driver and the client software, refer to the installation chapter of the IBM Netezza System Administration Guide. The default Netezza client installation is located in the following directory: /usr/local/nz/bin

The default Netezza ODBC driver installation is located in the following directory: /usr/local/nz/lib64

2. Create and configure the ODBC DSN. 3. Specify the Netezza client utilities directory in the PATH. For example, export the PATH as follows: export PATH=$PATH:/usr/local/nz/bin

Note: If the Netezza nzds and nzload client utilities are not in the PATH when DMX initiates a load to the Netezza database, DMX does the following: • nzds - DMX issues a performance warning message and establishes only one connection to the Netezza database. • nzload - DMX issues an error message and the DMX task aborts.

4. Set NZ_ODBC_INI_PATH to point to the directory where odbc.ini, without the leading period, ".", is located. For example, set NZ_ODBC_INI_PATH as follows: export NZ_ODBC_INI_PATH=$NZ_ODBC_INI_PATH:

5. To run the nzds client utility, ensure that the database user account has the Manage Hardware privilege.

DMX Install Guide 49

For additional information on required privileges, see the IBM Netezza System Administration Guide and the Netezza Data Loading Guide.

6. Verify the port number used to connect to the Netezza database. When the NZ_DBMS_PORT environment variable is defined, DMX connects to the Netezza database using the value specified in NZ_DBMS_PORT; otherwise, DMX connects to the Netezza database using the default port number, 5480. NoSQL Databases DMX can connect to any NoSQL database, for example, Apache Cassandra, Apache Hbase, and MongoDB, provided that you install the applicable NoSQL database client software and a compliant ODBC driver or JDBC driver.

DMX requires a Level 3.0 compliant ODBC driver or a Level 4.0 compliant JDBC driver to connect to a NoSQL database. Provided that your ODBC or JDBC driver supports NoSQL databases as sources and targets, DMX supports NoSQL databases as sources and targets.

To verify the level of NoSQL database support that your ODBC or JDBC driver provides, contact your ODBC or JDBC driver vendor. Installation and configuration DMX connects to NoSQL databases through the client software applicable to your NoSQL database and through a compliant ODBC or JDBC driver. Client software installation and configuration To reference client software download information, links, and installation instructions that are applicable to current NoSQL databases, for example Cassandra, Hbase, and MongoDB, consider the following sites:

• Cassandra: http://cassandra.apache.org/ • Hbase: http://hbase.apache.org/ • MongoDB: http://www.mongodb.org/ To establish a connection to a NoSQL database, install the applicable client software on the system on which DMX is installed. ODBC driver installation DMX requires a Level 3.0 compliant ODBC driver to connect to a NoSQL database. For driver installation and configuration information, refer to your ODBC driver documentation.

To reference ODBC driver download information for Simba ODBC drivers, for example, consider the following sites:

• Cassandra: http://www.simba.com/connectors/apache-cassandra-odbc • Hbase: http://www.simba.com/connectors/apache-hbase-odbc • MongoDB: http://www.simba.com/connectors/mongodb-odbc The installation documentation applicable to these sites outlines the steps to create the ODBC DSN and provides links to advanced options specific to the Simba driver for Cassandra, Hbase, and MongoDB.

50 DMX Install Guide

You can also reference Defining ODBC Data Source Names. While you can use any ODBC driver manager to load ODBC drivers for UNIX systems, by default, DMX uses the shipped unixODBC driver manager. JDBC driver installation DMX requires a Level 4.0 compliant JDBC driver to connect to a NoSQL database. For driver installation and configuration information, refer to your JDBC driver documentation. Oracle Your Oracle client must be installed on the system and configured so that it can connect to databases that you want to access from DMX. Please refer to specific Oracle documentation for details on configuring the client. Oracle naming method Oracle supports multiple naming methods to resolve Connect Identifiers. DMX only supports the Oracle Local Naming Method, which uses aliases defined in the tnsnames.ora configuration file on the Oracle client machine.This file is expected to reside in the /network/admin directory, where denotes the directory where Oracle is installed, or in the directory pointed to by the TNS_ADMIN environment variable. The Task Editor always reads the list of available databases from tnsnames.ora to automatically populate the list of databases in the Database Connection dialog. The file has to be formatted according to the Oracle documentation on syntax rules for configuration files. Otherwise, DMX may not be able to read the file correctly, resulting in an empty or partial database list in the Database Connection dialog. Verify that TNSNAMES is listed as one of the values of the NAMES.DIRECTORY_PATH parameter in the Oracle Net profile sqlnet.ora. The TNSNAMES field indicates that local naming is enabled.

If TNSNAMES is not listed as one of the values of the NAMES.DIRECTORY_PATH parameter in sqlnet.ora, run Oracle Net Configuration Assistant or Oracle Net Manager to add local naming method and the Oracle databases you want DMX to connect to. The configuration utility updates the Oracle Net profile, sqlnet.ora, located in /network/admin. Windows systems To access Oracle databases, Oracle client software must be accessible via the dynamic link libraries (dll) located under the \bin folder. The actual location of Oracle installation is usually stored in the ORACLE_HOME environment variable.

If you have installed the 64-bit version of DMX on 64-bit Windows, there are some important differences with respect to defining a DMX Task and running your application. UNIX systems To access Oracle databases, Oracle client software must be accessible via the shared libraries located under the /lib and /network/lib directories. The name of the shared library directory may vary, e.g. lib32 or lib64, depending on the Oracle version. Snowflake Snowflake is a cloud data warehouse that leverages separating storage from the platform in a cloud environment. Through JDBC connectivity, DMX-h supports Snowflake data warehouses as sources and targets.

DMX Install Guide 51

Snowflake connection requirements Snowflake requires a JDBC connection configuration with the driver name and location for all connections. Before attempting to connect to Snowflake, do the following:

• Install DMX server on an Amazon Elastic Compute Cloud (EC2) instance or your local machine. • Specify JDBC and cluster parallelization parameters in the DMX JDBC configuration file. The parameters outlined in the DMX JDBC configuration file, as defined by the DMX_JDBC_INI_FILE environment variable, provide DMX with the mandatory and optional values required to access an Amazon S3 bucket and to invoke a Snowflake COPY/MERGE query.

• If DMX runs inside EC2, attach an IAM role to the EC2 instance with the following conditions: 1. The attached IAM role must grant DMX read and write access to objects in the work bucket specified in the configuration file. 2. Configure the IAM role for Snowflake. 3. If the IAM role configured for Snowflake is not the same role attached to EC2, set the IAMROLE parameter in the configuration file to the IAM role configured for snowflake.

Note: When DMX cannot get temporary security credentials from an IAM role, DMX issues an error message and the DMX task aborts.

• When DMX is runs outside of an EC2 instance, DMX accesses snowflake using keys-based authentication. If no access keys are provided, DMX issues a UNIAMCRE error message aborts the job. The parameters outlined in a DMX Snowflake configuration file include the following:

• DriverName - Required JDBC driver name. • DriverClassPath - Required JDBC class path. • MAXPARALLELSTREAMS - Optional integer representing Maximum number of parallel streams that can be established for loading data to the staging data. By default, MAXPARALLELSTREAMS is set to the number of CPUs available in the client machine. • WORKTABLEDIRECTORY - Required path to an s3 bucket or local directory. If the path is an s3 url, s3://, DMX creates an external staging data. If the path is a local directory, file://, DMX creates an internal staging data using the specified local directory. • WORKTABLESCHEMA - Optional schema name to create the staging data . The default schema for the staging data is the same as the target data. • WORKTABLENCRYPTION - Server side encryption algorithm for encrypting staging data in the S3 bucket. Valid values are AES256 and aws:kms. • AWSACCESSKEYID - A 20-character, alphanumeric string that Amazon provides upon establishing an AWS account. If DMX runs in EC2, AWSACCESSKEYID is optional. • AWSACCESSKEY - The 40-character string, which is also referred to as the secret access key, which Amazon provides upon establishing an AWS account. If DMX runs in EC2, AWSACCESSKEY is optional.

DMX requires the access key id and the secret access key to send requests to an Amazon S3 bucket.

• IAMROLE - Optional Amazon Resource Name (ARN) for an IAM role that Snowflake uses for authentication and authorization if the same role is not attached to EC2. If EC2 and Snowflake share the same role, this parameter is not required.

52 DMX Install Guide

• LoadViaPut - Optional character. If WORKTABLEDIRECTORY is not set, DMX uses a PUT command to load data when LoadViaPut is set to "y". If the work table directory isn't provided and the LoadViaPut parameter isn't set to "y", the DMX job aborts with an error message.

Defining Snowflake database connections In the Database Connection dialog, define a connection to a Snowflake database as follows:

• At DBMS, select Snowflake. • At Access Method, select JDBC. • At Database, select a previously defined Snowflake JDBC database connection URL. • At Authentication, select Auto-detect.

Snowflake target connections Using a Snowflake JDBC connection, DMX-h can write supported Snowflake data types to Snowflake targets directly for optimal performance.

Defining Snowflake targets At the Target Database Table dialog, define a Snowflake database table target:

1. At Connection, select a previously defined Snowflake target connection or select Add new... to add a new one. 2. Select a table from the list of Tables, or select Create new... to create a new one. o User defined SQL statement is not supported. o All target disposition methods are supported. 3. On the Parameters tab, the following optional parameters are available for Snowflake target database tables. Values specified here take precedence over their corresponding property in the jdbc configuration file, if any. o Maximum parallel streams - the maximum number of parallel streams that can be established to load data for performance and that are created according to demand. o Work directory connection - name of the Amazon S3 that DMX uses to connect to and write to work table directory. o Work table codec - specifies the compression algorithm used to compress Snowflake data. o Work table directory - serves as the parent-level directory beneath which job-specific subdirectories are created for staging data. o Work table encryption - server-side encryption algorithm to encrypt the staging data o Work table schema - the schema used to create the staging table. 4. Set commit interval and Abort task if any record is rejected are not supported.

Snowflake source connections Using a Snowflake JDBC connection, DMX can read supported Snowflake data types from any Snowflake table.

Defining Snowflake sources For all DMX-h ETL jobs, DMX-h supports Snowflake database tables as sources and as lookup sources. At the Source Database Table dialog or at the Lookup Source Database Table dialog define either a Snowflake database table source or lookup source respectively:

DMX Install Guide 53

• At Connection, select a previously defined Snowflake source connection or select Add new... to add a new connection. Sybase Your Sybase client and Open Client Library must be installed on the system and configured so that it can connect to databases that you want to access from DMX. Please refer to specific Sybase documentation for details on configuring the client. Windows Systems To access Sybase databases, Sybase client software and Open Client Library must be accessible via the dynamic link libraries (dll) located in the installation directory. You can configure the client by using the dsedit utility, provided with the Sybase installation, to define database connections in the sql.ini file. UNIX Systems To access Sybase databases, Sybase client software and Open Client Library must be accessible via the shared libraries located in the Sybase installation directory. You can make the client libraries accessible by running the scripts provided with Sybase, such as /SYBASE.sh, where denotes the directory where Sybase is installed. You can configure the client by using the dsedit utility, provided with the Sybase installation, to define database connections in the interfaces file. Teradata In order to define a task that uses a Teradata table, the DMX Task Editor needs to access the Teradata database from the system where the DMX Task Editor is run. This requires the Teradata Call-Level Interface Version 2 for Network-Attached Systems (CLIv2), which is a Teradata Tools and Utilities product, to be installed and configured on that system.

To access the Teradata database at run-time, Teradata FastExport, Teradata Parallel Transporter (TPT), Teradata FastLoad, Teradata MultiLoad and Teradata Parallel Data Pump, which are Teradata Tools and Utilities products, must be installed and configured on the system where DMX jobs are run. Installation and configuration Teradata client software For Windows and UNIX systems, the client software installation includes the following:

1. On the system where the DMX Task Editor runs, install and configure the Teradata Utility Pack, which includes CLIv2 and the Teradata ODBC driver. 2. On the system where the DMX Job Editor runs, install and configure the Teradata extract and load utilities. For Windows systems, the default, base installation directory is as follows: C:\Program Files (x86)\Teradata\Client\

For UNIX systems, the default, base installation directory is as follows: /opt/teradata/client/

54 DMX Install Guide

For installation instructions and for information on the subdirectories under which the software components are installed, see the Teradata Tools and Utilities Installation Guide for Windows or UNIX that corresponds to your Teradata client software version.

Note:

• The Teradata installer updates all required directories to the PATH. • When connecting through CLIv2 using the TTU access method, you do not have to create and configure an ODBC data source. Vertica The primary way to connect to Vertica databases is via ODBC. With Vertica version 7 or later, DMX establishes parallel connections using Vertica COPY LOCAL, providing optimal performance, ease- of-use, and dynamic tuning.

With older versions of Vertica, there are cases when the DMX Vertica Load Example Files method may perform better than the ODBC method, as described at the end of this topic in "Choosing a Method." Connecting via ODBC When connecting to Vertica databases via ODBC, DMX uses different load methods based on the Vertica version, as shown in the following overview table:

Vertica Version Load Method

7 or later Multi-stream Vertica COPY LOCAL via ODBC on both Windows and Linux

6 or later Linux: Multi-stream Vertica COPY LOCAL via ODBC Windows: Multi-stream SQL INSERT via ODBC

Earlier than 6 Single-stream SQL INSERT via ODBC

Vertica Version 6 or Later When connecting to Vertica via ODBC, on Linux as of Vertica version 6, and on Windows as of Vertica version 7, DMX uses Vertica COPY LOCAL to load data, which provides the best possible load performance. If running Vertica 6 on Windows, it uses multi-stream SQL INSERT. Vertica Earlier than Version 6 When connecting to Vertica via ODBC prior to version 6, the Vertica ODBC client driver method loads data using a SQL INSERT statement to a single Vertica initiator node. Configuring ODBC for Vertica The Vertica configuration file, vertica.ini, is used by Vertica to determine the absolute path to the file containing the ODBC installer library and the absolute path to the directory containing the Vertica client driver's error message files. The path to vertica.ini is set through the Vertica configuration file environment variable, VERTICAINI.

Note: vertica.ini is different from the DMX node loading configuration file, DMXVertica.ini, which is specified through DMX_VERTICA_INI_FILE.

DMX Install Guide 55

To configure the Vertica ODBC driver:

1. Follow the instructions for defining ODBC data sources on Windows and UNIX systems. For Vertica ODBC client driver v5.1 or later on UNIX/Linux platforms, specify the following DSN parameters in vertica.ini and set the environment variable VERTICAINI to point to the location of the vertica.ini file: [Driver] ODBCInstLib=/lib/libodbcinstSSL.so ErrorMessagesPath=

where is the directory where DMX is installed. The error message files are generally stored in the same directory as the Vertica ODBC driver files.

2. When using the unixODBC driver manager, override the standard threading settings in the ODBC section of odbcinst.ini as follows: [ODBC] Threading = 1

For additional details, see the Vertica 's Guide for your version of Vertica. Other DBMSs ODBC Some sources and targets other than the databases mentioned above (e.g. Microsoft Access databases) may be accessed via ODBC, by defining the appropriate ODBC data sources. See Defining ODBC Data Sources for details on defining data sources on Windows and UNIX systems. JDBC To access a database management system (DBMS) that is not explicitly supported by DMX, Java Database Connectivity API (JDBC) can be used when a JDBC driver is provided by the database vendor. The JDBC driver establishes the connection to the database and implements the protocol for transferring queries and results between a client and the database.

Connecting to JDBC sources and targets requires that you define a JDBC configuration file, set DMX and JAVA environment variables, and specify the database connection URL, which the DBMS JDBC driver uses to connect to a database through the Database Connection dialog. Connection Overview Through the DMX_JDBC_INI_FILE environment variable, DMX gains access to the JDBC configuration file, which you define. As per the JDBC driver properties outlined in the JDBC configuration file, DMX determines the JDBC driver class name and the Java class path to the class and dependent classes; establishes the connection with the DBMS; and connects to the source or target database, which is specified in the database connection URL. JDBC Configuration File Outlined within the JDBC configuration file are the JDBC driver class name and Java class path for locating the driver class and dependent classes for each DBMS. A separate section in the JDBC configuration file is required for each DBMS.

56 DMX Install Guide

Format Requirements The JDBC configuration file is organized in sections. Consider the following format requirements:

• A section header marks the beginning of each section and is specified by a string enclosed in square brackets ([]). The enclosed string specifies the name or alias of the DBMS. To establish a connection to the database in the DBMS, the DBMS name, which is enclosed within brackets([]) in the section header of the JDBC configuration file, must match the DBMS name specified in the second parameter of the database connection URL. • Within each section, name-value pairs describe the properties of the JDBC driver. Unless otherwise stated, parameter names are case-insensitive and parameter values are case- sensitive. Each line can contain a maximum of one parameter description where the parameter value is separated from the parameter name by an equal sign (“=”). Extra spaces before and after the equal sign are ignored. Consider the following parameters for accessing a DBMS through JDBC:

• DriverName - Mandatory - This mandatory parameter identifies the JDBC driver class name or Java class. • DriverClassPath - Mandatory - This mandatory parameter identifies the Java class path that points to the JDBC driver. Use a semi-colon (;) to separate different entries in the path. • SelectStatement and InsertStatement – Optional - If the query language used by the DBMS does not follow the SQL92 standard, these optional parameters enable you to provide custom queries for select and insert operations. When you provide these parameters, DMX uses the statement templates to create the appropriate select and insert statements. If either the SelectStatement or InsertStatement parameter is not defined, standard SQL is used for the corresponding read/write operation. If the right side of the equal sign is blank for the SelectStatement or InsertStatement parameter, the corresponding read/write operation is not supported. You can use the following place holders in the statement templates: o – location where the actual comma-separated columns should be placed. o

– location where the actual table name should be placed. • IsSchemaSupported – Optional - This optional parameter ensures that DMX correctly identifies all specified database tables. Through JDBC calls, DMX can generally determine whether a DBMS supports a schema; however, certain DBMSs, such as Hive, do not return the values that are expected from certain JDBC calls. Under these circumstances, you can set the IsSchemaSupported parameter to ensure that all specified database tables are identified correctly. The values for this parameter can be either true or false; as an exception to the general rule, IsSchemaSupported parameter values are case insensitive. For information on connecting to Hive through ODBC, see Connecting to Hive data warehouses.

• The character '#" marks the start of a comment, which continues until the end of the line. Comments are permitted anywhere within a JDBC configuration file. • Empty lines are permitted anywhere within a JDBC configuration file. DMX and JAVA Environment Variables DMX_JDBC_INI_FILE To provide DMX with access to the JDBC configuration file, you must set the DMX environment variable, DMX_JDBC_INI_FILE, to point to the full path of the JDBC configuration file. The full path includes the directory location and the JDBC configuration file name.

Consider the following examples on setting the DMX_JDBC_INI_FILE environment variable: On Windows: set DMX_JDBC_INI_FILE=C:\Program Files\DMExpress\Programs\DMXJdbcConfig.ini

DMX Install Guide 57

On UNIX: export DMX_JDBC_INI_FILE=/usr/dmexpress/etc/DMXJdbcConfig.ini JAVA_HOME After you install the Java Runtime Environment (JRE) in Windows, you must set the Java environment variable, JAVA_HOME, to point to the JRE installation directory. The bit level (32 or 64) of the installed JRE must match the bit level of the DMX release that you are running.

Consider the following examples on setting the JAVA_HOME environment variable: On Windows: set JAVA_HOME=C:\Program Files (x86)\Java\jdk1.7.0_51

On UNIX: export JAVA_HOME=/usr/java/jdk1.6.0_24

Database Connection URL To connect to a database using the JDBC access method, you must specify the database connection URL as the database specification in the Database Connection dialog. MySQL Example At a minimum, each database connection URL, which the JDBC driver uses to connect to the JDBC source or target, consists of jdbc, which is the required first parameter; the DBMS name, the database host name, the database name, and any additional connection property specification.

To access database db1 in the MySQL DBMS installed on the local computer, consider the following valid database URL: jdbc:mysql://localhost/db1 where

jdbc - required first parameter to connect to a JDBC source or target.

mysql - DBMS name. This DBMS name must match the DBMS name specified within brackets ([]) in the section header of the JDBC configuration file.

//localhost/db1 - host and database identification string that identifies the db1 database in the local MySQL DBMS installation.

For additional information, see MySQL Driver and Data Source Class Names, URL Syntax, and Configuration Properties for Connector/J. Defining ODBC Data Sources A data source needs to be defined for each database that you want DMX to access through ODBC. The data source name on the client, where DMX tasks or jobs are defined, has to be the same as the data source name on the server where DMX tasks or jobs run. Windows Systems You can define an ODBC data source through the ODBC Data Source Administrator as follows:

• From the Start menu, select Settings, Control Panel, Administrative Tools, Data Sources (ODBC).

58 DMX Install Guide

• In the ODBC Data Source Administrator dialog, choose the User DSN or System DSN tab, and click on the Add button. On 64-bit Windows, select the User DSN tab. • In the Create New Data Source dialog, select the appropriate DBMS driver from the list, e.g. SQL Server, Microsoft Access Driver (*.mdb), etc. Then, press the Finish button. • The setup wizard guides you with further driver specific instructions. UNIX Systems On UNIX systems, you may choose to use the ODBC driver manager, unixODBC, that is shipped with DMX, or you may use your own driver manager. DMX Default Driver Manager The DMX install and databaseSetup programs assist you in creating unixODBC data sources.

Alternatively, you can define ODBC data sources manually. The ODBC data source manager provides support for ODBC data sources through two configuration files, odbcinst.ini and odbc.ini that are located in the directory /etc. This directory also contains templates and examples of the configuration files. To change the location of the configuration files, export the ODBCSYSINI environment variable to the new directory where both files reside.

The files need to be set up appropriately before you can access databases via ODBC. This is a one- time configuration step, similar to defining system data sources on Windows.

/etc/odbcinst.ini: This file contains DBMS specific and system specific driver definitions. Configuring this file corresponds to selecting the DBMS driver while adding a data source on a Windows system. • /etc/odbc.ini: This file contains DBMS specific data source definitions, based on the drivers defined previously in the odbcinst.ini file. Configuring this file corresponds to following DBMS driver specific instructions while adding a data source on a Windows system. If you wish to remove a data source, delete the section that corresponds to that data source from the odbc.ini file. A section starts with the data source name enclosed by [], and ends at the beginning of the next section or at the end of the file. 64 Bit ODBC The ODBC headers and libraries that are shipped with the Microsoft Data Access Components (MDAC) 2.7 Software Development Kit (SDK) have changed from earlier versions of ODBC to support 64-bit platforms. Since the ODBC driver for a specific DBMS and the unixODBC libraries are built separately, there may be an incompatibility in the definition of SQLLEN variable which was specifically introduced for ODBC access on 64-bit UNIX platforms. On 64-bit UNIX platforms, DMX assumes that the ODBC driver is 64-bit compliant and defaults the value of SQLLEN variable to 8 bytes. You can overwrite this default, that is, the DMXSQLLEN value corresponding to the specific DBMS driver, in the odbcinst.ini file. Use Other ODBC Driver Manager By default, DMX uses the shipped unixODBC driver manager to load all ODBC drivers. Some ODBC drivers, such as Teradata ODBC driver, may not work with it. You can tell DMX to use a different ODBC driver manager by specifying the option DMXODBCDRIVERMANAGER=No under the driver section in the odbcinst.ini file. You need to make sure that your ODBC driver manager library path (e.g. /usr/odbc/lib for Teradata V12) is in the system library path (e.g. LD_LIBRARY_PATH on Linux) so that it is loaded first by DMX. In addition, you may need to export the ODBCINI environment variable with the absolute path to the file odbc.ini (e.g. export ODBCINI=/etc/odbc.ini). Refer to your DBMS documentation for details on this requirement.

DMX Install Guide 59

Connecting to Message Queues from DMX DMX can access message queues as sources or targets when the appropriate message queue client software is installed on the system and accessible. The configuration steps needed to access a specific message queue are described in the following sections.

To connect to a message queue via a Data Connector, follow the installation instructions that accompany the connector. IBM WebSphere MQ To create a connection to an IBM Websphere queue manager, you specify the queue manager name in the Message Queue Connection dialog, and provide a channel definition that includes:

• the channel name, • the transport type, and • the connection name, with an optional port number. Port number DMX assumes a default port number of 1414 for the port where the server’s listener is expecting client communication. You can change the port number by specifying it in parentheses following the connection name. For example, 192.168.2.100(1415) or server-machine.com(1415).

The channel definition may be specified by either:

1. Defining the MQSERVER environment variable, or 2. Using a DMX WebSphere queue manager configuration file. The MQSERVER environment variable You can define a channel via the MQSERVER environment variable as defined by IBM.

Example On Windows: SET MQSERVER=CHANNEL1/TCP/MQSERVER01

or, to change the default port number: SET MQSERVER=CHANNEL1/TCP/MQSERVER01(1418)

On Unix: export MQSERVER=CHANNEL1/TCP/’MQSERVER01’

or, to change the default port number: export MQSERVER=CHANNEL1/TCP/’MQSERVER01(1418)’ Queue manager configuration file You can also create channel definitions for one or more queue managers in a configuration file, and provide the fully qualified file name to DMX in the DMX_CONNECTOR_ENV_MQ_WS_INI_FILE environment variable. This populates the Queue manager combo box of the Message Queue Connection dialog with the defined queue managers or their aliases.

The contents of the file must be formatted as follows:

60 DMX Install Guide

• Anything following a “#” character until the end of the line is a comment. Comments are allowed anywhere. • Empty lines are allowed anywhere. • The file is organized in sections. The beginning of each section (the section header) is specified by a string enclosed in square brackets. The enclosed string may be a queue manager name or a queue manager alias. • The section headers must be unique. • The lines between section headers contain the channel definition parameters for that particular queue manager or alias. There are 4 supported parameters: queuemanager, channel, transport, connectionname. The parameter values are separated from the parameter name by an “=” character. The parameter names are case insensitive, but their values are case sensitive, except for the transport parameter. Each line may contain at most one parameter definition. • The queuemanager parameter is used for cases where the section name is not a queue manager name, but a queue manager alias. This is to allow potential configuration of different channel definition options for the same queue manager. If there is a configuration section with just the name but no parameters, the MQSERVER environment variable definition is used at connection time. This saves you from typing in the queue manager's name in the GUI, as the name appears in the list of known queue managers.

The connection parameters defined in this file override the MQSERVER environment variable only if all 3 parameters (channel, transport and connectionname) are defined for a particular queue manager. If you would like to define several different parameter sets for the same queue manager, use an alias for the section name and override the queue manager name by defining the queuemanager parameter inside the parameters section for that alias. Sample configuration file A sample configuration file (DMXWebSphereConnector.ini) is installed in the directory:

• On Windows: \Examples\WebSphereConnector\DMXWebSphereConnector.ini • On Unix: /etc/DMXWebSphereConnector.ini where is the directory where DMX is installed. Example Define the DMX_CONNECTOR_ENV_MQ_WS_INI_FILE environment variable: SET DMX_CONNECTOR_ENV_MQ_WS_INI_FILE = C:\tmp\DMXWSConfig.ini

Create the DMXWSConfig.ini file at the above location, with the following content: [my.local.queue.manager] Channel = all.clients Transport = tcp Connectionname = mw-server.com Connecting to Salesforce from DMX In order for DMX to connect to Salesforce.com, the SSL client certificate must be up-to-date. The DMX installation includes the file cacert.pem in the /CACertificates directory. This is a plain text file containing a set of public keys used for SSL authentication when connecting to Salesforce.com.

DMX Install Guide 61

If this file goes out-of-date, an HTTPSCVF error is issued when attempting to connect to Salesforce.com. If that happens, go to http://curl.haxx.se/ca/cacert.pem, save the file as cacert.pem to the /CACertificates directory, and retry the Salesforce.com connection. Connecting to SAP from DMX In order for DMX to access data in an SAP system, SAP’s NetWeaver client libraries NW RFC SDK 7.10 with patch level 2 or higher must be on the system and accessible via the appropriate shared library or dynamic link library (dll) paths. They can be downloaded from SAP’s marketplace at http://service.sap.com/swdc. For Windows 64-bit platforms, both the 64bit and 32bit NetWeaver client libraries are required.

The following environment variable must be set to include the path to the NetWeaver client libraries, for example, /nwrfcsdk/lib, and exported on the corresponding platform:

Windows PATH AIX LIBPATH HP-UX SHLIB_PATH Linux LD_LIBRARY_PATH Solaris LD_LIBRARY_PATH

On UNIX systems, the variable needs to be set and exported prior to starting the DMX Run-time Service or running DMX tasks or jobs.

The SAP NetWeaver client libraries depend on the corresponding C/C++ libraries that they were built with. The path to the C/C++ libraries must also be included in the library search path.

Windows: Microsoft C Runtime DLLs version 8.0 need to be installed. Refer to SAP Note 684106 at https://service.sap.com/sap/support/notes/684106. The vcredist_x86 package needs to be installed on all Windows platforms. In addition, the vcredist_IA64 and vcredist_x64 packages need to be installed on Windows IA64 and Windows x64 platforms, respectively. AIX: AIX C++ library libC.a – usually found in /usr/lib. HP-UX IA64: HP C++ library libCsup.so.1 – usually found in /usr/lib/hpux64. Linux: C library version 2.3.4 or higher, libstdc++.so.6 – usually found in /lib/tls and /usr/lib. Refer to SAP Note 1021236 at https://service.sap.com/sap/support/notes/1021236. Solaris: Sun C++ libraries libCstd.so.1 and libCrun.so.1 for SunOS 5.10 or higher – usually found in /usr/lib/sparcv9 or /usr/lib/64.

If lower versions of these libraries are also on the system, then the path to the libraries of the required version must be before the older versions in the library search path.

On all systems, the path to the DMX library must be before the path to the SAP NetWeaver client libraries in the library search path.

The SAP client library must be configured so that it can connect to SAP systems that you want to access from DMX. For example, you can configure the client by defining SAP system aliases in the sapnwrfc.ini file. Please refer to specific SAP documentation for details on configuring the client. Once configured, the directory where the file is located must be set in the environment variable RFC_INI.

The DMX install program assists you with verifying connections to SAP systems.

62 DMX Install Guide

On UNIX systems, if you wish to configure and/or verify SAP connections any time after the installation procedure, run the SAPSetup program as follows: cd ./SAPSetup Registering DMX in SAP SLD Per SAP recommendation, each DMX server should be registered in your SAP SLD (System Landscape Directory). Please refer to the topic “Registration of DMExpress Components in the SAP System Landscape Directory” in the DMX help. Connecting to HDFS from DMX In order for DMX to access data located in a HDFS, a Hadoop distribution must be installed and configured as follows on the system where the DMX jobs and tasks are executed:

• The hadoop command must be accessible to DMX:

o DMX first looks for the hadoop command in $HADOOP_HOME/bin/hadoop, where the environment variable HADOOP_HOME is set to the directory where Hadoop is installed. Defining environment variables can be done through the Environment Variables tab of the DMX Server dialog. o If HADOOP_HOME is not defined or the directory can't be found, DMX looks for the hadoop command in the system path, where it is automatically added by some Hadoop distributions.

• The fs.default.name property in the core-site.xml configuration file must be set to point to the Hadoop file system. • The HTTP namenode daemon must be running on the default port 50070. If you would like to use a different port number, please contact Technical Support. • If the Hadoop cluster requires Kerberos authentication, you need to use the dmxkinit utility to run your HDFS extract/load jobs/tasks. Connecting to Connect:Direct nodes from DMX In establishing connectivity to a Connect:Direct node on the mainframe, DMX initiates file transfers from this node to an open-systems Linux server. Security The Connect:Direct proprietary security protocol offers security through authentication and user proxies. User authorities and user proxies are setup during Connect:Direct installation and configuration. Installation and Configuration For DMX to access data located on a Connect:Direct node, a Connect:Direct server and Connect:Direct client must be installed on the same Linux machine on which DMX jobs and tasks are executed.

• Configure Connect:Direct to access the required Connect:Direct nodes.

DMX Install Guide 63

For details on configuring Connect:Direct nodes, refer to the IBM Sterling Connect:Direct product documentation.

• Add a Connect:Direct user for each DMX user who accesses Connect:Direct. • The DMX server must be configured as the Connect:Direct primary node (pnode) to enable sampling with Connect:Direct connections. • Prior to starting the DMX Run-time Service or to running DMX tasks or jobs, set the following environment variables: o NDMAPICFG points to the CLI/API configuration file, ndmapi.cfg, for example: export NDMAPICFG=$NDMAPICFG:/ndm/cfg/cliapi/ndmapi. cfg

o PATH points to the Connect:Direct /bin directory, for example: export PATH=$PATH:/ndm/bin

If you plan to start the DMX Run-time Service using sudo, use the –E option to preserve the environment variable settings.

Note: If the :file.open.exit.program parameter in the user.exits section of the parameter initialization configuration file, /ndm/cfg//initparm.cfg, contains any path, including the path to SSConnectDirectFileOpenUserExit, remove the full path such that the parameter value is blank: :file.open.exit.program=:\ Connecting to CyberArk Enterprise Password Vault

DMX connects to CyberArk Enterprise Password Vault over a TLS-secured HTTPS connection and requires access to an up-to-date TLS client certificate. If the CyberArk server secures the DMX connection with a self-signed certificate, update /CACertificates/cacert.pem with the public certificate at the same time you update or install the client certificate, where is DMX install directory

For DMX jobs run in a Hadoop cluster, update the client certificate and cacert.pem on the edge node only. DMX distributes TLS configurations, keys, and certificates to the cluster nodes.

If a client certificate file is out-of-date, DMX issues an HTTPSCVF error when it attempts to connect to the CyberArk server.

CyberArk Licenses

DMX can only connect to licensed CyberArk vaults. Check the CyberArk license status if you encounter repeated failures to retrieve a CyberArk password.

64 DMX Install Guide

Connecting to Protegrity Data Security Gateway

DMX connects to Protegrity Data Security Gateway by making REST API POST requests over HTTP. The DMX Protect and Unprotect functions use Protegrity resources to protect and unprotect data sent to Protegrity. You must configure the Protegrity Gateway server to receive and process REST requests before DMX can use the functions, The API end point implementation determines the specific protection methods. Some details needed to set up protection are:

• All REST API calls use the POST method. • Data is always sent as part of HTTP message body. • Data is always sent without any encoding change. The Protegrity server must return protected data with the same encoding as the data input. • DMX does not pass data with empty or NULL values to the Protegrity server.

Connecting to QlikView data eXchange files from QlikView or Qlik Sense Qlik is the provider of QlikView and Qlik Sense business intelligence and visualization software applications. DMX supports QlikView data eXchange (QVX) files as targets. Through DMX, you define the QVX file and the QlikView data eXchange reformat layout.

QVX files can be used as data sources for QlikView or Qlik Sense. QlikView desktop installation overview To access QVX files as sources from within QlikView:

1. Install the QlikView desktop. 2. At the QlikView desktop: a) Start QlikView Personal Edition. b) At the File menu, select Open. c) In the Open dialog, ensure that the file type is All Files (*.*) and browse to the appropriate QVX file. d) Select the QVX file and select Open. e) At the File Wizard dialog, ensure that the File type is Qvx and select Finish. f) At the Edit Script dialog, select Reload to execute the displayed script, which loads the QVX data. g) At the Fields tab of the Sheet Properties dialog, select the fields to display on the Main QlikView sheet. h) To save the data in the QlikView document, select Save. Qlik Sense desktop installation overview To access QVX files as sources from within Qlik Sense:

1. Install the Qlik Sense desktop. For information on Qlik Sense, see Qlik Sense help.

DMX Install Guide 65

2. At the Qlik Sense desktop: a) Start the Qlik Sense desktop. b) Select Create a New App. c) In the Create new app dialog, enter the name of the application and select Create. d) At the New app created dialog, select Open. e) At the Qlik Sense desktop, select Quick data load. f) At the Select file dialog, ensure that the file type is QlikView data exchange files (qvx), browse to the appropriate QVX file, and select Select. g) At the Select data from .qvx dialog, select the appropriate fields to load and select Load data. h) When the data loads successfully, a new data sheet is created. i) To edit the data sheet, select Edit the sheet. Connecting to Tableau Data Extract files from Tableau Tableau is a business intelligence application that provides browser-based analytics. DMX supports Tableau Data Extract (TDE) files as targets. Through DMX, you define the TDE file and the Tableau Data Extract reformat layout.

TDE files can be used as data sources for Tableau. Tableau desktop installation overview To access TDE files as sources from within Tableau:

1. Install the Tableau desktop. 2. At the Tableau desktop: a) Start Tableau Desktop. b) Select Connect to Data. c) In the File section of the Connect to Data page, select Tableau Data Extract. d) At the Open dialog, browse to and select the Tableau Data Extract file. e) At the Tableau Data Extract Connection dialog, enter the name of the data connection for use in Tableau.

The data in the Tableau Data Extract file displays within the Tableau Desktop. Removing DMX/DMX-h from Your System Windows Systems Perform the following steps to remove DMX from your system:

1. Ensure that the DMX Task Editor, DMX Job Editor, and DMX Server are closed and no DMX jobs are running.

2. Go to Programs, DMExpress from the Start menu and select Uninstall DMX.

66 DMX Install Guide

3. Alternatively, you can remove DMX as follows: Go to Settings, Control Panel from the Start menu and double-click Add/Remove Programs. In the list of applications that can be removed, select the entry for DMX. Click Add/Remove and confirm. 4. Delete folders if necessary. If you created any of your own files in the folder where you installed DMX, these files are not removed by the uninstall program. UNIX Systems Perform the following steps to remove DMX from your system:

1. Ensure that no DMX jobs are running.

2. If you installed the DMX Run-time Service, you need to uninstall it first. Login as root and run: cd ./install

When prompted, select to uninstall the service.

3. Remove the DMX directory: cd /.. rm –rf

Remove any environment variable settings that you added to your profile, e.g. /bin in your PATH, after the DMX installation. DMX-h in a Hadoop Cluster The method for removing DMX-h from the nodes of a Hadoop cluster depends on how you originally installed DMX-h in the cluster. Follow the instructions in the appropriate section below. Cloudera Manager Parcel Uninstall Uninstall DMX-h on all nodes in the cluster as follows:

1. Ensure that no DMX-h jobs are running. 2. Uninstall the DMX Run-time Service on any edge/cluster node where it is running. 3. Click on the parcel indicator button in the Cloudera Manager Admin console navigation bar to bring up the Parcels tab of the Hosts page. 4. In the currently activated dmexpress parcel, click on the Actions button and select Deactivate to deactivate the parcel. 5. Once deactivated, click on the Actions button and select Remove From Hosts to remove the parcel from the cluster nodes. 6. Once the parcel is removed from the cluster nodes, click on the Actions button and select Delete to delete the parcel from the repository. Apache Ambari Service Uninstall Follow the instructions for RPM Uninstall or

1. Open the Ambari Web UI and navigate to “Hosts”

DMX Install Guide 67

2. For each host, choose the “Installed” drop-down next to “Clients” 3. For both “DMX-h” and “DMX-h License” (if present), choose “UNINSTALL.” Once uninstalled, either via the UI or using RPM, disable the uninstalled services:

1. Open the Ambari Web UI, and navigate to “Services” 2. For each of “DMX-h” and “DMX-h License” (if present), choose “Service Actions” -> “Delete Service” and follow the prompts. RPM Uninstall Uninstall DMX-h on all nodes in the cluster as follows:

1. Ensure that no DMX-h jobs are running. 2. Uninstall the DMX Run-time Service on any edge/cluster node where it is running.

3. Run the following command with sudo or root privileges using the erase option, –e: Software: rpm -e dmexpress License: rpm -e dmexpresslicense- e.g. rpm -e dmexpresslicense-12345

If you do not know your license site ID, run the following command to find the installed license package name: rpm -qa | grep dmexpresslicense-

You can also use an RPM wrapper such as yum instead: yum erase dmexpress

yum erase dmexpresslicense-

Manual/Silent Uninstall Uninstall DMX-h on the edge/ETL node and each remaining node in the cluster as follows:

1. Ensure that no DMX-h jobs are running. 2. Uninstall the DMX Run-time Service on any edge/cluster node where it is running. 3. Remove the DMX home directory on the edge node and all remaining nodes in the cluster: cd /.. rm –rf

4. Remove any environment variable modifications made for DMX, such as the addition of /bin to your PATH. Uninstall the DMX Run-time Service When instructed to uninstall the DMX Run-time Service, run the install script in the DMX installation directory as root, and select the option to uninstall the DMX Run-time Service. For example: cd /usr/local/DMExpress ./install

68 DMX Install Guide

DMX installation component options DMX installation component options include the following:

• Standard The standard installation enables you to install the following components on one server:

o Development client, Job Editor and Task Editor o DMX engine, dmxdfnl/ dmxjob/ dmexpress o Service for development client, which is the DMX Run-time Service, dmxd o DataFunnel Run-time Service, dmxrund o See DMX DataFunnel run-time service installation and configuration. • Full The full installation enables you to install all DMX components on one server:

o Development client, DMX Job Editor and Task Editor o DMX engine, dmxdfnl/ dmxjob/ dmexpress o Service for development client, which is the DMX Run-time Service, dmxd o DataFunnel Run-time Service, dmxrund o See DMX DataFunnel run-time service installation and configuration. o Management Service, which includes, dmxmgr, REST , and the Connect Portal user interface (UI) See DMX Management Service installation and configuration. • Classic The classic installation enables you to install traditional DMX components on one server:

o Development client, Job Editor and Task Editor o DMX engine, dmxjob/dmexpress o Service for development client, which is the DMX Run-time Service, dmxd • Custom The custom installation enables you to install individual components on different servers:

o DMX engine Installs the DMX engine, dmxdfnl/ dmxjob/ dmexpress.

o Service for development client Installs the DMX Run-time Service, dmxd.

o DataFunnel Run-time Service Installs the DataFunnel Run-time Service, dmxrund. See DMX DataFunnel run-time service installation and configuration.

o Development client Installs the Job Editor and Task Editor.

o Management Service Installs the management service, dmxmgr, REST APIs, and the Connect Portal UI. See DMX Management Service installation and configuration.

DMX Install Guide 69

DMX Management Service installation and configuration Installation DMX Management Service executable The DMX Management Service executable, DMXManager, is installed in the following directory: Windows \Programs Linux /bin The DMX management service configuration file The DMX management service configuration file, dmxmgr.properties, is installed in the following directory: Windows \Conf Linux /conf Configuration DMX management service configuration file Many of the properties within dmxmgr.properties are populated with commented, preliminary default values. Consider each of the name value pairs among the following properties within the file; uncomment and update to meet your system requirements:

• Server • Secure socket layer (SSL) • Authentication • Central file repository • Central database repository • Logging Configuration properties as environment variables You can specify the configuration properties defined within dmxmgr.properties as environment variables by capitalizing the property name and replacing the period separator, ".", with an underscore. The configuration property name, authentication.method could be specified as a Linux environment variable, for example, as follows: export AUTHENTICATION_METHOD=LDAP. Server configuration properties Server configuration properties are defined through the name value pairs specified in the DMExpress management service configuration file, dmxmgr.properties.

Consider the following server configuration properties:

Property name Property description Property values

70 DMX Install Guide

server.address The address of the Required embedded Apache Tomcat web server application.

server.port The DMX management 8280 (default). service port that is assigned If the port number dedicated to listening to to listen for client requests. client requests is different from 8280, assign the appropriate value. Secure socket layer configuration properties Secure socket layer (SSL) configuration properties are defined through the name value pairs specified in the DMX management service configuration file, dmxmgr.properties.

By default, the DMX management service disables SSL certification. To enable SSL certification, SSL configuration properties must be added to dmxmgr.properties.

Consider the following SSL configuration properties:

Property name Property description Property values

security.require-ssl Determines whether SSL False (default), True certification is required. For SSL certification to be enabled, the property value must be set to True.

server.ssl.client-auth Determines whether client want authentication occurs during For client authentication to occur the SSL handshake. during the SSL handshake, the property value must be set to want.

server.ssl.key-alias Alias of the SSL key. Required when SSL certification is enabled.

server.ssl.key- Password of the SSL key. Required when SSL certification is password enabled.

server.ssl.key-store Location of the DMX central Required when SSL certification is management server keyStore. enabled.

server.ssl.key-store- Password of the DMX central Required when SSL certification is password management server keyStore. enabled.

server.ssl.trust-store Location of the DMX central Required when the DMX DataFunnel management server trustStore. run-time service uses SSL

server.ssl.trust- Password of the DMX central Required when the DMX DataFunnel store-password management server trustStore. run-time service uses SSL. Authentication configuration properties Authentication configuration properties are defined through the name value pairs specified in the DMX management service configuration file, dmxmgr.properties.

DMX Install Guide 71

Consider the following authentication configuration properties:

Property name Property description Property Values

authentication.method The authentication LDAP (default) and SIMPLE. method for authenticating users. If you skip the configuration setup during the installation, the installation process automatically assigns the value LDAP to the authentication.method property name.

When LDAP is the authentication method, you must provide LDAP authentication configuration properties.

authentication.login.aut Specifies whether true (default) and false. o_create_users new users should be To successfully call REST APIs, valid user created dynamically credentials on the authentication backend (for upon login. example, on the LDAP active directory) must

also be registered with the DMX management service. When the property value is set to false, users are not automatically created and registered on the DMX management service even when they are registered on the LDAP active directory. Any attempt by an unregistered user to call the REST API layer of the DMX management service results in a call failure with status code 401/Unauthorized.

authentication.login.def The default user role_administrator and role_user (defaults). ault_role roles, which the DMX These roles are assigned dynamically as part management service of the initial login: requires to operate,

are automatically The first user who successfully logs into the established as part of system is granted the user role, the DMX role_administrator. management service installation. Any subsequent user who successfully logs into the system is granted the user role, role_user. While not required, system administrators can create new, custom user roles and assign existing permissions to the new roles. Examples of possible custom user roles include the following: business user, operator, data scientist, data architect, solution engineer, developer.

72 DMX Install Guide

authentication.token.sig The signature secret If a signature secret value is not specified, a nature_secret to Secure Hash random secret is generated at DMX Algorithm (SHA)-sign management service at start up time. generated

authentication When generating the cryptographic signature tokens. of an authentication token, a portion of the authentication token segment is signed using a SHA message digest. A signature secret is applied to the message digest. The resulting signature secret value, which is applied to the authentication cookie, is encoded as a Base64 string.

authentication.token.to The time in seconds 36000 seconds (default). ken_validity in which a generated token is valid. 36000 seconds is equivalent to 10 hours.

authentication.token.co The domain attribute The cookie domain specifies to the browser okie_domain of the authentication that cookies should only be sent back to the token cookie. DMX management service for the given domain.

If the cookie domain is not specified, the cookie is sent back to the domain on the DMX management service from which the object was requested by default.

For additional information, see HTTP Cookie Domain and Path.

authentication.token.co The path attribute of The cookie path specifies to the browser that okie_path the authentication cookies should only be sent back to the DMX token cookie. management service for the given path. If the cookie path is not specified, the cookie is sent back to the path on the DMX management service from which the object was requested by default.

For additional information, see HTTP Cookie Domain and Path.

LDAP authentication configuration properties Property name Property description Property Values

ldap.url LDAP URL. Required when authentication.met hod is set to LDAP.

DMX Install Guide 73

ldap.active_directory.us LDAP active directory user domain. Required when er_domain authentication.met hod is set to LDAP.

ldap.active_directory.ro LDAP active directory root domain. Required when ot_domain authentication.met hod is set to LDAP.

ldap.search.managerDn Distinguished name (DN) of the manager, which is the user that performs searches when the LDAP server does not support or has not enabled anonymous searches.

ldap.search.managerPa Password of the manager that performs LDAP ssword searches.

ldap.search.userBaseDn The search base DN for finding users.

Central file repository configuration properties Central file repository configuration properties are defined through the name value pairs specified in the DMX management service configuration file, dmxmgr.properties.

The DMXDFNL root job and its job dependencies, which include subjobs, tasks, and operational metadata, are stored in the DMX central file repository. The DMX central file repository must be configured to reside on a local file system.

Consider the following file repository configuration properties:

74 DMX Install Guide

Local central file repository configuration properties

Property name Property description Property Values

repository.url Location of the local DMX central file repository. The default location of the central file repository is the home directory on your local client workstation.

History repository configuration properties

Property Property name Property description Values

history.repository. Location of the job execution history directory, which is relative Required location to the DMX central file repository. Beneath the top-level history directory, individual job run directories are created and organized by date: ~/.dmexpress/history/ {YEAR}/ {YEAR}/ {MONTH}/ {DAY}/ {}_{[_]_log.{xml|txt} {}_{[_].json The job log is generated in XML or Text format; the operational metadata log is generated in JSON format.

Central database repository configuration properties Central database repository configuration properties are defined through the name value pairs specified in the DMX management service configuration file, dmxmgr.properties.

The DMXDFNL job definition and runtime connection data are stored in the central database repository. The central database repository must be configured to reside on your local client workstation.

Consider the following database repository configuration properties:

Property Property name Property description Values

DMX Install Guide 75

spring.datasource.url Location of the local DMX central Required. database repository.

The default location of the central database repository is located beneath the home directory on your local client workstation: ~/.dmexpress/com.syncsort.dmxmgr/

spring.datasource.username Name of the database user with access to Required the database repository.

spring.datasource.password Password value associated with the user Required with access to the database.

spring.datasource.driverClassName Identifies the JDBC driver class name or Required Java class. Logging configuration properties Logging configuration properties are defined through the name value pairs specified in the DMX management service configuration file, dmxmgr.properties.

Consider the following logging configuration properties:

Property name Property description Property Values

logging.file The relative or absolute path to and name of for example: the DMX management service log file; logging.file=${java.io.tmpdir:- /tmp}/dmxmgr.log.

logging.level.* The level of logging detail written to the DMX Valid values include the management service log file that is defined in following: ERROR, WARN, logging.file. INFO, DEBUG, or TRACE. ERROR, WARN and INFO level messages are

logged by default. DMX DataFunnel run-time service install and configuration Installation DMX DataFunnel run-time service executable The DMX DataFunnel run-time service executable, dmxrund, is installed in the following directory: Windows \Programs Linux /bin DMX DataFunnel Run-time Service configuration file The DMX DataFunnel Run-time Service configuration file, dmxrund.conf, is installed in the following directory:

76 DMX Install Guide

Windows \Conf Linux /conf

Linux only: DMX impersonation executable The DMX impersonation executable, dmxexecutor.exe, is installed in the following Linux directory: /bin Linux only: DMX custom impersonation configuration file The DMX custom impersonation configuration file, dmxexecutor.conf, is located in the following Linux directory: /conf Configuration DMX custom impersonation configuration file Many of the properties within dmxrund.conf are populated with commented, preliminary default values. Uncomment and update applicable properties to meet your system requirements. DMX DataFunnel Run-time Service configuration properties DMX DataFunnel Run-time Service configuration properties are defined through the name value pairs specified in the DMX DataFunnel Run-time Service service file, dmxrund.conf.

Consider the following DataFunnel Run-time Service configuration properties:

Property Property name Property description Values

SERVER_PORT The DMX execution service port that is assigned to listen 33636 for job execution requests from the DMX management (default) service, dmxmgr.

DMEXPRESS_HOME The directory where DMX is installed. Required

UNPACK_WORK_DIR The working directory where jobs are unpacked. Required ECTORY Value: Required.

SECURITY_ENABLED Determines whether Secure socket layer (SSL) security is Y (default), N enabled. For SSL certification to be enabled, the property value must be set to Y.

DMX Install Guide 77

SSL_SERVER_PRIVAT The path to the SSL server private key file, which is in Required E_KEY PEM format. when SSL certification is enabled.

SSL_SERVER_CERTIFI The path to the SSL server certificate public key file, Required CATE which is in PEM format. when SSL certification is enabled

SSL_CLIENT_AUTHE Determines whether to authenticate the client. Y (default), N NTICATION_ENABLE D

SSL_TRUSTED_CERTI The path to the trusted certificates file, which is in PEM Required FICATES format. This file can contain multiple client certificates in when SSL PEM format certification is enabled. Linux only: DMX custom impersonation configuration file If dmxexecutor was established as the impersonated user during Linux pre-installation, you have the option of updating dmxexecutor.conf. Updating properties in dmxexecutor.conf is optional; update only to customize the impersonation process. DMX custom impersonation configuration properties To customize impersonation, DMX custom impersonation configuration properties are defined through the name value pairs specified in the DMX custom impersonation file, dmxexecutor.conf.

Consider the following custom impersonation configuration properties:

Property name Property description Property Values

SERVICE_GROUP The service group to which the service user belongs. dmexpress (default)

MIN_USERID The minimum user identification (ID) number or 500 (default) security access level that is assigned for impersonation. If the user ID is greater than this minimum value, the user is not impersonated and the job run aborts. Value:

78 DMX Install Guide

BANNED_USERS Users listed as banned prevent dmxexecutor from impersonating that user. Multiple banned users in the list must be separated by commas. All users not listed as banned qualify for impersonation and allow

dmxexecutor to impersonate that user. Upon receipt of a job submission request • from a banned user, dmxexecutor rejects the job request, generates an error, and the job aborts. • from a user not listed as banned, dmxexecutor calls the DMX engine to run the job.

ALLOWED_USERS Users listed as allowed are the only users that enable dmxexecutor to impersonate that user. Multiple allowed users in the list must be separated by commas. All users not listed as allowed are disqualified from impersonation and prevent dmxexecutor from impersonating that user. Upon receipt of a job request • from an allowed user, dmxexecutor calls the DMX engine to run the job. • from a user not listed as allowed, dmxexecutor rejects the job request, generates an error, and the job aborts.

DMX Install Guide 79

Technical Support If you have a maintenance support agreement for DMX, and you encounter difficulties in installing or running DMX, contact Syncsort Incorporated. In the United States (available 24 hours a day, 7 days a week): Phone 1-877-700-8270 or 201-930-8270

E-mail [email protected]

In other countries: Contact information can be found by country at https://mysupport.syncsort.com/.

80 DMX Install Guide