PUBLIC Document Version: 4.2 Support Package 12 (14.2.12.0) – 2020-02-06

Data Services Supplement for Big Data company. All rights reserved. affiliate

THE BEST RUN 2020 SAP SE or an SAP © Content

1 About this supplement...... 5

2 Naming Conventions...... 6

3 Big data in SAP Data Services...... 10 3.1 ...... 11 Setting ODBC driver configuration on Linux...... 11 Data source properties for Cassandra...... 13 3.2 ...... 14 Hadoop in Data Services...... 17 Hadoop sources and targets...... 19 Prerequisites to Data Services configuration...... 20 Verify Linux setup with common commands ...... 21 Hadoop support for the Windows platform...... 23 Configure Hadoop for text data processing...... 24 Setting up HDFS and Hive on Windows...... 25 3.3 Apache Impala...... 27 Connecting Impala using the Cloudera ODBC driver ...... 28 Creating an Apache Impala datastore and DSN for Cloudera driver...... 29 3.4 Connect to HDFS...... 31 HDFS file location objects...... 31 HDFS file format objects...... 37 3.5 Connect to Hive...... 45 Hive adapter datastores...... 45 Hive database datastores...... 62 3.6 Upload data to HDFS in the cloud...... 69 3.7 Google Cloud Dataproc...... 70 Configure driver and data source name (DSN)...... 71 Hive database datastore for Google Dataproc...... 72 Create a WebHDFS file location...... 73 3.8 HP Vertica...... 74 Enable MIT Kerberos for HP Vertica SSL protocol...... 75 Creating a DSN for HP Vertica with Kerberos SSL...... 78 Creating HP Vertica datastore with SSL encryption...... 80 Increasing loading speed for HP Vertica...... 82 HP Vertica data type conversion...... 83 HP Vertica table source...... 85

Data Services Supplement for Big Data 2 PUBLIC Content HP Vertica target table configuration...... 86 3.9 MongoDB...... 90 MongoDB metadata...... 90 MongoDB as a source...... 91 MongoDB as a target...... 94 MongoDB template documents...... 96 Preview MongoDB document data...... 98 Parallel Scan...... 99 Reimport schemas...... 100 Searching for MongoDB documents in the repository...... 101 3.10 PostgreSQL ...... 102 Datastore options for PostgreSQL...... 103 Configure the PostgreSQL ODBC driver ...... 104 Import PostgreSQL metadata...... 105 PostgreSQL source, target, and template tables ...... 106 PostgreSQL data type conversions...... 106 3.11 SAP HANA...... 108 Cryptographic libraries and global.ini settings ...... 108 Bulk loading in SAP HANA...... 110 Creating stored procedures in SAP HANA...... 112 SAP HANA database datastores ...... 113 Datatype conversion for SAP HANA...... 119 Using spatial data with SAP HANA...... 121 3.12 About SAP Vora datastore...... 123 SAP Vora datastore...... 124 Configuring DSN for SAP Vora on Windows...... 125 Configuring DSN for SAP Vora on Unix and Linux...... 126 SAP Vora table source options...... 127 SAP Vora target table options...... 128 SAP Vora data type conversions...... 130 3.13 Data Services Connection Manager (Unix)...... 132

4 Cloud computing services...... 133 4.1 Cloud databases...... 133 Amazon Redshift database...... 134 Azure SQL database...... 141 Google BigQuery...... 142 Snowflake...... 149 4.2 Cloud storages...... 155 Amazon S3...... 156 Azure blob storage...... 160 Azure Data Lake Store protocol options...... 165

Data Services Supplement for Big Data Content PUBLIC 3 Google cloud storage...... 167

Data Services Supplement for Big Data 4 PUBLIC Content 1 About this supplement

This supplement contains information about the big data products that SAP Data Services supports.

The supplement contains information about the following:

● Supported big data products ● Supported cloud computing technologies including cloud databases and cloud storages.

Find basic information in the Reference Guide, Designer Guide, and some of the applicable supplement guides. For example, to learn about datastores and creating datastores, see the Reference Guide. To learn about Google BigQuery, refer to the Supplement for Google BigQuery.

Data Services Supplement for Big Data About this supplement PUBLIC 5 2 Naming Conventions

We refer to certain systems with shortened names plus we use specific environment variables when we refer to locations for SAP and SAP Data Services files.

Shortened names

● The terms “Data Services system” and “SAP Data Services” mean the same thing. ● The term “BI platform” refers to “SAP BusinessObjects Business Intelligence platform.” ● The term “IPS” refers to “SAP BusinessObjects Information platform services.”

 Note

Data Services requires BI platform components. However, IPS, a scaled back version of BI, also provides these components.

● CMC refers to the Central Management Console provided by the BI or IPS platform. ● CMS refers to the Central Management Server provided by the BI or IPS platform.

Variables

Variables Description

INSTALL_DIR The installation directory for the SAP software.

Default location:

● For Windows: C:\Program Files (x86)\SAP BusinessObjects ● For UNIX: $HOME/sap businessobjects

 Note

INSTALL_DIR is not an environment variable. The in­ stallation location of SAP software may be different than what we list for INSTALL_DIR based on the location that your administrator set during installation.

Data Services Supplement for Big Data 6 PUBLIC Naming Conventions Variables Description

The root directory of the BI or IPS platform.

Default location:

● For Windows: \SAP BusinessObjects Enterprise XI 4.0

 Example

C:\Program Files (x86)\SAP BusinessObjects\SAP BusinessObjects Enterprise XI 4.0

● For UNIX: /enterprise_xi40

 Note

These paths are the same for both BI and IPS.

The root directory of the Data Services system.

Default location:

● All platforms \Data Services

 Example

C:\Program Files (x86)\SAP BusinessObjects\Data Services

Data Services Supplement for Big Data Naming Conventions PUBLIC 7 Variables Description

The common configuration directory for the Data Services system.

Default location:

● If your system is on Windows (Vista and newer): \SAP BusinessObjects \Data Services

 Note

The default value of environ­ ment variable for Windows Vista and newer is C: \ProgramData.

 Example

C:\ProgramData\SAP BusinessObjects \Data Services

● If your system is on Windows (Older versions such as XP) \Application Data \SAP BusinessObjects\Data Services

 Note

The default value of environ­ ment variable for Windows older versions is C: \Documents and Settings\All Users.

 Example

C:\Documents and Settings\All Users\Application Data\SAP BusinessObjects\Data Services

● UNIX systems (for compatibility)

The installer automatically creates this system environment variable during installation.

 Note

Starting with Data Services 4.2 SP6, users can desig­ nate a different default location for during installation. If you cannot find the in the listed default location, ask

Data Services Supplement for Big Data 8 PUBLIC Naming Conventions Variables Description

your System Administrator to find out where your de­ fault location is for .

The user-specific configuration directory for the Data Services system.

Default location:

● If you are on Windows (Vista and newer): \AppData\Local\SAP BusinessObjects\Data Services

 Note

The default value of environment variable for Windows Vista and newer versions is C:\Users\{username}.

● If you are on Windows (Older versions such as XP): \Local Settings \Application Data\SAP BusinessObjects \Data Services

 Note

The default value of environment variable for Windows older versions is C: \Documents and Settings\{username}.

The installer automatically creates this system environment variable during installation.

 Note

This variable is used only for Data Services client appli­ cations on Windows. is not used on UNIX platforms.

Data Services Supplement for Big Data Naming Conventions PUBLIC 9 3 Big data in SAP Data Services

SAP Data Services supports many types of big data through various object types and file formats.

Apache Cassandra [page 11] Apache Cassandra is an open-source data storage system that you can access with SAP Data Services.

Apache Hadoop [page 14] Use SAP Data Services to connect to Apache Hadoop frameworks including Hadoop Distributive File Systems (HDFS) and Hive.

Apache Impala [page 27] Create a database datastore for Apache Impala, which is an open source database for Apache Hadoop.

Connect to HDFS [page 31] Connect to your Hadoop Distributed File System (HDFS) data using an HDFS file format or an HDFS file location.

Connect to Hive [page 45] To connect to the remote Hive server, you create a Hive database datastore or a Hive adapter datastore.

Upload data to HDFS in the cloud [page 69] Upload data processed with Data Services to your HDFS that is managed by SAP Cloud Platform Big Data Services.

Google Cloud Dataproc [page 70] To connect to an Apache Hadoop web interface running on Google Cloud Dataproc clusters, use a Hive database datastore and a WebHDFS file location.

HP Vertica [page 74] Access HP Vertica data in SAP Data Services by creating an HP Vertica database datastore.

MongoDB [page 90] The MongoDB adapter allows you to read data from MongoDB sources and load data to other SAP Data Services targets.

PostgreSQL [page 102] To use your PostgreSQL tables as sources and targets in SAP Data Services, create a PostgreSQL datastore and import your tables and other metadata.

SAP HANA [page 108] Process your SAP HANA data in SAP Data Services by creating an SAP HANA database datastore.

About SAP Vora datastore [page 123] Use the SAP Vora datastore as a source in a data flow, and a template table for the target.

Data Services Connection Manager (Unix) [page 132] Use the Connection Manager after you install Data Services on Unix, to configure ODBC databases and ODBC drivers for repositories, sources, and targets.

Data Services Supplement for Big Data 10 PUBLIC Big data in SAP Data Services 3.1 Apache Cassandra

Apache Cassandra is an open-source data storage system that you can access with SAP Data Services.

Data Services natively supports Cassandra as an ODBC data source with a DSN connection. Cassandra uses the generic ODBC driver. Use Cassandra on Windows or Linux operating systems.

Use Cassandra data for the following tasks:

● Use as sources, targets, or template tables ● Preview data ● Query using distinct, where, group by, and order by ● Write scripts using functions such as math, string, date, aggregate, and ifthenelse

Before you use Cassandra with Data Services, ensure that you perform the following setup tasks:

● Add the appropriate environment variables to the al_env.sh file. ● For Data Services on Linux platforms, configure the ODBC driver using the Connection Manager.

 Note

For Data Services on Windows platforms, use the generic ODBC driver.

Setting ODBC driver configuration on Linux [page 11] Use the Connection Manager to configure the ODBC driver for Apache Cassandra on Linux.

Data source properties for Cassandra [page 13] Complete data source properties in the Connection Manager when you configure the ODBC driver for SAP Data Services on Linux.

Related Information

Configure database connectivity for UNIX and Linux

3.1.1 Setting ODBC driver configuration on Linux

Use the Connection Manager to configure the ODBC driver for Apache Cassandra on Linux.

Before you complete the following steps, read the topic and subtopics under “Configure database connectivity for UNIX and Linux” in the Administrator Guide.

Use the GTK+2 library to make a graphical user interface for the Connection Manager. Connection Manager is a command-line utility. To use it with a UI, install the GTK+2 library. For more information about obtaining and installing GTK+2, see https://www.gtk.org/ . The following steps are for the UI for Connection Manager.

1. Open a command prompt and set $ODBCINI to a file in which the Connection Manager defines the DSN. Ensure that the file is readable and writable.

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 11  Sample Code

$ export ODBCINI=/odbc.ini

touch $ODBCINI

The Connection Manager uses the $ODBCINI file and other information that you enter for data sources, to define the DSN for Cassandra.

 Note

Do not point to the Data Services ODBC .ini file.

2. Start the Connection Manager user interface by entering the following command:

 Sample Code

$ cd /bin/

$ /DSConnectionManager.sh

 Note

is the Data Services installation directory.

3. In Connection Manager, open the Data Sources tab, and click Add to display the list of database types. 4. In the Select Database Type dialog box, select Cassandra and click OK.

The Configuration for... dialog box opens. It contains the absolute location of the odbc.ini file that you set in the first step. 5. Provide values for additional connection properties for the Cassandra database type as applicable. 6. Provide the following properties:

○ User name ○ Password

 Note

Data Services does not save these properties for other users.

7. To test the connection, click Test Connection. 8. Click Restart Services to restart services applicable to the Data Services installation location:

If Data Services is installed on the same machine and in the same folder as the IPS or BI platform, restart the following services: ○ EIM Adaptive Process Service ○ Data Services Job Service

If Data Services is not installed on the same machine and in the same folder as the IPS or BI platform, restart the following service: ○ Data Services Job Service

Task overview: Apache Cassandra [page 11]

Data Services Supplement for Big Data 12 PUBLIC Big data in SAP Data Services Related Information

Data source properties for Cassandra [page 13] Configure database connectivity for UNIX and Linux

3.1.2 Data source properties for Cassandra

Complete data source properties in the Connection Manager when you configure the ODBC driver for SAP Data Services on Linux.

The Connection Manager configures the $ODBCINI file based on the property values that you enter in the Data Sources tab. The following table lists the properties that are relevant for Apache Cassandra.

Data Source settings for Apache Cassandra Database Type Properties on Data Sources tab

Apache Cassandra ● User Name ● Database password ● Host Name ● Port ● Database ● Unix ODBC Lib Path ● Driver ● Cassandra SSL Certificate Mode [0:disabled|1:one-way|2:two-way]

Depending on the value you choose for the certificate mode, Data Services may require you to define some or all of the following options:

● Cassandra SSL Server Certificate File ● Cassandra SSL Client Certificate File ● Cassandra SSL Client Key File ● Cassandra SSL Client Key Password ● Cassandra SSL Validate Server Hostname? [0:disabled|1:enabled]

Parent topic: Apache Cassandra [page 11]

Related Information

Setting ODBC driver configuration on Linux [page 11]

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 13 3.2 Apache Hadoop

Use SAP Data Services to connect to Apache Hadoop frameworks including Hadoop Distributive File Systems (HDFS) and Hive.

Data Services supports Hadoop on both the Linux and Windows platform. For Windows support, Data Services uses Hortonworks Data Platform (HDP) only. HDP allows data from many sources and formats. See the latest Product Availability Matrix (PAM) on the SAP Support Portal for the supported versions of HDP.

For information about deploying Data Services on a Hadoop MapR cluster machine, see SAP Note 2404486 .

For information about accessing your Hadoop in the administered SAP Cloud Platform Big Data Services, see the Supplement for SAP Cloud Platform Big Data Services.

The following table describes the relevant components of Hadoop:

Component Description

HDFS (Hadoop Distributed File Sys­ A distributive file system that stores data on nodes, providing high aggregate tem) bandwidth across the cluster.

Hive A data warehouse infrastructure that allows SQL-like on demand querying of data, in any format, stored in Hadoop.

Pig A high-level data flow language and execution framework for parallel computation that is built on top of Hadoop. Data Services uses Pig scripts to read from and write to HDFS, including join and push-down operations.

Map/Reduce A computational paradigm where the application is divided into many small frag­ ments of work. Each fragment may be executed or re-executed on any node in the cluster. Data Services uses map/reduce to do text data processing.

The following table describes all of the objects related to Hadoop in Data Services that you use to work with your Hadoop data.

Hadoop objects and tools

Object Description

Hive adapter Enables Data Services to connect to a Hive server so that you can work with data from Hadoop.

For complete information about using Data Services adapt­ ers, see the Supplement for Adapters.

Data Services Supplement for Big Data 14 PUBLIC Big data in SAP Data Services Object Description

Hive datastore Enables Data Services to access data from your Hive data warehouse to use as a source or a target in Data Services processing.

Also use a Hive datastore to access your Hadoop clusters in Google Cloud Dataproc for processing in Data Services.

There are two types of Hive datastores:

● Hive adapter datastore: Use with the Hive adapter.

 Note

To use the Hive adapter datastore, install Data Services on the machine within the Hadoop cluster.

 Note

For complete information about adapters and cre­ ating an adapter datastore, see the Supplement for Adapters and the Reference Guide.

● Hive database datastore.

 Note

Install Data Services on any machine.

To access a remote Hive server, configure the Hive data­ base datastore with a supported Hive ODBC driver and a data source name (DSN).

HDFS file format Contains a description for your HDFS file system structure.

 Note

To use as a source or target, install Data Services within the Hadoop cluster. Use an HDFS file location to connect to HDFS when Data Services is installed outside of the Hadoop cluster.

HDFS file location Contains the transfer protocol to your HDFS.

Associate a file format with the HDFS file location and use as a source or target in a data flow. Use a file format that is not an HDFS file format. For example, use a flat file format.

 Note

Install Data Services on any machine.

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 15 Object Description

WebHDFS file location Contains the transfer protocol to your HDFS using REST API.

Configure a Hive target object to bulk load processed data to your Hadoop clusters in Google Cloud Dataproc.

Hive template table Use a Hive template table as a target using one of the follow­ ing two methods:

● Method 1: Use a Hive datastore template table from the Datastore tab in the Designer object library. ● Method 2: Use a template table from the Designer tool palette.

After you have used a Hive template table in an executed data flow, you can use the target Hive template as a source in a data flow.

Hadoop in Data Services [page 17] SAP Data Services has added support for Hadoop in stages, with features added in specific Data Services versions.

Hadoop sources and targets [page 19] Use SAP Data Services objects that you configure for Hive or Hadoop Distributive File System (HDFS) as sources and targets in data flows.

Prerequisites to Data Services configuration [page 20] Before configuring SAP Data Services to connect to Hadoop, verify that your system configuration is correct.

Verify Linux setup with common commands [page 21] Use common commands to verify that the configuration of your SAP Data Services system on Windows for Hadoop is correct.

Hadoop support for the Windows platform [page 23] SAP Data Services supports Hadoop on the Windows platform using Hortonworks.

Configure Hadoop for text data processing [page 24] SAP Data Services supports text data processing in the Hadoop framework using a MapReduce form of the Entity Extraction transform.

Setting up HDFS and Hive on Windows [page 25] Set system environment variables and use command prompts to configure HDFS and Hive for Windows.

Related Information

Connect to HDFS [page 31] Connect to Hive [page 45]

Data Services Supplement for Big Data 16 PUBLIC Big data in SAP Data Services 3.2.1 Hadoop in Data Services

SAP Data Services has added support for Hadoop in stages, with features added in specific Data Services versions.

Use the following table to determine how your version of Data Services supports Hadoop.

Hadoop features in chronological order

Data Services version Hadoop support More information

4.2 SP1 Connect to Hadoop Hive and HDFS, Li­ For more information about Data nux only: Services adapters, see the Supplement for Adapters ● HDFS file format ● Hive adapter ● Hive adapter datastore

4.2 SP2 For the Hive adapter datastore, support For current version information, consult for: the Product Availability Matrix (PAM).

● Hive Server version 2 ● Hive Server sub version 0.11 and later

Users must migrate to new Hive Server versions.

4.2 SP3 Preview Hive table data Supplement for Hadoop

4.2 SP4 Preview HDFS data Supplement for Hadoop

4.2 SP5 Hive datastore support for SQL func­ Supplement for Hadoop tions and SQL transform.

4.2 SP6 ● Hive template table Supplement for Hadoop ● SASL-QoP with Kerberos on Hive Reference Guide (Simple Authentication and Secur­ ity Layer - Quality of Protection) ● Windows support for Hadoop in Data Services ● Hive support for Varchar and Char data types

4.2 SP7 ● Hive on Spark Supplement for Hadoop ● Support for Hive Beeline CLI to re­ place Hive CLI ● Deprecated support for Hive CLI

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 17 Data Services version Hadoop support More information

4.2 SP8 ● Support Edge nodes running on Supplement for Hadoop Hadoop Hive clusters, Linux only ● Hive template tables that use Par­ quet, AVRO, and ORC table for­ mats.

4.2 SP9 Connect to Hadoop on SAP Cloud Plat­ Supplement for SAP Cloud Platform Big form Big Data Services Data Services

4.2 SP10 HDFS file location object: Supplement for Adapters

● Supports reading and loading to Supplement for Hadoop HDFS system with or without Ker­ Reference Guide beros authentication. ● Data Services installation is not re­ stricted to be inside the Hadoop cluster. Install Data Services on any machine for HDFS and Hive reading and loading.

Hive datastore enhancements:

● Use supported Hive ODBC drivers to connect to the Hive server re­ motely. Data Services installation is not restricted to be inside the Hadoop cluster. ● Support for Kerberos. ● Support for bulk loading to Hive.

4.2 SP11 Support for Knox gateway access for Supplement for Hadoop HDFS through the HDFS file location Reference Guide object. For more information about setting up your Hadoop with Knox gateway, see your Apache Knox documentation.

4.2 SP12 Access Kerberos-secure Hadoop clus­ Supplement for Hadoop ter in SAP Cloud Platform Big Data For complete information about access­ Services. ing Hadoop in SAP Cloud Platform Big Data Services, see the Supplement for DSN and server name (DSN-less) con­ SAP Cloud Platform Big Data Services. nections to Hive Server 2.

4.2 SP12 Patch 1 (14.02.12.01) Access your Hadoop clusters in Google Supplement for Hadoop Cloud Dataproc using a Hive database datastore. Upload generated data from Data Services using a WebHDFS file lo­ cation.

Data Services Supplement for Big Data 18 PUBLIC Big data in SAP Data Services Parent topic: Apache Hadoop [page 14]

Related Information

Hadoop sources and targets [page 19] Prerequisites to Data Services configuration [page 20] Verify Linux setup with common commands [page 21] Hadoop support for the Windows platform [page 23] Configure Hadoop for text data processing [page 24] Setting up HDFS and Hive on Windows [page 25]

3.2.2 Hadoop sources and targets

Use SAP Data Services objects that you configure for Hive or Hadoop Distributive File System (HDFS) as sources and targets in data flows.

To access data from Hive, use objects that are designed for Hive. For example, use the Hive adapter datastore for jobs that use data from your Hive storage. When you want data from your HDFS, use an HDFS file format or HDFS file location object.

Use other Data Services objects along with Hadoop objects in data flows based on your objectives.

 Example

● Configure a data source name (DSN) using a supported Hive ODBC driver to create a Hive datastore that accesses a remote Hive server. ● Configure a flat file with an HDFS file location object and use as a source or target in a data flow. ● Use an HDFS file location object and a script to access data from a remote source with the copy_from_remote_system function. ● Use an HDFS file location object and a script to upload data from your local server to a remote server using the copy_to_remote_system function. ● Use bulk loading to upload data to Hive or HDFS. Works with a flat file, Hive template table, or a table within the Hive datastore as a target in a data flow.

Parent topic: Apache Hadoop [page 14]

Related Information

Hadoop in Data Services [page 17] Prerequisites to Data Services configuration [page 20] Verify Linux setup with common commands [page 21]

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 19 Hadoop support for the Windows platform [page 23] Configure Hadoop for text data processing [page 24] Setting up HDFS and Hive on Windows [page 25] Connect to HDFS [page 31] Connect to Hive [page 45] Creating a DSN connection with SSL protocol in Windows [page 64] Configuring bulk loading for Hive [page 67]

3.2.3 Prerequisites to Data Services configuration

Before configuring SAP Data Services to connect to Hadoop, verify that your system configuration is correct.

Ensure that your Data Services system configuration meets the following prerequisites.

For Linux and Windows platforms:

● You configure the machine where the Data Services Job Server is installed to work with Hadoop. ● The machine where the Data Services Job Server is installed has the Pig client installed. ● If you use Hive, verify that the Hive client is installed: 1. Log in to the node. 2. Issue Pig and Hive commands to start the respective interfaces. ● The Data Services Job Server is installed on one of the Hadoop cluster machines, which can be either an Edge (Linux only) or a Data node. To install Data Services to any machine, including the machine in the cluster, use one of the following methods: ○ Use a supported ODBC driver and configure a DSN for the Hive adapter datastore. ○ Set up jobs using the HDFS file location object. Create the HDFS file location object using either the WebHDFS or the HTTPFS connection protocols. ● If you use text data processing, ensure that you copy the necessary text data processing components to the HDFS to enable MapReduce functionality.

For Linux platforms:

● You set the environment for interaction with Hadoop. ● You start the Job Server from an environment that sources the Hadoop environment script.

 Example

For example:

source /hadoop/bin/hadoop_env_setup.sh -e

Parent topic: Apache Hadoop [page 14]

Related Information

Hadoop in Data Services [page 17]

Data Services Supplement for Big Data 20 PUBLIC Big data in SAP Data Services Hadoop sources and targets [page 19] Verify Linux setup with common commands [page 21] Hadoop support for the Windows platform [page 23] Configure Hadoop for text data processing [page 24] Setting up HDFS and Hive on Windows [page 25] Setting UNIX environment variables HDFS file location object options [page 32] Hive database datastores [page 62] HDFS file format

3.2.4 Verify Linux setup with common commands

Use common commands to verify that the configuration of your SAP Data Services system on Windows for Hadoop is correct.

When you use the commands in this topic, your output may be different from what we show. If your output is different, it is okay as long as your commands do not result in errors.

Setting up the environment

To set up the Data Services environment for Hadoop, use the following command:

$ cd /bin

$ source ./al_env.sh $ cd ../hadoop/bin

$ source ./hadoop_env_setup.sh -e

Checking components

Ensure that Hadoop, Pig, and Hive are installed and correctly configured on the machine where Data Services Job Server for Hadoop resides.

Check the Hadoop, Pig, and Hive configuration by using the following command:

$ hadoop fs -ls /

For Hadoop, you should see output similar to the following:

$ hadoop fs -ls /

Found 2 items drwxr-xr-x - hadoop supergroup 0 2013-03-21 11:47 /tmp drwxr-xr-x - hadoop supergroup 0 2013-03-14 02:50 /user

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 21 For Pig, you should see output similar to the following:

$ pig

INFO org.apache.pig.Main - Logging error messages to: /hadoop/ pig_1363897065467.log INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://machine:9000 INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: machine:9001 grunt> fs -ls / Found 2 items drwxr-xr-x - hadoop supergroup 0 2013-03-21 11:47 /tmp drwxr-xr-x - hadoop supergroup 0 2013-03-14 02:50 /user

grunt> quit

For Hive, you should see output similar to the following:

$ hive

Hive history file=/tmp/hadoop/hive_job_log_hadoop_201303211318_504071234.txt hive> show databases; OK default Time taken: 1.312 seconds

hive> quit;

Set up or restart the Job Server

If all commands pass, use the following command from within the same shell to set up or restart the Job Server.

/bin/svrcfg

This command provides the Job Server with the proper environment from which it starts engines that call Hadoop, Pig, and Hive.

Parent topic: Apache Hadoop [page 14]

Related Information

Hadoop in Data Services [page 17] Hadoop sources and targets [page 19] Prerequisites to Data Services configuration [page 20] Hadoop support for the Windows platform [page 23] Configure Hadoop for text data processing [page 24] Setting up HDFS and Hive on Windows [page 25]

Data Services Supplement for Big Data 22 PUBLIC Big data in SAP Data Services 3.2.5 Hadoop support for the Windows platform

SAP Data Services supports Hadoop on the Windows platform using Hortonworks.

Use the supported version of Hortonworks HDP only. See the Product Availability Matrix (PAM) on the SAP Support Portal for the most recent supported version number.

When you use Hadoop on the Windows platform, use Data Services to do the following tasks:

● Use Hive tables as a source or target in your data flows. ● Use HDFS files as a source or target in your data flows using Pig script or the HDFS library API. ● Use HDFS file location object as a source or target in your data flows. Pig Script or HDFS library API is not required. ● Stage non-Hive data in a data flow using the Data_Transfer transform. ● Preview data for HDFS files and Hive tables.

Requirements

Make sure that you set up your system as follows:

● Install the Data Services Job Server in one of the nodes of the Hadoop cluster.

 Note

Install Data Services on any machine when you use an HDFS file location object with one of the following connection protocols: ○ WebHDFS ○ HTTPFS Alternatively, use a Hive database datastore configured with DSN and a supported Hive ODBC driver.

● Set the system environment variables, such as PATH and CLASSPATH, so that the Job Server can run as a service. ● Set the permission requirements for the HDFS file system to use HDFS or Hive.

Parent topic: Apache Hadoop [page 14]

Related Information

Hadoop in Data Services [page 17] Hadoop sources and targets [page 19] Prerequisites to Data Services configuration [page 20] Verify Linux setup with common commands [page 21] Configure Hadoop for text data processing [page 24] Setting up HDFS and Hive on Windows [page 25] Connect to HDFS [page 31] Previewing HDFS file data [page 44] HDFS file location object options [page 32] Hive adapter datastores [page 45]

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 23 Hive database datastores [page 62] Pushing the JOIN operation to Hive [page 59]

3.2.6 Configure Hadoop for text data processing

SAP Data Services supports text data processing in the Hadoop framework using a MapReduce form of the Entity Extraction transform.

To use text data processing in Hadoop, run the following Hadoop environment script.

/hadoop/bin/hadoop_env_setup.sh -c

The script copies the language modules and other dependent libraries to the Hadoop file system so that MapReduce can distribute them during the MapReduce job setup. You only have to do this file-copying operation once after an installation or update, or when you want to use custom dictionaries or rule files.

If you use the Entity Extraction transform with custom dictionaries or rule files, copy the custom dictionaries or rule files to the Hadoop file system for distribution. To do so, first copy the files into the languages directory of the Data Services installation, then rerun the Hadoop environment script. For example:

cp /myhome/myDictionary.nc /TextAnalysis/languages

/hadoop/bin/hadoop_env_setup.sh -c

After you complete the Hadoop environment set up, configure the Entity Extraction transform to push down operations to the Hadoop system by connecting it to a single HDFS Unstructured Text source.

Optimize text data processing for the Hadoop framework [page 25] To control the mapper settings for text data processing in the Hadoop framework, use a configuration setting.

Parent topic: Apache Hadoop [page 14]

Related Information

Hadoop in Data Services [page 17] Hadoop sources and targets [page 19] Prerequisites to Data Services configuration [page 20] Verify Linux setup with common commands [page 21] Hadoop support for the Windows platform [page 23] Setting up HDFS and Hive on Windows [page 25]

Data Services Supplement for Big Data 24 PUBLIC Big data in SAP Data Services 3.2.6.1 Optimize text data processing for the Hadoop framework

To control the mapper settings for text data processing in the Hadoop framework, use a configuration setting.

Use the following Hadoop configuration setting to control the amount of data a mapper can handle and the number of mappers used by a job: mapred.max.split.size .

Set the value for mapred.max.split.size in the Hadoop configuration file. The Hadoop configuration file is located at $HADOOP_HOME/conf/core-site.xml. The Hadoop configuration file could be located in an alternate location, depending on the type of Hadoop you use.

By default, the value for mapred.max.split.size is 0. When you keep the default for this configuration setting:

● The software does not limit the amount of data the mapper handles. ● The software runs text data processing with only one mapper.

Change the default configuration value to the amount of data that each mapper can handle.

 Example

A Hadoop cluster contains 20 machines. Each machine is set to run a maximum of 10 mappers. 20 machines x 10 mappers = 200 mappers available in the cluster.

Your input data averages 200 GB. To have the text data processing job consume 100 percent of the available mappers, set mapred.max.split.size to 1073741824 (1 GB).

Calculation: 200 GB ÷ 200 mappers = 1 GB per mapper.

mapred.max.split.size 1073741824

To have the text data processing job consume 50 percent of the available mappers, set mapred.max.split.size to 2147483648 (2 GB).

Calculation: 200 GB ÷ 100 mappers = 2 GB per mapper.

mapred.max.split.size 2147483648

Parent topic: Configure Hadoop for text data processing [page 24]

3.2.7 Setting up HDFS and Hive on Windows

Set system environment variables and use command prompts to configure HDFS and Hive for Windows.

Install the SAP Data Services Job Server component.

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 25 Perform the following steps to set up HDFS and Hive on Windows:

1. Set the following system environment variable:

HDFS_LIB_DIR = /sap/dataservices/hadoop/tdp

2. Add the following to the PATH: \ext\jre\bin\server. 3. Run the following command:

hadoop classpath --jar c:\temp\hdpclasspath.jar

4. Update CLASSPATH=%CLASSPATH%; c:\temp\hdpclasspath.jar.

CLASSPATH generates the Hadoop and Classpath .jar files. 5. Set the location of the Hadoop and Classpath .jar files. 6. When the Hadoop CLASSPATH command completes successfully, check the content of the .jar file for the Manifest file. 7. Check that the hdfs.dll has symbols exported. If the symbols from the hdfs.dll are not exported, install the fix from Hortonworks for the export of symbols. If the symbols from the .dll are still not exported, use the .dll from Hortonworks 2.3. 8. Required only if you use Text Data Processing transforms in jobs, and only once per Data Services install: Run the following command from \bin:

Hadoop_env_setup.bat

The .bat file copies the Text Analysis Language file to the HDFS cache directory. 9. Ensure that the Hadoop or Hive .jar files are installed. The Data Services Hive adapter uses the following .jar files: ○ commons-httpclient-3.0.1.jar ○ commons-logging-1.1.3.jar ○ hadoop-common-2.6.0.2.2.6.0-2800.jar ○ hive-exec-0.14.0.2.2.6.0-2800.jar ○ hive-jdbc-0.14.0.2.2.6.0-2800-standalone.jar ○ hive-jdbc-0.14.0.2.2.6.0-2800.jar ○ hive-metastore-0.14.0.2.2.6.0-2800.jar ○ hive-service-0.14.0.2.2.6.0-2800.jar ○ httpclient-4.2.5.jar ○ httpcore-4.2.5.jar ○ libfb303-0.9.0.jar ○ -1.2.16.jar ○ slf4j-api-1.7.5.jar ○ slf4j-log4j12-1.7.5.jar 10. Run the following commands to set up the permissions on the HDFS file system:

hdfs dfs -chmod -R 777 /mapred

hdfs dfs –mkdir /tmp hdfs dfs –chmod –R 777 /tmp hdfs dfs -mkdir /tmp/hive/ hdfs dfs -chmod -R 777 /tmp/hive

Data Services Supplement for Big Data 26 PUBLIC Big data in SAP Data Services hdfs dfs –mkdir –p /sap/dataservices/hadoop/tdp hdfs dfs -mkdir -p /user/hive hdfs dfs -mkdir -p /hive/warehouse hdfs dfs -chown hadoop:hadoop /user/hive hdfs dfs -chmod -R 755 /user/hive hdfs dfs -chmod -R 777 /hive/warehouse

Task overview: Apache Hadoop [page 14]

Related Information

Hadoop in Data Services [page 17] Hadoop sources and targets [page 19] Prerequisites to Data Services configuration [page 20] Verify Linux setup with common commands [page 21] Hadoop support for the Windows platform [page 23] Configure Hadoop for text data processing [page 24] Data Services adapters Hive adapter datastores [page 45]

3.3 Apache Impala

Create a database datastore for Apache Impala, which is an open source database for Apache Hadoop.

Before you create an Apache Impala datastore, import the Cloudera driver and create a data source name (DSN). To create an Apache Impala datastore, open the datastore editor and select ODBC for the Data Type. Then select the DSN you created for this datastore and complete the remaining options as applicable.

Use the datastore to import Impala tables, then use data from Impala as a source or target in a data flow.

Before you work with Apache Impala, be aware of the following limitations:

● Data Services supports Impala 2.5 and later. ● Data Services supports only Impala scalar data types. Data Services does not support complex types such as ARRAY, STRUCT, or MAP.

Connecting Impala using the Cloudera ODBC driver [page 28] For Linux users. Before you create an Impala database datastore, connect to Apache Impala using the Cloudera OBDC driver.

Creating an Apache Impala datastore and DSN for Cloudera driver [page 29] For Windows, create the Apache Impala datastore and create a DSN for the Cloudera driver in Designer.

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 27 Related Information

ODBC Common datastore options

3.3.1 Connecting Impala using the Cloudera ODBC driver

For Linux users. Before you create an Impala database datastore, connect to Apache Impala using the Cloudera OBDC driver.

For Linux users. Follow these high-level steps to connect to Apache Impala using the Cloudera ODBC driver. For more in-depth information, consult the Cloudera documentation.

1. Enable Impala Services on the Hadoop server. 2. Download and install the Cloudera ODBC driver (https://www.cloudera.com/downloads/connectors/ impala/odbc/2-5-26.html ): ○ For Windows, use ClouderaImpalaODBC64.msi. ○ For SUSE, use ClouderaImpalaODBC-2.5.39.1020-1.x86_64.rpm. ○ For RedHat 7, use ClouderaImpalaODBC-2.5.39.1020-1.el7.x86_64.rpm. 3. Configure an Impala data source name (DSN). a. Run DSConnectionManager.sh. For more information about DSConnectionManager, see “Using the Connection Manager for UNIX systems” in the Administrator Guide.

 Note

This example has Kerberos and Secure Sockets Layer (SSL) enabled.

The ODBC ini file is

There are available DSN names in the file:

[DSN name 1] [DSN name 2]

Specify the DSN name from the list or add a new one:

imp_ssl_1

Specify the User Name:

Type database password:(no echo)

Retype database password:(no echo)

Specify the Host Name:

Specify the Port:'21050'

Specify the Database:

default

Specify the Unix ODBC Lib Path:

The Unix ODBC Lib Path is based on where you install the driver.

For example, /build/unixODBC-2.3.2/lib.

Specify the Driver:

/opt/cloudera/impalaodbc/lib/64/libclouderaimpalaodbc64.so

Specify the Impala Auth Mech [0:noauth|1:kerberos|2:user|3:user-password]:'0':

1

Specify the Kerberos Host FQDN:

Specify the Kerberos Realm:

Data Services Supplement for Big Data 28 PUBLIC Big data in SAP Data Services

Specify the Impala SSL Mode [0:disabled | 1:enabled]:'0'

1

Specify the Impala SSL Server Certificate File:

Testing connection...

Successfully added database source.

4. Create an ODBC datastore with the Impala DSN and then import the Impala tables. 5. Optional. Enable Kerberos authentication. a. When configuring the Impala DSN in the Connection Manager (see step 3), enable Kerberos by setting Specify the Impala Auth Mech [0:noauth|1:kerberos|2:user|3:user-password]: to 1. b. Enter the Kerberos Host FQDN. c. Enter the Kerberos Realm.

 Note

DSConnectionManager does not test the Kerberos connection. It saves all the input to the $ODBCINI file and tests the connection at runtime.

6. Optional. Enable Secure Socket Layer (SSL). a. When configuring the Impala DSN in the Connection Manager (see step 3), enable SSL by setting Specify the Impala SSL Mode [0:disabled|1:enabled]: to 1. b. Enter the path to the Impala certificate.pem file.

Task overview: Apache Impala [page 27]

Related Information

Creating an Apache Impala datastore and DSN for Cloudera driver [page 29] Creating an Apache Impala datastore and DSN for Cloudera driver [page 29]

3.3.2 Creating an Apache Impala datastore and DSN for Cloudera driver

For Windows, create the Apache Impala datastore and create a DSN for the Cloudera driver in Designer.

Enable Impala Services in the Hadoop server. Download the Cloudera driver for your platform.

1. Select Tools New Datastore in SAP Data Services Designer.

The datastore editor opens. 2. Select Database from the Datastore Type dropdown list. 3. Select ODBC from the Database Type dropdown list.

If you used the SAP Data Services Connection Manager to create a DSN connection with the Cloudera driver, skip steps 4–9.

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 29 4. Click ODBC Admin.

The ODBC Data Source Administrator opens. 5. Open the System DSN tab and select the Cloudera driver that you downloaded from the System Data Sources list. Click Configure.

The Cloudera ODBC Driver for Impala DSN Setup dialog box opens. 6. Enter the required information based on your system, and click Advanced Options. 7. In the Advanced Options dialog box, enable Use SQL Unicode Types. 8. Close Advanced Options. 9. Optional. To enable Kerberos authentication, perform the following substeps in the ODBC Data Source Administrator: a. In the Cloudera ODBC Driver for Impala DSN Setup dialog box, select Kerberos from the Mechanism dropdown list. b. Enter the name of the applicable realm in Realm.

A realm is a set of managed nodes that share the same Kerberos database. For example, your realm name might be “Cloudera”. c. Enter the fully qualified domain name (FQDN) of the Hive Server host in Host FQDN. d. Enter the service principal name of the Hive server in Service Name. e. Enable the Canonicalize Principal FQDN option, which canonicalizes the host FQDN in the server service principal name. 10. Optional. To enable Secure Sockets Layer (SSL) protocol, perform the following steps in the ODBC Data Source Administrator: a. In the Cloudera ODBC Driver for Impala DSN Setup dialog box, select No Authentication (SSL) from the Mechanism dropdown list. b. Click Advanced Options. c. Enter or browse to the Cloudera certificate file in Trusted Certificates.

The default path to the Impala certificate.pem file automatically populates. d. Close Advanced Options, Cloudera ODBC Driver for Impala DSN Setup, and the ODBC Data Source Administrator. 11. In the datastore editor in Data Services, select the Cloudera DSN that you just created from the Data Source Name dropdown list.

The DSN appears in the dropdown list only when you created it with the ODBC Data Source Administrator or the Data Services Connection Manager. 12. Click Advanced and complete the advanced options as necessary. 13. Optional. To process multi byte data in Impala tables, go to the Locale group and set the Code page option to utf-8. 14. In the ODBC Date Function Support group, set the Week option to No.

If you do not set the Week option to No, the result of the Data Services built-in function week_in_year() may be incorrect.

Task overview: Apache Impala [page 27]

Data Services Supplement for Big Data 30 PUBLIC Big data in SAP Data Services Related Information

Connecting Impala using the Cloudera ODBC driver [page 28] Connecting Impala using the Cloudera ODBC driver [page 28]

3.4 Connect to HDFS

Connect to your Hadoop Distributed File System (HDFS) data using an HDFS file format or an HDFS file location.

An HDFS file format and an HDFS file location contain your HDFS connection information, including account name, password, security protocol, and so on. Data Services uses this information to access HDFS data during Data Services processing.

Decide which object to use based on the location of your Data Services installation:

● Use an HDFS file format when Data Services is installed within the Hadoop cluster. ● Use an HDFS file location when Data Services is installed anywhere, including within the Hadoop cluster.

If your Hadoop system is managed in SAP Cloud Platform Big Data Service (formerly Altiscale), your connection setup uses information from your Big Data Service account. For complete instructions to connect to your Big Data Service account, see the Supplement for SAP Cloud Platform Big Data Service.

HDFS file location objects [page 31] To use HDFS data as a source, or to upload generated data to HDFS, use an HDFS file location as a source or target in a data flow along with a flat file or template table.

HDFS file format objects [page 37] An HDFS file format stores connection information to an HDFS file.

Related Information

HDFS file format HDFS file location object options [page 32] Upload data to HDFS in the cloud [page 69]

3.4.1 HDFS file location objects

To use HDFS data as a source, or to upload generated data to HDFS, use an HDFS file location as a source or target in a data flow along with a flat file or template table.

When you create the file location:

● Enter the file transfer protocol specifics for your HDFS.

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 31 ● Define a local and remote server for which you have access permission.

Data Services uses the remote and local server information and the file transfer protocol in a data flow to move data between the local and remote server.

For information about running a Hadoop file location job to append data to HDFS file on a small cluster, see SAP Note 2771182 .

For more information about file location objects, see the Designer Guide.

HDFS file location object options [page 32] Use a Hadoop distributed file system (HDFS) file location to access your Hadoop data for Data Services processing.

Parent topic: Connect to HDFS [page 31]

Related Information

HDFS file format objects [page 37] HDFS file location object options [page 32]

3.4.1.1 HDFS file location object options

Use a Hadoop distributed file system (HDFS) file location to access your Hadoop data for Data Services processing.

Use the HDFS file location as a source or target in an SAP Data Services data flow.

When you create a new HDFS file location, select HDFS from the Protocol dropdown list.

The following table describes the file location options that are specific to the HDFS protocol. For descriptions of general options, see the Reference Guide.

Option Description

Connection section

Protocol Type of file transfer protocol.

Select HDFS.

Data Services Supplement for Big Data 32 PUBLIC Big data in SAP Data Services Option Description

Communication Protocol Type of protocol to use to access the data in your HDFS.

● WebHDFS: Select when Data Services is not installed as a part of the Hadoop cluster. Ensure that you configure WebHDFS on your server side. ● HTTPFS: Select when Data Services is not installed as a part of the Hadoop cluster. ● HDFS: Select when Data Services is installed as a part of the Hadoop cluster.

Host Name of the computer that hosts the NameNode.

Secondary NameNode Name of the computer that hosts the secondary NameNode.

Port Port number on which the NameNode listens.

User Hadoop user name.

Password Password for the WebHDFS communication protocol method. Required for Knox gateway and topology.

Compression type Specifies not to use compression, or to use gzip compres­ sion:

● None: Default setting. The file location object does not use compression. ● gzip: The file location object uses gzip compression. The software compresses the files before upload to Ha­ doop and decompresses the files after download from Hadoop.

Not applicable when you select HDFS type for Communication protocol.

Connection retry count Number of times the computer tries to create a connection with the remote server after a connection fails.

The default value is 10.

The value cannot be zero.

After the specified number of retries, Data Services issues an error message and stops the job.

Not applicable when you select HDFS type for Communication protocol.

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 33 Option Description

Batch size for uploading data (MB) Size of the data transfer in MB to use for uploading data.

The default value is 5.

Data Services uses different upload methods based on file size:

● Single part uploads for files less than 5 MB. ● Multi part uploads for files larger than 5 MB.

Data Services limits the total upload batch size to 100 MB.

Not applicable when you select HDFS type for Communication protocol.

Batch size for downloading data MB Size of the data transfer in MB the software uses to down­ load data from Hadoop.

The default value is 5.

Not applicable when you select HDFS type Communication protocol.

Number of threads Number of upload and download threads for transferring data from and to Hadoop.

The default value is 1.

Not applicable when you select HDFS type Communication protocol.

Authentication type Authentication for the HDFS connection.

● None: Kerberos security is not enabled.

For Kerberos enabled cluster:

● Delegation token: You have a delegation token for au­ thentication of the request. ● Kerberos: Default. You have a password to enter in the Password option. ● Kerberos keytab: You have a generated keytab file. With this option, you do not enter a value for Password, but you enter a location for Keytab file.

Keytab file Generated keytab file name.

Applicable when you select Kerberos keytab for Authentication type.

Kerberos Password Kerberos password.

Applicable when you select Kerberos for Authentication type.

Data Services Supplement for Big Data 34 PUBLIC Big data in SAP Data Services Option Description

SSL enabled Select Yes to use a Secure Socket Layer (SSL) connection to HDFS.

Not applicable when you select WebHDFS type for Communication protocol.

File System section

Remote directory Path for your HDFS working directory.

Local directory Path for your local working directory.

Replication factor The number of replicated files that HDFS should create.

The default value is 2.

Not applicable when you select HDFS type for Communication protocol.

Proxy section: Complete the Proxy options only when you are using a proxy.

Proxy host Path and host name for the REST API proxy server.

Not applicable when you select HDFS type for Communication protocol.

Proxy port Port number for the REST API Proxy server.

Not applicable when you select HDFS type for Communication protocol.

Proxy Username User name for the REST API proxy server.

Not applicable when you select HDFS type for Communication protocol.

Proxy Password Password for the REST API proxy server.

Not applicable when you select HDFS type for Communication protocol.

HDFS Proxy user Proxy user name configured for the HDFS user.

Not applicable when you select HDFS type for Communication protocol.

Pig section

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 35 Option Description

Working directory Directory path or variable. The software uses this directory when transferring data from the remote server to the local server, and when transferring data from the local server to the remote server.

Applicable only for HDFS type Communication protocol.

Clean up working directory Determines if the software deletes files in the working direc­ tory after execution.

● Yes: Default setting. Deletes the working directory files. ● No: Preserves the working directory files.

If you select No, intermediate files remain in both this work­ ing directory and the Data Services directory < $LINK_DIR>/log/hadoop.

Applicable only for HDFS type Communication protocol.

Custom Pig script Directory path or variable. Location of a custom Pig script, if applicable.

Applicable only for HDFS type Communication protocol.

Knox

Gateway and topology URL to your gateway and topology file. The topology file specifies the Hadoop cluster services that the Knox gateway accesses. Supports only WebHDFS communication protocol.

Server certificate Location of your Knox certificate. If you leave this option blank, Data Services establishes an unsecured connection.

Parent topic: HDFS file location objects [page 31]

Related Information

Common options File location object Prerequisites to Data Services configuration [page 20] Setting up HDFS and Hive on Windows [page 25] Connect to HDFS [page 31]

Data Services Supplement for Big Data 36 PUBLIC Big data in SAP Data Services 3.4.2 HDFS file format objects

An HDFS file format stores connection information to an HDFS file.

Use a file format to connect to source or target data when the data is stored in a file instead of a database table. To use an HDFS file format, do the following:

● Create a file format that defines the structure for a file. ● Drag and drop a file format into a data flow and specify whether it is a source or target. ● Specify connection information in the source or target file format editor.

HDFS file format options [page 37] Create a Hadoop distributed file system (HDFS) file format in the File Format Editor in SAP Data Services.

Configuring custom Pig script results as source [page 43] Use an HDFS file format and a custom Pig script to use the results of the PIG script as a source in a data flow.

Previewing HDFS file data [page 44] Preview HDFS file data for delimited and fixed width file types.

Parent topic: Connect to HDFS [page 31]

Related Information

HDFS file location objects [page 31] HDFS file format options [page 37] Previewing HDFS file data [page 44]

3.4.2.1 HDFS file format options

Create a Hadoop distributed file system (HDFS) file format in the File Format Editor in SAP Data Services.

Access the following options in the source or target file editors when you use the HDFS file format in a data flow. Mode refers to creating a new file format, editing a file format, completing source options, or completing target options. The options in the following table appear in all modes.

Option Possible values Description Mode

Data File(s)

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 37 Option Possible values Description Mode

NameNode host Computer name, fully quali­ Name of the NameNode All fied domain name, IP ad­ computer. dress, or variable If you use the following de­ fault settings, the local Ha­ doop system uses what is set as the default file system in the Hadoop configuration files.

● NameNode Host: default ● NameNode port: 0

NameNode port Positive integer or variable Port on which the NameNode All listens.

If you use the following de­ fault settings, the local Ha­ doop system uses what is set as the default file system in the Hadoop configuration files.

● NameNode Host: default ● NameNode port: 0

Hadoop user Alphanumeric characters Hadoop user name. All and underscores or variable If you use Kerberos authenti­ cation, include the Kerberos realm in the user name. For example: [email protected].

Data Services Supplement for Big Data 38 PUBLIC Big data in SAP Data Services Option Possible values Description Mode

Authentication Kerberos Indicates the type of authen­ All Kerberos keytab tication for the HDFS con­ nection. Select either value for Hadoop and Hive data sources when they are Ker­ beros enabled.

Kerberos: Select when you have a password to enter in the Password option.

Kerberos keytab: Select when you have a generated keytab file. With this option, you do not need to enter a value for Password, but you enter a location for File Location.

A Kerberos keytab file con­ tains a list of authorized users for a specific pass­ word. The software uses the keytab information instead of the entered password in the Password option. For more information about keytabs, see the MIT Kerberos docu­ mentation at http:// web.mit.edu/kerberos/krb5- latest/doc/basic/ keytab_def.html .

File Location File path Location for the applicable All Kerberos keytab that you generated for this connec­ tion.

 Note

This option is only availa­ ble when you choose Kerberos keytab for the Authentication.

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 39 Option Possible values Description Mode

Password Alphanumeric characters Password associated with All and underscores or variable the selected authentication type.

This field is required for Authentication type Ker­ beros. This field is not appli­ cable for Authentication type Kerberos keytab.

Root directory Directory path or variable Root directory path or varia­ All ble name for the output file.

File name(s) Alphanumeric characters Select the source connection All file name or browse to the file and underscores or variable by clicking the dropdown ar­ row. For added flexibility, you can select a variable for this option or use the * wildcard.

Pig

Working directory Directory path or variable The Pig script uses this direc­ All tory to store intermediate data.

 Note

When you leave this op­ tion blank, Data Services creates and uses a direc­ tory in /user/ sapds_temp, within the HDFS.

Data Services Supplement for Big Data 40 PUBLIC Big data in SAP Data Services Option Possible values Description Mode

Clean up working directory Yes, No Yes: Deletes working direc­ All tory files

No: Preserves working direc­ tory files

The software stores the Pig output file and other inter­ mediate files in the working directory. Files include scripts, log files, and the /log/ hadoop directory.

 Note

If you select No, inter­ mediate files remain in both the Pig Working Di­ rectory and the Data Services directory /log/ hadoop.

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 41 Option Possible values Description Mode

Custom Pig script Directory path or variable Location of a custom Pig All script.

Use the results of the script as a source in a data flow.

Custom Pig script can con­ tain any valid Pig Latin com­ mand, including calls to any MapReduce jobs that you want to use with Data Services. See your Pig docu­ mentation for information about Pig Latin commands.

Custom Pig scripts must re­ side on and be runnable from the local file system that con­ tains the Data Services Job Server that is configured for Hadoop. It is not the Job Server on HDFS. Any external reference or dependency in the script should be available on the Data Services Job Server machine configured for Hadoop.

To test your custom Pig script, execute the script from the command prompt and check that it finishes without errors. For example, you could use the following command:

$ pig -f myscript

Use the results of the Pig script as source in a data flow by using the HDFS file format as a source in a data flow.

Locale

Data Services Supplement for Big Data 42 PUBLIC Big data in SAP Data Services Option Possible values Description Mode

Code page The applicable Pig code All page. us-ascii The Default option uses UTF-8 for the code page. Se­ lect one of these options for better performance.

 Note

For other types of code pages, Data Services uses HDFS API-based file reading.

Parent topic: HDFS file format objects [page 37]

Related Information

Configuring custom Pig script results as source [page 43] Previewing HDFS file data [page 44] Configuring custom Pig script results as source [page 43]

3.4.2.2 Configuring custom Pig script results as source

Use an HDFS file format and a custom Pig script to use the results of the PIG script as a source in a data flow.

Create a new HDFS file format or edit an existing one. Use the Pig section of the HDFS file format to create or locate a custom Pig script that outputs data.

Follow these steps to use the results of a custom Pig script in your HDFS file format as a source:

1. In the HDFS file format editor, select Delimited for Type in the General section. 2. Enter the location for the custom Pig script results output file in Root directory in the Data File(s) section. 3. Enter the name of the file to contain the results of the custom Pig script in File name(s). 4. In the Pig section, set Custom Pig script to the path of the custom Pig script. The location must be on the machine that contains the Data Services Job Server. 5. Complete the applicable output schema options for the custom Pig script. 6. Set the delimiters for the output file in the Delimiters section. 7. Save the file format.

Use the file format as a source in a data flow. When the software runs the custom Pig script in the HDFS file format, the software uses the script results as source data in the job.

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 43 Task overview: HDFS file format objects [page 37]

Related Information

HDFS file format options [page 37] Previewing HDFS file data [page 44] HDFS file format objects [page 37]

3.4.2.3 Previewing HDFS file data

Preview HDFS file data for delimited and fixed width file types.

To preview the first 20 or so rows of an HDFS file:

1. Right-click an HDFS file name in the Format tab of the Local Object Library 2. Click Edit.

The File Format Editor opens. You can only view the data. Sorting and filtering are not available when you view sample data in this manner.

Use one of the following methods to access HDFS file data so that you can view, sort, and filter the data:

● Right-click on HDFS source or target object in a data flow and click View Data. ● Click the magnifying glass icon located in the lower right corner of the HDFS source or target objects in the data flow. ● Right-click an HDFS file in the Format tab of the Local Object Library, click Properties, and then open the View Data tab.

 Note

By default, the maximum number of rows displayed for data preview and filtering is 1000. Adjust the number lower or higher, up to a maximum of 5000. Perform the following steps to change the maximum number of rows to display:

1. Select Tools Options Designer General . 2. Set the View data sampling size (rows) to the desired number of rows.

Parent topic: HDFS file format objects [page 37]

Related Information

HDFS file format options [page 37] Configuring custom Pig script results as source [page 43] Designer Guide: Viewing and adding filters

Data Services Supplement for Big Data 44 PUBLIC Big data in SAP Data Services Designer Guide: Sorting

3.5 Connect to Hive

To connect to the remote Hive server, you create a Hive database datastore or a Hive adapter datastore.

Before you create a Hive adapter datastore, create the adapter in the Administrator module of the Management Console. For details about the Hive adapter, see the Supplement for Adapters. Use the Hive adapter datastore when SAP Data Services is installed within the Hadoop cluster. Use the Hive adapter datastore for server- named (DSN-less) connections. Also include SSL (or the newer Transport Layer Security TLS) for secure communication over the network.

Use a Hive database datastore when Data Services is installed on a machine either within the Hadoop cluster or not. Use the Hive database datastore for a DSN or a DSN-less connection. Also include SSL (or the newer TLS) for secure communication over the network.

 Note

SAP Data Services supports and HiveServer2 version 0.11 and higher. Support for DSN-less connections for Hive datastores begins with Data Services 4.2.12. For the most recent compatibility information, see the Product Availability Matrix on the SAP Support Portal.

Hive adapter datastores [page 45] Use a Hive adapter datastore to connect to a Hive server and work with your tables stored in Hadoop.

Hive database datastores [page 62] Use a Hive database datastore to access data in Hadoop through the HiveServer2.

Related Information

Hive adapter installation and configuration [page 47] Hive adapter datastore configuration options [page 48] Creating a DSN connection with SSL protocol in Windows [page 64]

3.5.1 Hive adapter datastores

Use a Hive adapter datastore to connect to a Hive server and work with your tables stored in Hadoop.

Import Hive tables and use them as sources or targets in data flows.

 Note

Data Services supports Apache Hive and HiveServer2 version 0.11 and higher. For the most recent compatibility information, see the Product Availability Matrix (PAM) on the SAP Support Portal.

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 45 The Hive adapter datastore requires a supported Hive ODBC driver, such as Cloudera, to connect remotely to the Hive Server.

For more information about Hadoop and Hive, see the Supplement for Hadoop.

Which type of Hive datastore to use

SAP Data Services supports two types of Hive datastores: Hive adapter datastore and Hive database datastore.

Use a Hive database datastore when SAP Data Services is installed on a machine either within the Hadoop cluster or not. Use the Hive database datastore for a DSN or a DSN-less connection. Also include SSL (or the newer TLS) for secure communication over the network. For details, see the Supplement for Hadoop.

Use the Hive adapter datastore when Data Services is installed within the Hadoop cluster. Use the Hive adapter datastore for server-named (DSN-less) connections. Also include SSL (or the newer TLS) for secure communication over the network. For more information about adapters, see the Supplement for Adapters.

Hive adapter installation and configuration [page 47] Add and configure the Hive adapter instance before you run jobs using information from the Hive adapter.

Hive adapter datastore configuration options [page 48] To configure a Hive adapter datastore, include connection information to your data in Hadoop.

Configuring Kerberos authentication for Hive connection [page 50] Data Services supports Kerberos authentication for Hadoop and Hive data sources when you use Hadoop and Hive services that are Kerberos enabled.

SSL connection support for Hive adapter [page 52] SAP Data Services supports SSL connections for the Hive adapter.

Metadata mapping for Hive [page 53] SAP Data Services matches Hive metadata data types to supported data types when it reads data from Hive adapter sources.

Apache Hive data type conversion [page 54] SAP Data Services converts some Apache Hive data types when importing data and when loading data into external tables or files.

Hive adapter source options [page 55] Set specific source options when you use a Hive adapter datastore table as a source in a data flow.

Hive adapter target options [page 56] Set specific Hive target options in the target editor when you use a Hive adapter datastore table as a target in a data flow.

Hive adapter datastore support for SQL function and transform [page 58] The Hive adapter datastore can process data using SQL functions and the SELECT statement in a SQL transform.

Pushing the JOIN operation to Hive [page 59] Stage non-Hive data in a dataflow with the Data Transfer transform before joining it with a Hive source.

About partitions [page 60]

Data Services Supplement for Big Data 46 PUBLIC Big data in SAP Data Services SAP Data Services imports Hive partition columns the same way as regular columns, but displays the partition columns at the end of the table column list.

Previewing Hive table data [page 60] After you import Hive table metadata using a Hive datastore, preview data in Hive tables.

Using Hive template tables [page 61] You can use Hive template tables as targets in your data flow.

Parent topic: Connect to Hive [page 45]

Related Information

Hive database datastores [page 62]

3.5.1.1 Hive adapter installation and configuration

Add and configure the Hive adapter instance before you run jobs using information from the Hive adapter.

Add the Hive adapter, create an adapter instance, configure an adapter instance, and set adapter operations in the Data Services Management Console Administrator. Adapter operations identify the integration options available for the configured adapter instance. After you complete the configuration and set up in Administrator, create a Hive adapter datastore in Data Services.

The Data Services installer automatically installs the Hive adapter. Applicable for Data Services version 4.1 SP 1 and later versions.

For steps to add and configure an adapter instance in Administrator, see the Supplement for Adapters.

Parent topic: Hive adapter datastores [page 45]

Related Information

Hive adapter datastore configuration options [page 48] Configuring Kerberos authentication for Hive connection [page 50] SSL connection support for Hive adapter [page 52] Metadata mapping for Hive [page 53] Apache Hive data type conversion [page 54] Hive adapter source options [page 55] Hive adapter target options [page 56] Hive adapter datastore support for SQL function and transform [page 58] Pushing the JOIN operation to Hive [page 59] About partitions [page 60]

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 47 Previewing Hive table data [page 60] Using Hive template tables [page 61] Supplement for Adapters: Adapter installation and configuration, Adding and configuring an adapter instance

3.5.1.2 Hive adapter datastore configuration options

To configure a Hive adapter datastore, include connection information to your data in Hadoop.

The following table contains descriptions for the datastore configuration options that apply to the Hive adapter datastore.

Hive adapter datastore option descriptions

Option Description

Datastore Type Select Adapter.

Adapter Instance Name Select the specific instance that you created in the Management Console.

Advanced options

User name Specifies the user name associated with the data to which you are connecting.

If you select Kerberos for the Authentication, include the Ker­ beros realm with the user name. For example: dsuser@BIG­ DATA.COM.

If you select Kerberos keytab for the Authentication, do not complete the User name option.

Password Specifies the password associated with the data to which you are connecting.

Local working directory Specifies the path to your local working directory.

HDFS working directory Specifies the path to your Hadoop Distributed File System (HDFS) directory. If you leave this blank, Data Services uses /user/sapds_hivetmp as the default.

 Note

If you use Beeline CLI, enter the directory that your ad­ ministrator created, and assign permission 755 to each directory in the path.

String size Specifies the size of the Hive STRING datatype. The default is 100.

Data Services Supplement for Big Data 48 PUBLIC Big data in SAP Data Services Option Description

SSL enabled Specifies whether to use SSL (Secure Socket Layer), or the newer Transport Layer Security (TLS), for secure communi­ cation over the network.

Select Yes to use an SSL connection to connect to the Hive server.

 Note

If you use Kerberos or Kerberos keytab for authentica­ tion, set this option to No.

SSL Trust Store Specifies the path and file name of the trust store that veri­ fies credentials and stores certificates.

Trust Store Password Specifies the password associated with the trust store.

Authentication Indicates the type of authentication you are using for the Hive connection:

● None ● Kerberos ● Kerberos keytab

 Note

Complete the remaining Kerberos options based on your selection for Authentication.

Additional Properties Specifies additional connection properties.

For multiple property value pairs, use a semicolon as a de­ limiter between pairs. End the string of property values with a semicolon.

 Example

name1=value1; name1=value1; name2=value2;

To enable SASL-QOP support, set the Authentication option to Kerberos. Then enter one of the following values, which should match the value on the Hive server:

● Authentication only: ;sasl.qop=auth; ● Authentication with integrity protec­ tion: ;sasl.qop=auth-int; ● Authentication with integrity and confidentiality protec­ tion:;sasl.qop=auth-conf;

Parent topic: Hive adapter datastores [page 45]

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 49 Related Information

Hive adapter installation and configuration [page 47] Configuring Kerberos authentication for Hive connection [page 50] SSL connection support for Hive adapter [page 52] Metadata mapping for Hive [page 53] Apache Hive data type conversion [page 54] Hive adapter source options [page 55] Hive adapter target options [page 56] Hive adapter datastore support for SQL function and transform [page 58] Pushing the JOIN operation to Hive [page 59] About partitions [page 60] Previewing Hive table data [page 60] Using Hive template tables [page 61] Configuring Kerberos authentication for Hive connection [page 50] Hive adapter datastores [page 45]

3.5.1.3 Configuring Kerberos authentication for Hive connection

Data Services supports Kerberos authentication for Hadoop and Hive data sources when you use Hadoop and Hive services that are Kerberos enabled.

Configure Kerberos authentication in your Hive adapter or Hive database datastore with a DSN or server name (DSN-less) connection.

 Note

● You cannot use SSL and Kerberos or Kerberos keytab authentication together. Set the SSL enabled option to No when using Kerberos authentication. ● To enable SASL-QOP support for Kerberos, enter a sasl.qop value into the Additional Properties field in the datastore editor.

Ensure that your Hive service is Kerberos-enabled and that you have the required Kerberos information to complete the configuration in the datastore.

To configure the datastore for Kerberos authentication, create a Hive datastore or edit an existing Hive datastore, and perform the following steps.

1. Select the authentication type. a. In the Hive adapter datastore editor, select Kerberos for Authentication. b. In the Hive database datastore editor, select Kerberos for Hive Authentication. 2. Complete the Kerberos options as described in the following table.

Data Services Supplement for Big Data 50 PUBLIC Big data in SAP Data Services Kerberos option descriptions for Hive adapter datastore Option Description

Kerberos Realm Specifies the name of your Kerberos realm. A realm con­ tains the services host machines, application servers, and so on, that users can access. For example, BIGDATA.COM.

Kerberos KDC Specifies the server name of the Key Distribution Center (KDC). The KDC database stores secret keys for user ma­ chines and services.

Configure the Kerberos KDC with renewable tickets (ticket validity as required by Hadoop Hive installation).

 Note

Data Services supports MIT KDC and Microsoft AD for Kerberos authentication.

Kerberos Hive Principal The Hive principal name for the KDC. The name can be the same as the user name that you use when installing Data Services. Find the Hive service principal information in the hive-site.xml file. For example, hive/ /@realm.

Kerberos Keytab location Location for the applicable Kerberos keytab that you gen­ erated for this connection.

A Kerberos keytab file contains a list of authorized users for a specific password. SAP Data Services uses the key­ tab information instead of the entered password in the Username and Password option. For more information about keytabs, see the MIT Kerberos documentation on the Massachusetts Institute of Technology (MIT) Website.

Kerberos option descriptions for Hive database datastore Option Description

Kerberos Service Name Specifies the name of the Kerberos service for Kerberos authentication.

Kerberos Host FQDN Specifies the Fully Qualified Domain Name (FQDN) for the Kerberos host.

Kerberos Realm Specifies the name of your Kerberos realm. A realm con­ tains the services host machines, application servers, and so on, that users can access. For example, BIGDATA.COM.

3. Complete the remaining options in the datastore as applicable.

Task overview: Hive adapter datastores [page 45]

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 51 Related Information

Hive adapter installation and configuration [page 47] Hive adapter datastore configuration options [page 48] SSL connection support for Hive adapter [page 52] Metadata mapping for Hive [page 53] Apache Hive data type conversion [page 54] Hive adapter source options [page 55] Hive adapter target options [page 56] Hive adapter datastore support for SQL function and transform [page 58] Pushing the JOIN operation to Hive [page 59] About partitions [page 60] Previewing Hive table data [page 60] Using Hive template tables [page 61]

3.5.1.4 SSL connection support for Hive adapter

SAP Data Services supports SSL connections for the Hive adapter.

The SSL connection support is through a database connection. For a database connection, configure SSL options on the Data Services server side. Provide Data Services with the necessary certificates.

Data Services automatically includes certificates in its Java keystore so that it recognizes an adapter datastore instance as a trusted Web site. However, if there’s an error regarding a certificate, manually add a certificate back into the Java keystore. For instructions, see the Supplement for Adapters.

Parent topic: Hive adapter datastores [page 45]

Related Information

Hive adapter installation and configuration [page 47] Hive adapter datastore configuration options [page 48] Configuring Kerberos authentication for Hive connection [page 50] Metadata mapping for Hive [page 53] Apache Hive data type conversion [page 54] Hive adapter source options [page 55] Hive adapter target options [page 56] Hive adapter datastore support for SQL function and transform [page 58] Pushing the JOIN operation to Hive [page 59] About partitions [page 60] Previewing Hive table data [page 60] Using Hive template tables [page 61]

Data Services Supplement for Big Data 52 PUBLIC Big data in SAP Data Services SSL connection support

3.5.1.5 Metadata mapping for Hive

SAP Data Services matches Hive metadata data types to supported data types when it reads data from Hive adapter sources.

The following table shows the conversion between Hive data types and Data Services data types when Data Services imports metadata from a Hive source or target.

Data type mapping between Hive and Data Services Hive data type Data Services data type

tinyint int

smallint int

int int

bigint decimal(20,0)

float real

double double

string varchar

boolean varchar(5)

complex not supported

Parent topic: Hive adapter datastores [page 45]

Related Information

Hive adapter installation and configuration [page 47] Hive adapter datastore configuration options [page 48] Configuring Kerberos authentication for Hive connection [page 50] SSL connection support for Hive adapter [page 52] Apache Hive data type conversion [page 54] Hive adapter source options [page 55] Hive adapter target options [page 56] Hive adapter datastore support for SQL function and transform [page 58] Pushing the JOIN operation to Hive [page 59] About partitions [page 60] Previewing Hive table data [page 60] Using Hive template tables [page 61]

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 53 3.5.1.6 Apache Hive data type conversion

SAP Data Services converts some Apache Hive data types when importing data and when loading data into external tables or files.

The following table shows the conversion between Apache Hive data types and Data Services data types. Data Services converts Apache Hive data types to Data Services data types when you import metadata from an Apache Hive source or target into the repository. Data Services also converts data types back to Apache Hive data types when it loads data into an external table or file.

Hive data type Data Services data type Additional information

TINYINT INT

SMALLINT INT

INT/INTEGER INT

BIGINT DECIMAL(19,0) As default, the precision is 19.

FLOAT DOUBLE

DOUBLE DOUBLE

DECIMAL DECIMAL

VARCHAR VARCHAR

CHAR VARCHAR

STRING VARCHAR(255)

BOOLEAN INT

TIMESTAMP DATETIME

Date Date

INTERVAL Not Supported Available with Hive 1.2.0 and later

complex Not Supported Complex types are array, map, and so on.

If Data Services encounters a column that has an unsupported data type, it does not import the column. However, you can configure Data Services to import unsupported data types. In the applicable datastore, check the Import unsupported data types as VARCHAR of size checkbox located in the left corner of the datastore editor dialog box.

Parent topic: Hive adapter datastores [page 45]

Data Services Supplement for Big Data 54 PUBLIC Big data in SAP Data Services Related Information

Hive adapter installation and configuration [page 47] Hive adapter datastore configuration options [page 48] Configuring Kerberos authentication for Hive connection [page 50] SSL connection support for Hive adapter [page 52] Metadata mapping for Hive [page 53] Hive adapter source options [page 55] Hive adapter target options [page 56] Hive adapter datastore support for SQL function and transform [page 58] Pushing the JOIN operation to Hive [page 59] About partitions [page 60] Previewing Hive table data [page 60] Using Hive template tables [page 61]

3.5.1.7 Hive adapter source options

Set specific source options when you use a Hive adapter datastore table as a source in a data flow.

Open the Adapter Source tab of the source table editor and complete the options. The following table contains Hive-specific options.

Hive adapter source option descriptions Option Description

Clean up working directory Specifies whether Data Services cleans up the working di­ rectory after the job completes.

● True: Deletes working directory after successful job completion. ● False: Doesn't delete the working directory after suc­ cessful job completion.

Execution engine type Specifies the type of engine to use for executing the job.

● Default: Uses the default Hive engine. ● Spark: Uses the Spark engine to read data from Spark. ● Map Reduce : Uses the Map Reduce engine to read data from Hive.

Parallel process threads Specifies the number of threads for parallel processing.

More than one thread can improve performance by maximiz­ ing CPU usage on the Job Server computer. For example, if you have four CPUs, enter 4 for the number of parallel proc­ ess threads.

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 55 Parent topic: Hive adapter datastores [page 45]

Related Information

Hive adapter installation and configuration [page 47] Hive adapter datastore configuration options [page 48] Configuring Kerberos authentication for Hive connection [page 50] SSL connection support for Hive adapter [page 52] Metadata mapping for Hive [page 53] Apache Hive data type conversion [page 54] Hive adapter target options [page 56] Hive adapter datastore support for SQL function and transform [page 58] Pushing the JOIN operation to Hive [page 59] About partitions [page 60] Previewing Hive table data [page 60] Using Hive template tables [page 61] Parallel process threads for flat files

3.5.1.8 Hive adapter target options

Set specific Hive target options in the target editor when you use a Hive adapter datastore table as a target in a data flow.

Open the Adapter Target tab of the target table editor and complete the Hive-specific options as described in the following table.

Hive adapter target option descriptions Option Description

Append Specifies whether Data Services appends new data to the ta­ ble or partition.

● True: Adds new data to the existing data in the table or partition. ● False: Deletes all existing data and adds new data to the table or partition.

Clean up working directory Specifies whether Data Services cleans up the working di­ rectory after the job completes.

● True: Deletes working directory after successful job completion. ● False: Doesn't delete the working directory after suc­ cessful job completion.

Data Services Supplement for Big Data 56 PUBLIC Big data in SAP Data Services Option Description

Dynamic partition Specifies whether Hive evaluates the table partitions when it scans data before loading.

● True: Uses table partitions when scanning data before loading. ● False: Uses static partitions for loading data.

SAP Data Services supports only all-dynamic or only all- static partitions.

Drop and re-create table before loading Specifies whether to drop the existing table and create a new table with the same name before loading.

● True: Drops existing table and creates a new table be­ fore loading data. ● False: Doesn't drop the existing table, but uses the exist­ ing table for loading data.

The Drop and re-create table before loading option is applica­ ble only when you use template tables in the design or test environment.

Number of loaders Specifies the number of loaders (threads) to run in parallel for loading data to the target table.

Specify a non-negative integer. The default is 1.

There are two types of loaders based on the number you en­ ter:

● Single loader loading: Loading with one loader. ● Parallel loading: Loading when the number of loaders is greater than one.

Parent topic: Hive adapter datastores [page 45]

Related Information

Hive adapter installation and configuration [page 47] Hive adapter datastore configuration options [page 48] Configuring Kerberos authentication for Hive connection [page 50] SSL connection support for Hive adapter [page 52] Metadata mapping for Hive [page 53] Apache Hive data type conversion [page 54] Hive adapter source options [page 55] Hive adapter datastore support for SQL function and transform [page 58] Pushing the JOIN operation to Hive [page 59]

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 57 About partitions [page 60] Previewing Hive table data [page 60] Using Hive template tables [page 61]

3.5.1.9 Hive adapter datastore support for SQL function and transform

The Hive adapter datastore can process data using SQL functions and the SELECT statement in a SQL transform.

When you use a Hive table in a data flow, use the SQL transform and add SQL functions to manipulate the data in the table.

● Use the SQL Transform to select specific data from the Hive table to process.

 Note

The SQL transform supports only a single SELECT statement. Also, SAP Data Services does not support SELECT for table columns with a constant expression.

● Use a () function to manipulate data in the following ways: ○ Create, drop, or INSERT Hive tables ○ Return a single string value from a Hive table ○ Select a Hive table that contains aggregate functions (max, min, count, avg, and sum) ○ Perform inner and outer joins

Parent topic: Hive adapter datastores [page 45]

Related Information

Hive adapter installation and configuration [page 47] Hive adapter datastore configuration options [page 48] Configuring Kerberos authentication for Hive connection [page 50] SSL connection support for Hive adapter [page 52] Metadata mapping for Hive [page 53] Apache Hive data type conversion [page 54] Hive adapter source options [page 55] Hive adapter target options [page 56] Pushing the JOIN operation to Hive [page 59] About partitions [page 60] Previewing Hive table data [page 60] Using Hive template tables [page 61]

Data Services Supplement for Big Data 58 PUBLIC Big data in SAP Data Services 3.5.1.10 Pushing the JOIN operation to Hive

Stage non-Hive data in a dataflow with the Data Transfer transform before joining it with a Hive source.

When you include a join operation in a data flow between Hive and non-Hive data, stage the Hive data before the operation for better performance. Staging data is more efficient because SAP Data Services doesn't have to read all the data from the Hive data source into memory before performing the join.

Before you stage the data, enable the Enable automatic data transfer option in the Hive datastore editor.

When you construct the data flow, add the Data_Transfer transform. Open the transform editor and make the following settings:

● Transfer Type = Table ● Database type = Hive

 Caution

For non-Hive relational databases: In the Data_Transfer transform, if the option Data Transfer Type is set to Automatic, disable the option Enable automatic data transfer. This rule applies to all relational databases except for Hive.

Parent topic: Hive adapter datastores [page 45]

Related Information

Hive adapter installation and configuration [page 47] Hive adapter datastore configuration options [page 48] Configuring Kerberos authentication for Hive connection [page 50] SSL connection support for Hive adapter [page 52] Metadata mapping for Hive [page 53] Apache Hive data type conversion [page 54] Hive adapter source options [page 55] Hive adapter target options [page 56] Hive adapter datastore support for SQL function and transform [page 58] About partitions [page 60] Previewing Hive table data [page 60] Using Hive template tables [page 61] Data_Transfer transform for push-down operations

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 59 3.5.1.11 About partitions

SAP Data Services imports Hive partition columns the same way as regular columns, but displays the partition columns at the end of the table column list.

The column attribute Partition Column identifies whether the column is partitioned.

When loading to a Hive target, select whether or not to use the Dynamic partition option on the Adapter Target tab of the target table editor.

Hive evaluates the partitioned data dynamically when it scans the data. If Dynamic partition is not selected, Data Services uses Hive static loading, in which it loads all rows to the same partition. The partitioned data comes from the first row that the loader receives.

Parent topic: Hive adapter datastores [page 45]

Related Information

Hive adapter installation and configuration [page 47] Hive adapter datastore configuration options [page 48] Configuring Kerberos authentication for Hive connection [page 50] SSL connection support for Hive adapter [page 52] Metadata mapping for Hive [page 53] Apache Hive data type conversion [page 54] Hive adapter source options [page 55] Hive adapter target options [page 56] Hive adapter datastore support for SQL function and transform [page 58] Pushing the JOIN operation to Hive [page 59] Previewing Hive table data [page 60] Using Hive template tables [page 61] Hive adapter target options [page 56]

3.5.1.12 Previewing Hive table data

After you import Hive table metadata using a Hive datastore, preview data in Hive tables.

To preview Hive table data, first import table metadata using the Hive datastore. Then, right-click a Hive table name in the SAP Data Services Designer object library and click View Data.

Alternatively, click the magnifying glass icon on Hive source and target objects in a data flow or open the View Data tab of the Hive table view.

 Note

The ability to preview Hive table data is available only with Apache Hive version 1.1 and later.

Data Services Supplement for Big Data 60 PUBLIC Big data in SAP Data Services For more information about how to use the View Data tab, see the Designer Guide.

Parent topic: Hive adapter datastores [page 45]

Related Information

Hive adapter installation and configuration [page 47] Hive adapter datastore configuration options [page 48] Configuring Kerberos authentication for Hive connection [page 50] SSL connection support for Hive adapter [page 52] Metadata mapping for Hive [page 53] Apache Hive data type conversion [page 54] Hive adapter source options [page 55] Hive adapter target options [page 56] Hive adapter datastore support for SQL function and transform [page 58] Pushing the JOIN operation to Hive [page 59] About partitions [page 60] Using Hive template tables [page 61] Using View Data

3.5.1.13 Using Hive template tables

You can use Hive template tables as targets in your data flow.

Ensure that the Hive adapter datastore is correctly configured in both SAP Data Services Management Console and SAP Data Services. To add a Hive template table as a target, start to create a data flow in Data Services Designer and perform the following steps:

1. Add a Hive template table object to the data flow.

Use one of two methods: ○ Select a template table icon from the toolbar at right and click anywhere in the data flow in the workspace. ○ Expand the Template node under the applicable Hive adapter datastore in the object library and drag and drop a template table onto your workspace. Note that the template table has to already exist before it is in the object library.

The Create Template dialog box opens. 2. Enter a template table name in Template name. 3. Select the applicable Hive datastore name from the In datastore dropdown list. 4. Enter the Hive dataset name in Owner name. 5. Select the format of the table from the Format dropdown list. 6. Click OK to close the Create Template dialog box.

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 61 7. Connect the Hive template table to the data flow. 8. Click the template table target icon in the data flow to open the target editor. 9. Open the Target tab and set applicable options. The software completes the input and output schema areas based on the schema in the stated Hive dataset. 10. Save your changes and execute the applicable job.

Data Services opens the applicable Hive project and dataset, and creates the table. The table name is the name that you entered for Template name in the Create Template window. Data Services populates the table with the data generated from the data flow.

Task overview: Hive adapter datastores [page 45]

Related Information

Hive adapter installation and configuration [page 47] Hive adapter datastore configuration options [page 48] Configuring Kerberos authentication for Hive connection [page 50] SSL connection support for Hive adapter [page 52] Metadata mapping for Hive [page 53] Apache Hive data type conversion [page 54] Hive adapter source options [page 55] Hive adapter target options [page 56] Hive adapter datastore support for SQL function and transform [page 58] Pushing the JOIN operation to Hive [page 59] About partitions [page 60] Previewing Hive table data [page 60]

3.5.2 Hive database datastores

Use a Hive database datastore to access data in Hadoop through the HiveServer2.

Use a Hive database datastore for the following tasks:

● Import Hadoop tables and use them as sources and targets in data flows ● Use a Hive template table in your data flow. ● Preview data from tables.

Configure a Hive database datastore with either a DSN or DSN-less connection, and include SSL encryption. Additionally, select one of the following Hive authentications:

● User name and password ● Kerberos ● User name

Data Services Supplement for Big Data 62 PUBLIC Big data in SAP Data Services ● No authentication

Supported Hive ODBC Drivers

The Hive database datastore supports the following ODBC drivers:

● Cloudera ● Hortonworks ● MapR

For more information about the specific driver versions currently supported, see the Product Availability Matrix (PAM) on the SAP Support Portal.

Limitations

● Operations such as DELETE and UPINSERT are not natively supported by the HiveServer2. ● Parameterized SQL is not supported; the HiveServer2 does not support the parameter marker.

Configuring ODBC driver in Windows [page 64] For Windows, configure the Hive ODBC driver using the ODBC Drivers Selector utility.

Creating a DSN connection with SSL protocol in Windows [page 64] Use the ODBC Data Source Administrator to create a DSN connection to use with a Hive database datastore, and optionally configure SSL/TLS protocol.

Configuring ODBC driver with SSL protocol for Linux [page 65] For Linux, use the SAP Data Services Connection Manager to configure the ODBC driver and configure SSL protocol for the Hive database datastore.

Configuring bulk loading for Hive [page 67] Use a combination of Hadoop objects to configure bulk loading for Hive targets in a data flow.

Hive database datastore option descriptions [page 68] Complete options in the Hive database datastore to configure connection types, authorizations, and SSL security protocol settings.

Parent topic: Connect to Hive [page 45]

Related Information

Hive adapter datastores [page 45]

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 63 3.5.2.1 Configuring ODBC driver in Windows

For Windows, configure the Hive ODBC driver using the ODBC Drivers Selector utility.

Download and install a supported ODBC driver for Hive. Supported drivers include the following:

● Cloudera ● Hortonworks ● MapR

Obtain driver downloads and related information by going to the product Web page for the driver type that you select.

1. Open the ODBC Drivers Selector utility located in \bin\ODBCDriversSelector.exe. 2. Select the ODBC driver for Hive server under the Database versions column. 3. Click the corresponding cell under the ODBC Drivers column and select the correct driver from the dropdown list. 4. Look at the value in the State column. If the status doesn't appear in the State column, click in the empty cell and the applicable status appears.

The state should be “Installed.” If the state is anything other than “Installed,” you may not have properly installed the ODBC driver. Exit the ODBC Drivers Selector utility and check your driver installation for errors. After you correct the errors, or reinstall the ODBC driver, repeat the steps to configure the driver.

Task overview: Hive database datastores [page 62]

Related Information

Creating a DSN connection with SSL protocol in Windows [page 64] Configuring ODBC driver with SSL protocol for Linux [page 65] Configuring bulk loading for Hive [page 67] Hive database datastore option descriptions [page 68]

3.5.2.2 Creating a DSN connection with SSL protocol in Windows

Use the ODBC Data Source Administrator to create a DSN connection to use with a Hive database datastore, and optionally configure SSL/TLS protocol.

Perform the following prerequisites before you configure a DSN connection with SSL protocol:

● Download and install a supported ODBC driver. For information, see Configuring ODBC driver in Windows [page 64]. ● Generate an SSL certificate and key file by following the instructions in your Hive documentation. ● Access the ODBC Data Source Administrator in one of two ways:

Data Services Supplement for Big Data 64 PUBLIC Big data in SAP Data Services ○ Create the Hive database datastore, select Use data source name (DSN), and click ODBC Admin.... ○ Open the ODBC Data Source Administrator using the Start menu in Windows.

1. In the ODBC Data Source Administrator, select the applicable tab and click Add.

Tabs: Select User DSN to create a DSN that is visible only to you. Select System DSN to create a DSN that is available to all users. 2. Select the applicable Hive ODBC driver from the list of drivers and click Finish. 3. Complete the applicable options in the DSN Setup dialog box.

 Example

If you installed the Cloudera ODBC driver, the dialog box is Cloudera ODBC Driver for Apache Hive DSN Setup.

The options to complete are based on the type of ODBC driver you use, and the service discovery mode. 4. Optional. Perform the following substeps to configure SSL protocol for your connection to Hive: a. Click SSL Options located at the bottom of the DSN Setup dialog box. The SSL Options dialog box opens. b. Select to enable SSL and enter values for the SSL options as applicable.

For descriptions of the SSL options, see Hive database datastore option descriptions [page 68]. c. Click Apply. 5. Click OK to close the ODBC Data Source Administrator and save your DSN.

Create the Hive database datastore and select the DSN that you just created.

Task overview: Hive database datastores [page 62]

Related Information

Configuring ODBC driver in Windows [page 64] Configuring ODBC driver with SSL protocol for Linux [page 65] Configuring bulk loading for Hive [page 67] Hive database datastore option descriptions [page 68]

3.5.2.3 Configuring ODBC driver with SSL protocol for Linux

For Linux, use the SAP Data Services Connection Manager to configure the ODBC driver and configure SSL protocol for the Hive database datastore.

Perform the following prerequisites before configuring the ODBC driver:

● Download and install a supported ODBC driver. Supported drivers include the following: ○ Cloudera

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 65 ○ Hortonworks ○ MapR ● Generate an SSL certificate and key file following instructions in your Hive documentation. ● The Connection Manager is a command-line utility. However, install the GTK+2 library to make a UI for Connection Manager. For more information about obtaining and installing GTK+2, see https:// www.gtk.org/ . The following instructions assume that you have the user interface for Connection Manager.

1. Open the Connection Manager by entering the following command:

$ cd $LINK_DIR/bin/

$ ./DSConnectionManager.sh

2. Select to configure the ODBC driver and enter values for the following information when the utility prompts you:

○ Location of the ODBC inst file ○ Driver version ○ Driver name ○ User name for the database ○ User password for the database ○ Driver location and file name ○ Host name ○ Port number ○ UNIX ODBC library path 3. Press Enter to go to the main menu. 4. Select to create the SSL protocol and enter values for the following parameters when the utility prompts you:

○ Hive authentication: noauth, kerberos, user, or user-passwd. ○ Hive SSL mode: disabled or enabled ○ Hive SSL Server Certificate File ○ Hive two way SSL ○ Hive SSL Client Certificate File ○ Hive SSL Client Key File ○ Hive SSL Client Key Password

 Note

For descriptions for the SSL parameters, see Hive database datastore option descriptions [page 68].

Task overview: Hive database datastores [page 62]

Related Information

Configuring ODBC driver in Windows [page 64] Creating a DSN connection with SSL protocol in Windows [page 64]

Data Services Supplement for Big Data 66 PUBLIC Big data in SAP Data Services Configuring bulk loading for Hive [page 67] Hive database datastore option descriptions [page 68]

3.5.2.4 Configuring bulk loading for Hive

Use a combination of Hadoop objects to configure bulk loading for Hive targets in a data flow.

Create the following objects:

● HDFS file location object ● HDFS file format ● Hive database datastore

To set up bulk loading to Hive, follow these steps:

1. Open the Format tab in the Local Object Library and expand the HDFS Files node. 2. Select the HDFS file format that you created for this task and drag it onto your data flow workspace. 3. Select Make Source. 4. Add the applicable transform objects to your data flow. 5. Add a template table as a target to the data flow: a. Select the template table icon from the tool palette at right. b. Click on a blank space in your data flow workspace

The Create Template dialog box opens 6. Complete Template name with a new name for the target. 7. Select the Hive database datastore that you created for this task from the In datastore dropdown list. 8. Select a format from the Formats dropdown list. 9. Click OK. 10. Connect the template to the data flow. 11. In your data flow workspace, open the target table and open the Bulk Loader Options tab.

The Bulk Load option is selected by default. 12. Select a mode from the Mode dropdown list.

Because the target is a newly-created table, there is no data in the table. However, if you use the data flow in subsequent runs, the Mode affects the data in the target table. ○ Append: Adds new records generated from Data Services processing to the existing data in the target table. ○ Truncate: Replaces all existing records in the existing target table with the records generated from Data Services processing. 13. Select the HDFS file location object that you created for this task from the HDFS File Location drop-down list. 14. Complete the remaining target options as applicable.

Task overview: Hive database datastores [page 62]

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 67 Related Information

Configuring ODBC driver in Windows [page 64] Creating a DSN connection with SSL protocol in Windows [page 64] Configuring ODBC driver with SSL protocol for Linux [page 65] Hive database datastore option descriptions [page 68] Data flows Apache Hadoop [page 14]

3.5.2.5 Hive database datastore option descriptions

Complete options in the Hive database datastore to configure connection types, authorizations, and SSL security protocol settings.

The following table contains datastore option descriptions specific to a DSN-less Hive database datastore.

Hive database datastore option descriptions Option Description

Datastore Type Select Database.

Database Type Select Hive.

Database Subtype Select Hive Server2.

Database Version Select the applicable version.

The following options appear when you create a server-named (DSN-less) connection.

Database server name Specifies the server for the client.

Hive Authentication Specifies the type of authentication to use to access the Hive Server.

● User Name and Password: Requires that you enter the authorized user for the Hive Server in User Name, and enter the related password in Password. ● Kerberos: Requires that you enter Kerberos authentica­ tion information. ● User Name: Requires that you enter the authorized user. ● No Authentication: No additional information required.

For more information about Kerberos, see Configuring Ker­ beros authentication for Hive connection [page 50].

Complete the following options for SSL encryption. Applicable for DSN-less connection.

Use SSL encryption Select Yes.

Encryption Parameters Opens the Encryption Parameters dialog box.

To open, double-click in the empty cell next to the option or click the … icon that appears at the end of the cell when you place your cursor in the cell.

Data Services Supplement for Big Data 68 PUBLIC Big data in SAP Data Services Option Description

Complete the following options in the Encryption Parameters dialog box.

Allow Common Name Host Name Mismatch Specifies whether the CA certificate name can be different than the Hive Server host name.

Select this option to allow the CA certificate name to be dif­ ferent than the Hive Server host name.

Allow Self-Signed Server Certificate Specifies whether to allow a certificate signer be the certifi- cate approver.

Select this option to allow the same signatures.

Trusted Certificates Specifies the path for the directory of Certificate Authority certificate files.

Two-Way SSL Specifies to allow a two-way SSL authentication.

Client Certificate File Specifies the location of the client certificate file.

Client Private Key File Specifies the location of the client private key file.

Client Private Key Password Specifies the password to access the client private key file.

Parent topic: Hive database datastores [page 62]

Related Information

Configuring ODBC driver in Windows [page 64] Creating a DSN connection with SSL protocol in Windows [page 64] Configuring ODBC driver with SSL protocol for Linux [page 65] Configuring bulk loading for Hive [page 67]

3.6 Upload data to HDFS in the cloud

Upload data processed with Data Services to your HDFS that is managed by SAP Cloud Platform Big Data Services.

Big Data Services is a Hadoop distribution in the cloud. Big Data Services performs all Hadoop upgrades and patches for you and provides Hadoop support. SAP Cloud Platform Big Data Services was formerly known as Altiscale.

Upload your big data files directly from your computer to Big Data Services. Or, upload your big data files from your computer to an established cloud account, and then to Big Data Services.

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 69  Example

Access data from S3 (Amazon Simple Storage Service) and use the data as a source in Data Services. Then upload the data to your HDFS that resides in Big Data Service in the cloud.

How you choose to upload your data is based on your use case.

For complete information about accessing your Hadoop account in Big Data Services and uploading big data, see the Supplement for SAP Cloud Platform Big Data Services.

3.7 Google Cloud Dataproc

To connect to an Apache Hadoop web interface running on Google Cloud Dataproc clusters, use a Hive database datastore and a WebHDFS file location.

Use a Hive datastore to browse and view metadata from Hadoop and to import metadata for use in data flows. To upload processed data, use a Hadoop file location and a Hive template table. Implement bulk loading in the target editor in a data flow where you use the Hive template table as a target.

Before you use a Hive datastore for accessing Hadoop data through Google Cloud Dataproc, ensure that you properly configure your Dataproc cluster to allow access from third-party clients. For information and instructions, see your Google Cloud Platform documentation at https://cloud.google.com/products/#data- analytics .

Prepare for accessing Google Dataproc by performing the following tasks:

● Download and configure the supported Hortonworks or Cloudera ODBC driver. ● Create a data source name (DSN) for the connection type in the datastore. ● Enter the IP address for the port-forwarding machine in the DSN Host field.

After all prerequisite tasks are complete, create the Hive datastore and the WebHDFS file location.

Configure driver and data source name (DSN) [page 71] The Hive database datastore requires a supported ODBC driver and a data source name (DSN) connection.

Hive database datastore for Google Dataproc [page 72] The Hive database datastore requires user name and password information to your Google Cloud Dataproc cluster.

Create a WebHDFS file location [page 73] Upload generated data to your WebHDFS through Google Cloud Dataproc cluster by creating a file location.

Related Information

Hive database datastores [page 62] Connect to HDFS [page 31] Configuring bulk loading for Hive [page 67]

Data Services Supplement for Big Data 70 PUBLIC Big data in SAP Data Services 3.7.1 Configure driver and data source name (DSN)

The Hive database datastore requires a supported ODBC driver and a data source name (DSN) connection.

Before you create the DSN, download the supported ODBC driver from either Cloudera or Hortonworks. For version information, see the Product Availability Matrix (PAM) on the Customer Support portal.

Create a DSN on Windows

Verify that the ODBC driver is installed by using the ODBC Drivers Selector utility. Then, create a DSN using the Microsoft ODBC Data Source Administrator utility.

 Note

For instructions to configure the DSN, see Creating a DSN connection with SSL protocol in Windows [page 64].

The following table contains the options for creating a DSN in the ODBC Data Source Administrator utility applicable to configuring a DSN for HiveServer2.

Option Value

Hive Server Type Hive Server 2

Service Discovery Mode No Service Discovery

Host(s) Internal IP address for the Google Cloud Dataproc VM where you established port forwarding.

Port Enter the Google Cloud port number for HiveServer2. The default port for HiveServer2 in Google Cloud Dataproc is 10000.

Database The name of the Google Cloud Dataproc network.

Create a DSN on Linux

Configure the driver and create a DSN using the SAP Data Services Connection Manager.

 Note

For steps, see Configuring ODBC driver with SSL protocol for Linux [page 65].

Ensure that you enter the internal IP address for the Google Cloud Dataproc VM where you established port forwarding for the Host name.

Parent topic: Google Cloud Dataproc [page 70]

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 71 Related Information

Hive database datastore for Google Dataproc [page 72] Create a WebHDFS file location [page 73] Using the ODBC Drivers Selector for Windows

3.7.2 Hive database datastore for Google Dataproc

The Hive database datastore requires user name and password information to your Google Cloud Dataproc cluster.

For the Hive datastore, the DSN that you previously created contains the connection information to your Google Dataproc cluster. Ensure that you select the applicable DSN when you create the datastore.

The following table contains the applicable settings to make when you create the Hive datastore.

Option Value

Datastore Type Database

Database Type Hive

Datastore Subtype Hive Server 2

Data Source Name The name of the DSN you created when you configured the ODBC driver.

User Name The user name associated with the project where you config- ured the Google Dataproc cluster.

Password The password associated with the project where you config- ured the Google Dataproc cluster.

Parent topic: Google Cloud Dataproc [page 70]

Related Information

Configure driver and data source name (DSN) [page 71] Create a WebHDFS file location [page 73] Hive database datastores [page 62]

Data Services Supplement for Big Data 72 PUBLIC Big data in SAP Data Services 3.7.3 Create a WebHDFS file location

Upload generated data to your WebHDFS through Google Cloud Dataproc cluster by creating a file location.

The following table contains options in the file location editor that are specific to Google Cloud Dataproc. Complete the other options as applicable.

 Restriction

Before you use the WebHDFS file location, configure the host on the Job Server following the steps in Configuring host for WebHDFS file location [page 73].

Option Value

Protocol Hadoop

Communication Protocol WEBHDFS

Host Internal IP address for the Google Cloud Dataproc VM where you established port forwarding.

User The user name associated with the project where you config- ured the Google Dataproc cluster.

Password Password associated with the project where you configured the Google Dataproc cluster.

Remote Directory Path for the working directory in Google Cloud.

Local Directory Path for your local working directory.

Replication Factor Google Cloud Dataproc has a default replication factor of 2.

Parent topic: Google Cloud Dataproc [page 70]

Related Information

Configure driver and data source name (DSN) [page 71] Hive database datastore for Google Dataproc [page 72] Creating a file location object HDFS file location object options [page 32]

3.7.3.1 Configuring host for WebHDFS file location

Before you test or use the WebHDFS file location for the first time, you must map the IP address to the Dataproc cluster host name in the HOSTS file.

1. Open the HOSTS file with a text editor.

Location of the HOSTS file:

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 73 ○ Linux: ~/etc/hosts ○ Windows: C:\Windows\System32\drivers\etc\hosts 2. On a new line in the hosts file, enter information that maps the IP address for the port forward machine to your Google Dataproc cluster host using the following syntax:

.c..internal

The master node name is the name of your Dataproc cluster followed by “-m”. Also add a line for each worker node as applicable. The worker node name is the name of your Dataproc cluster followed by “-w”.

 Example

Add a line to your HOSTS file for the master node and worker nodes using the following information: ○ Internal IP of the port forward machine: 10.160.205.211 ○ Dataproc master node name: My_Cluster ○ Dataproc project name: My_Project The strings that you add to the HOSTS file appear as follows:

10.160.205.211 My_Cluster-m.c.My_Project.internal

10.160.205.211 My_Cluster-w-0.c.My_Project.internal

10.160.205.211 My_Cluster-w-1.c.My_Project.internal

3. Save and close the hosts file.

3.8 HP Vertica

Access HP Vertica data in SAP Data Services by creating an HP Vertica database datastore.

Use HP Vertica data as sources or targets in data flows. Implement SSL secure data transfer with MIT Kerberos to access HP Vertica data securely. Additionally, configure options in the source or target table editors to enhance HP Vertica performance.

Enable MIT Kerberos for HP Vertica SSL protocol [page 75] SAP Data Services uses MIT Kerberos 5 authentication to securely access an HP Vertica database using SSL protocol.

Creating a DSN for HP Vertica with Kerberos SSL [page 78] To enable SSL for HP Vertica database datastores, first create a data source name (DSN).

Creating HP Vertica datastore with SSL encryption [page 80] SSL encryption protects data as it transfers between the database server and Data Services.

Increasing loading speed for HP Vertica [page 82] SAP Data Services does not support bulk loading for HP Vertica, but there are settings you can make to increase loading speed.

HP Vertica data type conversion [page 83] SAP Data Services converts incoming HP Vertica data types to native data types, and outgoing native data types to HP Vertica data types.

HP Vertica table source [page 85]

Data Services Supplement for Big Data 74 PUBLIC Big data in SAP Data Services Configure options for an HP Vertica table as a source by opening the source editor in the data flow.

HP Vertica target table configuration [page 86] Configure options for an HP Vertica table as a target by opening the target editor in the data flow.

3.8.1 Enable MIT Kerberos for HP Vertica SSL protocol

SAP Data Services uses MIT Kerberos 5 authentication to securely access an HP Vertica database using SSL protocol.

You must have Database Administrator permissions to install MIT Kerberos 5 on your Data Services client machine. Additionally, the Database Administrator must establish a Kerberos Key Distribution Center (KDC) server for authentication. The KDC server must support Kerberos 5 using the Generic Security Service (GSS) API. The GSS API also supports non_MIT Kerberos implementations, such as Java and Windows clients.

 Note

Specific Kerberos and HP Vertica database processes are required before you can enable SSL protocol in Data Services. For complete explanations and processes for security and authentication, consult your HP Vertica user documentation and the MIT Kerberos user documentation.

MIT Kerberos authorizes connections to the HP Vertica database using a ticket system. The ticket system eliminates the need for users to enter a password.

Parent topic: HP Vertica [page 74]

Related Information

Creating a DSN for HP Vertica with Kerberos SSL [page 78] Creating HP Vertica datastore with SSL encryption [page 80] Increasing loading speed for HP Vertica [page 82] HP Vertica data type conversion [page 83] HP Vertica table source [page 85] HP Vertica target table configuration [page 86] Information to edit configuration or initialization file [page 76] Generate secure key with kinit command [page 78] Creating a DSN for HP Vertica with Kerberos SSL [page 78]

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 75 3.8.1.1 Information to edit configuration or initialization file

Descriptions for kerberos properties for configuration or initialization files.

After you install MIT Kerberos, define the specific Kerberos properties in the Kerberos configuration or initialization file and save it to your domain. For example, save krb5.ini to C:\Windows.

See the MIT Kerberos documentation for information about completing the Unix krb5.conf property file or the Windows krb5.ini property file. Kerberos documentation is located at: http://web.mit.edu/kerberos/ krb5-current/doc/admin/conf_files/krb5_conf.html .

Log file locations for Kerberos

[logging] Locations for Kerberos log files

Property Description

default = The location for the Kerberos library log file, krb5libs.log. For example: default = FILE:/var/log/krb5libs.log

kdc = The location for the Kerberos Data Center log file, krb5kdc.log. For example: kdc = FILE:/var/log/ krb5kdc.log

admin_server = The location for the administrator log file, kadmind.log. For example: admin_server = FILE:/var/log/ kadmind.log

Kerberos 5 library settings

[libdefaults] Settings used by the Kerberos 5 library

Property Description

default_realm = The location of your domain. Example: default_realm = EXAMPLE.COM

Domain must be in all capital letters.

dns_lookup_realm = Set to False: dns_lookup_realm = false

dns_lookup_kdc = Set to False: dns_lookup_kdc = false

Data Services Supplement for Big Data 76 PUBLIC Big data in SAP Data Services Property Description

ticket_lifetime = Set number of hours for the initial ticket request. For exam­ ple: ticket_lifetime = 24h

The default is 24h.

renew_lifetime = Set number of days a ticket can be renewed after the ticket lifetime expiration. For example: renew_lifetime = 7d

The default is 0.

forwardable = Initial tickets can be forwarded when this value is set to True. For example: forwardable = true

Kerberos realm values

[realms] Value for each Kerberos realm

Property Description

= {} ple:

EXAMPLE.COM = {kdc= admin_server= kpasswd_server=}

Properties include:

● KDC location ● Admin Server location ● Kerberos Password Server location

 Note

Host and server names are lowercase.

Kerberos domain realm

[domain_realm]

Property Description

= Maps the server host name to the Kerberos realm name. If you use a domain name, prefix the name with a period (.).

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 77 Related Information

Generate secure key with kinit command [page 78]

3.8.1.2 Generate secure key with kinit command

Execute the kinit command to generate a secure key.

After you have updated the configuration or initialization file and saved it to the client domain, execute the kinit command to generate a secure key.

For example, enter the following command using your own information for the variables: kinit @

The command should generate the following keys:

Key Description

-k Precedes the service name portion of the Kerberos principal. The default is vertica.

-K Precedes the instance or host name portion of the Kerberos principal.

-h Precedes the machine host name for the server.

-d Precedes the HP Vertica database name that you want to connect to.

-U Precedes the user name of the administrator user.

See the MIT Kerberos ticket management documentation for complete information about using the kinit command to obtain tickets: http://web.mit.edu/kerberos/krb5-current/doc/user/tkt_mgmt.html .

3.8.2 Creating a DSN for HP Vertica with Kerberos SSL

To enable SSL for HP Vertica database datastores, first create a data source name (DSN).

This procedure is for HP Vertica users who have database administrator permissions to perform these steps. Other non-database administrators may access the HP Vertica database only when they are associated with an authentication method through a GRANT statement.

To create a DSN for HP Vertica, use SAP Data Services 4.2 SP7 Patch 1 (14.2.7.1) or later version.

Install MIT Kerberos 5 and perform all of the required steps for MIT Kerberos authentication for HP Vertica. See your HP Vertica documentation in the security and authentication sections for details.

Data Services Supplement for Big Data 78 PUBLIC Big data in SAP Data Services 1. Open the ODBC Data Source Administrator.

Access the ODBC Data Source Administrator either from the Datastore Editor in Data Services Designer or directly from your Start menu. 2. In the ODBC Data Source Administrator, open the System DSN tab and click Add. 3. Select the applicable HP Vertica driver from the list and click Finish. 4. Open the Basic Settings tab and complete the following options:

HP Vertica ODBC DSN Configuration Basic Settings tab

Option Value

DSN Enter the HP Vertica data source name.

Description Optional. Enter a description for this data source.

Database Enter the name of the database that is running on the server.

Server Enter the server name.

Port Enter the port number on which HP Vertica listens for ODBC connections. The default is 5433.

User Name Enter the database user name. The database user has DBADMIN permission, or is associated with the authenti­ cation method through a GRANT statement.

5. Optional. Select Test Connection.

 Note

If the connection fails, either continue with the configuration and fix the connection issue later, or reconfigure the connection information and try to test the connection again.

6. Open the Client Settings tab and complete the options as described in the following table.

HP Vertica ODBC DSN Configuration Client Settings tab

Option Value

Kerberos Host Name Enter the name of the host computer where Kerberos is installed.

Kerberos Service Name Enter the applicable value.

SSL Mode Select Require.

Address Family Preference Select None.

Autocommit Select this option.

Driver String Conversions Select Output.

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 79 Option Value

Result Buffer Size (bytes) Enter the applicable value in bytes. Default is 131072.

Three Part Naming Select this option.

Log Level Select No logging from the dropdown list.

7. Click Test Connection. When the connection test is successful, click OK and close the ODBC Data Source Administrator.

Now the HP Vertica DSN that you just created is included in the DSN option in the datastore editor.

Create the HP Vertica database datastore in Data Services Designer and select the DSN that you just created.

Task overview: HP Vertica [page 74]

Related Information

Enable MIT Kerberos for HP Vertica SSL protocol [page 75] Creating HP Vertica datastore with SSL encryption [page 80] Increasing loading speed for HP Vertica [page 82] HP Vertica data type conversion [page 83] HP Vertica table source [page 85] HP Vertica target table configuration [page 86] Creating HP Vertica datastore with SSL encryption [page 80]

3.8.3 Creating HP Vertica datastore with SSL encryption

SSL encryption protects data as it transfers between the database server and Data Services.

An administrator must install MIT Kerberos 5 and enable Kerberos for HP Vertica SSL protocol. Additionally, an administrator must create an SSL data source name (DSN) using the ODBC Data Source Administrator. Then the DSN is available to choose when you create the datastore. See the Administrator Guide for more information about configuring MIT Kerberos.

SSL encryption for HP Vertica is available in SAP Data Services version 4.2 Support Package 7 Patch 1 (14.2.7.1) or later.

 Note

Enabling SSL encryption slows down job performance.

 Note

An HP Vertica database datastore requires that you choose DSN as a connection method. DSN-less connections are not allowed for HP Vertica datastore with SSL encryption.

Data Services Supplement for Big Data 80 PUBLIC Big data in SAP Data Services 1. In Designer, select Project New Datastore . 2. Complete the options as you would for an HP Vertica database datastore. Complete the following options specifically for SSL encryption:

SSL-specific options

Option Value

Use Data Source Name (DSN) Select this option.

Data Source Name Select the HP Vertica SSL DSN data source file that was created previously in the ODBC Data Source Administra­ tor.

3. Complete the remaining applicable advanced options and save your datastore.

Task overview: HP Vertica [page 74]

Related Information

Enable MIT Kerberos for HP Vertica SSL protocol [page 75] Creating a DSN for HP Vertica with Kerberos SSL [page 78] Increasing loading speed for HP Vertica [page 82] HP Vertica data type conversion [page 83] HP Vertica table source [page 85] HP Vertica target table configuration [page 86] HP Vertica datastore options [page 81]

3.8.3.1 HP Vertica datastore options

Create an HP Vertica database datastore to use as a source or target in a data flow.

The following tables contain datastore configuration options specific to HP Vertica datastore.

Main window

HP Vertica option Description

Database version Select your HP Vertica client version from the drop-down list. This is the version of HP Vertica that this datastore accesses.

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 81 HP Vertica option Description

Data source name Required. Select a DSN from the dropdown list if you have al­ ready defined one. If you haven't defined a DSN previously, click ODBC Admin to define a DSN.

You must first install and configure MIT Kerberos 5 and per­ form other HP Vertica set up tasks before you can define a DSN.

For more information about HP Vertica MIT Kerberos and DSN for HP Vertica, read the Server Management section of the Administrator Guide.

User name Enter the user name of the account through which SAP Data Services accesses the database.

Password Enter the database password for the user that you entered in User Name.

Related Information

About working with Aliases Common datastore options

3.8.4 Increasing loading speed for HP Vertica

SAP Data Services does not support bulk loading for HP Vertica, but there are settings you can make to increase loading speed.

For complete details about connecting to HP Vertica, consult with the Connecting to HP Vertica guide at https://my.vertica.com/ (copy and paste URL in your browser to follow link). Select Documentation and click the applicable version from the dropdown list.

When you load data to an HP Vertica target in a data flow, the software automatically executes an HP Vertica statement that contains a COPY Local statement. This statement makes the ODBC driver read and stream the data file from the client to the server.

You can further increase loading speed by increasing rows per commit and enable use native connection load balancing:

1. when you configure the ODBC driver for HP Vertica, enable the option to use native connection load balancing. 2. In Designer, open the applicable data flow. 3. In the workspace, double-click the HP Vertica datastore target object to open it. 4. Open the Options tab in the lower pane.

Data Services Supplement for Big Data 82 PUBLIC Big data in SAP Data Services 5. Increase the number of rows in the Rows per commit option.

Task overview: HP Vertica [page 74]

Related Information

Enable MIT Kerberos for HP Vertica SSL protocol [page 75] Creating a DSN for HP Vertica with Kerberos SSL [page 78] Creating HP Vertica datastore with SSL encryption [page 80] HP Vertica data type conversion [page 83] HP Vertica table source [page 85] HP Vertica target table configuration [page 86]

3.8.5 HP Vertica data type conversion

SAP Data Services converts incoming HP Vertica data types to native data types, and outgoing native data types to HP Vertica data types.

The following table contains HP Vertica data types and the native data types to which Data Services converts them.

HP Vertica data type Data Services data type

Boolean Int

Integer, INT, BIGINT, INT8, SMALLINT, TINYINT Decimal

FLOAT Double

Money Decimal

Numeric Decimal

Number Decimal

Decimal Decimal

Binary, Varbinary, Long Varbinary Blob

Long Varchar Long

Char Varchar

Varchar Varchar

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 83 HP Vertica data type Data Services data type

Char(n), Varchar(n) Varchar(n)

DATE Date

TIMESTAMP Datetime

TIMESTAMPTZ Varchar

Time Time

TIMETZ Varchar

INTERVAL Varchar

The following table contains native data types and the HP Vertica data types to which Data Services outputs them. Data Services outputs the converted data types to HP Vertica template tables or Data_Transfer transform tables.

Data Services data type HP Vertica data type

Blob Long Varbinary

Date Date

Datetime Timestamp

Decimal Decimal

Double Float

Int Int

Interval Float

Long Long Varchar

Real Float

Time Time

Varchar Varchar

Timestamp Timestamp

Parent topic: HP Vertica [page 74]

Data Services Supplement for Big Data 84 PUBLIC Big data in SAP Data Services Related Information

Enable MIT Kerberos for HP Vertica SSL protocol [page 75] Creating a DSN for HP Vertica with Kerberos SSL [page 78] Creating HP Vertica datastore with SSL encryption [page 80] Increasing loading speed for HP Vertica [page 82] HP Vertica table source [page 85] HP Vertica target table configuration [page 86]

3.8.6 HP Vertica table source

Configure options for an HP Vertica table as a source by opening the source editor in the data flow.

HP Vertica source table options

Option Description

Table name Specifies the table name for the source table.

Table owner Specifies the table owner.

You cannot edit the value. Data Services automatically popu­ lates with the name that you entered when you created the HP Vertica table.

Datastore name Specifies the name of the related HP Vertica datastore.

Database type Specifies the database type.

You cannot edit this value. Data Services automatically pop­ ulates with the database type that you chose when you cre­ ated the datastore.

Parent topic: HP Vertica [page 74]

Related Information

Enable MIT Kerberos for HP Vertica SSL protocol [page 75] Creating a DSN for HP Vertica with Kerberos SSL [page 78] Creating HP Vertica datastore with SSL encryption [page 80] Increasing loading speed for HP Vertica [page 82] HP Vertica data type conversion [page 83] HP Vertica target table configuration [page 86]

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 85 3.8.7 HP Vertica target table configuration

Configure options for an HP Vertica table as a target by opening the target editor in the data flow.

Options tab General options

Option Description

Column comparison Specifies how the software maps input columns to output columns:

● Compare by position: Maps source columns to target columns by position, and ignores column names. ● Compare by name: Maps source columns to target col­ umns by column name. Compare by name is the default setting.

Data Services issues validation errors when the data types of the columns do not match.

Number of loaders Specifies the number of loaders Data Services uses to load data to the target.

Enter a positive integer. The default is 1.

There are different types of loading:

● Single loader loading: Loading with one loader. ● Parallel loading: Loading with two or more loaders.

With parallel loading, each loader receives the number of rows indicated in the Rows per commit option, and proc­ esses the rows in parallel with other loaders.

 Example

For example, if Rows per commit = 1000 and Number of Loaders = 3:

● First 1000 rows go to the first loader ● Second 1000 rows go to the second loader ● Third 1000 rows go to the third loader ● Fourth 1000 rows go to the first loader

Data Services Supplement for Big Data 86 PUBLIC Big data in SAP Data Services Options tab Error handling options

Option Description

Use overflow file Specifies whether Data Services uses a recovery file for rows that it could not load.

● No: Data Services does not save information about un­ loaded rows. The default setting is No. ● Yes: Data Services loads data to an overflow file when it cannot load a row. When you select Yes, also complete File Name and File Format.

File name Specifies the file name and file format for the overflow file. Applicable only when you select Yes for Use overflow file. File format Enter a file name or specify a variable

The overflow file can include the data rejected and the oper­ ation being performed (write_data) or the SQL command used to produce the rejected operation (write_sql).

Update control

Option Description

Use input keys Specifies whether Data Services uses the primary keys from the input table when the target table does not have a pri­ mary key.

● Yes: Uses the primary keys from the input table when the target table does not have primary keys. ● No: Does not use primary keys from the input table when the target table does not have primary keys. No is the default setting.

Update key columns Specifies whether Data Services updates key column values when it loads data to the target table.

● Yes: Updates key column values when it loads data to the target table. ● No: Does not update key column values when it loads data to the target table. No is the default setting.

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 87 Option Description

Auto correct load Specifies whether Data Services uses auto correct loading when it loads data to the target table. Auto correct loading ensures that Data Services does not duplicate the same row in a target table. Auto correct load is useful for data recovery operations.

● Yes: Uses auto correct loading.

 Note

Not applicable for targets in real time jobs or target tables that contain LONG columns.

● No: Does not use auto correct loading. No is the default setting.

For more information about auto correct loading, read about recovery mechanisms in the Designer Guide.

Ignore columns with value Specifies a value that might appear in a source column and that you do not want updated in the target table during auto correct loading.

Enter a string excluding single or double quotation marks. The string can include spaces.

When Data Services finds the string in the source column, it does not update the corresponding target column during auto correct loading.

Data Services Supplement for Big Data 88 PUBLIC Big data in SAP Data Services Transaction control

Option Description

Include in transaction Specifies that this target table is included in the transaction processed by a batch or real-time job.

● No: This target table is not included in the transaction processed by a batch or real-time job. No is the default setting ● Yes: The target table is included in the transaction proc­ essed by a batch or real-time job. Selecting Yes enables Data Services to commit data to multiple tables as part of the same transaction. If loading fails for any of the ta­ bles, Data Services does not commit any data to any of the tables.

 Note

Ensure that the tables are from the same datastore.

Data Services does not push down a complete opera­ tion to the database when transactional loading is ena­ bled.

Data Services may buffer rows to ensure the correct load or­ der. If the buffered data is larger than the virtual memory available, Data Services issues a memory error.

If you choose to enable transactional loading, the following options are not available:

● Rows per commit ● Use overflow file and overflow file specification ● Number of loaders

Parent topic: HP Vertica [page 74]

Related Information

Enable MIT Kerberos for HP Vertica SSL protocol [page 75] Creating a DSN for HP Vertica with Kerberos SSL [page 78] Creating HP Vertica datastore with SSL encryption [page 80] Increasing loading speed for HP Vertica [page 82] HP Vertica data type conversion [page 83] HP Vertica table source [page 85]

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 89 3.9 MongoDB

The MongoDB adapter allows you to read data from MongoDB sources and load data to other SAP Data Services targets.

MongoDB is an open-source document database, which has JSON-like documents called BSON. MongoDB has dynamic schemas instead of traditional schema-based data.

Data Services needs metadata to gain access to MongoDB data for task design and execution. Use Data Services processes to generate schemas by converting each row of the BSON file into XML and converting XML to XSD.

Data Services uses the converted metadata in XSD files to access MongoDB data.

MongoDB metadata [page 90] Use data from MongoDB as a source or target in a data flow, and also create templates.

MongoDB as a source [page 91] Use MongoDB as a source in Data Services and flatten the nested schema by using the XML_Map transform.

MongoDB as a target [page 94] Configure options for MongoDB as a target in your data flow using the target editor.

MongoDB template documents [page 96] Use template documents as a target in one data flow or as a source in multiple data flows.

Preview MongoDB document data [page 98] Use the data preview feature in SAP Data Services Designer to view a sampling of data from a MongoDB document.

Parallel Scan [page 99] SAP Data Services uses the MongoDB Parallel Scan process to improve performance while it generates metadata for big data.

Reimport schemas [page 100] When you reimport documents from your MongoDB datastore, SAP Data Services uses the current datastore settings.

Searching for MongoDB documents in the repository [page 101] SAP Data Services enables you to search for MongoDB documents in your repository from the object library.

3.9.1 MongoDB metadata

Use data from MongoDB as a source or target in a data flow, and also create templates.

The embedded documents and arrays in MongoDB are represented as nested data. SAP Data Services converts MongoDB BSON files to XML and then to XSD. Data Services saves the XSD file to the following location: LINK_DIR\ext\mongo\mcache.

Data Services Supplement for Big Data 90 PUBLIC Big data in SAP Data Services Restrictions and limitations

Data Services has the following restrictions and limitations for working with MongoDB:

● In the MongoDB collection, the tag name cannot contain special characters that are invalid for the XSD file. For example, the following special characters are invalid for XSD files: >, <, &,/, \, #, and so on. If special characters exist, Data Services removes them. ● MongDB data is always changing, so the XSD may not reflect the entire data structure of all the documents in the MongoDB. ● Data Services does not support projection queries on adapters. ● Data Services ignores any new fields that you add after the metadata schema creation that were not present in the common documents. ● Data Services does not support push down operators when you use MongoDB as a target.

Parent topic: MongoDB [page 90]

Related Information

MongoDB as a source [page 91] MongoDB as a target [page 94] MongoDB template documents [page 96] Preview MongoDB document data [page 98] Parallel Scan [page 99] Reimport schemas [page 100] Searching for MongoDB documents in the repository [page 101] Formatting XML documents Source and target objects

3.9.2 MongoDB as a source

Use MongoDB as a source in Data Services and flatten the nested schema by using the XML_Map transform.

The following examples illustrate how to use various objects to process MongoDB sources in data flows.

Example 1: Change the schema of a MongoDB source using the Query transform, and load output to an XML target.

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 91  Note

Specify conditions in the Query transform. Some conditions can be pushed down and others are processed by Data Services.

Example 2: Set a dataflow where Data Services reads the schema and then loads the schema directly into an XML template file.

Example 3: Flatten a schema using the XML_Map transform and then load the data to a table or flat file.

 Note

Specify conditions in the XML_Map transform. Some conditions can be pushed down and others are processed by Data Services.

Parent topic: MongoDB [page 90]

Related Information

MongoDB metadata [page 90] MongoDB as a target [page 94] MongoDB template documents [page 96] Preview MongoDB document data [page 98] Parallel Scan [page 99] Reimport schemas [page 100] Searching for MongoDB documents in the repository [page 101] MongoDB query conditions [page 93] Push down operator information [page 93]

Data Services Supplement for Big Data 92 PUBLIC Big data in SAP Data Services 3.9.2.1 MongoDB query conditions

Use query criteria to retrieve documents from a MongoDB collection.

Use query criteria as a parameter of the db..find() method. Add MongoDB query conditions to a MongoDB table as a source in a data flow.

To add a MongoDB query format, enter a value next to the Query criteria parameter in the source editor Adapter Source tab. Ensure that the query criteria is in MongoDB query format. For example, { type: { $in: [‘food’, ’snacks’] } }

 Example

Given a value of {prize:100}, MongoDB returns only rows that have a field named “prize” with a value of 100. If you don’t specify the value 100, MongoDB returns all the rows.

Configure a Where condition so that Data Services pushes down the condition to MongoDB. Specify a Where condition in a Query or XML_Map transform, and place the Query or XML_Map transform after the MongoDB source object in the data flow. MongoDB returns only the rows that you want.

For more information about the MongoDB query format, consult the MongoDB Web site.

 Note

If you use the XML_Map transform, it may have a query condition with a SQL format. Data Services converts the SQL format to the MongoDB query format and uses the MongoDB specification to push down operations to the source database. In addition, be aware that Data Services does not support push down of query for nested arrays.

Related Information

Push down operator information [page 93]

3.9.2.2 Push down operator information

SAP Data Services processes push down operators with a MongoDB source in specific ways based on the circumstance.

Push down behavior:

● Data Services does not push down Sort by conditions. ● Data Services pushes down Where conditions. ● Data Services does not push down the nested array when you use a nested array in a Where condition. ● Data Services does not support push down operators when you use MongoDB as a target.

Data Services supports the following operators when you use MongoDB as a source:

● Comparison operators: =, !=, >, >=, <, <=, like, and in.

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 93 ● Logical operators: and and or in SQL query.

3.9.3 MongoDB as a target

Configure options for MongoDB as a target in your data flow using the target editor.

About the <_id> field

SAP Data Services considers the <_id> field in MongoDB data as the primary key. If you create a new MongoDB document and include a field named <_id>, Data Services recognizes that field as the unique BSON ObjectID. If a MongoDB document contains more than one <_id> field at different levels, Data Services considers only the <_id> field at the first level as the BSON Object Id.

The following table contains descriptions for options in the Adapter Target tab of the target editor.

Adapter Target tab options Option Description

Use auto correct Specifies the mode Data Services uses for MongoDB as a target datastore.

● True: Uses Upsert mode for the writing behavior. Updates the document with the same <_id> field or it inserts a new <_id> field.

 Note

Selecting True may slow the performance of writing operations.

● False: Uses Insert mode for writing behavior. If documents have the same <_id> field in the MongoDB collection, then Data Services issues an error message.

Write concern level Specifies the MongoDB write concern level that Data Services uses for reporting the success of a write operation. Enable or disable different levels of acknowledgement for writing opera­ tions.

● Acknowledged: Provides acknowledgment of write operations on a standalone mongod or the primary in a replica set. Acknowledged is the default setting. ● Unacknowledged: Disables the basic acknowledgment and only returns errors of socket exceptions and networking errors. ● Replica Set Acknowledged: Guarantees that write operations have propagated success­ fully to the specified number of replica set members, including the primary. ● Journaled: Acknowledges the write operation only after MongoDB has committed the data to a journal. ● Majority: Confirms that the write operations have propagated to the majority of voting no­ des.

Data Services Supplement for Big Data 94 PUBLIC Big data in SAP Data Services Option Description

Use bulk Specifies whether Data Services executes writing operations in bulk. Bulk may provide better performance.

● True: Runs write operation in bulk for a single collection to optimize the CRUD efficiency. If the write operation in a bulk is more than 1000, MongoDB automatically splits into multi­ ple bulk groups. ● False: Does not run write operation in bulk.

For more information about bulk, ordered bulk, and bulk maximum rejects, see the MongoDB documentation at http://help.sap.com/disclaimer?site=http://docs.mongodb.org/manual/ core/bulk-write-operations/.

Use ordered bulk Specifies the order in which Data Services executes write operations: Serial or Parallel.

● True: Executes write operations in serial. ● False: Executes write operations in parallel. False is the default setting. MongoDB proc­ esses the remaining write operations even when there are errors.

Documents per commit Specifies the maximum number of documents that are loaded to a target before the software saves the data.

● Blank: Uses the maximum of 1000 documents. Blank is the default setting. ● Enter any integer to specify a number other than 1000.

Bulk maximum rejects Specifies the maximum number of acceptable errors before Data Services fails the job.

 Note

Data Services continues to load to the target MongoDB even when the job fails.

Enter an integer. Enter -1 so that Data Services ignores and does not log bulk loading errors.

If the number of actual errors is less than, or equal to the number you specify here, Data Services allows the job to succeed and logs a summary of errors in the adapter instance trace log.

Applicable only when you select True for Use ordered bulk.

Delete data before loading Deletes existing documents in the current collection before loading occurs. Retains all the con­ figuration, including indexes, validation rules, and so on.

Drop and re-create Specifies whether Data Services drops the existing MongoDB collection and creates a new one with the same name before loading occurs.

● True: Drops the existing MongoDB collection and creates a new one with the same name before loading. Ignores the value of Delete data before loading. True is the default setting. ● False: Does not drop the existing MongoDB collection and create a new one with the same name before loading.

This option is available for template documents only.

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 95 Option Description

Use audit Specifies whether Data Services creates audit files that contain write operation information.

● True: Creates audit files that contain write operation information. Stores audit files in the /adapters/audits/ directory. The name of the file is .txt. ● False: Does not create and store audit files.

Data Services behaves in the following way when a regular load fails:

● Use audit = False: Data Services logs loading errors in the job trace log. ● Use audit = True: Data Services logs loading errors in the job trace log and in the audit log.

Data Services behaves in the following way when a bulk load fails:

● Use audit = False: Data Services creates a job trace log that provides only a summary. It does not contain details about each row of bad data. There is no way to obtain details about bad data. ● Use audit = True: Data Services creates a job trace log that provides only a summary but no details. However, the job trace log provides information about where to find details about each row of bad data in the audit file.

Parent topic: MongoDB [page 90]

Related Information

MongoDB metadata [page 90] MongoDB as a source [page 91] MongoDB template documents [page 96] Preview MongoDB document data [page 98] Parallel Scan [page 99] Reimport schemas [page 100] Searching for MongoDB documents in the repository [page 101]

3.9.4 MongoDB template documents

Use template documents as a target in one data flow or as a source in multiple data flows.

Template documents are useful in early application development when you design and test a project. After you import data for the MongoDB datastore, Data Services stores the template documents in the object library. Find template documents in the Datastore tab of the object library.

When you import a template document, the software converts it to a regular document. You can use the regular document as a target or source in your data flow.

Data Services Supplement for Big Data 96 PUBLIC Big data in SAP Data Services  Note

Template documents are available in Data Services 4.2.7 and later. If you upgrade from a previous version, open an existing MongoDB datastore and then click OK to close it. Data Services updates the datastore so that you see the Template Documents node and any other template document related options.

Template documents are similar to template tables. For information about template tables, see the Data Services User Guide and the Reference Guide.

Parent topic: MongoDB [page 90]

Related Information

MongoDB metadata [page 90] MongoDB as a source [page 91] MongoDB as a target [page 94] Preview MongoDB document data [page 98] Parallel Scan [page 99] Reimport schemas [page 100] Searching for MongoDB documents in the repository [page 101]

3.9.4.1 Creating template documents

Create MongoDB template documents and use them as targets or sources in data flows.

In SAP Data Services Designer, open or create a data flow in which you plan to use MongoDB documents.

1. Click the template icon from the tool palette. 2. Click inside a data flow in the workspace.

The Create Template dialog box opens. 3. Enter a name for the template in Template name.

 Note

The name is the MongoDB collection namespace: database.collection. Do not exceed 120 bytes.

4. Select the related MongoDB datastore from the In datastore dropdown list. 5. Click OK. 6. To use the template document as a target in the data flow, connect the template document to an input object. 7. Click Save.

When you link the template document as a target in the data flow, Data Services automatically generates a schema based on the source object. The template document icon changes in the workspace. The template

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 97 document appears in the Template Documents node under the applicable MongoDB datastore in the object library.

To use the template document as a source, drag it from the object library onto a new data flow in the workspace.

Related Information

Preview MongoDB document data [page 98] MongoDB as a source [page 91] MongoDB as a target [page 94]

3.9.4.2 Convert a template document into a regular document

SAP Data Services enables you to convert an imported template document into a regular document.

Use one of the following methods to import a MongoDB template document:

● Open a data flow and select one or more template target documents in the workspace. Right-click, and choose Import Document. ● Select one or more template documents in the Local Object Library, right-click, and choose Import Document.

The icon changes and the document appears under Documents instead of Template Documents in the object library.

 Note

The Drop and re-create target configuration option is available only for template target documents. Therefore it is not available after you convert the template target into a regular document.

3.9.5 Preview MongoDB document data

Use the data preview feature in SAP Data Services Designer to view a sampling of data from a MongoDB document.

Choose one of the following methods to preview MongoDB document data:

● Expand an applicable MongoDB datastore in the object library. Right-click the MongoDB document and select View Data from the dropdown menu. ● Right-click the MongoDB document in a data flow and select View Data from the dropdown menu. ● Click the magnifying glass icon in the lower corner of either a MongoDB source or target object in a data flow.

Data Services Supplement for Big Data 98 PUBLIC Big data in SAP Data Services  Note

By default, Data Services displays a maximum of 100 rows. Change this number by setting the Rows To Scan option in the applicable MongoDB datastore editor. Entering -1 displays all rows.

For more information about viewing data, see the Designer Guide.

Parent topic: MongoDB [page 90]

Related Information

MongoDB metadata [page 90] MongoDB as a source [page 91] MongoDB as a target [page 94] MongoDB template documents [page 96] Parallel Scan [page 99] Reimport schemas [page 100] Searching for MongoDB documents in the repository [page 101] MongoDB adapter datastore configuration options

3.9.6 Parallel Scan

SAP Data Services uses the MongoDB Parallel Scan process to improve performance while it generates metadata for big data.

To generate metadata, Data Services first scans all documents in the MongoDB collection. This scanning can be time consuming. However, when Data Services uses the Parallel Scan command parallelCollectionScan, it uses multiple parallel cursors to read all the documents in a collection. Parallel Scan can increase performance.

 Note

Parallel Scan works with MongoDB server version 2.6.0 and above.

For more information about the parallelCollectionScan command, consult your MongoDB documentation.

Parent topic: MongoDB [page 90]

Related Information

MongoDB metadata [page 90]

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 99 MongoDB as a source [page 91] MongoDB as a target [page 94] MongoDB template documents [page 96] Preview MongoDB document data [page 98] Reimport schemas [page 100] Searching for MongoDB documents in the repository [page 101]

3.9.7 Reimport schemas

When you reimport documents from your MongoDB datastore, SAP Data Services uses the current datastore settings.

Reimport a single MongoDB document by right-clicking the document and selecting Reimport from the dropdown menu.

To reimport all documents, right-click an applicable MongoDB datastore or right-click on the Documents node and select Reimport All from the dropdown menu.

 Note

When you enable Use Cache, Data Services uses the cached schema.

When you disable Use Cache, Data Services looks in the sample directory for a sample BSON file with the same name. If there is a matching file, the software uses the schema from the BSON file. If there isn't a matching BSON file in the sample directory, the software reimports the schema from the database.

Parent topic: MongoDB [page 90]

Related Information

MongoDB metadata [page 90] MongoDB as a source [page 91] MongoDB as a target [page 94] MongoDB template documents [page 96] Preview MongoDB document data [page 98] Parallel Scan [page 99] Searching for MongoDB documents in the repository [page 101]

Data Services Supplement for Big Data 100 PUBLIC Big data in SAP Data Services 3.9.8 Searching for MongoDB documents in the repository

SAP Data Services enables you to search for MongoDB documents in your repository from the object library.

1. Right-click in any tab in the object library and choose Search from the dropdown menu. The Search dialog box opens. 2. Select the applicable MongoDB datastore name from the Look in dropdown menu.

The datastore is the one that contains the document for which you are searching. 3. Select Local Repository to search the entire repository. 4. Select Documents from the Object Type dropdown menu. 5. Enter the criteria for the search. 6. Click Search. Data Services lists matching documents in the lower pane of the Search dialog box. A status line at the bottom of the Search dialog box shows statistics such as total number of items found, amount of time to search, and so on.

Task overview: MongoDB [page 90]

Related Information

MongoDB metadata [page 90] MongoDB as a source [page 91] MongoDB as a target [page 94] MongoDB template documents [page 96] Preview MongoDB document data [page 98] Parallel Scan [page 99] Reimport schemas [page 100] Designer Guide: Searching for objects

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 101 3.10 PostgreSQL

To use your PostgreSQL tables as sources and targets in SAP Data Services, create a PostgreSQL datastore and import your tables and other metadata.

Required versions

Download and install the PostgreSQL Server version 10.X from the official PostgreSQL Web site. Check the Product Availability Matrix (PAM) on the SAP Support Portal to ensure that you have the supported PostgreSQL version for your version of Data Services.

 Note

Data Services supports PostgreSQL on Windows beginning with version 4.2.12 (14.02.12.00). Data Services supports PostgreSQL on Linux beginning with version 4.2.12 Patch 1 (14.02.12.01).

Obtain the ODBC driver that is compatible with your version of PostgreSQL. To avoid potential processing problems, download the ODBC driver from the official PostgreSQL Web site.

DSN or DSN-less connections

Create a PostgreSQL datastore using either a DSN or DSN-less connection.

 Note

Currently, Data Services does not support SSL connections for PostgreSQL.

Pushdown functions

Data Services supports the basic pushdown functions for PostgreSQL. For a list of pushdown functions that Data Services supports for PostgreSQL, see 2212730 .

UTF-8 encoding

To process PostgreSQL tables as sources in data flows, Data Services requires that all data in PostgreSQL tables use UTF-8 encoding. Additionally, Data Services outputs data to PostgreSQL target tables using UTF 8 encoding.

Data Services Supplement for Big Data 102 PUBLIC Big data in SAP Data Services Conversion to or from internal data types

Data Services converts PostgreSQL data types to data types that it can process. After processing, Data Services outputs data and converts the data types back to the corresponding PostgreSQL data types.

Datastore options for PostgreSQL [page 103] Complete options in the datastore editor to set the datastore type, database version, database access information, and DSN information if applicable.

Configure the PostgreSQL ODBC driver [page 104] Configure the PostgreSQL ODBC driver for Windows or Linux to update the configuration file with the applicable driver information.

Import PostgreSQL metadata [page 105] Use the PostgreSQL database datastore to access the schemas and tables in the defined database.

PostgreSQL source, target, and template tables [page 106] Use PostgreSQL tables as sources and targets in data flows and use PostgreSQL table schemas for template tables.

PostgreSQL data type conversions [page 106] When you import metadata from a PostgreSQL table into the repository, SAP Data Services converts PostgreSQL data types to Data Services native data types for processing.

3.10.1 Datastore options for PostgreSQL

Complete options in the datastore editor to set the datastore type, database version, database access information, and DSN information if applicable.

The first set of options define the datastore type (database) and the PostgreSQL version information.

PostgreSQL datastore option descriptions Option Value

Datastore Type Select Database

Database Type Select PostgreSQL

Database Version Select PostgreSQL 10.X

To create a server-name (DSN-less) datastore for PostgreSQL, complete the database-specific options described in the following table.

PostgreSQL database option descriptions for DSN-less connection Option Description

Database server name Specifies the database server address. Enter localhost or an IP address.

Database name Specifies the database name to which this datastore con­ nects.

Port Specifies the port number that this datastore uses to access the database.

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 103 Option Description

User name Specifies the name of the user authorized to access the da­ tabase.

Password Specifies the password related to the specified User name.

Enable Automatic Data Transfer Specifies that any data flow that uses the tables imported with this datastore may use the Data_Transfer transform. Data_Transfer uses transfer tables to push down certain op­ erations to the database server for more efficient process­ ing.

When you create a DSN connection, the remaining options include information that you don't enter in the ODBC Data Source Administrator.

PostgreSQL DSN option descriptions Option Description

Data Source Name Specifies the name of the DSN you create in the ODBC Data Source Administrator.

Ensure that you create the DSN so that it appears in the dropdown list.

User Name Specifies the user name to access the data source defined in the DSN.

Password Specifies the password related to the User Name value.

Parent topic: PostgreSQL [page 102]

Related Information

Configure the PostgreSQL ODBC driver [page 104] Import PostgreSQL metadata [page 105] PostgreSQL source, target, and template tables [page 106] PostgreSQL data type conversions [page 106] Properties for ODBC data sources using DSN connections

3.10.2 Configure the PostgreSQL ODBC driver

Configure the PostgreSQL ODBC driver for Windows or Linux to update the configuration file with the applicable driver information.

For Windows, use the ODBC Drivers Selector to verify the ODBC driver is installed. For Linux, configure the ODBC driver using the SAP Data Services Connection Manager. For information about configuring the ODBC driver, see the Server Maintenance section of the Administrator Guide.

Data Services Supplement for Big Data 104 PUBLIC Big data in SAP Data Services Parent topic: PostgreSQL [page 102]

Related Information

Datastore options for PostgreSQL [page 103] Import PostgreSQL metadata [page 105] PostgreSQL source, target, and template tables [page 106] PostgreSQL data type conversions [page 106] Using the ODBC Drivers Selector for Windows Using the DS Connection Manager Using the DS Connection Manager

3.10.3 Import PostgreSQL metadata

Use the PostgreSQL database datastore to access the schemas and tables in the defined database.

Open the datastore and view the metadata available to download. For PostgreSQL, download schemas and the related tables. Each table resides under a specific schema. For example, each schema contains tables that use the schema. A table name appears as ...

Import metadata by browsing, by name, or by searching.

Parent topic: PostgreSQL [page 102]

Related Information

Datastore options for PostgreSQL [page 103] Configure the PostgreSQL ODBC driver [page 104] PostgreSQL source, target, and template tables [page 106] PostgreSQL data type conversions [page 106] Datastore metadata Imported metadata from database datastores

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 105 3.10.4 PostgreSQL source, target, and template tables

Use PostgreSQL tables as sources and targets in data flows and use PostgreSQL table schemas for template tables.

Drag the applicable PostgreSQL table onto your workspace and connect it to a data flow as a source or target. Also, use a template table as a target in a data flow and save it to use as a future source in a different data flow.

See the Designer Guide to learn about using template tables. Additionally, see the Reference Guide for descriptions of options to complete for source, target, and template tables.

Parent topic: PostgreSQL [page 102]

Related Information

Datastore options for PostgreSQL [page 103] Configure the PostgreSQL ODBC driver [page 104] Import PostgreSQL metadata [page 105] PostgreSQL data type conversions [page 106]

3.10.5 PostgreSQL data type conversions

When you import metadata from a PostgreSQL table into the repository, SAP Data Services converts PostgreSQL data types to Data Services native data types for processing.

After processing, Data Services converts data types back to PostgreSQL data types when it outputs the generated data to the target.

The following table contains PostgreSQL data types and the corresponding Data Services data types.

Data type conversion for PostgreSQL Converts to or from Data Services PostgreSQL data type data type Notes

Boolean/Integer/Smallint Int

Serial/Samllserial/Serial4/OID Int

Bigint/BigSerial/Serial8 Decimal(19,0)

Float(1)-Float(24), Real real

Float(25)-Float(53), Double precision double

Money double

Numeric(precision, scale) Decimal(precision, scale)

Numeric/Decimal Decimal(28,6)

Data Services Supplement for Big Data 106 PUBLIC Big data in SAP Data Services Converts to or from Data Services PostgreSQL data type data type Notes

Bytea Blob

Char(n) Fixedchar(n)

Text/varchar(n) Varchar(n)

DATE Date

TIMESTAMP Datetime

TIMESTAMPTZ Varchar(127)

TIMETZ Varchar(127)

INTERVAL Varchar(127)

If Data Services encounters a column that has an unsupported data type, it does not import the column. However, you can configure Data Services to import unsupported data types by checking the Import unsupported data types as VARCHAR of size option in the datastore editor dialog box.

 Note

When you import tables that have specific PostgreSQL native data types, Data Services saves the data type as varchar or integer, and includes an attribute setting for Native Type. The following table contains the column data type in which Data Services saves the PostgreSQL native data type, and the corresponding attribute.

Data Services saves PostgreSQL native data types PostgreSQL column native data type Data Services saves as data type Data Services attribute

json Varchar Native Type = JSON jsonb Native Type = JSONB xml Native Type = XML uuid Native Type = UUID

bool Integer Native Type = BOOL text Native Type = TEXT bigint Native Type = INT8

Parent topic: PostgreSQL [page 102]

Related Information

Datastore options for PostgreSQL [page 103] Configure the PostgreSQL ODBC driver [page 104] Import PostgreSQL metadata [page 105] PostgreSQL source, target, and template tables [page 106]

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 107 3.11 SAP HANA

Process your SAP HANA data in SAP Data Services by creating an SAP HANA database datastore.

Use data that you import using the SAP HANA database datastore as sources and targets in Data Services data flows. Protect your HANA data during network transmission using SSL protocol and cryptographic libraries. Create stored procedures and enable bulk loading for faster reading and loading. Additionally, load spatial and complex spatial data from Oracle to SAP HANA.

 Note

Beginning with SAP HANA 2.0 SP1, access databases only through a multitenant database container (MDC). If you use a version of SAP HANA that is earlier than 2.0 SP1, access only a single database.

Cryptographic libraries and global.ini settings [page 108] When you create an SAP HANA database datastore with SSL/TLS encryption, configure both server side and client side for SSL/TLS authentication.

Bulk loading in SAP HANA [page 110] SAP Data Services improves bulk loading for SAP HANA by using a staging mechanism to load data to the target table.

Creating stored procedures in SAP HANA [page 112] SAP Data Services supports SAP HANA stored procedures with zero, one, or more output parameters.

SAP HANA database datastores [page 113] To access SAP HANA data for SAP Data Services processes, configure an SAP HANA database datastore with either a data source name (DSN) or a server name (DSN-less) connection.

Datatype conversion for SAP HANA [page 119] SAP Data Services performs data type conversions when it imports metadata from SAP HANA sources or targets into the repository and when it loads data into an external SAP HANA table or file.

Using spatial data with SAP HANA [page 121] SAP Data Services supports spatial data such as point, line, polygon, collection, for specific databases.

3.11.1 Cryptographic libraries and global.ini settings

When you create an SAP HANA database datastore with SSL/TLS encryption, configure both server side and client side for SSL/TLS authentication.

On the server side, the process of configuring the ODBC driver and SSL/TLS protocol automatically sets the applicable settings in the communications section of the global.ini file.

SAP HANA uses the SAP CommonCrypto library for SSL/TLS encryption. CommonCryptoLib (libsapcrypto.sar) is installed by default as part of SAP HANA server installation. SAP HANA server installation installs the CommoncryptoLib to $DIR_EXECUTABLE.

 Note

SAP CommonCrypto library was formerly known as SAPCrypto library. The two librarys are the same.

Data Services Supplement for Big Data 108 PUBLIC Big data in SAP Data Services  Note

Support for OpenSSL in SAP HANA is deprecated. If you were using OpenSSL, we recommend that you migrate to CommonCryptoLib. For more information, see 2093286

Parent topic: SAP HANA [page 108]

Related Information

Bulk loading in SAP HANA [page 110] Creating stored procedures in SAP HANA [page 112] SAP HANA database datastores [page 113] Datatype conversion for SAP HANA [page 119] Using spatial data with SAP HANA [page 121]

3.11.1.1 Obtaining the SAP CommonCryptoLib file in Windows and Unix

The SAP CommonCryptoLib files are required for using SSL/TLS encryption in your SAP HANA database datastores.

1. For Windows: a. Create a local folder to store the CommonCryptoLib files. b. Download and install the applicable version of SAPCAR from the SAP download center.

Use SAPCAR to extract the SAP CommonCryptoLib libraries. c. Obtain the SAP CommonCryptoLib Library file from the SAP download center. d. Use SAPCar to extract the library files from libsapcrypto.sar to the local folder that you created to store the files. e. Create a system variable named SECUDIR and point to the local folder that you created for the CommonCryptoLib library files.

To create a system variable for Windows, access Control Panel and open Systems. f. Append %SECUDIR% to the PATH variable in Environment Variables. g. Restart Windows. 2. For Unix: a. Create a local folder to store the CommonCryptoLib files. b. Obtain the SAP CommonCryptoLib Library libsapcrypto.sar file from https:// launchpad.support.sap.com/#/softwarecenter/template/products/ %20_APP=00200682500000001943&_EVENT=DISPHIER&HEADER=Y&FUNCTIONBAR=N&EVENT =TREE&NE=NAVIGATE&ENR=67838200100200022586&V=MAINT&TA=ACTUAL&PAGE=SEARCH/ COMMONCRYPTOLIB%208 .

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 109 c. Use SAPCar to extract the library files from libsapcrypto.sar to the local folder that you created to store the files. d. Create a system variable named SECUDIR and point to the local folder that you created to store the files.

export SECUDIR=/PATH/

e. Append $SECUDIR to PATH.

export PATH=$SECUDIR:$PATH

f. Restart the Job Server.

3.11.2 Bulk loading in SAP HANA

SAP Data Services improves bulk loading for SAP HANA by using a staging mechanism to load data to the target table.

When Data Services uses changed data capture (CDC) or auto correct load, it uses a temporary staging table to load the target table. Data Services loads the data to the staging table and applies the operation codes INSERT, UPDATE, and DELETE to update the target table. With the Bulk load option selected in the target table editor, any one of the following conditions triggers the staging mechanism:

● The data flow contains a Map CDC Operation transform. ● The data flow contains a Map Operation transform that outputs UPDATE or DELETE rows. ● The data flow contains a Table Comparison transform. ● The Auto correct load option in the target table editor is set to Yes.

If none of these conditions are met, the input data contains only INSERT rows. Therefore Data Services performs only a bulk insert operation, which does not require a staging table or the need to execute any additional SQL.

By default, Data Services automatically detects the SAP HANA target table type. Then Data Services updates the table based on the table type for optimal performance.

The bulk loader for SAP HANA is scalable and supports UPDATE and DELETE operations. Therefore, the following options in the target table editor are also available for bulk loading:

● Use input keys: Uses the primary keys from the input table when the target table does not contain a primary key. ● Auto correct load: If a matching row to the source table does not exist in the target table, Data Services inserts the row in the target. If a matching row exists, Data Services updates the row based on other update settings in the target editor.

Find these options in the target editor under Update Control.

For more information about SAP HANA bulk loading and option descriptions, see the Data Services Supplement for Big Data.

Parent topic: SAP HANA [page 108]

Data Services Supplement for Big Data 110 PUBLIC Big data in SAP Data Services Related Information

Cryptographic libraries and global.ini settings [page 108] Creating stored procedures in SAP HANA [page 112] SAP HANA database datastores [page 113] Datatype conversion for SAP HANA [page 119] Using spatial data with SAP HANA [page 121] SAP HANA target table options [page 111]

3.11.2.1 SAP HANA target table options

When you use SAP HANA tables as targets in a data flow, configure options in the target editor.

The following tables describe options in the target editor that are applicable to SAP HANA. For descriptions of the common options, see the Reference Guide.

Options Option Description

Table type Specifies the table type when you use SAP HANA template table as target.

● Column Store: Creates tables organized by column. Column Store is the default set­ ting.

 Note

Data Services does not support blob, dbblob, and clob data types for column store table types.

● Row Store: Creates tables organized by row.

Bulk loading Option Description

Bulk load Specifies whether Data Services uses bulk loading to load data to the target.

● Selected: Uses bulk loading to load data to the target. ● Not selected: Does not use bulk loading to load data to the target.

Mode Specifies the mode that Data Services uses for loading data to the target table:

● Append: Adds new records to the table. Append is the default setting. ● Truncate: Deletes all existing records in the table and then adds new records.

Commit size Specifies the maximum number of rows that Data Services loads to the staging and target tables before it saves the data (commits).

● default: Uses a default commit size based on the target table type. ○ Column Store: Default commit size is 10,000 ○ Row Store: Default commit size is 1,000 ● Enter a value that is greater than 1.

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 111 Option Description

Update method Specifies how Data Services applies the input rows to the target table.

● default: Uses an update method based on the target table type: ○ Column Store: Uses UPDATE to apply the input rows. ○ Row Store: Uses DELETE-INSERT to apply the input rows. ● UPDATE: Issues an UPDATE to the target table. ● DELETE-INSERT: Issues a DELETE to the target table for data that matches the old data in the staging table. Issues an INSERT with the new data.

 Note

Do not use DELETE-INSERT if the update rows contain data for only some of the columns in the target table. If you use DELETE-INSERT, Data Services replaces missing data with NULLs.

Related Information

3.11.3 Creating stored procedures in SAP HANA

SAP Data Services supports SAP HANA stored procedures with zero, one, or more output parameters.

Data Services supports scalar data types for input and output parameters. Data Services does not support table data types. If you try to import a procedure with table data type, the software issues an error. Data Services does not support data types such as binary, blob, clob, nclob, or varbinary for SAP HANA procedure parameters.

Procedures can be called from a script or from a Query transform as a new function call.

 Example

Syntax

The SAP HANA syntax for the stored procedure:

CREATE PROCEDURE GET_EMP_REC (IN EMP_NUMBER INTEGER, OUT EMP_NAME VARCHAR(20), OUT EMP_HIREDATE DATE) AS

BEGIN SELECT ENAME, HIREDATE INTO EMP_NAME, EMP_HIREDATE FROM EMPLOYEE WHERE EMPNO = EMP_NUMBER;

END;

Data Services Supplement for Big Data 112 PUBLIC Big data in SAP Data Services Limitations

SAP HANA provides limited support of user-defined functions that can return one or several scalar values. These user-defined functions are usually written in L. If you use user-defined functions, limit them to the projection list and the GROUP BY clause of an aggregation query on top of an OLAP cube or a column table. These functions are not supported by Data Services.

SAP HANA procedures cannot be called from a WHERE clause.

Parent topic: SAP HANA [page 108]

Related Information

Cryptographic libraries and global.ini settings [page 108] Bulk loading in SAP HANA [page 110] SAP HANA database datastores [page 113] Datatype conversion for SAP HANA [page 119] Using spatial data with SAP HANA [page 121] Creating stored procedures in a database

3.11.4 SAP HANA database datastores

To access SAP HANA data for SAP Data Services processes, configure an SAP HANA database datastore with either a data source name (DSN) or a server name (DSN-less) connection.

You can optionally include secure socket layer (SSL) or transport layer security (TLS) for secure transfer of data over a network.

 Note

DSN with SSL is applicable for SAP Data Services beginning with version 4.2 SP7 (14.2.7.0). DSN-less with SSL is applicable for SAP Data Services beginning with version 4.2 SP12 (14.2.12.0).

 Note

Enabling SSL encryption slows down job performance but may be necessary for security purposes.

The SAP HANA database datastore requires downloading the HANA ODBC driver. SSL encryption requires you to have the SAP CommonCrypto library and an SAP HANA SSL certificate and key files.

For more information about SAP HANA, SSL, cryptographic libraries, and settings for secure external connections in the global.ini file, see the “SAP HANA Network and Communication Security” section of the SAP HANA Security Guide at https://help.sap.com/viewer/b3ee5778bc2e4a089d3299b82ec762a7/2.0.03/en- US/dcd7bf45bb571014b6fa8b64bb6fdef3.html?q=sap%20hana%20network%20and%20security.

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 113 Parent topic: SAP HANA [page 108]

Related Information

Cryptographic libraries and global.ini settings [page 108] Bulk loading in SAP HANA [page 110] Creating stored procedures in SAP HANA [page 112] Datatype conversion for SAP HANA [page 119] Using spatial data with SAP HANA [page 121]

3.11.4.1 SAP HANA datastore prerequisites and option descriptions

To use SAP HANA data for SAP Data Services processes, create an SAP HANA database datastore and import SAP HANA data.

Prerequisites tasks

Perform the following prerequisite tasks before you create the SAP HANA datastore:

● Download, install, and configure the SAP HANA ODBC driver. ○ For Windows, configure the driver using the ODBC Drivers Selector utility. ○ For UNIX, configure the driver using the ODBC Data Source Administrator. ● Optional. If you plan to use SSL/TLS encryption, download the SAP CommonCrypto library and set the PATH environment variable as instructed in Obtaining the SAP CommonCryptoLib file in Windows and Unix [page 109]. ● If applicable, create a DSN connection using the ODBC Data Source Administrator (Windows) or the SAP Data Services Connection Manager (Unix). If you plan to use SSL/TLS encryption, also set SSL options in the Connection Manager. ● If you have SAP HANA version 2.0 SPS 01 or later with multitenancy database containers (MDC), specify the port number and the database server name specific to the tenant database you are accessing.

The following tables contain the SAP HANA-specific options in the datastore editor.

Data Services Supplement for Big Data 114 PUBLIC Big data in SAP Data Services SAP HANA datastore options

Option Value

Use Data Source Name (DSN) Specifies whether to use a data source name connection.

● Select to create a datastore using a data source name. ● Do not select to create a datastore using a server-name (DSN-less) connection.

The following options appear when you select to use a DSN connection:

Data Source Name Select the SAP HANA SSL DSN (data source name) that you created previously (see Prerequisites).

User Name Enter the user name and password connected to the DSN.

Password

Password

The following options appear when you create a DSN-less connection:

Database server name Specifies the name of the computer where the SAP HANA server is located.

If you are connecting to SAP HANA MDC, enter the SAP HANA database server name for the applicable tenant data­ base.

Port Enter the port number to connect to the SAP HANA Server. The default is 30015.

If you are connecting to SAP HANA 2.0 SPS 01 MDC or later, enter the port number of the specific tenant database.

 Note

See SAP HANA documentation to learn how to find the specific tenant database port number.

Advanced options Option Description

Database name Optional. Enter the specific tenant database name. Applica­ ble for SAP HANA version 2.0 SPS 01 MDC and later.

Additional connection parameters Enter information for any additional parameters that the data source ODBC driver and database supports. Use the following format:

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 115 Option Description

Use SSL encryption Specifies to use SSL/TLS encryption for the datastore con­ nection to the database.

● Yes: Creates the datastore with SSL/TLS encryption. ● No: Creates the datastore without SSL/TLS encryption.

Enabled only when you create a server-name (DSN-less) connection.

Encryption parameters Opens the Encryption Parameters dialog box.

Either double-click in the empty cell or click in the empty cell and click the … icon that appears at the end.

Enabled only when you select Yes for Use SSL encryption.

The following options are in the Encryption Parameters dialog box:

Validate Certificate Specifies whether the software validates the SAP HANA server SSL certificate. If you do not select this option, none of the other SSL options are available to complete.

Crypto Provider Specifies the crypto provider used for SSL communication. Data Services populates Crypto Provider with commoncrypto. SAP CommonCryptoLib library is the only supported cryptographic library for SAP HANA.

Certificate host Specifies the host name used to verify the server identity.

● Leave blank to use the value in Database server name. ● Enter a string that contains the SAP HANA server host­ name. ● Enter the wildcard character “*” so that Data Services does not validate the certificate host.

Key Store Specifies the location and file name for your key store. Appli­ cable when you use the SAP CommonCryptoLib library.

For information about creating a database datastore, see the Designer Guide.

Related Information

Defining a database datastore

3.11.4.2 Configuring DSN for SAP HANA on Windows

To use a DSN connection for an SAP HANA datastore, configure a DSN connection for Windows using the ODBC Data Source Administrator.

Optionally include SSL/TLS encryption settings when you configure the DSN.

Data Services Supplement for Big Data 116 PUBLIC Big data in SAP Data Services Perform the following tasks before you follow the steps to configure DSN including settings for SSL/TLS:

● Download and install the supported SAP HANA ODBC driver. Use the ODBC Drivers Selector to configure the driver in Windows. ● Download the SAP cryptographic library for SSL/TLS encryption: ○ Set the PATH environment variable as instructed in Obtaining the SAP CommonCryptoLib file in Windows and Unix [page 109]. ● For SSL/TLS encryption: ○ Copy sapsrv.pse from SECUDIR of the SAP HANA server. ○ Paste sapsrv.pse to SECUDIR of the client. ○ Rename sapsrv.pse with sapcli.pse.

The following steps include instructions to configure options for SSL/TLS encryption. If you don't want to include SSL/TLS encryption, skip those steps:

1. Open the ODBC Data Source Administrator.

Access the ODBC Data Source Administrator either from the datastore editor in Data Services Designer or directly from your Start menu. 2. In the ODBC Data Source Administrator, open the System DSN tab and click Add. 3. Select the SAP HANA ODBC driver and click Finish. 4. Enter a unique name in Data Source Name, and enter a description if applicable. 5. Enter the server name and port number in the following format: :. 6. For SSL/TLS encryption, click Settings. 7. In the SSL Connection group, select the following options:

○ Connect using SSL ○ Validate the SSL certificate (optional) 8. Either leave Certificate host blank or enter a value:

If you leave Certificate host blank, Data Services uses the value in Database server name. If you don't want the value from the Database server name, enter one of the following values: ○ Enter a string that contains the SAP HANA server hostname. ○ Enter the wildcard character “*” so that Data Services does not validate the certificate host. 9. If you use the SAP CommonCryptoLib library files, specify the location and file name for your key store in Key Store. 10. Click OK.

For descriptions for all SSL options, see SAP HANA datastore prerequisites and option descriptions [page 114].

Related Information

Using the ODBC Drivers Selector for Windows

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 117 3.11.4.3 Configuring DSN for SAP HANA on Unix

Configure a DSN connection for an SAP HANA database datastore for Unix using the SAP Data Services Connection Manager.

Optionally include SSL/TLS encryption settings when you configure the DSN.

Perform the following tasks before you follow the steps to configure DSN for SAP HANA:

● Download and install the supported HANA ODBC driver. ● For SSL/TLS encryption, download the SAP cryptographic library and set the PATH environment variable as instructed in Obtaining the SAP CommonCryptoLib file in Windows and Unix [page 109]. ● For SSL/TLS encryption, copy the applicable SAP HANA SSL certificate from the HANA server and paste to a secure location. ● Use the GTK+2 library to make a graphical user interface for the Connection Manager. Connection Manager is a command-line utility. To use it with a UI, install the GTK+2 library. For more information about obtaining and installing GTK+2, see https://www.gtk.org/ .

The following instructions assume that you have the user interface for Connection Manager.

1. Export $ODBCINI to a file in the same computer as the SAP HANA data source. For example:

export ODBCINI=/odbc.ini

2. Start SAP Data Services Connection Manager by entering the following command:

$LINK_DIR/bin/DSConnectionManager.sh

3. Click the Data Sources tab and click Add to display the list of database types. 4. On the Select Database Type window, select the SAP HANA database type and click OK.

The configuration page opens with some of the connection information automatically completed: ○ Absolute location of the odbc.ini file ○ Driver for SAP HANA ○ Driver version 5. Complete the following options:

○ DSN Name ○ Specify the Driver Name ○ Specify the Server Name ○ Specify the Server Instance ○ Specify the User Name ○ Type the database password ○ Specify the Host Name ○ Specify the Port 6. If you want to include SSL/TLS encryption, select y for Specify the SSL Encryption Option and complete the following SSL options:

○ Specify the SSL Encryption Option ○ Specify the Validate Server Certificate Option ○ Specify the HANA SSL provider ○ Specify the SSL Certificate File

Data Services Supplement for Big Data 118 PUBLIC Big data in SAP Data Services ○ Specify the SSL Key File ○ Specify the SSL Host Name in Certificate

For descriptions of the DSN and SSL options, see SAP HANA datastore prerequisites and option descriptions [page 114].

Related Information

Using the DS Connection Manager

3.11.5 Datatype conversion for SAP HANA

SAP Data Services performs data type conversions when it imports metadata from SAP HANA sources or targets into the repository and when it loads data into an external SAP HANA table or file.

Data Services uses its own conversion functions instead of conversion functions that are specific to the database or application that is the source of the data.

Additionally, if you use a template table or Data_Transfer table as a target, Data Services converts from internal data types to the data types of the respective DBMS.

Parent topic: SAP HANA [page 108]

Related Information

Cryptographic libraries and global.ini settings [page 108] Bulk loading in SAP HANA [page 110] Creating stored procedures in SAP HANA [page 112] SAP HANA database datastores [page 113] Using spatial data with SAP HANA [page 121]

3.11.5.1 SAP HANA datatypes

SAP Data Services converts SAP HANA data types when you import metadata from an SAP HANA source or target into the repository.

Data Services converts data types back to SAP HANA data types when you load data into SAP HANA after processing.

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 119 Data type conversion on import SAP HANA data type Converts to Data Services data type

integer int

tinyint int

smallint int

bigint decimal

char varchar

nchar varchar

varchar varchar

nvarchar varchar

decimal or numeric decimal

float double

real real

double double

date date

time time

timestamp datetime

clob long

nclob long

blob blob

binary blob

varbinary blob

The following table shows the conversion from internal data types to SAP HANA data types in template tables.

Data type conversion on load to template table Data Services data type Converts to SAP HANA data type

blob blob

date date

datetime timestamp

decimal decimal

double double

int integer

interval real

long clob/nclob

real decimal

time time

timestamp timestamp

Data Services Supplement for Big Data 120 PUBLIC Big data in SAP Data Services Data Services data type Converts to SAP HANA data type

varchar varchar/nvarchar

3.11.6 Using spatial data with SAP HANA

SAP Data Services supports spatial data such as point, line, polygon, collection, for specific databases.

Data Services supports spatial data in the following databases:

● Microsoft SQL Server for reading ● Oracle for reading ● SAP HANA for reading and loading

When you import a table with spatial data columns, Data Services imports the spatial type columns as character based large objects (clob). The column attribute is Native Type, which has the value of the actual data type in the database. For example, Oracle is SDO_GEOMETRY, Microsoft SQL Server is geometry/ geography, and SAP HANA is ST_GEOMETRY.

Limitations

● You cannot create template tables with spatial types because spatial columns are imported into Data Services as clob. ● You cannot manipulate spatial data inside a data flow because the spatial utility functions are not supported.

Parent topic: SAP HANA [page 108]

Related Information

Cryptographic libraries and global.ini settings [page 108] Bulk loading in SAP HANA [page 110] Creating stored procedures in SAP HANA [page 112] SAP HANA database datastores [page 113] Datatype conversion for SAP HANA [page 119]

3.11.6.1 Loading spatial data to SAP HANA

Load spacial data from Oracle or Microsoft SQL Server to SAP HANA.

Learn more about spatial data by reading the SAP HANA documentation.

1. Import a source table from Oracle or Microsoft SQL Server to SAP Data Services. 2. Create a target table in SAP HANA with the appropriate spatial columns.

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 121 3. Import the SAP HANA target table into Data Services. 4. Create a data flow with an Oracle or Microsoft SQL Server source as reader. Include any necessary transformations. 5. Add the SAP HANA target table as a loader. Make sure not to change the data type of spatial columns inside the transformations. 6. Build a job that includes the data flow and run it to load the data into the target table.

3.11.6.2 Loading complex spatial data from Oracle to SAP HANA

Complex spatial data is data such as circular arcs and LRS geometries.

1. Create an Oracle datastore for the Oracle table.

For instructions, see the guide Supplement for Oracle Applications. 2. Import a source table from Oracle to SAP Data Services using the Oracle datastore. 3. Create a target table in SAP HANA with the appropriate spatial columns. 4. Import the SAP HANA target table into Data Services. 5. Create a data flow in Data Services, but instead of including an Oracle source, include a SQL transform as reader. 6. Retrieve the data from the Oracle database directly. First, open the SQL transform, then add the SQL Select statement. Add the SQL Select statement by calling the following functions against the spatial data column: ○ SDO_UTIL.TO_WKTGEOMETRY ○ SDO_GEOM.SDO_ARC_DENSIFY

For example, in the SQL below, the table name is “Points”. The “geom” column contains the following geospatial data:

SELECT

SDO_UTIL.TO_WKTGEOMETRY( SDO_GEOM.SDO_ARC_DENSIFY( geom, (MDSYS.SDO_DIM_ARRAY( MDSYS.SDO_DIM_ELEMENT('X',-83000,275000,0.0001), MDSYS.SDO_DIM_ELEMENT('Y',366000,670000,0.0001) )), 'arc_tolerance=0.001' ) ) from "SYSTEM"."POINTS"

For more information about how to use these functions, see the Oracle Spatial Developer's Guide on the Oracle Web page at SDO_GEOM Package (Geometry) . 7. Build a job in Data Services that includes the data flow and run it to load the data into the target table.

Data Services Supplement for Big Data 122 PUBLIC Big data in SAP Data Services 3.12 About SAP Vora datastore

Use the SAP Vora datastore as a source in a data flow, and a template table for the target.

With an SAP Vora datastore, access Vora tables by using the SAP HANA ODBC driver and the SAP HANA wire protocol.

SAP Data Services loads data from the Vora target template table to a CSV staging file in one of the following file types:

● Locally configured ● HDFS ● Amazon S3 HDFS

The software loads the table from the local file and appends data to the existing table in SAP Vora.

Perform the following tasks with the SAP Vora datastore:

● Import Vora tables. ● Append data to existing Vora tables using INSERT. ● Utilize bulk loading. ● View Vora table data in Data Services. ● Browse metadata.

Consider the following limitations when you use an SAP Vora datastore:

● The datastore does not work for SAP Vora views and partitions. ● The datastore uses the SAP Vora relational disk engine. It is not applicable for other engines such as SAP Vora graph engine or collection engine. ● The datastore does not permit partial column mapping.

The following are SAP Vora datastore requirements:

● Use with SAP Vora version 2.0 and later versions. To access SAP Vora with versions earlier than 2.0, use the ODBC datastore. ● Use the SAP HANA version 2.0 Support Package 2 ODBC driver for the SAP HANA wire protocol. ● Ensure that the datastore user is registered as an SAP Vora “Vora user.” For details about user types, see your SAP Vora Developer Guide.

SAP Vora datastore [page 124] Access table data in SAP Vora using an SAP Vora datastore as a source or target in a data flow.

Configuring DSN for SAP Vora on Windows [page 125] With SAP Vora on a Windows platform, configure a DSN type connection while you create the datastore.

Configuring DSN for SAP Vora on Unix and Linux [page 126] With SAP Vora on Unix or Linux environments, configure a DSN type connection using the Connection Manager.

SAP Vora table source options [page 127] Use tables imported with a SAP Vora datastore as sources in your data flows.

SAP Vora target table options [page 128] Use an SAP Vora datastore as a target in a data flow.

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 123 SAP Vora data type conversions [page 130] SAP Vora has different data types than SAP Data Services. Therefore, Data Services must perform data conversion upon reading data from and loading data to SAP Vora tables.

3.12.1 SAP Vora datastore

Access table data in SAP Vora using an SAP Vora datastore as a source or target in a data flow.

The following table describes the options in the datastore editor specific to SAP Vora.

SAP Vora datastore options

Option Description

Data Source Name Specifies the data source name to use for the datastore. A DSN is required for SAP Vora datastores.

If you have already created a DSN, select it from the drop­ down list. To create a new DSN, select ODBC Admin to open the ODBC Administrator, where you can create a DSN.

 Note

Ensure that you use a driver for SAP HANA version 2.0 SP2 ODBC or later version

ODBC Admin Opens the ODBC Administrator to create a DSN.

Parent topic: About SAP Vora datastore [page 123]

Related Information

Configuring DSN for SAP Vora on Windows [page 125] Configuring DSN for SAP Vora on Unix and Linux [page 126] SAP Vora table source options [page 127] SAP Vora target table options [page 128] SAP Vora data type conversions [page 130] Common datastore options

Data Services Supplement for Big Data 124 PUBLIC Big data in SAP Data Services 3.12.2 Configuring DSN for SAP Vora on Windows

With SAP Vora on a Windows platform, configure a DSN type connection while you create the datastore.

Download and install the SAP HANA ODBC driver version 2.0 SP2 and later. Open the applicable SAP Vora datastore to open the datastore editor.

1. Click ODBC Admin.

The ODBC Data Source Administrator dialog box opens. 2. Open the System DSN tab and click Add. 3. Select the HDBODBC driver from the list.

The HDBODBC driver appears in the list only if you have downloaded and installed the driver as instructed in Prerequisites. 4. Click Finish.

The ODBC Configuration for SAP HANA dialog box opens. 5. Enter a name in Data Source Name. Optionally enter a description in Description. 6. Enter the server name and port number separated with a colon in Server:Port.

 Example

vora:30115

7. If the Vora 2.x server has TLS enabled, click Settings.

The Advanced ODBC Connection Property Setup dialog box opens. 8. Check Connect using SSL to enable SSL and click OK. 9. Click Connect to test the connection. 10. When the connection tests successfully, click OK.

Task overview: About SAP Vora datastore [page 123]

Related Information

SAP Vora datastore [page 124] Configuring DSN for SAP Vora on Unix and Linux [page 126] SAP Vora table source options [page 127] SAP Vora target table options [page 128] SAP Vora data type conversions [page 130] Configuring DSN for SAP Vora on Unix and Linux [page 126] SAP Vora datastore [page 124]

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 125 3.12.3 Configuring DSN for SAP Vora on Unix and Linux

With SAP Vora on Unix or Linux environments, configure a DSN type connection using the Connection Manager.

Download and install the SAP HANA ODBC driver version 2.0 SP2 and later. The file name is libodbcHDB.so.

Use the GTK+2 toolkit to create a graphical user interface for Connection Manager. The GTK+2 is a free multiplatform toolkit that creates user interfaces. For more information about obtaining and installing GTK+2, see https://www.gtk.org/ . The following instructions assume that you have the GUI for Connection Manager.

1. In a Command Prompt, open the Connection Manager as follows:

 Sample Code

$ cd $LINK_DIR/bin/

$ ./DSConnectionManager.sh

The SAP Data Services Connection Manager dialog box opens. 2. In the Data Sources tab, select SAP Vora and click Add.

The Configuration for SAP Vora dialog box opens. 3. Enter the remaining options as described in the following table.

Driver options

Option Description

ODBC ini File Enter the absolute pathname for the odbc.ini file.

DSN Name Select the name from the dropdown arrow.

User Name Enter the user name to access the SAP Vora table.

Password Enter the password to access the SAP Vora table.

Driver Enter the location and name of the SAP Hana ODBC driver. Name: libodbcHDB.so.

Host Name Enter the server name.

Port Enter the port number.

SSL Encryption Option Select y if Vora server has TLS enabled.

Select n if Vora server does not have TLS enabled.

4. Optional. Click Test Connection. When the connection is successful, click OK. 5. Click Close to close the Connection Manager.

Task overview: About SAP Vora datastore [page 123]

Data Services Supplement for Big Data 126 PUBLIC Big data in SAP Data Services Related Information

SAP Vora datastore [page 124] Configuring DSN for SAP Vora on Windows [page 125] SAP Vora table source options [page 127] SAP Vora target table options [page 128] SAP Vora data type conversions [page 130] Configuring DSN for SAP Vora on Windows [page 125] SAP Vora datastore [page 124]

3.12.4 SAP Vora table source options

Use tables imported with a SAP Vora datastore as sources in your data flows.

When you drag the datastore table onto the Data Services workspace as a source, the software auto completes the options in the following table. You cannot edit these options.

Source options

Option Description

Table name Name of the SAP Vora table.

Table owner Name of the SAP Vora table owner.

Datastore name Name of the SAP Vora datastore.

Database type Vora.

Parent topic: About SAP Vora datastore [page 123]

Related Information

SAP Vora datastore [page 124] Configuring DSN for SAP Vora on Windows [page 125] Configuring DSN for SAP Vora on Unix and Linux [page 126] SAP Vora target table options [page 128] SAP Vora data type conversions [page 130] SAP Vora datastore [page 124]

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 127 3.12.5 SAP Vora target table options

Use an SAP Vora datastore as a target in a data flow.

The following table contains options and descriptions specific for configuring an SAP Vora datastore as a target in a data flow. Find all other option descriptions for target tables in the topic “Common target table options” in the Reference Guide.

Option tab

Option Description

Drop and re-create table Specifies whether the software drops the existing table and creates a different table with the same name as before load­ ing. This option is required for SAP Vora.

Table type Specifies the type of SAP Vora table to process.

● IN-MEMORY: Uses the relational in-memory store in SAP Vora. Loads relational data into the main memory for fast access. IN-MEMORY is the default setting. Bulk loading is required for IN-MEMORY tables. If you choose IN-MEMORY, configure a valid file location supported by SAP Vora. Specify the file location in the Vora Import File Location option. ● DISK: Performs regular table loading. Uses the disk en­ gine in SAP Vora. Provides relational data processing for data sets that do not fit into main memory.

Read about the engines in the SAP Vora Installation and Ad­ ministration Guide.

If the existing target table is a DATASOURCE table, set the following bulk loading options.

Bulk Loader options

Option Description

Bulk Load Specifies whether the software uses SAP Vora bulk loading options to write data.

● Selected: Enables bulk loading. ● Not Selected: Disables bulk loading.

 Note

When you enable bulk loading, also set the number of loaders in Number of Loaders option.

Data Services Supplement for Big Data 128 PUBLIC Big data in SAP Data Services Option Description

Vora Import File Location Specifies the location of the local file to use for loading SAP Vora tables.

● Local ● HDFS ● WEBHDFS ● S3 ● ADL (Azure Data Lake Store)

If you select Bulk Load, the local file system should be WEBHDFS, S3, or ADL.

Clean up bulk loader directory after load Specifies whether the software deletes the data files after successfully completing the bulk load.

● Selected: Deletes the data files after successfully com­ pleting the bulk load. Selected is the default. ● Not selected: Does not delete the data files after suc­ cessfully completing the bulk load.

If the bulk load does not successfully complete, the data file and auxiliary files remain in the bulk loader directory. Ensure that you manually delete the data files. If you do not select Bulk Load, the files remain in the local file system for you to clean up manually.

General settings

Option Description

Number of loaders Specifies the number of loaders the software uses when bulk loading is enabled.

● The default is 1. ● Enter more than 1 for parallel loading.

For parallel loading, each loader receives the number of rows indicated in the Rows per commit option in turn. Then the software applies the rows in parallel with the other loaders.

Applicable for DATASOURCE and TRANSACTIONAL table types only.

The following table describes the Vora template tables and the type of table created based on whether the Bulk Load option is selected or not selected.

Template table Bulk Load option Table type created

IN-MEMORY Selected DATASOURCE

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 129 Template table Bulk Load option Table type created

DISK Not selected (default) STREAMING

DISK Selected DATASOURCE

Parent topic: About SAP Vora datastore [page 123]

Related Information

SAP Vora datastore [page 124] Configuring DSN for SAP Vora on Windows [page 125] Configuring DSN for SAP Vora on Unix and Linux [page 126] SAP Vora table source options [page 127] SAP Vora data type conversions [page 130]

3.12.6 SAP Vora data type conversions

SAP Vora has different data types than SAP Data Services. Therefore, Data Services must perform data conversion upon reading data from and loading data to SAP Vora tables.

The following table shows the conversion between SAP Vora data types and Data Services data types.

SAP Vora data type to SAP Data Services data type

SAP Vora data type SAP Data Services data type

integer int

tinyint int

smallint int

bigint decimal

char varchar

varchar varchar

real real

double double

decimal decimal

Data Services Supplement for Big Data 130 PUBLIC Big data in SAP Data Services SAP Vora data type SAP Data Services data type

boolean int

SAP Data Services data type to SAP Vora data type

SAP Data Services data type SAP Vora data type

int integer

varchar varchar

interval real

real real

double double

decimal decimal

date date

time time

datetime timestamp

timestamp timestamp

blog varchar

long varchar

Parent topic: About SAP Vora datastore [page 123]

Related Information

SAP Vora datastore [page 124] Configuring DSN for SAP Vora on Windows [page 125] Configuring DSN for SAP Vora on Unix and Linux [page 126] SAP Vora table source options [page 127] SAP Vora target table options [page 128] SAP Vora datastore [page 124]

Data Services Supplement for Big Data Big data in SAP Data Services PUBLIC 131 3.13 Data Services Connection Manager (Unix)

Use the Connection Manager after you install Data Services on Unix, to configure ODBC databases and ODBC drivers for repositories, sources, and targets.

The Connection Manager is a command-line utility. However, a graphical user interface (GUI) is available.

 Note

To use the graphical user interface for Connection Manager, you must install the GTK+2 library. The GTK+2 is a free multi-platform toolkit that creates user interfaces. For more information about obtaining and installing GTK+2, see https://www.gtk.org/ .

To use DSConnectionManager.sh from the command line, use the -c parameter which must be the first parameter.

If an error occurs when using the Connection Manager, use the -d option to show details in the log

For example:

$LINK_DIR/bin/DSConnectionManager.sh -c -d

 Note

For Windows installation, use the ODBC Driver Selector to configure ODBC databases and drivers for repositories, sources, and targets.

Related Information

Using the ODBC Drivers Selector for Windows

Data Services Supplement for Big Data 132 PUBLIC Big data in SAP Data Services 4 Cloud computing services

SAP Data Services provides access to various cloud databases and storages to use for reading or loading big data.

Cloud databases [page 133] Access various cloud databases through file location objects and file format objects.

Cloud storages [page 155] Access various cloud storages through file location objects and gateways.

4.1 Cloud databases

Access various cloud databases through file location objects and file format objects.

SAP Data Services supports many cloud database types to use as readers and loaders in a data flow.

Amazon Redshift database [page 134] Redshift is a cloud database designed for large data files.

Azure SQL database [page 141] Developers and administrators who use Microsoft SQL Server can store on-premise SQL Server workloads on an Azure virtual machine in the cloud.

Google BigQuery [page 142] The Google BigQuery datastore contains access information and passwords so that the software can open your Google BigQuery account on your behalf.

Snowflake [page 149] Snowflake provides a data warehouse that is built for the cloud.

Parent topic: Cloud computing services [page 133]

Related Information

Cloud storages [page 155]

Data Services Supplement for Big Data Cloud computing services PUBLIC 133 4.1.1 Amazon Redshift database

Redshift is a cloud database designed for large data files.

In SAP Data Services, you create a database datastore to access your data from Amazon Redshift. Additionally, load Amazon S3 data files into Redshift using the build-in function load_from_s3_to_redshift.

Amazon Redshift datastores [page 134] Use an Amazon Redshift datastore to import and load tables, load Amazon S3 data files, and more.

Configuring Redshift as source using DSConnectionManager [page 135] Use the DSConnection Manager to configure Amazon Redshift as a source for Data Services.

Amazon Redshift source [page 136] Option descriptions for using an Amazon Redshift database table as a source in a data flow.

Amazon Redshift target table options [page 137] Descriptions of options for using an Amazon Redshift table as a target in a data flow.

Amazon Redshift data types [page 139] SAP Data Services converts Redshift data types to the internal data types when it imports metadata from a Redshift source or target into the repository.

4.1.1.1 Amazon Redshift datastores

Use an Amazon Redshift datastore to import and load tables, load Amazon S3 data files, and more.

Use a Redshift database datastore for the following tasks:

● Import tables ● Read or load Redshift tables in a data flow ● Preview data ● Create and import template tables ● Load Amazon S3 data files into a Redshift table using the built-in function load_from_s3_to_redshift

The following table describes the options specific for Redshift when you create or edit a datastore.

Main window options

Option Description

Enable Automatic Data Transfer Select to enable transfer tables in this datastore. The Data_Transfer transform uses transfer tables to push down subsequent database operations.

This option is enabled by default.

Use an Amazon Redshift ODBC driver to connect to the Redshift cluster database. The Redshift ODBC driver connects to Redshift on Windows and Linux platforms only.

For information about downloading and installing the Amazon Redshift ODBC driver, see the Amazon Redshift documentation on the Amazon website.

Data Services Supplement for Big Data 134 PUBLIC Cloud computing services  Note

Enable secure socket layer (SSL) settings in the Amazon Redshift ODBC Driver. In the Amazon Redshift ODBC Driver DSN Setup window, set the SSL Authentication option to allow.

For details about Amazon Redshift support, see the Supplement for Big Data.

Related Information

Amazon Redshift database [page 134]

4.1.1.2 Configuring Redshift as source using DSConnectionManager

Use the DSConnection Manager to configure Amazon Redshift as a source for Data Services.

1. Download and install the Amazon Redshift ODBC driver for Linux. For more information, read about installing the Redshift ODBC driver for Linux in the Amazon Redshift Management Guide on the Amazon website ( http://docs.aws.amazon.com/redshift/latest/mgmt/install-odbc-driver-linux.html ).

After installing the ODBC driver on Linux, configure the following files:

○ amazon.redshiftodbc.ini ○ odbc.ini ○ odbcinst.ini

For more information about configuring these .ini files, see the Amazon Redshift Management Guide on the Amazon website (http://docs.aws.amazon.com/redshift/latest/mgmt/odbc-driver-configure-linux- mac.html ). 2. At the end of /opt/amazon/redshiftodbc/lib/64/amazon.redshiftodbc.ini, add a line to point to the libodbcinst.so file. This file is in the unixODBC/lib directory.

For example, ODBCInstLib=/home/ec2-user/unixODBC/lib/libodbcinst.so.

In addition, in the [Driver] section of the amazon.redshiftodbc.ini file , set DriverManagerEncoding to UTF-16.

For example,

[Driver]

DriverManagerEncoding=UTF-16

3. Configure the Linux ODBC environment. a. Run DSConnectionManager.sh and configure a data source for Redshift.

Data Services Supplement for Big Data Cloud computing services PUBLIC 135  Note

The Unix ODBC Lib Path is based on where you install the driver. For example, for Unix ODBC 2.3.4 the path would be /build/unixODBC-232/lib.

Specify the DSN name from the list or add a new one:

DS42_REDSHIFT

Specify the User Name:

Type database password:(no echo)

Retype database password:(no echo)

Specify the Unix ODBC Lib Path:

/build/unixODBC-232/lib

Specify the Driver:

/opt/amazon/redshiftodbc/lib/64/libamazonredshiftodbc64.so

Specify the Driver Version:'8'

8

Specify the Host Name:

Specify the Port:

Specify the Database:

Specify the Redshift SSL certificate verification mode [require|allow|disable|prefer|verify-ca|verify-full]:'require'

require Testing connection...

Successfully added database source.

Related Information

Amazon Redshift data types [page 139] Amazon Redshift source [page 136] Amazon Redshift target table options [page 137] Amazon S3 file location protocol options [page 156] load_from_s3_to_redshift Amazon Redshift datastores [page 134]

4.1.1.3 Amazon Redshift source

Option descriptions for using an Amazon Redshift database table as a source in a data flow.

When you use an Amazon Redshift table as a source, the software supports the following features:

● All Redshift data types ● Optimized SQL ● Basic push-down functions

The following list contains behavior differences from Data Services when you use certain functions with Amazon Redshift:

Data Services Supplement for Big Data 136 PUBLIC Cloud computing services ● When using add_month(datetime, int), pushdown doesn't occur if the second parameter is not in an integer data type. ● When using cast(input as ‘datatype’), pushdown does not occur if you use the real data type. ● When using to_char(input, format), pushdown doesn't occur if the format is ‘XX’ or a number such as ‘099’, ‘999’, ‘99D99’, ‘99G99’. ● When using to_date(date, format), pushdown doesn't occur if the format includes a time part, such as ‘YYYY-MM-DD HH:MI:SS’.

For more about push down functions, see SAP Note 2212730 , “SAP Data Services push-down operators, functions, and transforms”. Also read about maximizing push-down operations in the Performance Optimization Guide.

The following table lists source options when you use an Amazon Redshift table as a source:

Option Description

Table name Name of the table that you added as a source to the data flow.

Table owner Owner that you entered when you created the Redshift table.

Datastore name Name of the Redshift datastore.

Database type Database type that you chose when you created the datastore. You cannot change this option.

The Redshift source table also uses common table source options.

Related Information

Amazon Redshift data types [page 139] Amazon Redshift target table options [page 137] Amazon Redshift datastores [page 134] Viewing Optimized SQL

4.1.1.4 Amazon Redshift target table options

Descriptions of options for using an Amazon Redshift table as a target in a data flow.

The Amazon Redshift target supports the following features:

● input keys ● auto correct ● data deletion from a table before loading ● transactional loads ● load triggers, pre-load commands, and post-load commands ● bulk loading When you use the bulk load feature, Data Services generates files and saves the files to the bulk load directory that is defined in the Amazon Redshift datastore. If there is no value set for the bulk load

Data Services Supplement for Big Data Cloud computing services PUBLIC 137 directory, the software saves the data files to the default bulk load location at: %DS_COMMON_DIR%/log/ BulkLoader. Data Services then copies the files to Amazon S3 and executes the Redshift copy command to upload the data files to the Redshift table.

 Note

The Amazon Redshift primary key is informational only and the software does not enforce key constraints for the primary key. Be aware that using SELECT DISTINCT may return duplicate rows if the primary key is not unique.

 Note

The Amazon Redshift ODBC driver does not support parallelize load via ODBC into a single table. Therefore, the Number of Loaders option in the Options tab is not applicable for a regular loader.

Bulk loader tab Option Description

Bulk load Select to use bulk loading options to write the data.

Mode Select the mode for loading data in the target table:

● Append: Adds new records to the table.

 Note

Append mode does not apply to template tables.

● Truncate: Deletes all existing records in the table, and then adds new records.

S3 file location Enter or select the path to the Amazon S3 configuration file. You can enter a variable for this option.

Maximum rejects Enter the maximum number of acceptable errors. After the maximum is reached, the soft­ ware stops Bulk loading. Set this option when you expect some errors. If you enter 0, or if you do not specify a value, the software stops the bulk loading when the first error occurs.

Column delimiter Enter a single-character column delimiter.

Text delimiter Enter a single-character text delimiter.

If you insert a Text delimiter, other than a single quote (‘), as well as a comma (,) for the Column delimiter, Data Services will treat the data file as a .csv file.

Generate files only Enable to generate data files that you can use for bulk loading.

When enabled, the software loads data into data files instead of the target in the data flow. The software writes the data files into the bulk loader directory specified in the datastore definition.

If you do not specify a bulk loader directory, the software writes the files to < %DS_COMMON_DIR%>\log\bulkloader\. Then you man­ ually copy the files to the Amazon S3 remote system.

The file name is ___.dat, where is the name of the target table.

Data Services Supplement for Big Data 138 PUBLIC Cloud computing services Option Description

Clean up bulk loader directory Enable to delete all bulk load-oriented files from the bulk load directory and the Amazon after load S3 remote system after the load is complete.

Parameters Allows you to enter some Amazon Redshift copy command data conversion parameters, such as escape, emptyasnull, blanksasnull, ignoredblanklines, and so on. These parame­ ters define how to insert data to a Redshift table. For more information about the parame­ ters, see https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html#r_COPY-syntax- overview-optional-parameters .

General settings

Option Description

Number of loaders Sets the number of threads to generate multiple data files for a parallel load job. Enter a positive integer for the number of loaders (threads).

Related Information

Amazon Redshift source [page 136] Amazon Redshift datastores [page 134]

4.1.1.5 Amazon Redshift data types

SAP Data Services converts Redshift data types to the internal data types when it imports metadata from a Redshift source or target into the repository.

The following table lists the internal data type that Data Services uses in place of the Redshift data type.

Convert Redshift to Data Services internal data type

Redshift data type Converts to Data Services data type

smallint int

integer int

bigint decimal(19,0)

decimal decimal

real real

float double

boolean varchar(5)

Data Services Supplement for Big Data Cloud computing services PUBLIC 139 Redshift data type Converts to Data Services data type

char char

 Note

The char data type doesn't support multibyte characters. The maximum range is 4096 bytes.

nchar char

varchar varchar

nvarchar  Note

The varchar and nvarchar data types support UTF8 multibyte characters. The size is the number of bytes and the max range is 65535.

 Caution

If you try to load multibyte characters into a char or nchar data type column, Redshift is­ sues an error. Redshift internally converts nchar and nvarchar data types to char and var­ char. The char data type in Redshift doesn't support multibyte characters. Use overflow to catch the unsupported data or, to avoid this problem, create a varchar column instead of using the char data type.

date date

timestamp datetime

text varchar(256)

bpchar char(256)

The following data type conversions apply when you use a Redshift template table as the target.

Data Services internal data type to Redshift data type

Data Services data type Redshift template table data type

blob varchar(max)

date date

datetime datetime

decimal decimal

double double precision

int integer

Data Services Supplement for Big Data 140 PUBLIC Cloud computing services Data Services data type Redshift template table data type

interval float

long varchar(8190)

real float

time varchar(25)

timestamp datetime

varchar varchar/nvarchar

char char/nchar

4.1.2 Azure SQL database

Developers and administrators who use Microsoft SQL Server can store on-premise SQL Server workloads on an Azure virtual machine in the cloud.

The Azure virtual machine supports both Unix and Windows platforms.

Moving files to and from Azure containers [page 141] Use scripts along with a file location object to move files (called blobs when in a container) from an Azure container to your local directory or to move blobs processed in SAP Data Services into your Azure container.

4.1.2.1 Moving files to and from Azure containers

Use scripts along with a file location object to move files (called blobs when in a container) from an Azure container to your local directory or to move blobs processed in SAP Data Services into your Azure container.

Use an existing container or create one if it does not exist. The files can be any type. Data Services does not internally manipulate files. Currently, Data Services supports the block blob in the container storage type.

Use a file format to describe a blob file and use it within a data flow to perform extra operations on the file. The file format can also be used in a script to automate upload and to delete the local file.

The following are the high-level steps for uploading files to a container storage blob in Microsoft Azure.

1. Create a storage account in Azure and take note of the primary shared key. For more information, see Microsoft documentation or Microsoft technical support. 2. Create a file location object with the Azure Cloud Storage protocol. 3. Create a job in Data Services Designer. 4. Add a script containing the appropriate function to the job.

To move files between remote and local directories, use the following scripts:

Data Services Supplement for Big Data Cloud computing services PUBLIC 141 ○ copy_to_remote_system ○ copy_from_remote_system

To access a subfolder in your Azure container, specify the subfolder in the script.

 Example

copy_to_remote_system('New_FileLocation', '*', '/ /')

A script that contains this function copies all of the files from the local directory specified in the file location object to the container specified in the same object. When you include the remote directory and subfolder in the script, the function copies all of the files from the local directory to the subfolder specified in the script.

5. Save and run the job.

Related Information

File location object Azure Cloud Storage protocol [page 160] copy_from_remote_system copy_to_remote_system

4.1.3 Google BigQuery

The Google BigQuery datastore contains access information and passwords so that the software can open your Google BigQuery account on your behalf.

After accessing your account, SAP Data Services can load data to or extract data from your Google BigQuery projects:

● Extract data from a Google BigQuery table to use as a source for Data Services processes. ● Load generated data from Data Services to Google BigQuery for analysis. ● Automatically create and populate a table in your Google BigQuery dataset by using a Google BigQuery template table.

For complete information about Data Services and Google BigQuery, see the Supplement for Google BigQuery.

Datastore option descriptions [page 143] Complete datastore common options, and options that are specific for Google BigQuery in the datastore editor.

Google BigQuery target table [page 144] Option descriptions for the Target tab in the datastore explorer for the Google BigQuery datastore table.

Optimize data extraction performance [page 145] When you have larger data files to extract from Google BigQuery, create a file location object that uses Google Cloud Storage (GCS) protocol to optimize data extraction.

Data Services Supplement for Big Data 142 PUBLIC Cloud computing services load_from_gcs_to_gbq [page 146] Use the load_from_gcs_to_gbq function to transfer data from a Google Cloud Storage into Google BigQuery tables.

gbq2file [page 148] Use the gbq2file function to optimize software performance when you export large-volume Google BigQuery results to a user-specified file on your local machine.

4.1.3.1 Datastore option descriptions

Complete datastore common options, and options that are specific for Google BigQuery in the datastore editor.

The following table contains option descriptions specific to Google BigQuery.

Google BigQuery datastore option descriptions

Option Instruction

Datastore Name Enter a unique name for the datastore.

Datastore Type Select Google BigQuery.

Web Service URL Accept the default: https://www.googleapis.com/ bigquery/v2.

Authentication Server URL Consists of the Google URL plus the name of the Web ac­ cess service provider, OAuth 2.0.

Accept the default: https:// accounts.google.com/o/oauth2/token.

Authentication Access Scope Grants Data Services read and write access to your Google projects.

Accept the default: https://www.googleapis.com/ auth/bigquery.

Service Account Email Address Paste the service account e-mail address from your Google project.

Service Account Private Key Specifies the local path and file name for the service ac­ count private key.

Prior to creating the datastore, generate the Service Ac­ count Private Key in your Google BigQuery account. Select the format, either .p12 or .JSON, and specify a local file loca­ tion to save the private key.

Data Services Supplement for Big Data Cloud computing services PUBLIC 143 Option Instruction

Service Account Signature Algorithm Algorithm that Data Services uses to obtain an access token from the Authentication Server. Data Services uses the algo­ rithm to sign JSON Web Tokens with your service account private key to obtain the access token.

Accept the default: SHA256withRSA

Substitute Access Email Address Optional. Enter the substitute e-mail address from your Google BigQuery datastore.

Cloud Storage File Location object Specifies the location for a cloud storage file location object.

Applicable when you download datasets that are larger than 10 MB. When you download data sets that are smaller than 10 MB, leave blank.

Create a file location object with your Google Cloud Platform (GCP) access information.

Using your Google Cloud Storage (GCS) account for reading Google BigQuery data may improve processing perform­ ance.

For more information see Optimize data extraction perform­ ance [page 145].

Dataset name for temporary results Specifies a name to store results of large queries tempora­ rily.

Leave this option blank to have Data Services use a hidden dataset.

Google defines large results as results that require special handling by Google. For more information about large re­ sults, see “Writing large query results” at https:// cloud.google.com/bigquery/docs/writing-results#large-re­ sults .

4.1.3.2 Google BigQuery target table

Option descriptions for the Target tab in the datastore explorer for the Google BigQuery datastore table.

When you include a Google BigQuery table in a data flow, you edit the target information for the target table. Double-click the target table in the data flow to open the target editor.

Data Services Supplement for Big Data 144 PUBLIC Cloud computing services Options specific to Google BigQuery

Option Description

Make Port Creates an embedded data flow port from a source or target file.

Default is No. Choose Yes to make a source or target file an embedded data flow port.

For more information, see “Creating embedded data flows” in the Designer Guide.

Mode Designates how Data Services updates the Google BigQuery table. The default is Truncate.

● Append: Adds new records generated from Data Services processing to the existing Google BigQuery ta­ ble. ● Truncate: Replaces all existing records from the Google project table with the uploaded data from Data Services.

Number of loaders Sets the number of threads to use for processing.

Enter a positive integer for the number of loaders (threads).

Each loader starts one resumable load job in Google Big­ Query to load data.

Loading with one loader is known as single loader loading. Loading when the number of loaders is greater than 1 is known as parallel loading. You can specify any number of loaders.

Maximum failed records per loader Sets the maximum number of records that can fail per loader before Google stops loading records. The default is zero (0).

The Target tab also displays the Google table name and the datastore used to access the table.

4.1.3.3 Optimize data extraction performance

When you have larger data files to extract from Google BigQuery, create a file location object that uses Google Cloud Storage (GCS) protocol to optimize data extraction.

Consider the following factors before you decide to use the GCS file location object for optimization. Compare the time saved using optimization against the potential fees from using your GCS account in this manner. Additionally, the optimization may not be beneficial for smaller data files of less than or equal to 10 MB.

Data Services Supplement for Big Data Cloud computing services PUBLIC 145 What you need

Required information to complete the GCS file location object includes the following:

● GCS bucket name ● Bucket folder structure ● Authentication access scope ● Service account private key file

How to set it up

1. Create a GCS file location object in Designer. 2. Select gzip for the compression type in the GCS file location. 3. Create a Google BigQuery datastore object in Designer. 4. In the datastore, complete the Use Google Cloud Storage for Reading option by selecting the GCS file location name from the dropdown list. 5. Create a data flow in Designer and add a SQL transform as a reader. 6. Open the SQL transform and enter a SQL statement in the SQL tab to specify the data to extract. 7. Set up the remaining data flow in Designer.

4.1.3.4 load_from_gcs_to_gbq

Use the load_from_gcs_to_gbq function to transfer data from a Google Cloud Storage into Google BigQuery tables.

 Syntax

load_from_gcs_to_gbq(“”, “”, “”, “”, “”)

Return value

int

Returns 1 if function is successful. Returns 0 if function is not successful.

Data Services Supplement for Big Data 146 PUBLIC Cloud computing services Where

Name of the Google BigQuery datastore.

Name of the file to copy from the remote server in the for­ mat gs://bucket/filename. You can use wildcards.

Name of the Google BigQuery table in the format dataset.table.

Optional. The write mode values are:

● Append ● Truncated

Append is the default.

The format of the data files using one of the following values:

● CSV: For CSV files. This is the default value. ● DATASTORE_BACKUP: For datastore backups. ● NEWLINE_DELIMITED_JSON: For newline-delimited JSON. ● AVRO: For Avro.

Details

Data Services uses the local and remote paths and Google Cloud Storage protocol information from the named file location object. After the table is in a Google BigQuery tab,e, use it as a source in a data flow.

 Example

To copy a file json08_from_gbq.json from a Google BigQuery datastore named NewGBQ1 on a remote server to a Google BigQuery table named test.json08 on a local server, set up the following script:

 Sample Code

load_from_gcs_to_gbq('NewGBQ1', 'gs://test-bucket_1229/from_gbq/ json08_from_gbq.json', 'test.json08', 'append', 'NEWLINE_DELIMITED_JSON');

Data Services Supplement for Big Data Cloud computing services PUBLIC 147 4.1.3.5 gbq2file

Use the gbq2file function to optimize software performance when you export large-volume Google BigQuery results to a user-specified file on your local machine.

 Syntax

gbq2file('','','','', '','/');

Return value int

Returns 1 if function is successful. Returns 0 if function is not successful.

Where

Name of the Google BigQuery application datastore in Data Services.

Name of the query in Google BigQuery.

Local file location and file name in which to store the Google data.

Should be the same location as your local server.

Name of the Google Cloud Storage file location object in Data Services.

Optional. The field delimiter to use between fields in the exported data. The default is a comma.

For example, /013

 Note

Default is 10, hex 0A.

Details

The software uses information in the associated Google cloud storage (GCS) file location object to identify your GCS connection information, bucket name, and compression information.

Data Services Supplement for Big Data 148 PUBLIC Cloud computing services How the function works:

1. The function saves your Google BigQuery results to a temporary table in Google. 2. The function uses export job to export data from the temporary table to GCS.

 Note

If the data is larger than 1 GB, Google exports the data in multiple files.

3. The function transfers the data from your Google Cloud Storage to the local file that you specified. 4. After the transfer is complete, the function deletes the temporary table and any files from Google Cloud Storage.

For details about creating a Google BigQuery application datastore, see the Supplement for Google BigQuery.

Related Information

Google Cloud Storage file location object [page 167]

4.1.4 Snowflake

Snowflake provides a data warehouse that is built for the cloud.

After connecting to Snowflake, you can do the following:

● Import tables ● Read or load Snowflake tables in a data flow ● Create and load data into template tables ● Browse and import the tables located under different schemas (for example, Netezza) ● Preview data ● Push down base SQL functions and Snowflake-specific SQL functions (see SAP Note 2212730 ) ● Bulkload data (possible through AWS S3 File Location or Azure Cloud Storage File Location)

Using an ODBC driver to connect to Snowflake [page 150] Use the DSConnection Manager to configure Snowflake as a source for Data Services.

Snowflake source [page 151] Option descriptions for using a Snowflake database table as a source in a data flow.

Snowflake data types [page 152] SAP Data Services converts Snowflake data types to the internal data types when it imports metadata from a Snowflake source or target into the repository.

Snowflake target table options [page 153] Descriptions of options for using a Snowflake table as a target in a data flow.

Data Services Supplement for Big Data Cloud computing services PUBLIC 149 Related Information

Creating a target template table

4.1.4.1 Using an ODBC driver to connect to Snowflake

Use the DSConnection Manager to configure Snowflake as a source for Data Services.

1. Download and install the Snowflake ODBC driver from the Snowflake website. Data Services supports Snowflake ODBC Driver version 2.16.0 and higher.

For more information about the Snowflake ODBC driver, see the Snowflake User Guide on the Snowflake website. 2. Configure a Data Source Name (DSN). For information about setting up a DSN-less connections, see “Using the ODBC Driver Selector on Windows for server name connections” in the Administrator Guide.

For Windows: a. Open ODBC Data Source Administrator from your Windows Start menu or click the ODBC Administrator button in the Datastore Editor when you create the Snowflake datastore in Data Services. b. In the ODBC Administrator, open the System DSN tab and select the ODBC driver for Snowflake that you just installed. c. Click Configure. d. Enter the required information into the Snowflake Configuration Dialog window and click Save. For information about the connection parameters, see the Snowflake User Guide.

For Linux: a. Run the Connection Manager utility: DSConnectionManager.sh. Find complete instructions for using the Connection Manager in the Administrator Guide. b. Type the number that corresponds to the database type. For example, type 18 for Snowflake. c. Complete the remaining options. The Connection Manager creates the DSN. For information about the connection parameters, see the Snowflake User Guide. 3. After you have configured the Snowflake DSN with Connection Manager, check to see that the Connection Manager created the following configuration files:

○ odbc.ini ○ ds_odbc.ini

 Note

If you currently have other ODBC connections configured, check to see that the connection information is added to the odbc.ini and ds_odbc.ini configuration files.

The odbc.ini is located in $ODBCINI:

Driver=/usr/lib64/snowflake/odbc/lib/libSnowflake.so

UIS=sapdhuser

PWD=

Data Services Supplement for Big Data 150 PUBLIC Cloud computing services SERVER=

PORT=

DATABASE=DS-DB

SCHEMA=

WAREHOUSE=

ROLE=

The ds_odbc.ini file is located in /bin:

Driver=/usr/local/unixODBC-2.3.2/lib/libodbc.so

4. Create a Snowflake database datastore using the DSN you just created. For more information, see “Defining a database datastore” in the Designer Guide.

Related Information

Configure drivers with data source name (DSN) connections Using the ODBC Drivers Selector for Windows Defining a database datastore

4.1.4.2 Snowflake source

Option descriptions for using a Snowflake database table as a source in a data flow.

When you use a Snowflake table as a source, the software supports the following features:

● All Snowflake data types ● SQL functions and snowflake-specific SQL function ● Push-down ODBC generic functions

For more about push down functions, see SAP Note 2212730 , “SAP Data Services push-down operators, functions, and transforms”. Also read about maximizing push-down operations in the Performance Optimization Guide.

The following table lists source options when you use a Snowflake table as a source:

Option Description

Table name Name of the table that you added as a source to the data flow.

Table owner Owner that you entered when you created the Snowflake table.

Datastore name Name of the Snowflake datastore.

Database type Database type that you chose when you created the datastore. You cannot change this option.

Table Schema Name of the table schema.

Data Services Supplement for Big Data Cloud computing services PUBLIC 151 Performance settings

Option Description

Join Rank Specifies the rank of the data file relative to other tables and files joined in a data flow.

Enter a positive integer. The default value is 0.

The software joins sources with higher join ranks before join­ ing sources with lower join ranks.

If the data flow includes a Query transform, the join rank specified in the Query transform overrides the Join Rank specified in the File Format Editor.

For new jobs, specify the cache only in the Query transform editor.

For more information about setting “Join rank”, see “Source- based performance options” in the Performance Optimiza­ tion Guide.

Cache Specifies whether the software reads the data from the source and load it into memory or pageable cache.

● Yes: Always caches the source unless the source is the outer-most source in a join. Yes is the default setting. ● No: Never caches the source.

If the data flow includes a Query transform, the cache set­ ting specified in the Query transform overrides the Cache setting specified in the Format File Editor tab.

For new jobs, specify the cache only in the Query transform editor.

For more information about caching, see “Using Caches” in the Performance Optimization Guide.

Array fetch size Indicates the number of rows retrieved in a single request to a source database. The default value is 1000. Higher num­ bers reduce requests, lowering network traffic, and possibly improve performance. Maximum value is 5000.

4.1.4.3 Snowflake data types

SAP Data Services converts Snowflake data types to the internal data types when it imports metadata from a Snowflake source or target into the repository.

The following table lists the internal data type that Data Services uses in place of the Snowflake data type.

Data Services Supplement for Big Data 152 PUBLIC Cloud computing services Snowflake data type Converts to Data Services data type Notes

byteint/tinyint decimal(38,0)

smallint decimal(38,0)

int/integer decimal(38,0)

bigint decimal(38,0)

number/numeric/decimal decimal Default of precision is 38.

float double

double double

real double

varchar varchar

char varchar Default is 1 byte.

string/text varchar Default is 16 mbyte.

boolean int

binary blob

varbinary blob

datetime/timestamp datetime

date date

time time

semi-structure not supported VARIANT, OBJECT, ARRAY

If Data Services encounters a column that has an unsupported data type, it does not import the column. However, you can configure Data Services to import unsupported data types by checking the Import unsupported data types as VARCHAR of size option in the datastore editor dialog box.

4.1.4.4 Snowflake target table options

Descriptions of options for using a Snowflake table as a target in a data flow.

The Snowflake target supports the following features:

● transactional loads ● load triggers, pre-load commands, and post-load commands ● bulk loading

Bulk loader tab Option Description

Bulk load Select to use bulk loading options to write the data.

Data Services Supplement for Big Data Cloud computing services PUBLIC 153 Option Description

Mode Select the mode for loading data in the target table:

● Append: Adds new records to the table.

 Note

Append mode does not apply to template tables.

● Truncate: Deletes all existing records in the table, and then adds new records.

Remote Storage Select the remote storage method:

● Amazon S3: Utilizes the S3 file location to copy the local data file into the staging storage temporarily and then load it to the target Snowflake table. Both the staging local file and the duplicated one in the staging storage can be cleaned according to the setting to the bulk loader. ● Microsoft Azure: Utilizes the Azure file location to copy the local data file into the stag­ ing storage temporarily and then load it to the target Snowflake table. Both the stag­ ing local file and the duplicated one in the staging storage can be cleaned according to the setting to the bulk loader.

File Location Enter or select the corresponding Amazon S3 or Microsoft Azure file location. You can en­ ter a variable for this option.

Generate files only Enable to generate data files that you can use for bulk loading.

When enabled, the software loads data into data files instead of the target in the data flow. The software writes the data files into the bulk loader directory specified in the datastore definition.

If you do not specify a bulk loader directory, the software writes the files to %DS_COMMON_DIR%/log/BulkLoader/ . Then you manually copy the files to the Amazon S3 or Microsoft Azure remote system.

The file name is SFBL_____***.d at, where is the name of the target table.

Clean up bulk loader directory Enable to delete all bulk load-oriented files from the bulk load directory after the load is after load complete.

General settings

Option Description

Column comparison Specifies how the software maps the input columns to per­ sistent cache table columns.

● Compare by position: The software disregards the col­ umn names and maps source columns to target col­ umns by position. ● Compare by name: The software maps source columns to target columns by name. Compare by name is the de­ fault setting.

Number of loaders Sets the number of threads to generate multiple data files for a parallel load job. Enter a positive integer for the number of loaders (threads).

Data Services Supplement for Big Data 154 PUBLIC Cloud computing services The Snowflake target table also uses some common options. See the Related Links in this topic for more information.

Related Information

Error Handling options Transaction Control Amazon S3 [page 156] Azure Data Lake Store protocol options [page 165]

4.2 Cloud storages

Access various cloud storages through file location objects and gateways.

File location objects specify specific file transfer protocols so that SAP Data Services safely transfers data from server to server.

For information about the SAP Big Data Services to access Hadoop in the cloud, see Supplement for SAP Cloud Platform Big Data Services.

Amazon S3 [page 156] Amazon Simple Storage Service (S3) is a product of that provides scalable storage in the cloud.

Azure blob storage [page 160] Blob data is unstructured data that is stored as objects in the cloud. Blob data is text or binary data such as documents, media files, or application installation files.

Azure Data Lake Store protocol options [page 165] Use an Azure Data Lake Store file location object to read data from and upload data to your Azure Data Lake Store.

Google cloud storage [page 167] Use a Google file location object to access data in your Google cloud account.

Parent topic: Cloud computing services [page 133]

Related Information

Cloud databases [page 133] Upload data to HDFS in the cloud [page 69]

Data Services Supplement for Big Data Cloud computing services PUBLIC 155 4.2.1 Amazon S3

Amazon Simple Storage Service (S3) is a product of Amazon Web Services that provides scalable storage in the cloud.

Store large volumes of data in an Amazon S3 cloud storage account. Then use SAP Data Services to securely download your data to a local directory. Configure a file location object to specify both your local directory and your Amazon S3 directory.

Data Services provides built-in functions for processing data that you can use with data from S3 and data that you load to S3. There is one built-in function specifically for moving data from S3 to Amazon Redshift named load_from_s3_to_redshift.

Also use the copy_to_remote_system and copy_from_remote_system functions. Data Services concatenates the remote directory that you specify in the copy function with the information in the file location object to form a full directory structure that includes subfolders.

Amazon S3 file location protocol options [page 156] When you configure a file location object for Amazon S3, complete all applicable options, especially the options specific to Amazon S3.

Related Information load_from_s3_to_redshift copy_from_remote_system copy_to_remote_system

4.2.1.1 Amazon S3 file location protocol options

When you configure a file location object for Amazon S3, complete all applicable options, especially the options specific to Amazon S3.

Use a file location object to access data or upload data stored in your Amazon S3 account. To view options common to all file location objects, see the Reference Guide. The following table describes the file location options that are specific to the Amazon S3 protocol.

 Restriction

You must have "s3:ListBucket" rights in order to view a list of buckets or a special bucket.

Option Description

Access Key Specifies the Amazon S3 identification input value.

Secret Key Specifies the Amazon S3 authorization input value.

Data Services Supplement for Big Data 156 PUBLIC Cloud computing services Option Description

Region Specifies the name of the region you are transferring data to and from; for example, "South America (Sao Palo)".

Server-Side Encryption Specifies the type of encryption method to use.

Amazon S3 uses a key to encrypt data at the object level as it writes to disks in the data centers and then decrypts it when the user accesses it:

● None ● Amazon S3-Managed Keys ● AWS KMS-Managed Keys ● Customer-Provided Keys

Data Services displays either one or none of the three re­ maining encryption options based on your selection here.

Encryption Algorithm Specifies the encryption algorithm to use to encode the data. For example AES256,aws:kms.

AWS KMS Key ID Specifies whether to create and manage encryption keys via the Encryption Keys section in AWS IAW console.

Leave this option blank to use a default key that is unique to you, the service you're using, and the region in which you're working.

AWS KMS Encryption Context Specifies the encryption context of the data.

The value is a base64-encoded UTF-8 string holding JSON with the encryption context key-value pairs.

 Example

If the encryption context is {"fullName": "John Connor" }, you need base64-encoded: echo {"fullName": "John Connor" } | openssl enc -base64 eyJmdWxsTmFtZSI6ICJKb2huIENvbm5vciIgf SANCg==

Enter eyJmdWxsTmFtZSI6ICJKb2huIENvbm5vciIgf SANCg== in the encryption context option.

Customer Key Specifies the key.

Enter a value less-than or equal-to 256 bits.

Data Services Supplement for Big Data Cloud computing services PUBLIC 157 Option Description

Communication Protocol/Endpoint URL Specifies the communication protocol you use with S3.

● http ● https ● Enter the endpoint URL

If you choose to enter the endpoint URL, consider the follow­ ing information:

● If the endpoint URL is for http, and it contains a region, ensure that you use a dash before the region.

 Example

For example, enter http://s3- .amazonaws.com. For the U.S. East (N.Virginia) region, the endpoint is http:// s3.amazonaws.com. Notice the period instead of a dash.

● If the endpoint URL is for https, enter the endpoint URL using either a dash or a period.

 Example

Enter either https://s3- .amazonaws.com or https://s3..amazonaws.com.

Compression Type Specifies the compression type to use.

The software compresses the files before uploading to S3 and decompresses the files after download from S3.

 Note

When you upload a file to the Amazon S3 cloud server using the copy_to_remote_system() function and gzip compression, Data Services adds a .gz file ex­ tension to the file name. For example, sample.txt.gz.

When you download the file, Data Services decom­ presses the file and removes the .gz extension from the file name. For example, sample.txt.

Connection Retry Count Specifies the number of times the software should try to up­ load or download data before stopping the upload or down­ load.

Data Services Supplement for Big Data 158 PUBLIC Cloud computing services Option Description

Batch size for uploading data, MB Specifies the size of the data transfer to use for uploading data to S3.

Data Services uses single-part uploads for files less than 5 MB in size, and multi-part uploads for files larger than 5 MB. Data Services limits the total upload batch size to 100 MB.

Batch size for downloading data, MB Specifies the size of the data transfer Data Services uses to download data from S3.

Number of threads Specifies the number of upload and download threads for transferring data to S3.

Storage Class Specifies the S3 cloud storage class to use to restore files.

● STANDARD: Default storage class. ● REDUCED_REDUNDANCY: For noncritical, reproducible data. ● STANDARD_IA: Stores object data redundantly across multiple geographically separated availability zones. ● ONEZONE_IA: Stores object data in only one availability zone.

 Note

The GLACIER storage class is not supported. Data Serv­ ices can't specify GLACIER storage class during object creation.

For more information about the storage classes, see the Am­ azon AWS documentation.

Remote directory Optional. Specifies the name of the directory for Amazon S3 to transfer files to and from.

Bucket Specifies the name of the Amazon S3 bucket that contains the data.

Local directory Optional. Specifies the name of the local directory to use to create the files. If you leave this field empty, Data Services uses the default Data Services workspace.

Proxy host, port, user name, password Specifies proxy information when you use a proxy server.

Related Information

Amazon Redshift datastores [page 134] load_from_s3_to_redshift Common options

Data Services Supplement for Big Data Cloud computing services PUBLIC 159 4.2.2 Azure blob storage

Blob data is unstructured data that is stored as objects in the cloud. Blob data is text or binary data such as documents, media files, or application installation files.

Access Azure blob storage by creating an Azure cloud file location object.

Azure Cloud Storage protocol [page 160] When you configure a file location object for Azure Cloud Storage, complete all applicable options, especially the options specific to Azure Cloud Storage.

Number of threads for Azure blobs [page 164] The number of threads is the number of parallel uploaders or downloaders to be run simultaneously when you upload or download blobs.

Related Information

Moving files to and from Azure containers [page 141]

4.2.2.1 Azure Cloud Storage protocol

When you configure a file location object for Azure Cloud Storage, complete all applicable options, especially the options specific to Azure Cloud Storage.

Use a file location object to access data or upload data stored in your Azure Cloud Storage account. To view options common to all file location objects, see the Reference Guide. The following table lists the file location object descriptions for the Azure Cloud Storage protocol.

Option Description

Name Specifies the file name for the file location object.

Protocol Specifies the type of file transfer protocol.

For Azure, the protocol is Azure Cloud Storage.

Account Name Specifies the name for the Azure storage account in the Azure Portal.

Storage Type Specifies the storage type to access. Data Services supports only one type for Azure Cloud Storage: Container storage, block blobs.

Data Services Supplement for Big Data 160 PUBLIC Cloud computing services Option Description

Authorization Type Indicates whether you use an account-level or service-level storage access signature (SAS). If you use a service-level SAS, indicate whether you access a resource in a file (blob) or in a container service.

● Primary Shared Key: Authentication for Azure Storage Services using an account-level SAS. Accesses resour­ ces in one or more storage services. ● File (Blob) Shared Access Signature: Authentication for Azure blob storage services using a service-level SAS. Select to access a specific file (blob). ● Container Shared Access Signature: Authentication for Azure container storage services using a service-level SAS. Select to access blobs in a container.

Shared Access Signature URL Specifies the access URL that enables access to a specific file (blob) or blobs in a container. Azure recommends that you use HTTPS instead of HTTP.

To access blobs in a container, include the following ele­ ments: https:/// /

To access a specific file (blob), include the following ele­ ments: https:/// //

Account Shared Key Specifies the Account Shared Key. Obtain a copy from the Azure portal in the storage account information.

 Note

For security, the software does not export the account shared key when you export a data flow or file location object that specifies Azure Cloud Storage as the proto­ col.

Connection Retry Count Specifies the number of times the computer tries to create a connection with the remote server after a connection fails.

The default value is 10. The value cannot be zero.

After the specified number of retries, Data Services issues an error message and stops the job.

Data Services Supplement for Big Data Cloud computing services PUBLIC 161 Option Description

Batch size for uploading data, MB Specifies the maximum size of a data block per request when transferring data files. The limit is 4 MB.

If you use SAP Data Services version 4.2 SP 11 or later ver­ sions, the limit is 100 MB.

 Caution

Accept the default setting unless you are an experienced user with an understanding of your network capacities in relation to bandwidth, network traffic, and network speed.

Batch size for downloading data, MB Specifies the maximum size of a data range to be down­ loaded per request when transferring data files. The limit is 4 MB.

If you use SAP Data Services version 4.2 SP 11 or later ver­ sions, the limit is 100 MB.

 Caution

Accept the default setting unless you are an experienced user with an understanding of your network capacities in relation to bandwidth, network traffic, and network speed.

Number of threads Specifies the number of upload and download threads for transferring data to Azure Cloud Storage. The default value is 1.

When you set this parameter correctly, it could decrease the download and upload time for blobs. For more information, see Number of threads for Azure blobs [page 164].

Data Services Supplement for Big Data 162 PUBLIC Cloud computing services Option Description

Remote Path Prefix Optional. Specifies the file path for the remote server, ex­ cluding the server name. You must have permission to this directory.

If you leave this option blank, the software assumes that the remote path prefix is the user home directory used for FTP.

When an associated file format is used as a reader in a data flow, the software accesses the remote directory and trans­ fers a copy of the data file to the local directory for process­ ing.

When an associated file format is used as a loader in a data flow, the software accesses the local directory location and transfers a copy of the processed file to the remote directory

Container type storage is a flat file storage system and it does not support subfolders. However, Microsoft allows for­ ward slashes with names to form the remote path prefix, and a virtual folder in the container where you upload the files.

 Example

You currently have a container for finance database files. You want to create a virtual folder for each year. For 2016, you set the remote path prefix to: 2016/. When you use this file location, all of the files upload into the virtual folder “2016”.

Local Directory Specifies the path of your local server directory for the file upload or download.

Requirements for local server:

● must exist ● located where the Job Server resides ● you have appropriate permissions for this directory

When an associated file format is used as a reader in a data flow, the software accesses the remote directory and trans­ fers a copy of the data file to the local directory for process­ ing.

 Note

This does not apply to FTP. It applies only to Azure Blob Storage location.

When an associated file format is used as a loader in a data flow, the software accesses the local directory location and transfers a copy of the processed file to the remote direc­ tory.

Data Services Supplement for Big Data Cloud computing services PUBLIC 163 Option Description

Container Specifies the Azure container name for uploading or down­ loading blobs to your local directory.

If you specified the connection information, including ac­ count name, shared key, and proxy information (if applica­ ble), click the Container field. The software sends a request to the server for a list of existing containers for the specific account. Either select an existing container or specify a new one. When you specify a new one, the software creates it when you run a job using this file location object.

Proxy Host, Port, User Name, Password Optional. Specifies the proxy information when you use a proxy server.

Related Information

Common options File location object

4.2.2.2 Number of threads for Azure blobs

The number of threads is the number of parallel uploaders or downloaders to be run simultaneously when you upload or download blobs.

The Number of threads setting affects the efficiency of downloading and uploading blobs to or from Azure Cloud Storage.

Determine the number of threads

To determine the number of threads to set for the Azure file location object, base the number of threads on the number of logical cores in the processor that you use.

Example thread settings

Processor logical cores Set Number of threads

8 8

16 16

The software automatically re-adjusts the number of threads based on the blob size you are uploading or downloading. For example, when you upload or download a small file, the software may adjust to use fewer

Data Services Supplement for Big Data 164 PUBLIC Cloud computing services numbers of threads and use the block or range size you specified in the Batch size for uploading data, MB or Batch size for downloading data, MB options.

Upload Blob to an Azure container

When you upload a large file to an Azure container, the software may divide the file into the same number of lists of blocks as the setting you have for Number of threads in the file location object. For example, when the Number of threads is set to 16 for a large file upload, the software divides the file into 16 lists of blocks. Additionally, each thread reads the blocks simultaneously from the local file and also uploads the blocks simultaneously to the Azure container.

When all the blocks are successfully uploaded, the software sends a list of commit blocks to the Azure Blob Service to commit the new blob.

If there is an upload failure, the software issues an error message. If they already existed before the upload failure, the blobs in the Azure container stay intact.

When you set the number of threads correctly, you may see a decrease in upload time for large files.

Download Blob from an Azure container

When you download a large file from the Azure container to your local storage, the software may divide the file into the Number of threads setting in the file location object. For example, when the Number of threads is set to 16 for a large file download to your local container, the software divides the blobs into 16 lists of ranges. Additionally, each thread downloads the ranges simultaneously from the Azure container and also writes the ranges simultaneously to your local storage.

When your software downloads a blob from an Azure container, it creates a temporary file to hold all of the threads. When all of the ranges are successfully downloaded, the software deletes the existing file from your local storage if it existed, and renames the temporary file using the name of the file that was deleted from local storage.

If there is a download failure, the software issues an error message. The existing data in local storage stays intact if it existed before the download failure.

When you set the number of threads correctly, you may see a decrease in download time.

4.2.3 Azure Data Lake Store protocol options

Use an Azure Data Lake Store file location object to read data from and upload data to your Azure Data Lake Store.

When you create the file location object, select Azure Data Lake Store from the Protocol dropdown list.

The following table describes the file location options that are specific to the Azure data lake Store protocol.

Data Services Supplement for Big Data Cloud computing services PUBLIC 165 Option Description

Data Lake Store name Name of the Azure Data Lake Store to access.

Optionally use a substitution parameter.

Service Principal ID Obtain from your Azure Data Lake Store administrator.

Tenant ID Obtain from your Azure Data Lake Store administrator.

Password Obtain from your Azure Data Lake Store administrator.

Batch size for uploading data (MB) Maximum size of a data block to upload per request when transferring data files. The default setting is 5 MB.

 Caution

Keep the default setting unless you are an experienced user with an understanding of your network capacities in relation to bandwidth, network traffic, and network speed.

Batch size for downloading data (MB) Maximum size of a data range to download per request when transferring data files. The default setting is 5 MB.

 Caution

Keep the default setting unless you are an experienced user with an understanding of your network capacities in relation to bandwidth, network traffic, and network speed.

Number of threads Number of parallel uploaders or downloaders to run simulta­ neously. The default value is 1.

Remote path prefix Directory path for your files in the Azure Data Lake Store. Obtain the directory path from Azure Data Lake Store Prop­ erties.

 Example

If the the directory in your Azure Data Lake Store Prop­ erties is adl:// .azuredatalakestor e.net//, the remote path prefix value is / .

Permission to access this directory required.

Optionally use substitution parameter.

Data Services Supplement for Big Data 166 PUBLIC Cloud computing services Option Description

Local directory Path to the local directory for your local Data Lake Store data.

Permission to access this directory required.

Optionally use substitution parameter.

Related Information

File location object Variables and Parameters Common options

4.2.4 Google cloud storage

Use a Google file location object to access data in your Google cloud account.

Google Cloud Storage file location object [page 167] A Google Cloud Storage (GCS) file location contains file transfer protocol information for moving data between and GCS and SAP Data Services.

File location option descriptions [page 168] Create a Google Cloud Storage (GCS) file location object and use it in data flows that extract large data files, 10 MB and larger, from GCS into SAP Data Services.

4.2.4.1 Google Cloud Storage file location object

A Google Cloud Storage (GCS) file location contains file transfer protocol information for moving data between and GCS and SAP Data Services.

Specifically, use the GCS file location when you extract large data files, greater than 10 MB from GCS to use as a source in a data flow. GCS protocol optimizes data extraction for large data files.

After you create the GCS file location, select the file location name in the datastore option Use Google Cloud Storage for Reading.

Related Information

Use Google BigQuery as a source

Data Services Supplement for Big Data Cloud computing services PUBLIC 167 4.2.4.2 File location option descriptions

Create a Google Cloud Storage (GCS) file location object and use it in data flows that extract large data files, 10 MB and larger, from GCS into SAP Data Services.

The following table lists the file location object descriptions for the Google Cloud Storage protocol.

GCS file location option descriptions Option Description

Name Specifies a file name for the file location object.

Protocol Specifies the type of file transfer protocol.

Select Google Cloud Storage.

Project Specifies the Google project name. For example, when you use the file location for Google BigQuery, enter the BigQuery project name.

Upload URL Accept the default, https://www.googleapis.com/ upload/storage/v1.

Download URL Accept the default, https://www.googleapis.com/ storage/v1.

Authentication Server URL Specifies the Google URL plus the name of the Web access service provider, OAuth 2.0.

Accept the default, https:// accounts.google.com/o/oauth2/token.

Authentication Access Scope Specifies the specific type of data access permission.

● Read-only: Access to read data, including listing buck­ ets. ● Read-write: Access to read and change data, but not metadata like ACLs. ● Full-control: Full control over data, including the ability to modify ACLs. ● Cloud-platform.read-only: View your data across Google Cloud Platform services. For Google Cloud Storage, this option is the same as devstorage.read-only. ● Cloud-platform: View and manage data across all Goo­ gle Cloud Platform services. For Google Cloud Storage, this option is the same as devstorage.full-control. Cloud-platform is the default.

Service Account Email Address Specifies the e-mail address from your Google project. This e-mail is the same as the service account e-mail address that you enter into the applicable Google BigQuery data­ store.

Data Services Supplement for Big Data 168 PUBLIC Cloud computing services Option Description

Service Account Private Key Specifies the P12 or JSON file you generated from your Goo­ gle project and stored locally.

Click Browse and open the location where you saved the file. Select the .p12 or .JSON file and click Open.

Service Account Signature Algorithm Specifies the algorithm type that Data Services uses to sign JSON Web tokens.

Accept the default: SHA256withRSA.

Data Services uses this value, along with your service ac­ count private key, to obtain an access token from the Au­ thentication Server.

Substitute Access Email Address Optional. Enter the substitute e-mail address from your Goo­ gle BigQuery datastore.

Web Service URL Specifies the Data Services web services server URL that the data flow uses to access the Web server.

Compression Type Specifies the type of compression to use.

● None: Does not use compression. ● gzip: Uses gzip for compression. Enables you to upload gzip files to GCS.

Connection Retry Count Specifies the number of times Data Services tries to create a connection with the remote server after a connection fails.

The default value is 10. The value cannot be zero.

After the specified number of retries, Data Services issues an error notification and stops the job.

Batch size for uploading data, MB Specifies the maximum size for a block of data to upload per request.

The limit is 5 TB.

For each request, Data Services sends a block of data in the specified size for uploading to GCS when transferring data files.

Batch size for downloading data, MB Specifies the maximum size for a block of data to download per request.

The limit is 5 TB.

For each request, Data Services downloads a block of data in the specified size.

Data Services Supplement for Big Data Cloud computing services PUBLIC 169 Option Description

Number of threads Specifies the number of threads to run in parallel when transferring data to GCS.

Enter a number from 1 to 30. The default is 1.

If you enter any number outside this range, the software au­ tomatically adjusts the number at runtime.

Bucket Specifies the bucket name, which is the name of the basic container that holds your data in GCS.

Select a bucket name from the dropdown list.

For uploading data only, create a new bucket by entering the name of the bucket. If the bucket does not exist in Google Cloud Storage, Google creates the bucket when you perform an upload for the specified bucket.

 Note

If you attempt to download the bucket and it does not exist in Google, the software issues an error.

Remote Path Prefix Optional. Specifies a folder structure of the Google Cloud Storage bucket.

Ensure that the path prefix ends with a forward slash (/). For example, test_folder1/folder2/. You must have permission to this directory.

If you leave this option blank, the software assumes the home directory of your file transfer protocol.

Reader: When an associated file format is used as a reader in a data flow, the software accesses the remote directory and transfers a copy of the data file to the local directory for processing.

Loader: When an associated file format is used as a loader in a data flow, the software accesses the local directory loca­ tion and transfers a copy of the processed file to the remote directory

Data Services Supplement for Big Data 170 PUBLIC Cloud computing services Option Description

Local Directory Specifies the file path of the local server that you use for this file location object.

The local server directory is located where the Job Server re­ sides. You must have permission to this directory.

 Note

If this option is blank, the software assumes the direc­ tory %DS_COMMON_DIR%/workspace as the default directory.

Reader: When an associated file format is used as a reader in a data flow, the software accesses the remote directory and transfers a copy of the data file to the local directory for processing.

Loader: When an associated file format is used as a loader in a data flow, the software accesses the local directory loca­ tion and transfers a copy of the processed file to the remote directory.

Proxy Host, Port, User Name, Password Optional. If you use a proxy server, specifies the proxy infor­ mation.

Data Services Supplement for Big Data Cloud computing services PUBLIC 171 Important Disclaimers and Legal Information

Hyperlinks

Some links are classified by an icon and/or a mouseover text. These links provide additional information. About the icons:

● Links with the icon : You are entering a Web site that is not hosted by SAP. By using such links, you agree (unless expressly stated otherwise in your agreements with SAP) to this:

● The content of the linked-to site is not SAP documentation. You may not infer any product claims against SAP based on this information. ● SAP does not agree or disagree with the content on the linked-to site, nor does SAP warrant the availability and correctness. SAP shall not be liable for any damages caused by the use of such content unless damages have been caused by SAP's gross negligence or willful misconduct.

● Links with the icon : You are leaving the documentation for that particular SAP product or service and are entering a SAP-hosted Web site. By using such links, you agree that (unless expressly stated otherwise in your agreements with SAP) you may not infer any product claims against SAP based on this information.

Beta and Other Experimental Features

Experimental features are not part of the officially delivered scope that SAP guarantees for future releases. This means that experimental features may be changed by SAP at any time for any reason without notice. Experimental features are not for productive use. You may not demonstrate, test, examine, evaluate or otherwise use the experimental features in a live operating environment or with data that has not been sufficiently backed up. The purpose of experimental features is to get feedback early on, allowing customers and partners to influence the future product accordingly. By providing your feedback (e.g. in the SAP Community), you accept that intellectual property rights of the contributions or derivative works shall remain the exclusive property of SAP.

Example Code

Any software coding and/or code snippets are examples. They are not for productive use. The example code is only intended to better explain and visualize the syntax and phrasing rules. SAP does not warrant the correctness and completeness of the example code. SAP shall not be liable for errors or damages caused by the use of example code unless damages have been caused by SAP's gross negligence or willful misconduct.

Gender-Related Language

We try not to use gender-specific word forms and formulations. As appropriate for context and readability, SAP may use masculine word forms to refer to all genders.

Videos Hosted on External Platforms

Some videos may point to third-party video hosting platforms. SAP cannot guarantee the future availability of videos stored on these platforms. Furthermore, any advertisements or other content hosted on these platforms (for example, suggested videos or by navigating to other videos hosted on the same site), are not within the control or responsibility of SAP.

Data Services Supplement for Big Data 172 PUBLIC Important Disclaimers and Legal Information Data Services Supplement for Big Data Important Disclaimers and Legal Information PUBLIC 173 www.sap.com/contactsap

© 2020 SAP SE or an SAP affiliate company. All rights reserved.

No part of this publication may be reproduced or transmitted in any form or for any purpose without the express permission of SAP SE or an SAP affiliate company. The information contained herein may be changed without prior notice.

Some software products marketed by SAP SE and its distributors contain proprietary software components of other software vendors. National product specifications may vary.

These materials are provided by SAP SE or an SAP affiliate company for informational purposes only, without representation or warranty of any kind, and SAP or its affiliated companies shall not be liable for errors or omissions with respect to the materials. The only warranties for SAP or SAP affiliate company products and services are those that are set forth in the express warranty statements accompanying such products and services, if any. Nothing herein should be construed as constituting an additional warranty.

SAP and other SAP products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of SAP SE (or an SAP affiliate company) in Germany and other countries. All other product and service names mentioned are the trademarks of their respective companies.

Please see https://www.sap.com/about/legal/trademark.html for additional trademark information and notices.

THE BEST RUN