BigInsights IBM

IBM Open Platform with and BigInsights 4.2 Technical Preview

Version 4 Release 2

BigInsights IBM

IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview

Version 4 Release 2 Edition notice - early release documentation This document contains proprietary information. All information contained herein shall be kept in confidence. None of this information shall be divulged to persons other than (a) IBM employees authorized by the nature of their duties to receive such information, or (b) individuals with a need to know in organizations authorized by IBM to receive this document in accordance with the terms (including confidentiality) of an agreement under which it is provided. This information might include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements or changes in the product or the programs described in this publication at any time without notice. © Copyright IBM Corporation 2013, 2016. US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents

Chapter 1. Introduction to 4.2 ..... 1 Preparing to install the BigInsights value-add Introduction ...... 1 services...... 63 Obtaining the BigInsights value-add services ... 68 Chapter 2. What's New in 4.2...... 3 Installing the BigInsights value-add packages ... 69 Installing BigInsights Home ...... 72 What's new for Version 4.2 ...... 3 Installing the BigInsights - Big SQL service ... 74 Open source technologies ...... 9 Installing the Text Analytics service ..... 84 Enabling Knox for value-add services .... 88 Chapter 3. Installing IBM Open Platform Removing BigInsights value-add services ... 90 with Apache Hadoop ...... 11 Get ready to install ...... 11 Chapter 5. Some new or enhanced Preparing your environment...... 16 features for 4.2 ...... 95 Configuring LDAP server authentication on Red Impersonation in Big SQL ...... 95 Hat Enterprise Linux 6.7 and 7.2 ...... 25 ANALYZE command ...... 99 Creating a mirror repository for the IBM Open Auto-analyze ...... 108 Platform with Apache Hadoop software ..... 27 HCAT_SYNC_OBJECTS stored procedure .... 112 Running the installation package ...... 28 Big SQL integration with .... 119 Upgrading the Java (JDK) version ...... 36 EXECSPARK table function...... 122 Installing and configuring Ranger in the Ambari web interface ...... 37 Configuring MySQL for Ranger ...... 41 Chapter 6. Known problems ..... 125 Installing Ranger plugins ...... 42 Set up user sync from LDAP/AD/Unix to Index ...... 127 Ranger ...... 45 Installing Ranger authentication ...... 48 Ranger KMS set up and usage ...... 53 Cleaning up nodes before reinstalling software .. 55 HostCleanup.ini file ...... 57 HostCleanup_Custom_Actions.ini file..... 59

Chapter 4. Installing the IBM BigInsights value-added services on IBM Open Platform with Apache Hadoop ...... 61 Users, groups, and ports for BigInsights value-add services...... 61

© Copyright IBM Corp. 2013, 2016 iii iv BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview Chapter 1. Introduction to 4.2

Introduction Welcome to the Technical Preview of the IBM® Open Platform with Apache Hadoop and IBM BigInsights 4.2. This README contains information to ensure the successful installation and operation of the IOP and the BigInsights Value add services.

The information contained in this Technical Preview documentation might not describe completely the functionality that is available in the 4.2 release. The information represents a snapshot of the full 4.2 release. It describes how to install the product and some of the highlights of the 4.2 release. Because the product documentation is still being refined, you might find links that are not valid. Contact your IBM representative for help in those cases. Description

IBM Open Platform with Apache Hadoop and IBM BigInsights Version 4.2 deliver enterprise Hadoop capabilities with easy-to-use analytic tools and visualization for business analysts and data scientists, rich developer tools, powerful analytic functions, complete administration and management capabilities, and the latest versions of Apache Hadoop and associated projects. This 4.2 release provides full function SQL query capability, with security and performance benefits, to data that is stored in Hadoop. Obtaining the Technical Preview for 4.2 TECHNICAL PREVIEW DOWNLOAD ONLY Accept the IBM BigInsights Early Release license agreement: http://www14.software.ibm.com/cgi-bin/weblap/lap.pl? popup=Y&li_formnum=L-MLOY-9YB5S9&accepted_url= http://ibm-open-platform.ibm.com/repos/beta/4.2.0.0/&title= IBM+BigInsights+Beta+License&declined_url= http://www-01.ibm.com/software/data/infosphere/hadoop/trials.html

Then select the appropriate repository file for your environment: RHEL6 https://ibm-open-platform.ibm.com/repos/beta/4.2.0.0/rhel6/

Use the following TAR files: BIPremium-4.2.0.0-beta1.el6.x86_64.tar.gz ambari-2.2.0.0-beta1.el6.x86_64.tar.gz iop-4.2.0.0-beta1-el6.x86_64.tar.gz iop-utils-4.2.0.0-beta1.el6.x86_64.tar.gz RHEL7 https://ibm-open-platform.ibm.com/repos/beta/4.2.0.0/rhel7/

Use the following TAR files: BIPremium-4.2.0.0-beta1.el7.x86_64.tar.gz ambari-2.2.0.0-beta1.el7.x86_64.tar.gz iop-4.2.0.0-beta1-el7.x86_64.tar.gz iop-utils-4.2.0.0-beta1.el7.x86_64.tar.gz

© Copyright IBM Corp. 2013, 2016 1 2 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview Chapter 2. What's New in 4.2

What's new for Version 4.2 New features for Version 4.2

Note: There is no UPGRADE path to or from the IBM Open Platform with Apache Hadoop and BigInsights Version 4.2 technical preview. Major milestones v ODPi compliant. v Express upgrade is available. You can quickly upgrade the entire cluster while it is shut down. v Apache Spark ecosystem. v Apache Hadoop ecosystem. Operating Systems Refer to the System Requirements for the most up-to-date information on operating system support: v RHEL 6.7+ v RHEL 7.2 Open Source The following open source technologies are now supported: v Ranger 0.5.2 v Phoenix 4.6.1 v Titan 1.0.0 (Titan server and OLAP are not integrated in IBM Open Platform with Apache Hadoop 4.2) The following open source technologies are updated: v Ambari 2.2.0 v Flume 1.6.0 v Hadoop 2.7.2 v HBase 1.2.0 v Kafka 0.9.0.1 v Knox 0.7.0 v Slider 0.90.2 v Solr 5.5 v Spark 1.6.1 BigInsights Big SQL updates 1. BigInsights - Big SQL is now packaged as part of the IBM BigInsights Premium package. 2. Big SQL and Spark Integration is now available as a technical preview. You can invoke Spark jobs from Big SQL by using a table UDF abstraction. The following example calls the SYSHADOOP.EXECSPARK user-defined function to kick off a Spark job that reads a JSON file stored on HDFS:

© Copyright IBM Corp. 2013, 2016 3 SELECT * FROM TABLE(SYSHADOOP.EXECSPARK( language => ’scala’, class => ’com.ibm.biginsights.bigsql.examples.ReadJsonFile’, uri => ’hdfs://host.port.com:8020/user/bigsql/demo.json’, card => 100000)) AS doc, products WHERE doc.country IS NOT NULL AND doc.language = products.language; 3. Support for update and delete on Big SQL HBase tables. 4. Impersonation feature, which allows a service user to securely access data in Hadoop on behalf of another user. 5. An auto-analyze feature that runs the ANALYZE command automatically under certain conditions. In addition, ANALYZE command now has a FOR ALL COLUMNS clause. 6. ANALYZE command improvements, some of which are listed in the following table: Table 1. Big SQL ANALYZE improvements Enhancement Description Analyze v2 There are major performance and memory improvements due to the removal of all dependencies from Hive and Map/Reduce. The ANALYZE command with no Map/Reduce dependency is called Analyze v2, which is the default for BigInsights 4.2. You can use Analyze v1 by setting the biginsights.stats.use.v2 property to false. However, Analyze v1 is deprecated and will be removed in future releases of Big SQL. Cumulative statistics When you run the ANALYZE command against a table on a set of columns, and then later run ANALYZE on a second set of columns, the statistics that are gathered from the first ANALYZE command are merged with the statistics that are gathered from the second ANALYZE command. SYSTEM sampling Instead of scanning an entire table, you can specify a percentage of the splits that ANALYZE can run against. Big SQL extrapolates the statistics for the whole table based on the sample of the table that it gathered statistics on. The SYSTEM option can reduce the time to run the ANALYZE command with minor impact on query performance. FOR ALL COLUMNS By using this option, you can collect statistics on all of the columns of the table.

7. Some maintenance improvements:

4 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview Table 2. Big SQL maintenance enhancements Enhancement Description Automatic analyze By default, ANALYZE is run automatically after a successful LOAD or HCAT_SYNC_OBJECTS call, which automatically gathers statistics on the table to improve query performance. Also, Big SQL determines whether a table has changed significantly and automatically schedules an ANALYZE, if necessary. Automatic HCAT_SYNC_OBJECTS By default, Big SQL automatically synchronizes the Big SQL and Hive catalogs so that when data is added to Hive it can be assessed automatically by Big SQL. In addition, table metadata is preserved through the support of ALTER column. You can customize Big SQL metadata synchronization with Hive with configuration options.

8. Some performance improvements: Table 3. Big SQL performance improvements Enhancement Description Concurrency improvements Improved performance for high concurrency Hadoop, HBase, and Hybrid workloads which allows for greater throughput and improved CPU utilization by default on Big SQL clusters. Join range predicate runtime For joins on partitioned and non-partitioned tables, filtering runtime filters are automatically injected to reduce unnecessary I/O during join processing. Deferred partition pruning An improvement to the partition pruning feature that allows Big SQL to eliminate more partitions during query execution. You can now consider partitioning on a join column to improve performance even further. More flexible partitioning options In the CREATE HADOOP TABLE statement, you can now partition a table on an expression of a previously defined column. You can apply an expression on a column with a lot of distinct values and the evaluation of that expression is used as the partitioning key. For example, ...PARTITIONED BY (MONTH(order_date) AS order_month).. Additional partition pruning data The following data types in the CREATE HADOOP types TABLE statement, now yield better performance when used in the PARTITIONED BY clause of the CREATE TABLE statement: v DECIMAL v DATE stored as DATE Partitioned tables Improved performance for queries against Hadoop tables with tens of thousands of partitions. LOAD improvements Significant performance improvements when loading tables with tens of thousands of partitions.

Chapter 2. What's New in 4.2 5 Table 3. Big SQL performance improvements (continued) Enhancement Description HBase query improvements Significant performance improvements for small HBase queries.

9. Some of the installation improvements: v Relaxed passwordless SSH requirements for the root user ID. v An enhanced installation pre-checker and an automated post-checker advisor. v Upgrade and patch management (4.1 fp2+ to 4.2, Ambari express upgrade). v Enhanced storage configuration (automatic discovery of multi-disk configuration for database storage and smarter defaults). v A new Big SQL high availability Ambari interface, with multiple head node support, and an automatic enablement by adding new head nodes. 10. Some administration improvements: v Simplified cluster topology changes, such as decommissioning dead nodes. v Parallel (and online) cluster topology changes, such as adding and dropping nodes. v Simplified service configuration management (global configuration updates). v The ability to install and decommission Big SQL high availability head nodes online. v Automatic failover and failback. v Some manual administration options for Big SQL high availability management, that are automatic through the Ambari dashboard. v Automatic database engine diagnostics log management. v Runtime diagnostics collection tool for problem determination. v Performance monitoring tool with data collection for serviceability. 11. Some native C++ and Java I/O engine memory management enhancements: v More optimal distribution of resources for high demand (enterprise) environments. v Improvements to utilize configured memory more efficiently. v Optimized internal configuration for large result sets transfer. v Serviceability enhancements for memory management . v Optimizations for high concurrency large data volume query workloads. 12. Big SQL disaster recovery improvements: v Online backup of the Big SQL metastore data (local tables), and an offline restore from a remote site. v Ability to do regular backups and restores that are configured to meet your recovery window requirements. 13. Deeper warehouse-style storage integration: v BLU acceleration is integrated within Big SQL 4.2 as a technical preview: in-memory processing of columnar data for analytic

6 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview workloads. BLU acceleration is a revolutionary query processing technology built by IBM. BLU introduces a set of advanced query processing technologies. v Ability to create BLU tables on the head node. v Support for update and delete and transactional workloads. v By using BLU tables, you can join Hadoop (HDFS) data, with local (row-stored) tables. v The Big SQL high availability feature makes data automatically highly available. 14. When you work with relational database products other than Big SQL, the following Big SQL enhancements help reduce the time and complexity of enabling applications that were written for those other relational database products.

Note: The links in the following table will be "live" at the time of the product's General Availability. Table 4. SQL compatibility enhancements Enhancement Description New built-in aggregate and scalar functions The following new built-in functions increase functionality and compatibility with other relational database management systems: v DATE_PART v EXTRACT (Additional parameters such as EPOCH) v HASH v HASH4 v HASH8 v NEXT_YEAR v OVERLAPS v POW v THIS_QUARTER v THIS_WEEK v TIMEZONE Syntax alternatives The following SQL syntax alternatives can now be used: v LIMIT ... OFFSET is a syntax alternative for a FETCH FIRST ... OFFSET clause. v ISNULL and NOTNULL are syntax alternatives for the IS NULL and IS NOT NULL predicates.

Chapter 2. What's New in 4.2 7 Table 4. SQL compatibility enhancements (continued) Enhancement Description Extensions that enhance SQL compatibility Big SQL has added several additional SQL extensions to improve SQL compatibility with other vendors: v OFFSET clause now available with FETCH FIRST n ROWS ONLY. LIMIT / OFFSET syntax may also be used in Big SQL. v ORDER BY now supports ASC NULLS FIRST and DECS NULLS FIRST - Enhanced NULL ordering support. v Oracle style join syntax (using +). v CONNECT BY support for hierarchical queries. v Support for ROWNUM. v DUAL table support (such as SELECT * FROM dual). v ISNULL can be used as a synonym for IS NULL. v NOTNULL can be used as a synonym for IS NOT NULL v NOW special register can be used as a synonym for CURRENT TIMESTAMP. v Netezza style CREATE TEMPORARY TABLE. v Netezza join syntax (USING clause). v Netezza style casting. v Extensions to NULLS FIRST / NULLS LAST. v You can now use the global variable, SQL_COMPAT to activate the following Netezza Performance Server (NPS) compatibility features (SQL_COMPAT=‘NPS’): – Double-dot notation (for database object), such that ..

is interpreted as . – TRANSLATE parameter syntax. – Operators: The operators ^ and ** are both interpreted as the exponential operator. The operator # is interpreted as bitwise XOR. – Use of ordinal and column aliases in GROUP BY clause. You can specify the ordinal position or exposed name of a SELECT clause column when grouping the results of a query. – Netezza style procedural language (NZPLSQL) can be used in addition to SQL PL language.

8 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview Table 4. SQL compatibility enhancements (continued) Enhancement Description New CREATE FUNCTION statement for The new CREATE FUNCTION (aggregate user-defined aggregate functions. interface) statement allows you to create your own aggregate functions, by using your choice of programming language. To see examples about migrating to this latest Big SQL feature, see Migrate your SQL code to use the user-defined aggregate function. User defined aggregate functions Big SQL now provides functionality to all the creation user defined aggregate functions for use anywhere in SQL that a built-in aggregate can be used, including OLAP specifications. The functions can be written in either JAVA or C. By using user defined aggregate functions, you can now run some of the following SQL code: v Create a SUM function that returns 0 instead of null when no rows are in a group. v Create a MUL function to generate the product of a group of numbers by multiplying them all. v Create a function to approximate the number of distinct values in a multiset (hyperloglog)

Expression operators v power (**) v modulo (%) v bit operators (&, | , and ^ (xor))

BigInsights - Text Analytics and Web tooling enhancements v BigInsights - Text Analytics is now packaged as part of the IBM BigInsights Premium package. v Spark support - Run on Cluster feature will also support Spark jobs, v Embedded light weight AQL editor. v Import / Export of projects. v Min/max of concepts in a sequence. v Input / Output spec separation. BigInsights - BigSheets No new features for version 4.2. BigInsights - Big R No new features for version 4.2.

Open source technologies The following open source technologies are included with IBM Open Platform with Apache Hadoop version 4.1. Table 5. Open source technology versions by IBM BigInsights value-add services release Open source technology 4.1.0.0 4.1.0.1 4.1.0.2 4.2 Ambari 2.1.0 2.1.0 2.1.0 2.2

Chapter 2. What's New in 4.2 9 Table 5. Open source technology versions by IBM BigInsights value-add services release (continued) Open source technology 4.1.0.0 4.1.0.1 4.1.0.2 4.2 Flume 1.5.2 1.5.2 1.5.2 1.6.0 Hadoop (HDFS, 2.7.1 2.7.1 2.7.1 2.7.2 YARN MapReduce) HBase 1.1.1 1.1.1 1.1.1 1.2.0 Hive 1.2.1 1.2.1 1.2.1 1.2.1 Kafka 0.8.2.1 0.8.2.1 0.8.2.1 0.9.0.1 Knox 0.6.0 0.6.0 0.6.0 0.7.0 Oozie 4.2.0 4.2.0 4.2.0 4.2.0 Phoenix N/A N/A N/A 4.6.1 Pig 0.15.0 0.15.0 0.15.0 0.15.0 Ranger N/A N/A N/A 0.5.2 Slider 0.80.0 0.80.0 0.80.0 0.90.2 Solr 5.1.0 5.1.0 5.1.0 5.5 Spark 1.4.1 1.4.1 1.5.1 1.6.1 1.4.6 1.4.6 1.4.6 1.4.6 Titan N/A N/A N/A 1.0.0 Zookeeper 3.4.6 3.4.6 3.4.6 3.4.6

10 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview Chapter 3. Installing IBM Open Platform with Apache Hadoop

The IBM Open Platform is comprised of entirely Apache Hadoop open source components, such as Apache Ambari, HDFS, Flume, Hive, and ZooKeeper. After you install the IBM Open Platform, you can add additional IBM value-add service modules. These value-add service modules are installed separately, and they are included in the IBM BigInsights® Premium package. What is Apache Ambari?

Ambari is a system for provisioning, managing, and monitoring Apache Hadoop clusters. With Ambari, system administrators can perform the following tasks within a centralized web interface: Provision a cluster Ambari provides a wizard to install and configure Hadoop services across any number of hosts in a cluster. Manage a cluster You use Ambari to start, stop, and reconfigure Hadoop services across an entire cluster. Monitor a cluster Ambari has a dashboard to monitor the health and status of a cluster. Ambari uses Ambari Metric Services for metrics and system alerts.

The Ambari architecture includes a server, agents, and web interface. The Ambari server collects data from your cluster. An Ambari agent is installed on each host so that the Ambari server can control the host. Each host also has a copy of the Ambari Metrics system to collect metric information.

Get ready to install Prepare your environment before you begin the installation of IBM Open Platform with Apache Hadoop. Meet the minimum system requirements

Make sure that your system meets the minimum requirements by reviewing the following documentation. v The detailed system requirements: https://www.ibm.com/support/ docview.wss?uid=swg27027565. v The release notes in the IBM Open Platform with Apache Hadoop Knowledge Center. Host name information that you should have FQDN You must know the fully qualified domain name (FQDN) of each host or node that is going to be part of your IBM Open Platform with Apache Hadoop cluster. The Ambari installation wizard does not support IP addresses, so you must use the FQDN. You can determine the FQDN by running the following command from the Linux command line of each node in the cluster:

© Copyright IBM Corp. 2013, 2016 11 hostname -f Preexisting database instances If you plan to install Hive/HCatalog (typical), and you want to use a preexisting instance of MYSQL/Maria (less typical), then you must know the host name, the database name, and the password for that instance. Components on each host Identify the components that you might want to set up on each host. Think about what directories on each host to use for data Begin thinking about the base directories to use for storing the following data: v NameNode data v DataNodes data v MapReduce data v ZooKeeper data (if you install ZooKeeper) v Various log files, PID files, and database files (depending on your installation type) You do not need the answers yet, but it might be useful to plan your installation strategy. Base operating system repositories

The Ambari installer pulls some packages from the base operating system repositories. If you do not have these base operating system repositories available to all your machines at the time of installation, you might see some issues. For example, if your operating system is RHEL, make sure that you have the Red Hat Linux and Red Hat Optional repository channels configured and set up prior to installation. Users and groups for IBM Open Platform with Apache Hadoop

Each service in Hadoop is run under the ownership of a corresponding UNIX account. These accounts are known as service users. The following information lists the users and groups that are created at the time you install IBM Open Platform with Apache Hadoop. In some cases, you will want to create users and groups prior to the installation of IBM Open Platform with Apache Hadoop to ensure consistency across the nodes in your cluster.

UIDs and GIDs must be consistent across all nodes. If you use local service IDs for the IBM Open Platform with Apache Hadoop services, ensure that the UIDs and GIDs are consistent across the cluster by creating them manually.

The following table lists the users and groups that are used by various services of the IBM Open Platform with Apache Hadoop.

If these users are not pre-created, they are created as part of the installation. If you are using LDAP, you can pre-create these IDs in LDAP before installing.

Important: For LDAP users, ensure that group memberships for each service ID are as indicated in the table.

If you are pre-creating the service IDs, create the IDs in all nodes of your cluster, and ensure consistent UIDs and GIDs across all of the nodes to make sure the services function properly.

12 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview Tip: If you are not using LDAP for your service users and groups, and you are depending on the users and groups that are created by Ambari, you might see some user ID and group ID inconsistencies across the different nodes in your cluster. With IBM Open Platform with Apache Hadoop, the Ambari installation wizard creates new service user accounts and preserves any existing service user accounts. These are the accounts that are used to configure Hadoop services. To avoid the possibility of UID and GID inconsistencies, create the service users and groups before you install IBM Open Platform with Apache Hadoop Table 6. Users and groups for IBM Open Platform with Apache Hadoop User Group Service apache (the user account is apache optional) ams hadoop Ambari metric service postgres postgres hive hadoop Hive oozie hadoop Oozie ambari-qa hadoop flume hadoop hdfs hadoop HDFS solr hadoop knox hadoop Knox spark hadoop Spark mapred hadoop MapReduce hbase hadoop HBase zookeeper hadoop ZooKeeper sqoop hadoop Sqoop yarn hadoop YARN hcat hadoop HCat,WebHCat rrdcached rrdcached mysql mysql hadoop (the user account is hadoop Hadoop optional) kafka hadoop Kafka

Default Ports created by a typical installation

Before you install IBM Open Platform with Apache Hadoop software, use the values in this table to plan for any conflicts that might exist in your system. Table 7. Default ports Component Default port Ambari server (HTTP) port 8080 Ambari server (HTTPS) port 8440 Ambari server (HTTPS) port 8441 dfs.datanode.address 50010

Chapter 3. Installing IBM Open Platform with Apache Hadoop 13 Table 7. Default ports (continued) Component Default port dfs.datanode.http.address 50075 dfs.datanode.https.address 50475 dfs.datanode.ipc.address 8010 dfs.https.port 50470 dfs.journalnode.http-address 8480 dfs.namenode.http-address 50070 dfs.namenode.https-address 50470 dfs.namenode.secondary.http-address 50090 hbase.master.info.port 60100 hbase.regionserver.info.port 60030 HBase Master Port 60000 Hbase root 8020 HiveServer2 Port 10000 hive.metastore.uri 9083 hive.server2.thrift.http.port 10001 knox gateway.port 8443 Kafka Port 6667 mapreduce.shuffle.port 13562 Oozie Server base 11000 Oozie Server Admin Port 11001 Solr 8983 spark_history_ui_port 18080 spark_thriftserver_port 10002 templeton.port 50111 yarn.nodemanager.address 45454 yarn.resourcemanager.admin.address 8188 yarn.resourcemanager.address 8050 yarn.resourcemanager.admin.address 8141 yarn.resourcemanager.resource- 8025 tracker.address yarn.resourcemanager.scheduler.address 8030 yarn.resourcemanager.webapp.address 8088 yarn.timeline-service.webapp.https.address 8190 yarn.timeline-service.address 10200 Zookeeper Server Port 2181

Setting up port forwarding from private to edge nodes

This is an optional task that depends on the network setup and specific communication requirements of your environment.

14 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview If the data nodes are on a private network, by default, there is no direct network path between the data nodes and the external (corporate or internet) network. This design forces communications to be routed through a management or edge node for security reasons. There are scenarios where nodes on the internal Hadoop network should be permitted to initiate network communications directly to the outside network, such as the following examples: v When you install IBM Open Platform with Apache Hadoop or the BigInsights value-add services, software is downloaded from a network location which might be external to the Hadoop network. In this case, the Ambari clients on the data nodes are not able to get the files they need to perform an install. v As part of general operating system maintenance, such as updating RPMs, the data nodes on the internal network need to reach an RPM repository on the corporate network or the internet. v Sqoop is a Map/Reduce job where each data node initiates its own JDBC connection to a data source that exists outside of the Hadoop only network.

Run the following commands as root on a management node to enable port forwarding between data nodes and management nodes. Data nodes can then initiate communication to servers outside of the private network and receive data, but external servers remain unable to directly address Hadoop nodes on the internal network. echo 1 > /proc/sys/net/ipv4/ip_foward /sbin/iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE /sbin/iptables -A FORWARD -i eth0 -o eth1 -m state --state RELATED,ESTABLISHED -j ACCEPT /sbin/iptables -A FORWARD -i eth1 -o eth0 -j ACCEPT

These commands assume eth0 is the external network interface. The eth1 traffic is routed to outside of the network as if it were coming from the management node. Adjust the interface names (eth0, eth1) to match your environment.

Test the configuration by logging into a data node which is part of the internal network only. Then ping or download a file from an external server or web site. The operation should succeed. Configure your browser

To run the installation program successfully, make sure that your browser is supported. For details, see the system requirements.

Make sure that JavaScript is enabled in your browser. Configure authentication

You configure LDAP authentication for IBM® Open Platform with Apache Hadoop in three steps. 1. Decide on the directory service you plan to use. IBM Open Platform with Apache Hadoop supports any directory service that complies with the LDAP v3 protocol (for example, OpenLDAP, Microsoft Active Directory). 2. During the installation of the IBM Open Platform with Apache Hadoop, you will install and enable the Knox service in your cluster. This Gateway service must be configured to communicate with the directory service you chose in step 1. 3. If you do not have a directory service configured, you can use the OpenLDAP service bundled with your Linux distribution (for example, see Configuring LDAP server authentication on Red Hat Enterprise Linux 6.5 and above. It is

Chapter 3. Installing IBM Open Platform with Apache Hadoop 15 advised that you run this service on a separate node outside of the cluster. You will use this directory service to authenticate all users that need access to the cluster.

Preparing your environment In addition to product prerequisites, there are tasks common to all IBM Open Platform with Apache Hadoop installation paths. You must complete these common tasks before you start an installation. Before you begin

The Ambari installer pulls some packages from the base operating system repositories. If you do not have these base operating system repositories available to all your machines at the time of installation, you might see some issues. For example, if your operating system is RHEL, make sure that you have the Red Hat Linux and Red Hat Optional repository channels configured and set up prior to installation.

Use the root user account to perform the following steps. To make sure that you are the root user account, run the following command from the Linux terminal: whoami

Procedure 1. Ensure that adequate disk space exists for the root partition. Issue the command to return a list of available disks in your cluster. You use the disk partition names when specifying the cache and data directories for your distributed file system. df -h

Here is an example of the output from the command: Table 8. Example of dfcommand output Filesystem Size Used Avail Use% Mounted on /dev/sda3 95G 5.4G 85G 7% / tmpfs 5.9G 0 5.9G 0% /dev/shm /dev/sda1 190M 73M 108M 41% /boot

Table 9. Estimated space needed Recommended disk space for a production Directory Minimum disk space environment root partition 40 GB 100 GB

Many directories are installed in the root partition during the IBM Open Platform with Apache Hadoop installation, so you need enough space for these directories and users. 2. Resolve host names and configure your network. a. Make sure that all characters in host names are lower case. b. If you want to change the host names in your cluster, and ensure they persist between system reboots, do these steps: RHEL 6.x

16 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview 1) Change the current host name by editing the following configuration file: /etc/sysconfig/network 2) Save the file and then restart the network service. /etc/init.d/network restart RHEL 7.x 1) Change the current host name by running the following command: hostnamect1 set-hostname 2) Verify the host name: hostnamectl status 3) Restart the host name service. systemctl restart systemd-hostnamed Restart the Linux machine to complete the changes. c. Edit the /etc/hosts file. Ensure that the host names for all cluster nodes are resolved. The host names must be configured to the same IP addresses as the actual servers, because IBM Open Platform with Apache Hadoop does not support dynamic IP addresses. All hosts in your system must be configured for DNS and reverse DNS. You can resolve host names by using DNS servers, or by ensuring that the host names are mapped correctly in the /etc/hosts file across all nodes in the cluster.

Note: If you are unable to configure DNS and Reverse DNS, contact your Linux administrator. In the /etc/hosts file, ensure that localhost is mapped to the loopback address 127.0.0.1, as shown in the following example. # Do not remove the following line, or various programs # that require network functionality will fail. 127.0.0.1 localhost.localdomain localhost ::1 localhost6.localdomain6 localhost6 192.0.2.* server_name.com server_name d. Edit the /etc/hosts file to include the IP address, fully qualified domain name, and short name of each host in your cluster, separated by spaces. You can edit this file on each node in your cluster, or edit the file on the first node in your cluster and copy it to every other node by using SCP. The format is IP_address domain_name short_name. In the following example, assume that node1 is the host that is used for the Ambari setup and the Ambari server: 127.0.0.1 localhost.localdomain localhost 123.123.123.123 node1.abc.com node1 123.123.123.124 node2.abc.com node2 123.123.123.125 node3.abc.com node3 e. If your cluster includes nodes that use private networks only, then you must configure a default gateway to a host that can access the management node, which must reside on a public network. Otherwise, skip this step and continue with the next step.

Important: You must always use a public host name for the management node.

Chapter 3. Installing IBM Open Platform with Apache Hadoop 17 1) On all private nodes in your cluster, edit the /etc/sysconfig/network- scripts/ifcfg-eth0 file and add the private IP address of the management node. This file contains the private network configuration for the Network Interface Controller (NIC). GATEWAY=management_node_IP

management_node_IP is the private IP of your management node, such as 192.0.2.21. 2) Save your changes and then restart your network. RHEL 6.x service network restart RHEL 7.x systemctl restart network.service 3) Check the routing tables to ensure that your gateway is enabled. route -n You should see the gateway that you added as the last line in the kernel IP routing table. You install BigInsights by using the public host name for the management node, and then use private or public host names for other nodes in your cluster. Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface ... 0.0.0.0 192.0.2.21 0.0.0.0 UG 0 0 0 eth0 ) 3. You must set up passwordless SSH connections between the Ambari server host and all other cluster hosts so that the Ambari server can install the Ambari agent automatically on each host. a. If your are not already logged into the master node, which is the node that you designate as the Ambari server host, log in to your Linux master node as root. b. On this Ambari server host, generate the public and private SSH keys with the following command: ssh-keygen

When you are asked to enter a passphrase, click the Enter key to make sure the passphrase is empty. Otherwise, the host registration at Ambari fails with the following error: Permission denied (publickey,gssapi- keyex,gssapi-with-mic,password). c. From the master node (assume that node1.abc.com is the master node), which is designated to contain the Ambari server host, copy the SSH public key (id_rsa.pub) to the root account on your target hosts, using the following command: ssh-copy-id -i ~/.ssh/id_rsa.pub [email protected] ssh-copy-id -i ~/.ssh/id_rsa.pub [email protected] ssh-copy-id -i ~/.ssh/id_rsa.pub [email protected] ...

Ensure that permissions on your .ssh directory are set to 700 (directory owner can read,write, and execute) and the permissions on the authorized_keys file in that directory are set to either 600 (file owner can read and write) or 640 (file owner can read and write and users in the group can read). You can determine the permission levels by issuing the following command from the /root/ directory and from the /root/.ssh directory:

18 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview ls -l

If you need to change the permissions, run the following code: chmod 700 /root/.ssh chmod 600 /root/.ssh/authorized_keys d. From the Ambari server host, connect to each host in the cluster using SSH. For example, enter the following command: ssh [email protected]

You may see this warning on your first connection. Are you sure you want to continue connecting (yes/no)?

Enter yes. Repeat the connection attempt from the master node to each child node to make sure the master node can connect to each child node in the cluster without a password: ssh [email protected] ssh [email protected] e. Save a copy of the SSH private key (id_rsa) on the machine where you will run the Ambari installation wizard. The file is in $HOME/.ssh/, by default. You have the option to use the contents of this file during the installation, and saving the contents to a convenient temporary file might save you some steps. 4. Disable firewalls and IPv6. a. Run the following commands in succession to disable the firewall (iptables) on all nodes in your cluster.

Important: Ensure that you reenable the firewall on all nodes in your cluster after installation. RHEL 6.x chkconfig iptables off

/etc/init.d/iptables stop RHEL 7.x systemctl stop firewalld.service

systemctl disable firewalld.service b. For Linux x86_64 systems only, for each client node in your cluster, disable the Transparent Huge Pages. To do this, run the following command on each Ambari client node: echo never > /sys/kernel/mm/transparent_hugepage/enabled

Since this change is temporary, add the following command to your /etc/rc.local file to run the command automatically when you reboot. if test -f /sys/kernel/mm/transparent_hugepage/enabled; then echo never > /sys/kernel/mm/transparent_hugepage/enabled fi c. On all servers in your cluster, disable IPv6. 1) From the command line, enter ifconfig to check whether IPv6 is running. In the output, an entry for inet6 indicates that IPv6 is running. 2) Run the command to disable IPv6, based on your operating system:

Chapter 3. Installing IBM Open Platform with Apache Hadoop 19 Operating system Command

Red Hat Enterprise Linux 6.x 1. Write the changes to the /etc/sysctl.conf file with the following statements, if the lines do not already exist: Open the file to edit it: vi /etc/sysctl.conf Add these lines if they do not already exist: net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1 Make sure the change takes effect: sysctl -p

If the line exists with a different value, change the value according to the example. 2. Verify that IPv6 is disabled. From the command line, enter ifconfig to check whether IPv6 is running. IPv6 is disabled if no line containing inet6 is listed in the output. For example, the shell command ifconfig | grep "inet6 addr:" | wc -l returns a value of "0". Red Hat Enterprise Linux 7.x The default settings are in this path: /usr/lib/sysctl.d/00-system.conf

To override the default settings, update the 00-system.conf file, or create a file such as the following example: /etc/sysctl.d/.conf

Run sysctl --system to commit the changes.

5. Check that all devices have a Universally Unique Identifier (UUID) and that the devices are mapped to the mount point. a. Display currently assigned UUIDs for all devices in your cluster. blkid The output lists all devices and their UUID. In the following example, three disks are listed: /dev/sda3, /dev/sda1, and /dev/sda2. /dev/sda3: UUID="1632fdf8-2283-4771-9fdd-664964ee7fcf" TYPE="ext3" /dev/sda1: UUID="8ed83d7a-4e5f-44a1-8448-533da7109312" TYPE="ext3" /dev/sda2: UUID="59f180e3-931f-4b50-aa94-4b3cb0ab2c0a" TYPE="swap" b. Hard code the mapping references for devices to the mount point by updating the following file: /etc/fstab

These references ensure that device mapping does not change if a device becomes unavailable or stops functioning. The mount point must exist before you create mapping references.

Important: Before you edit /etc/fstab, save a copy of the original file. #UUID=

#/dev/sda3 UUID=1632fdf8-2283-4771-9fdd-664964ee7fcf / ext3 defaults 1 1 #/dev/sda1

20 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview UUID=8ed83d7a-4e5f-44a1-8448-533da7109312 /boot ext3 defaults 1 2 #/dev/sda2 UUID=59f180e3-931f-4b50-aa94-4b3cb0ab2c0a swap swap defaults 0 0 In the previous example, both /dev/sda3, which is the root file system, and /dev/sda1 are included in the backup dump, as indicated by the first integer listed (1). The second integer determines the order in which file systems are checked. In the previous example, /dev/sda3 is checked first, /dev/sda1 is checked second, and /dev/sda2 is not checked. 6. Before installing the Ambari server, confirm that your environment does not include any existing Ambari installation files, by running a search for the string ambari. The following code returns nothing if no Ambari installation files exist: yum list installed | grep -i ambari If files exist, follow the procedure in “Cleaning up nodes before reinstalling software” on page 55. 7. Ensure that the ulimit properties for your operating system are configured. a. Edit the /etc/security/limits.conf file. b. Ensure that the nofile and nproc properties contain the following values or greater. The nofile parameter sets the maximum number of files that can be open, and the nproc property sets the maximum number of processes that can run. The following values are the minimum values that are required. nofile 65536 nproc 65536 8. Synchronize the clocks of all servers in the cluster by using an internal or external Network Time Protocol (NTP/NTPD) source. The IBM Open Platform with Apache Hadoop installation program synchronizes the other server clocks with the master server during installation. You must enable the NTP/NTPD service on the management node and allow the clients to synchronize with the master node.

Chapter 3. Installing IBM Open Platform with Apache Hadoop 21 Operating system Command

Red Hat Enterprise Linux 1. Ensure that NTP is installed by running the following 6.x command: rpm -qa | grep ntp

If it is not installed (which means the grep command returned no output), run this install command to install it: yum install ntp. 2. From the /etc directory, open the ntp.conf script. vi /etc/ntp.conf 3. In the ntp.conf script, search for the line that begins with # Please consider joining the pool (http://www.pool.ntp.org/join.html. After this line, insert one or more of the following time servers. server1.rhel.pool.ntp.org server2.rhel.pool.ntp.org server3.rhel.pool.ntp.org Where serverN represents time servers that your organization can access. The rhel.pool.ntp.org is a domain address where the servers are hosted, if the operating system you are using is Red Hat Enterprise Linux. This can change for other Linux distributions. Refer to your Linux distribution documentation for details. Note: Ensure that you can ping the time servers listed. If they are unreachable, specify a time server that can be reached, perhaps one within your organization's network. 4. After the configuration, synchronize the servers manually with the following command: /usr/sbin/ntpdate pool.ntp.org 5. Update the NTPD service with the time servers that you specified. chkconfig --add ntpd 6. Stop and then start the NTPD service. service ntpd stop service ntpd start 7. Enable NTPD service to start automatically on reboot by running chkconfig ntpd on. 8. Verify that the clocks are synchronized with a time server by running ntpstat. Running ntpstat fails if the clocks are not synchronized. If the clocks remain unsynchronized, try restarting the service.

22 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview Operating system Command

Red Hat Enterprise Linux 1. Ensure that NTP is installed by running the following 7.x command: rpm -qa | grep ntp

If it is not installed (which means the grep command returned no output), run the yum install ntp command to install it. 2. From the /etc directory, open the ntp.conf script. vi /etc/ntp.conf 3. In the ntp.conf script, search for the line that begins with # Please consider joining the pool (http://www.pool.ntp.org/join.html. After this line, insert one or more of the following time servers. server1.rhel.pool.ntp.org server2.rhel.pool.ntp.org server3.rhel.pool.ntp.org Where serverN represent time servers that your organization can access. The rhel.pool.ntp.org is a domain address where these servers are hosted, if the operating system you are using is Red Hat Enterprise Linux. This can change for other Linux distributions. Refer to your Linux distribution documentation for details. Note: Ensure that you can reach ping the time servers listed. If they are unreachable, specify a time server that can be reached, perhaps one within your organization's network. 4. After the configuration, synchronize the servers manually with the following command: /usr/sbin/ntpdate pool.ntp.org 5. Stop and then start the NTPD service. systemctl stop ntpd.service systemctl start ntpd.service 6. Enable NTPD service to start automatically on reboot: systemctl enable ntpd.service 7. Verify that the clocks are synchronized with a time server by runnng ntpstat. Running ntpstat fails if the clocks are not synchronized. If the clocks remain unsynchronized, try restarting the service.

9. You must disable SELinux before installing IBM Open Platform with Apache Hadoop and it must remain disabled for IBM Open Platform with Apache Hadoop to function. To disable SELinux temporarily, run the following command on each host in your cluster: setenforce 0 Then, disable SELinux permanently by editing the SELinux config and set the SELINUX parameter to disabled on each host. This ensures SELinux remains disabled if the system is rebooted. vi /etc/selinux/config # This file controls the state of SELinux on the system. # SELINUX= can take one of these three values: # enforcing - SELinux security policy is enforced. # permissive - SELinux prints warnings instead of enforcing. # disabled - SELinux is fully disabled.

Chapter 3. Installing IBM Open Platform with Apache Hadoop 23 SELINUX=disabled # SELINUXTYPE= type of policy in use. Possible values are: # targeted - Only targeted network daemons are protected. # strict - Full SELinux protection. SELINUXTYPE=targeted 10. When you install on a Red Hat Enterprise Linux operating system, for all nodes in your cluster, ensure that the ZONE parameter value is valid, which means that it must match an actual file name in /usr/share/zoneinfo. If the ZONE parameter value is set to a file that does not exist, modify the value to refer to the correct time zone. As an example, sometimes spaces exist in the time zone name. Replace the space with an underscore (_) to match the correct time zone file. If this value is not correct, the Open JDK software picks the wrong time zone information, which results in invalid time stamp values. RHEL 6.x The ZONE parameter value is in /etc/sysconfig/clock, and it must refer to a valid file under /usr/share/zoneinfo. For example, view the contents of the /etc/sysconfig/clock file. # The time zone of the system is defined by the contents of /etc/localtime. # This file is only for evaluation by system-config-date, do not rely on its # contents elsewhere. ZONE="America/Los Angeles"

You see that the value of ZONE is America/Los Angeles, which does not match the /usr/share/zoneinfo/America/Los_Angeles file name. Change the value in /etc/sysconfig/clock file to the value of the actual file name: a. Edit the /etc/sysconfig/clock file: vi /etc/sysconfig/clock b. Set the value of ZONE to match the file name: ZONE="America/Los_Angeles" RHEL 7.x Use the following command to set the Zone: timedatectl set-timezone [timezone]

where timezone is the zone. This command updates the symbolic link for /etc/localtime.

For example if the file name is actually /usr/share/zoneinfo/America/ New_York, then issue the following command: timedatectl set-timezone America/New_York

The result is the following output: /etc/localtime-> ../usr/share/zoneinfo/America/New_York 11. Optional: Use these optional steps to explicitly create a non-root user and group with sudo privileges. Otherwise, default IDs will be used. a. On every node in your cluster, as the root user, create the somegroup group and then add the userx user to it. 1) Add the somegroup group. groupadd -g 123 somegroup 2) Add the userx user to the somegroup group. useradd -g somegroup -u 123 userx 3) Set the password for the userx user.

24 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview passwd userx b. On the intended master node, add the userx user to the sudoers group. (Repeat this step for all nodes.) 1) Edit the sudoers file. sudo visudo -f /etc/sudoers 2) Comment out the following line. Defaults requiretty 3) Locate the following line. # %wheel ALL=(ALL) NOPASSWD: ALL Replace that line with the following lines, depending on what type of access is required. ##Permits users in the somegroup group to run all commands without ##supplying a password somegroup ALL=(ALL) NOPASSWD:ALL Configuring LDAP server authentication on Red Hat Enterprise Linux 6.7 and 7.2 If you want to use LDAP authentication on RHEL 6.7 or 7.2 for your users and groups, you must configure your LDAP server before installing IBM Open Platform with Apache Hadoop. You must complete this procedure on every node in your cluster. Before you begin

You need the following information to complete this procedure. You can find this information in the ldap.conf file in the /etc/openldap directory. v LDAP server URI, such as ldap://10.0.0.1. v LDAP server search base, such as dc=example,dc=com.

Add users and user groups to your LDAP configuration. All users of Oozie services, zookeeper services, or monitoring services must belong to the Hadoop group. For more information about potential groups, see “Get ready to install” on page 11.

To disable LDAP authentication, use the following command: sudo /usr/bin/authconfig --disableldap --disableldapauth --ldapserver=ldap://your-ldap-server-name:port --ldapbasedn="dc=your-ldap-dc,dc=your-ldap-dc" --update

Procedure 1. Install the following required packages. yum install authconfig

yum install pam_ldap

yum install openldap openldap-clients openldap-servers sssd 2. Configure your OpenLDAP server. a. Change the directory to /etc/openldap/slapd.d/cn\=config. Then, update the olcDatabase\=\{2\}bdb.ldif parameter to point to the LDAP server config file. In the LDAP server config file modify the olcSuffix entry to

Chapter 3. Installing IBM Open Platform with Apache Hadoop 25 identify your domain. For example, if your domain is example.com, then your suffix looks like the following example. olcSuffix "dc=example,dc=com" b. Modify the olcRootDN entry to reflect the name of the privileged user who has unrestricted access to your OpenLDAP directory. For example, if the privileged user is ldapadmin and the domain is example.com, then your olcRootDN looks like the following example. olcRootDN "cn=ldapadmin,dc=example,dc=com" c. Enter a password for your OpenLDAP server by using the olcRootPW parameter. Using a password provides the capability to configure, test, and correct your OpenLDAP system over your local network. olcRootPW password Alternatively, you can use the slappasswd command to generate an encrypted password that you can copy and paste into the slapd.conf file. The command prompts you to enter a password and then generates an encrypted password. d. From the /etc/init.d directory, run the ldap script to start your OpenLDAP server. /etc/init.d/slapd start 3. Configure the LDAP user stores and enable your machine to authenticate to your remote LDAP server. You must use the full LDAP URL for your LDAP server. /usr/sbin/authconfig --enableldapauth --ldapserver=ldap://ldap.example.com / --ldapbasedn="dc=ibm,dc=com" --update 4. Configure the LDAP client by using sssd. The sssd configuration is located at /etc/sssd/sssd.conf. Examples of sssd.conf: [sssd] config_file_version = 2 services = nss, pam domains = default

[nss] filter_users = root,ldap,named,avahi,haldaemon,dbus,radiusd,news,nscd

[pam]

[domain/default] auth_provider = ldap id_provider = ldap ldap_schema = rfc2307 ldap_search_base = ou=im,dc=example,dc=com ldap_group_member = memberuid ldap_tls_reqcert = never ldap_id_use_start_tls = False chpass_provider = ldap ldap_uri = ldap://ldap.example.com:389/ ldap_tls_cacertdir = /etc/openldap/cacerts entry_cache_timeout = 600 ldap_network_timeout = 3 #ldap_access_filter = (&(object)(object)) ldap_default_bind_dn = cn=Manager,ou=im,dc=example,dc=com ldap_default_authtok_type = password ldap_default_authtok = YOUR_PASSWORD cache_credentials = True enumerate=true

Note:

26 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview v There are a very large number of LDAP calls made by a Hadoop cluster. v ALL IBM Open Platform with Apache Hadoop and IBM BigInsights value-add service users can be local. Adding them to the filter users clause prevents any call to LDAP. For more information, see http://www- 01.ibm.com/support/docview.wss?uid=swg21962541. 5. Edit /etc/nsswitch.conf to make sure the account resolution is using sss. passwd: files sss shadow: files sss group: files sss 6. From the /etc/init.d/sssd, run the sssd script to start your LDAP client. /etc/init.d/sssd start

Creating a mirror repository for the IBM Open Platform with Apache Hadoop software You can create a mirror of the IBM hosted repository on a machine within your enterprise network and instruct Ambari to use that local repository. You can use this approach when internet access is restricted. Before you begin

Ensure you have met all the prerequisites described in Preparing to install IBM Open Platform with Apache Hadoop Procedure 1. On a server that is accessible to your cluster, enable network access from all hosts in your cluster to a mirror server. This mirror server can be defined in DNS or you can add an entry for the mirror server in /etc/hosts on each node of your cluster. 2. Create an HTTP server on the mirror server, such as Apache httpd. Apache httpd might already be installed on your mirror server. If it is not already installed, install it with the yum install httpd command. Start this mirror web server. For Apache httpd, you can start it using the following command: apachectl start 3. Ensure that any firewall settings allow inbound HTTP access from your cluster nodes to your mirror web server. 4. On the mirror web server, create a directory for your IOP repos, such as /repos. For Apache httpd with document root /var/www/html, type the following command: mkdir -p /var/www/html/repos 5. Obtain the following tarballs for the IBM Open Platform repository, using either wget or curl -O: TECHNICAL PREVIEW DOWNLOAD ONLY Accept the IBM BigInsights Early Release license agreement: http://www14.software.ibm.com/cgi-bin/weblap/lap.pl? popup=Y&li_formnum=L-MLOY-9YB5S9&accepted_url= http://ibm-open-platform.ibm.com/repos/beta/4.2.0.0/&title= IBM+BigInsights+Beta+License&declined_url= http://www-01.ibm.com/software/data/infosphere/hadoop/trials.html

Then select the appropriate repository file for your environment: RHEL6 https://ibm-open-platform.ibm.com/repos/beta/4.2.0.0/rhel6/

Chapter 3. Installing IBM Open Platform with Apache Hadoop 27 Use the following TAR files: BIPremium-4.2.0.0-beta1.el6.x86_64.tar.gz ambari-2.2.0.0-beta1.el6.x86_64.tar.gz iop-4.2.0.0-beta1-el6.x86_64.tar.gz iop-utils-4.2.0.0-beta1.el6.x86_64.tar.gz RHEL7 https://ibm-open-platform.ibm.com/repos/beta/4.2.0.0/rhel7/

Use the following TAR files: BIPremium-4.2.0.0-beta1.el7.x86_64.tar.gz ambari-2.2.0.0-beta1.el7.x86_64.tar.gz iop-4.2.0.0-beta1-el7.x86_64.tar.gz iop-utils-4.2.0.0-beta1.el7.x86_64.tar.gz

Note: If you use a Windows system to download the files, you can also cut and paste the URLs into a web browser and proceed to download the files. You can then transfer the files to the system that will host your mirror/repository files. 6. Extract the IOP repository tarballs in the repos directory under document root. For Apache httpd, issue the following commands: cd /var/www/html/repos tar xzvf 7. Test your local repository by browsing to the web directory: http:///repos

Running the installation package To install the IBM Open Platform with Apache Hadoop software, run the installation commands, start the Ambari server, and complete the installation wizard steps. Before you begin

TECHNICAL PREVIEW DOWNLOAD ONLY

See the download information for the technical preview.

If you need overview information about the Ambari server, see Chapter 3, “Installing IBM Open Platform with Apache Hadoop,” on page 11.

UIDs and GIDs must be consistent across all nodes. If you use local service IDs for IBM Open Platform with Apache Hadoop services, ensure that the UIDs and GIDs are consistent across the cluster by creating them manually. For more information about what users and groups to create, see Table 6 on page 13. . Procedure 1. Log in to your Linux cluster on the master node as root. You can log in as a user with root privileges, but this is not typical. a. Verify that you are the root user on the management node by running the following commands: hostname -f This command returns the current hostname (such as node1.abc.com) of the node that will contain the Ambari server.

28 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview whoami This command returns the user account, root. b. Log into the system as the root user: ssh [email protected] 2. Ensure that the nc package is installed on all nodes. If you installed the Basic Server option, you might not already have the nc package, which can result in data node failures. RHEL yum install -y nc

Your output will show that the nc package is installed, or is is already installed and there is nothing for your to do.

... Nothing to do

... Updated: nc.x86_64 0:1.84-24.el6 Complete!

3. If you are using a mirror repository, complete the following steps. a. Create the ambari.repo file. You can use the following command to copy it from where the Ambari tarball is extracted: cp /var/www/html/repos/ambari.repo /etc/yum.repos.d/ a. Edit the file /etc/yum.repos.d/ambari.repo. Replace the value for baseURL with your mirror URL. The original baseURL might look like one of the following: xRHEL6 v Ambari: https://ibm-open-platform.ibm.com/repos/Ambari/ rhel/6/x86_64/2.2.x/Updates/2.2.0_Spark-1.x.x/BI-AMBARI- 2.2.0-Spark-1.x.x-20160105_1211.el6.x86_64.tar.gz v IOP: https://ibm-open-platform.ibm.com/repos/IOP/rhel/6/ x86_64/4.x.x/IOP-4.x-Spark-1.x.x- 20151210_1028.el6.x86_64.tar.gz v IOP-Utils: https://ibm-open-platform.ibm.com/repos/IOP- UTILS/rhel/6/x86_64/1.1/ xRHEL7 v Ambari: https://ibm-open-platform.ibm.com/repos/Ambari/ rhel/7/x86_64/2.2.x/Updates/2.2.0_Spark-1.x.x/BI-AMBARI- 2.2.0-Spark-1.x.x-20160105_1212.el7.x86_64.tar.gz v IOP: https://ibm-open-platform.ibm.com/repos/IOP/rhel/7/ x86_64/4.x.x/Updates/4.x.0.0_Spark-1.x.x/IOP-4.x-Spark- 1.x.x-20151209_2001.el7.x86_64.tar.gz v IOP-UTILS: https://ibm-open-platform.ibm.com/repos/IOP- UTILS/rhel/7/x86_64/1.1/ For example: baseurl= http://web_server/repos/Ambari/rhel/6/x86_64/ 2.2.x/Updates/2.2.0_Spark-1.x.x/... Remember, the Linux version number and the platform might be different. b. Perform one of the following two actions: v Disable gpgcheck in the ambari.repo file. To disable signature validation, change gpgcheck=1 to gpgcheck=0.

Chapter 3. Installing IBM Open Platform with Apache Hadoop 29 v Keep gpgcheck enabled and change the public key file location to the mirror Ambari repository. Replace gpgkey=http://ibm-open- platform.ibm.com/repos/Ambari/rhel/6/x86_64/2.2.x/Updates/ 2.2.0_Spark-1.x.x/.../BI-GPG-KEY.public with the mirror Ambari repository location. For example: gpgkey=http://web_server/repos/Ambari/rhel/6/x86_64/ 2.2.x/Updates/2.2.0_Spark-1.x.x/.../BI-GPG-KEY.public. Remember, the Linux version number and the platform might be different.

Note: v The IBM hosted repository uses HTTPS. If your local mirror is not configured for HTTPS, use http:// instead of https://. v If you are installing on an operating system other than RHEL6, the paths will be slightly different. Modify as appropriate. 4. Clean the yum cache on each node so that the right packages from the remote repository are seen by your local yum. yum clean all

The output might look like the following text:

Loaded plugins: product-id, refresh-packagekit, rhnplugin, security, : subscription-manager Cleaning repos: BI_AMBARI-2.2.xxxx rhel-x86_64-server-6 Cleaning up Everything

5. Install the Ambari server on the intended management node, using the following command: yum install ambari-server

Accept the install defaults. 6. Update the following file with the mirror repository URLs.

For Mirror Repository: /var/lib/ambari-server/resources/stacks/BigInsights/4.x/repos/repoinfo.xml

In the file, change the information from the Original content to the Modified content. Modify according to your level of Linux and platform:

Example original content Modified content https://ibm-open-platform.ibm.com/repos/ http:///repos/IOP/rhel/6/ Ambari/rhel/6/x86_64/2.2.x/Updates/ x86_64/4.x IOP-4.x repoid> IOP IOP-4.x IOP http:///repos/IOP-UTILS/ http://ibm-open- rhel/6/x86_64/1.0 platform.ibm.com/repos/IOP-UTILS/rhel/ IOP-UTILS-1.0 6/x86_64/1.0 IOP-UTILS IOP-UTILS-1.0 IOP-UTILS

Tip:

30 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview If you use a local repository URL, and you must modify the repo URL, there are three ways to change the repo URL: a. Change the repo.xml file manually. b. During the cluster installation, change repos on the Ambari web interface at the Select Stack step. c. After you complete an installation, use the Ambari web tool: 1) From the Ambari web dashboard, in the menu bar, click admin > Manage Ambari.

2) From the Clusters panel, click Versions > . 3) Change the URL as needed, and click Save. 4) Restart the Ambari server. There is no need to restart the Ambari server for the second or third option. Edit the /etc/ambari-server/conf/ambari.properties file. Change the information from the Original content to the Modified content

Original content Modified content openjdk1.8.url=http://ibm-open- openjdk1.8.url=http:///repos/ platform.ibm.com/repos/IOP-UTILS/rhel/ IOP-UTILS/rhel/6/x86_64/1.1/openjdk/ 6/x86_64/1.1/openjdk/jdk-1.8.0.tar.gz jdk-1.8.0.tar.gz

7. Set up the Ambari server: a. Run the following command and accept the default settings: ambari-server setup The IBM Open Platform with Apache Hadoop installation includes OpenJDK 1.8 and is the default. The Ambari server setup allows you to reuse an existing JDK. The command is: ambari-server setup -j /full/path/to/JDK

The JDK path set by the -j parameter must be identical and valid on every node in the cluster. For a list of the currently supported JDK versions, see “Upgrading the Java (JDK) version” on page 36.

Chapter 3. Installing IBM Open Platform with Apache Hadoop 31 Tip: You might need to reboot your nodes if you see a message that SELinux is still enabled. 8. Start the Ambari server, using the following command: ambari-server start 9. If the Ambari server had been installed on your node previously, the node may contain old cluster information. Reset the Ambari server to clean up its cluster information in the database, using the following commands: ambari-server stop

ambari-server reset

ambari-server start 10. Access the Ambari web user interface from a web browser by using the server name (the fully qualified domain name) on which you installed the software, and port 8080. For example, enter the following string in your browser: HTTP://node1.abc.com:8080

Note: In some networks, port 8080 is already in use. To use another port, do the following: a. Edit the ambari.properties file: vi /etc/ambari-server/conf/ambari.properties b. Add a line in the file to select another port: client.api.port=8081 c. Save the file and restart the Ambari server: ambari-server restart 11. Log in to the Ambari server (http://:8080) with the default username and password: admin/admin. The port 8080 is the default. If you changed the default port in the previous step, use the modified port number. 12. On the Welcome page, click Launch Install Wizard. 13. On the Get Started page, enter a name for the cluster you want to create. The name cannot contain blank spaces or special characters. Click Next. 14. On the Select Stack page, click the Stack version you want to install:

Option Description IBM Open Platform with Apache Hadoop The administrator selects BigInsights 4.2 as the stack.

Click Next. 15. On the Install Options page, complete the following two steps: a. In Target Hosts, list all of the nodes that will be used in the IBM Open Platform with Apache Hadoop cluster. Specify one node per line, as in the following example: node1.abc.com node2.abc.com node3.abc.com node4.abc.com The host name must be the fully qualified domain name (FQDN). b. In Host Registration Information, select one of the two options: v Provide your SSH Private Key to automatically register hosts

32 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview Click SSH Private Key. The private key file is /root/.ssh/id_rsa, if the root user installed the Ambari server. If you installed as a non-root user, then the default private key is in the .ssh directory in the non-root user's home directory. You have the option of browsing to the .ssh/id_rsa file and letting the Ambari web interface upload the contents of the key file, or you can open the file and copy and paste the contents into the SSH key field. For more information about the key file, see “Preparing your environment” on page 16. Click the Register and Confirm button. v Perform manual registration on hosts and do not use SSH You can choose this option when the ambari-agents are manually installed on all nodes, and they are running. In this case, no password-less SSH setup is required. For more information, see https://ambari.apache.org/1.2.0/installing-hadoop-using-ambari/ content/ambari-chap6-1.html. 16. On the Confirm Hosts page, check that the correct hosts for your cluster have been located: If hosts were selected in error, remove the hosts one-by-one by following these steps: a. Click the box next to the server to be removed. b. Click Remove in the Action column. If warnings are found during the check process, you can click Click here to see the warnings to see what caused the warnings. The Host Checks page identifies any issues with the hosts. For example, a host may have Transparent Huge Pages or Firewall issues. For information on how to address these issues, see “Preparing your environment” on page 16. After you resolve the issues, click Rerun Checks on the Host Checks page. When you have confirmed the hosts, click Next. 17. On the Choose Services page, select the services you want to install. You must select to install HDFS. Ambari shows a confirmation message to install the required service dependencies. For example,when selecting Oozie only, the Ambari web interface shows messages for accepting YARN/MR2, HDFS and Zookeeper installations. Click Next. 18. On the Assign Masters page, assign the master nodes to hosts in your cluster for the services you selected. Refer to the right column for the default service assignments by host. You can accept the current default assignments. To assign a new host to run services, click the dropdown list next to the master node in the left column and select a new host. Click Next. To see a suggested layout of services, see Suggested services layout for IBM Open Platform with Apache Hadoop 19. On the Assign Slaves and Clients page, assign the slave and client components to hosts in your cluster. You can accept the default assignments.

Tip: If you anticipate adding the Big SQL service, you must include all clients on all the anticipated Big SQL worker nodes. Big SQL specifically needs the HDFS, Hive, HBase, Sqoop, HCat, and Oozie clients.

Chapter 3. Installing IBM Open Platform with Apache Hadoop 33 Click all or none to decide the host assignments. Or, you can select one or more components next to a selected host (that is, DataNode, NodeManager, RegionServer, Flume, Client). Click Next. 20. On the Customize Services page, select configuration settings for the services selected. Default values are filled in automatically when available and they are the recommended values. The installation wizard prompts you for required fields (such as password entries) by displaying a number in a circle next to an installed service. Click the number and enter the requested information in the field outlined in red. Make sure that the service port that is set is not already used by another component.

Important: For example, the Knox gateway port is, by default, set as 8443. But, when the Ambari server is set up with HTTPs, and the SSL port is set up using 8443, then you must change the Knox gateway port to some other value.

Note: If you are working in an LDAP environment where users are set up centrally by the LDAP administrator and therefore, already exist, selecting the defaults can cause the installation to fail. Open the Misc tab, and check the box to ignore user modification errors. 21. When you have completed the configuration of the services, click Next. 22. On the Review page, verify that your settings are correct. Click Deploy. 23. The Install, Start, and Test page shows the progress of the installation. The progress bar at the top of the page gives the overall status while the main section of the page gives the status for each host. Logs for a specific task can be displayed by clicking on the task. Click the link in the Message column to find out what tasks have been completed for a specific host or to see the warnings that have been encountered. When the Ambari web interface displays messages about the install status and displays the Next button, click it to proceed to the next page. 24. On the Summary page, review the accomplished tasks. Click Complete to go to the IBM Open Platform with Apache Hadoop dashboard. What to do next MySQL/MariaDB If you plan to use MySQL/MariaDB for the Hive Metastore, ensure that the mysqld service starts on reboot. Run the following command on the node that will host the Hive Metastore and the MySQL/MariaDB:

Note: Install MySQL in the RHEL 6.x operating system. Install MariaDB in the RHEL 7.x operating system. RHEL 6.x chkconfig mysqld on service mysqld start RHEL 7.x systemctl start mariadb.service systemctl enable mariadb.service postgresql Ensure that the postgresql service, which is used by Ambari, starts automatically on reboot. Run the following command on the node that will host the Ambari database and the postgres:

34 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview RHEL 6.x chkconfig postgresql on service postgresql start RHEL 7.x systemctl start postgresql.service systemctl enable postgresql.service HDFS caching HDFS caching is supported in the IBM Open Platform with Apache Hadoop. To make sure that it can be started successfully, you must change two configuration settings: 1. From the Linux command line where the Ambari server is running, edit the /etc/security/limits.conf file: hdfs - memlock 200000

This value must be set to equal or greater than the value that you set for the dfs.datanode.max.locked.memory property. Also, consider the memory available on the server when you set the memlock value. For more information about the values, see HDFS caching 2. Open the Ambari web interface, and select the HDFS service. 3. Stop the HDFS service. 4. Click the Configs tab. 5. Expand the Advanced hdfs-site section and locate the following property to add the value: dfs.datanode.max.locked.memory 200000

Restriction: To make sure data is cached, the lowest value for this property must be bigger than the virtual memory page size (the default value is 4096 bytes = getconf PAGE_SIZE). 6. Restart the HDFS service. Ambari dashboard user name and password You can change default username and password and can configure users and groups after the first login to the Ambari web interface. Troubleshooting Be sure to check the available troubleshooting information if you have problems. Non-root user

Tip:

If you install IBM Open Platform with Apache Hadoop as the non-root user, which is not typical, preface the instructions with sudo, where the instruction would normally require the root user. Related information: Information on installing with Ambari Blueprints Information on using Ambari Blueprints

Chapter 3. Installing IBM Open Platform with Apache Hadoop 35 Upgrading the Java (JDK) version When you installed IBM Open Platform with Apache Hadoop, you selected the Java Development Kit (JDK) to use or you provided a path to a custom JDK that was already installed on your hosts. You can change the JDK version after you complete the initial installation of IBM Open Platform with Apache Hadoop. About this task

The JDK version that you use is dependent on which IBM Open Platform with Apache Hadoop stack that you plan to install in your cluster. Use the following table as a guide for the JDK that will work with which version of IBM Open Platform with Apache Hadoop. Table 10. Acceptable JDK versions IBM Open Platform with Apache Hadoop Release JDK version 4.2 Open JDK v1.8 4.1.0.2 Open JDK v1.8, v1.7 4.1.0.1 Open JDK v1.8, v1.7 4.1.0.0 Open JDK v1.8, v1.7 4.0.0.1 Open JDK v1.7 4.0.0.0 Open JDK v1.7

Important: If you upgrade from a previous version of IBM Open Platform with Apache Hadoop to a later version, such as from IBM Open Platform with Apache Hadoop 4.1 to IBM Open Platform with Apache Hadoop 4.2, do not change the JDK version until your IOP upgrade is successfully completed, and the cluster is running. Procedure 1. Re-run the Ambari server set up. ambari-server setup 2. When you are prompted to change the JDK, enter y. Do you want to change Open JDK [y/n] (n)? y 3. At the prompt to select a JDK, enter 1 to change the JDK to v1.8. Do you want to change the current JDK [y/n] (n)? [1] OpenJDK 1.8.0 [2] Custom JDK ======Enter choice (1): a. If you select OpenJDK 1.8, the JDK is automatically downloaded and installed on the Ambari server host. You must have an internet connection to download the JDK. Install this JDK on all the hosts in the cluster to this same path. b. If you select Custom JDK, verify or add the custom JDK path on all the hosts in the cluster. Select this option if you want to use a different JDK or you do not have an internet connection (and have pre-installed the JDK on all hosts). 4. When the Ambari set up completes, from the Ambari dashboard restart all services so that the new JDK is used by the Hadoop services.

36 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview Installing and configuring Ranger in the Ambari web interface After you reinstall IBM Open Platform with Apache Hadoop, you can add the Ranger open source. Before you begin 1. Make sure you have installed at least one of the Hadoop services, such as Hadoop, HBase or Hive. 2. Install and configure one of the databases that Ranger can use, such as MySQL, PostgreSQLm or Oracle. See Configuring MySQL for Ranger for the steps to configure MySQL for Ranger. 3. Install a JDBC connector for a database server on the node where the Ambari server is installed. Procedure 1. Log into your Ambari cluster with your designated user credentials. 2. On the main ambari dash board, in left navigation menu, click Actions, then select Add Service. 3. On the Add Service Wizard, select Ranger, then click Next. 4. The Ranger Requirements page appears. Ensure that you have met all of the installation requirements, then select the "I have met all the requirements above" check box and click Proceed. 5. You are then prompted to select the host where Ranger Admin and Ranger UserSync will be installed. This host should have DB admin access to the Ranger DB host. Make a note of the Ranger Admin host for use in subsequent installation steps. Click Next when finished to continue with the installation to open the Customize services page.

Note:

The Ranger Admin and Ranger Usersync services must be installed on the same cluster node. 6. Specify the Ranger settings on the Customize Services page. You must specify all of the Range, Ranger Admin, and Ranger Audit settings on the Customize Services page before clicking Next at the bottom of the page to continue with the installation. The page shows that there are some configurations that have to be set (highlighted in red). The other configuration properties have default values. Refer to the tables below for the configuration details under each tab. Table 11. Ranger admin settings Configuration Example Property Name Description Default Value Value/comment DB FLAVOR Select the "DB MYSQL Flavor" (installed database type) that you are using with Ranger.

Chapter 3. Installing IBM Open Platform with Apache Hadoop 37 Table 11. Ranger admin settings (continued) Configuration Example Property Name Description Default Value Value/comment Ranger DB host The server/host iopserver.ambari.ibm.co where the installed m database is present. Based on the DB (For MYSQL) FLAVOR and version, this value might need to include the port as well. Ranger DB name The name of the ranger The DB is Ranger Policy automatically created database. For Oracle by the Ranger the tablespace name installation should be given here. Driver class name JDBC driver class com.mysql.jdbc.Dri name to connect to ver Ranger database Ranger DB username The username to rangeradmin Automatically create the Policy created by the database Ranger installation Ranger DB Password The password for the N/A Admin-password Ranger Policy database user JDBC connect string Ranger database jdbc Auto filled based on connect string for the the other values. Ranger DB user. Setup Database and Enabling this option Yes Database User will automatically create the ranger databases and the users. Database The Ranger database root If you have set up a Administrator (DBA) user that has non- root user during username administrative the pre-requisite set privileges to create up, then update this database schemas field, such as and users. rangerdba Database The root password N/A rangerdba Administrator (DBA) for the Ranger password database user. JDBC connect string Ranger database jdbc Auto filled based on for root user connect string for the the other values. root user.

7. After completing the database configuration click Test Connection. This should result in connection successful. If not, then re-check your database configuration values. 8. Update the Ranger User information.

38 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview Table 12. Ranger User settings Configuration Example Property Name Description Default Value Value/comment Enable User Sync Enables Ranger Yes usersync to sync up the users based on the Sync Source Sync Source The users to be UNIX synced based on the authentication method.

9. Make sure the configurations in the Ranger Plugin tab are disabled until you comlete the installation. The configurations under this tab enable the Ranger plugins for the different Hadoop services. These configurations are disabled by default. 10. In the Ranger Audit tab, make sure that the Audit to Solr option is disabled and the ranger.audit.source.type property is set to db. Ranger can store the audit records to Solr for temporary use and to HDFS and/or to Database for long-term. Currently, IBM Open Platform with Apache Hadoop version 4.2 supports storing the audit records to HDFS or Database only. In the Advanced tab of the Ranger Configuration, in the Advanced ranger-admin-site section, set the value of ranger.audit.source.type to db. Table 13. Ranger audit settings Configuration Example Property Name Description Default Value Value/comment Audit to HDFS Allows the audit ON This property is records to be stored overridable at service to HDFS. level. Destination HDFS The HDFS directory hdfs:// Make sure all service Directory to write audit to. iopserver.ambari.ib users have required m.com:8020/ranger permissions. This /audit property is overridable at service level. Audit to DB Enable audit to DB OFF This property is for all Ranger overridable at service supported services. level. Ranger Audit DB Audit database name ranger_audit Automatically name created by the Ranger installation. Ranger Audit DB The username for rangerlogger Automatically username storing the audit log created by the information. Ranger installation Ranger Audit DB The password for the N/A Audit-password Password Ranger audit user

11. Update the property values in the Advanced tab.

Chapter 3. Installing IBM Open Platform with Apache Hadoop 39 Table 14. Ranger Advanced tab Configuration Example Property Name Description Default Value Value/comment Ranger Admin Ambari user created Amb_ranger_admin username for Ambari for creating repositories and polices in Ranger Admin for each of the Ranger supported service. Ranger Admin user's This password will This has a default password for Ambari only be used by the value. Can be Ambari Agent, with changed. the “Ranger Admin username for Ambari” External URL The Ranger Policy http:// Manager host. :6080 HTTP Enabled A check box that Selected Selected specifies whether or not HTTP authentication is enabled. If HTTP is not enabled, only HTTPS is allowed. Authentication The type of UNIX None method authentication method used to log into the Policy Manager. Only users created within the Policy Manager tool can log in. The available authentication methods are LDAP, Active Directory, UNIX, and NONE. If NONE is selected, Ranger uses the local user database for authentication, and only internal Ranger users can log in. Ranger Group The value used to ranger ranger create groups and assign ranger ranger permissions. This is the OS level group that will be created and used to start the Ranger Admin and Ranger Usersync services.

40 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview Table 14. Ranger Advanced tab (continued) Configuration Example Property Name Description Default Value Value/comment Ranger User The value used to ranger ranger create users and assign permissions. This is the OS level user that will be created and used to start the Ranger Admin and Ranger Usersync services.

After you have finished specifying all of the settings on the Customize Services page, click Next at the bottom of the page to continue with the installation. 12. Complete the Ranger installation. a. On the Review page, carefully review all of your settings and configurations. If everything looks good, click Deploy to install Ranger on the Ambari server. b. When you click Deploy, Ranger is installed on the specified host on your Ambari server. A progress bar displays the installation progress. c. When the installation is complete, a Summary page displays the installation details. What to do next

The Ranger installation triggers a restart of the other services that have the Ranger plugins. This is done to load the necessary Ranger plugin configurations for the Ranger supported services. Restart the necessary services.

If you would like to set up Ranger User sync with LDAP/AD, then see Automatic user sync from LDAP/AD/Unix to Ranger.. For setting up the Ranger authentication with LDAP or AD other than the default UNIX, then see Ranger Authentication setup using LDAP/AD or Unix.

After the Ranger installation is complete and the necessary services are restarted, you can enable the individual component plugins. For more information see, Ranger Hadoop Plugin Usage for HDFS, Hive, HBase and Knox and Yarn Ranger Plugin Usage. Configuring MySQL for Ranger Procedure 1. Download and install the MySQL server. If you have installed and configured Hive with MySQL server, you can use the same MySQL Server instance for Ranger. 2. You can use the MySQL root user or create a non-root user to use to create the Ranger databases. For example, use the following series of commands to create the rangerdba user with password rangerdba. a. Log in as the root user, then use the following commands to create the rangerdba user and grant it adequate privileges.

Chapter 3. Installing IBM Open Platform with Apache Hadoop 41 CREATE USER ’rangerdba’@’localhost’ IDENTIFIED BY ’rangerdba’; GRANT ALL PRIVILEGES ON *.* TO ’rangerdba’@’localhost’ WITH GRANT OPTION; CREATE USER ’rangerdba’@’%’ IDENTIFIED BY ’rangerdba’; GRANT ALL PRIVILEGES ON *.* TO ’rangerdba’@’%’ WITH GRANT OPTION; FLUSH PRIVILEGES; b. Use the exit command to exit MySQL. c. Test that rangerdba is able to connect to the database using the following command: mysql -u rangerdba -prangerdba

After testing the rangerdba login, use the exit command to exit MySQL. 3. Use the following command to confirm that the mysql-connector-java.jar file is in the Java share directory. This command must be run on the server where Ambari server is installed. ls /usr/share/java/mysql-connector-java.jar If the file is not in the Java share directory, use the following command to install the MySQL connector .jar file. yum install mysql-connector-java* 4. Use the following command format to set the jdbc/driver/path based on the location of the MySQL JDBC driver .jar file.This command must be run on the server where Ambari server is installed. ambari-server setup --jdbc-db={database-type} --jdbc- driver={/jdbc/driver/path} For example: ambari-server setup --jdbc-db=mysql --jdbc- driver=/usr/share/java/mysql-connector-java.jar Installing Ranger plugins The Ranger plugins can be set up and configured from the individual service or from ranger service. The examples here show you how to enable plugins from the Ranger service. Ranger Hadoop plugin usage for HDFS, Hive, HBase, Knox, and Yarn About this task

This topic shows you how to configure and enable five of the Ranger plugins: HDFS, Hive, Hbase, Knox, and Yarn. You can select the services that you would like Ranger to control.

Procedure 1. Set up and configure HDFS plugins for Ranger: a. From the Ambari web interface, select the Ranger servcie and then open the Configs tab. Select the Ranger Plugin tab. b. In the Ranger Plugin section, enable the HDFS Ranger Plugin, and then click Save. c. The Dependent Configurations window opens to confirm the configurations that are updated. Click OK. d. Before you restart HDFS, open the HDFS configuration tab and verify the following changes: in the Advanced ranger-hdfs-plugin-properties: v Enable Ranger for HDFS: Check

42 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview v Policy user for HDFS: ambari-qa v Ranger repository config user: hadoop v common.name.for.certificate: a single space without the quotes: " " v REPOSITORY_CONFIG_PASSWORD: the password you set for the above user, such as hadoop-password e. Restart the HDFS service. The Ranger plugin for HDFS is now enabled. You can verify this by logging into the Ranger web interface, and you will see that the HDFS repository appears in the access manager list with a default policy. f. Verify that the policy is synched by navigating to Ranger > Audit > Plugins with 200 Response code. 2. Set up and configure Hive plugins for Ranger: You should not use the Hive CLI after enabling the Ranger Hive plugin. Using the Hive CLI can break the install or lead to other unpredictable behavior. Instead, use the HiveServer2 Beeline CLI after you enable the Ranger Hive plugin. Ranger communicates through HiveServer2. a. From the Ambari web interface, select the Ranger servcie and then open the Configs tab. Select the Ranger Plugin tab. b. In the Ranger Plugin section, enable the Hive Ranger Plugin, and then click Save. c. The Dependent Configurations window opens to confirm the configurations that are updated. Click OK. d. Before you restart Hive, open the Hive configuration tab and verify the following changes: in the Hive > Configs > Settings tab: v In the Security section, in the Choose Authorization selection, Ranger is showing. e. In the Hive > Configs > Advanced tab, validate there is a default policy_user and repository config user in the Advanced ranger-hive-plugin-properties. f. Restart the Hive service. The Ranger plugin for Hive is enabled. g. To verify, login to the Ranger web interface and notice that the Hive repository appears in the access manager list with a default policy. h. Verify that the policy is synced up by going to Ranger > Audit > Plugins with 200 Response code. 3. Set up and configure HBase plugin for Ranger: a. From the Ambari web interface, select the Ranger servcie and then open the Configs tab. Select the Ranger Plugin tab. b. In the Ranger Plugin section, enable the HBase Ranger Plugin, and then click Save. c. The Dependent Configurations window opens to confirm the configurations that are updated. Click OK. d. Before you restart HBase, open the HBase configuration tab and verify the following changes: in the HBase > Configs > Settings > Security tab: v In the Security section, in the Enable Authorization selection, HBase is showing. e. In the HBase > Configs > Advanced tab, validate there is a default policy_user and repository config user in the Advanced ranger-hbase-plugin-properties. f. Restart the HBase service. The Ranger plugin for HBase is enabled.

Chapter 3. Installing IBM Open Platform with Apache Hadoop 43 g. To verify, login to the Ranger web interface and notice that the HBase repository appears in the access manager list with a default policy. h. Verify that the policy is synced up by going to Ranger > Audit > Plugins with 200 Response code. 4. Set up and configure Knox plugin for Ranger: a. From the Ambari web interface, select the Ranger servcie and then open the Configs tab. Select the Ranger Plugin tab. b. In the Ranger Plugin section, enable the Knox Ranger Plugin, and then click Save. c. The Dependent Configurations window opens to confirm the configurations that are updated. Verify that the topology has been updated with the authorization of XASecurePDPKnox. d. Click OK. e. Before you restart Knox, open the Knox configuration tab and add ambari-qa (or, the default ranger policy user) to ldif file. In the Knox > Configs > Settings tab, open the Advanced users-ldif text box. Scroll to the bottom of the text box and add the following lines of code: # entry for sample user ambari-qa dn: uid=ambari-qa,ou=people,dc=hadoop,dc=apache,dc=org objectclass:top objectclass:person objectclass:organizationalPerson objectclass:inetOrgPerson cn: ambari-qa sn: ambari-qa uid: ambari-qa userPassword:ambari-password f. Save the configuration. g. Restart the Knox service. The Ranger plugin for Knox is enabled. h. To verify, login to the Ranger web interface and notice that the Knox repository appears in the access manager list with a default policy. i. Verify that the policy is synced up by going to Ranger > Audit > Plugins with 200 Response code. 5. Set up and configure Yarn plugin for Ranger: a. Select the Yarn service in the Ambari dashboard and open the Config tab. b. In the Advanced tab, look for the ranger-yarn-plugin-properties and select Enable ranger for yarn. c. Click Save. You will see other suggestions on properties to change. Select OK save. d. If you see a warning about Ranger enable property needs to be set to Yes, ignore it. e. Restart the Yarn service. f. Open the Ranger web interface and notice that the Yarn agent appears in the list of agents with a 200 Response code. g. Configure Yarn to use only Ranger ACLs and ignore YARN ACLs by adding a Custom ranger-yarn-security property: ranger.add-yarn-authorization = false h. Save and restart the Yarn service. i. Create a test user, such as test_user and then login: su test_user j. Run a test Spark job:

44 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview cd /usr/iop/current/spark-client bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10 k. It should fail with a message like Denied: test_user cannot submit spark application. l. Try to login as hdfs (su hdfs) or as an accepted user. Then, you can submit the Yarn job. Set up user sync from LDAP/AD/Unix to Ranger Ranger User Sync pulls in users from UNIX, LDAP, AD or a file, and populates Ranger's local user tables with these users. About this task

You can use these users later with the Ranger policies creation for granting access to the different services such as HDFS or Hive. After you have the usersync set up, the users are automatically pulled from the specified source into Ranger, periodically.

Note: The Ranger User Sync can be configured with LDAP/AD after the Ranger installation also.

The following information describes how to configure the Ranger User Sync for either UNIX or LDAP/AD. Procedure 1. Configure Ranger user sync for UNIX: a. On the Ranger Customize Services page, select the Ranger User Info tab. b. Click Yes under Enable User Sync. c. Use the Sync Source drop-down to select UNIX, then set the following properties: Table 15. UNIX user sync properties Property Description Default value Minimum user ID Only sync users above this 500 user ID. Password file The location of the password /etc/passwd file on the Linux server. Group file The location of the groups /etc/group file on the Linux server.

2. Configure Ranger user sync for LDAP/AD:

Note:

To ensure that LDAP/AD group level authorization is enforced in Hadoop, set up Hadoop group mapping for LDAP/AD.

Chapter 3. Installing IBM Open Platform with Apache Hadoop 45 The configurations shown are examples only. Modify to fit your own LDAP configuration. There are more options that are available but they generally do not need to change. a. On the Customize Services page, select the Ranger User Info tab. b. Click Yes under Enable User Sync. c. Use the Sync Source drop-down to select LDAP/AD. d. Set the following properties on the Common Configs tab. Table 16. LDAP/AD common configurations Property Description Default value Sample values LDAP/AD URL Add URL depending ldap://{host}:{port} ldap:// upon LDAP/AD ldap.example.com:38 sync source 9

or

ldaps:// ldap.example.com:63 6 Bind Anonymous If Yes is selected, the NO Bind User and Bind User Password are not required. Bind User The full cn=admin,dc=example,dc=c distinguished name om or The LDAP/AD bind (DN), including [email protected] user is used to common name (CN), connect to LDAP and of an LDAP/AD user query for users and account groups. So this user should have privileges to search for users. This user could be read-only LDAP user. Bind User Password The password of the Bind User.

e. Set the following properties on the User Configs tab. Table 17. LDAP/AD user configs Property Description Default value Sample values Username Attribute The LDAP user name sAMAccountName attribute. for AD, uid or cn for OpenLDAP User Object Class Object class to person top, person, identify user entries. organizationalPerson, user, or posixAccount User Search Base Search base for users. cn=users,dc=example,dc=com

46 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview Table 17. LDAP/AD user configs (continued) Property Description Default value Sample values User Search Filter Optional additional Sample filter to filter constraining the retrieve all the users: users selected for cn=* syncing. Sample filter to retrieve all the users who are members of groupA or groupB: (|(memberof=CN=GroupA,OU=g roups,DC=example,DC=com)(me mberof=CN=GroupB,OU=groups,DC=example,DC=com)) User Search Scope This value is used to sub base, one, or sub limit user search to the depth from search base. User Group Name Attribute from user You can provide memberof, Attribute entry whose values multiple attribute ismemberof, or would be treated as names separated by gidNumber group values to be commas. pushed into the memberof,isme Policy Manager mberof database. You can provide multiple attribute names separated by commas. Group User Map Sync specific groups No Yes Sync for users.

f. Set up the following properties on the Group Configs tab. Table 18. LDAP/AD group configs Property Description Default value Sample values Enable Group Sync If Enable Group Sync No Yes is set to No, the group names the users belong to are derived from “User Group Name Attribute”. In this case no additional group filters are applied.

If Enable Group Sync is set to Yes, the groups the users belong to are retrieved from LDAP/AD using the following group-related attributes.

Chapter 3. Installing IBM Open Platform with Apache Hadoop 47 Table 18. LDAP/AD group configs (continued) Property Description Default value Sample values Group Member The LDAP group member Attribute member attribute name. Group Name The LDAP group distinguishedName Attribute name attribute. for AD, cn for OpenLdap Group Object Class LDAP Group object group, class. groupOfNames, or posixGroup Group Search Base Search base for ou=groups,DC=example,DC=com groups. Group Search Filter Optional additional Sample filter to filter constraining the retrieve all groups: groups selected for cn=* syncing. Sample filter to retrieve only the groups whose cn is Engineering or Sales: (|(cn=Engineering)(cn=Sales))

If you perform the Ranger User Sync set up after Ranger installation, save the above settings. Then restart Ranger. Verify the user sync by clicking Settings > Users/Groups > Users in the Ranger user interface. Installing Ranger authentication The authentication method determines who is allowed to login to the Ranger Web interface, such as local unix, AD, or LDAP. About this task

To login to Ranger as admin/admin, leave the default value (such as UNIX) for authentication. You can skip the extra configurations that are listed here. However, if you use other users, then follow the steps described here to set up the Ranger authentication using Unix, LDAP or AD.

Note: The default authentication is set up as the UNIX authentication. The method of authentication can be configured during or after the Ranger installation. It can also be changed later. In the configurations listed, some of the Properties contain the value {{xyz}} that are macro variables that are derived from other specified values in order to streamline the configuration process. Macro variables can be edited if required. If you need to restore the original value, click the Set Recommended symbol at the right of the property box. Procedure 1. Configure Ranger UNIX authentication. a. Select the Advanced tab on the Ranger Customize Services page. b. Specify the following settings, under Ranger Settings:

48 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview Table 19. Ranger UNIX authentication Property Description Default value Sample values External URL Ranger Policy http://:6080 address. authentication The authentication UNIX UNIX method method to be used for Ranger. HTTP enabled This allows to select HTTP is enabled by Keep the default HTTP/HTTPS default communication for Ranger admin console. If you disable HTTP, only HTTPS is allowed

c. After you select the UNIX authentication, set the following properties in the UNIX Authentication Settings: Table 20. UNIX authentication settngs Property Description Default value Sample values Allow remote login Flag to true true enable/disable remote login. Only applies to UNIX authentication. ranger.unixauth.servi The address of the {{ugsync_host}} {{ugsync_host}} ce.hostname host where the UNIX authentication service is running. ranger.unixauth.servi The port number on 5151 5151 ce.port which the UNIX authentication service is running.

2. Configure Ranger LDAP authentication. a. Select the Advanced tab on the Ranger Customize Services page. b. Under Ranger Settings, specify the Ranger Policy Manager host address in the External URL box in the format http://:6080. c. Under Ranger Settings, authentication method, select LDAP. d. Under LDAP Settings, set the following properties: Table 21. LDAP authentication settings Property Description Default value Sample values ranger.ldap. base.dn The Distinguished dc=example,dc=com dc=example,dc=com Name (DN) of the starting point for directory server searches.

Chapter 3. Installing IBM Open Platform with Apache Hadoop 49 Table 21. LDAP authentication settings (continued) Property Description Default value Sample values Bind User The full {{ranger_ug_ldap_bi {{ranger_ug_ldap_bind Distinguished Name nd_dn}} _dn}} (DN), including Common Name (CN) of an LDAP user account that has privileges to search for users. This is a macro variable value that is derived from the Bind User value from Ranger User Info > Common Configs. Bind User Password Password for the Bind User. This is a macro variable value that is derived from the Bind User Password value from Ranger User Info > Common Configs. ranger.ldap. group. The LDAP group role cn cn roleattribute attribute. ranger.ldap. referral See description ignore follow | ignore | below. throw LDAP URL The LDAP server {{ranger_ug_ldap_url {{ranger_ug_ldap_url} URL. This is a macro }} } variable value that is derived from the LDAP/AD URL value from Ranger User Info > Common Configs. ranger.ldap. user. The user DN pattern uid={0},ou=users, cn=ldapadmin,ou=User dnpattern is expanded when a dc=xasecure,dc=net s, user is being logged dc=example,dc=com in. For example, if the user "ldapadmin" attempted to log in, the LDAP Server would attempt to bind against the DN "uid=ldapadmin,ou=users,dc=exa mple,dc=com" using the password the user provided>

50 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview Table 21. LDAP authentication settings (continued) Property Description Default value Sample values User Search Filter The search filter used {{ranger_ug_ldap_us {{ranger_ug_ldap_user for Bind er_searchfilter}} _searchfilter}} Authentication. This is a macro variable value that is derived from the User Search Filter value from Ranger User Info > User Configs.

There are three possible values for ranger.ldap.referral: follow, throw, and ignore. When searching a directory, the server might return several search results, along with a few continuation references that show where to obtain further results. These results and references might be interleaved at the protocol level. Recommended settings for ranger.ldap.referral follow When this property is set to follow, the LDAP service provider processes all of the normal entries first, and then follows the continuation references. throw When this property is set to throw, all of the normal entries are returned in the enumeration first, before the ReferralException is thrown. By contrast, a "referral" error response is processed immediately when this property is set to follow or throw. ignore When this property is set to ignore, it indicates that the server should return referral entries as ordinary entries (or plain text). This might return partial results for the search. 3. Configure Ranger Active Directory authentication a. Select the Advanced tab on the Ranger Customize Services page. b. Under Ranger Settings, specify the Ranger Policy Manager host address in the External URL box in the format http://:6080. c. Under Ranger Settings for Authentication method, select ACTIVE_DIRECTORY. d. Under AD Settings, set the following properties: Table 22. AD settings Property Description Default value Sample values ranger.ldap.ad.base. The Distinguished dc=example,dc=com dc=example,dc=com dn Name (DN) of the starting point for directory server searches.

Chapter 3. Installing IBM Open Platform with Apache Hadoop 51 Table 22. AD settings (continued) Property Description Default value Sample values ranger.ldap.ad.bind. The full {{ranger_ug_ldap_bi {{ranger_ug_ldap_bin dn Distinguished Name nd_dn}} d_dn}} (DN), including Common Name (CN) of an LDAP user account that has privileges to search for users. This is a macro variable value that is derived from the Bind User value from Ranger User Info > Common Configs. ranger.ldap.ad.bind. Password for the password bind.dn. This is a macro variable value that is derived from the Bind User Password value from Ranger User Info > Common Configs. Domain Name (Only The domain name of dc=example,dc=com for AD) the AD Authentication service. ranger.ldap.ad.referral See description ignore follow | ignore | below. throw ranger.ldap.ad.url The AD server URL. {{ranger_ug_ldap_url {{ranger_ug_ldap_url} This is a macro }} } variable value that is derived from the LDAP/AD URL value from Ranger User Info > Common Configs. ranger.ldap.ad.user.s The search filter used {{ranger_ug_ldap_us {{ranger_ug_ldap_use earchfilter for Bind er_searchfilter}} r_searchfilter}} Authentication. This is a macro variable value that is derived from the User Search Filter value from Ranger User Info > User Configs.

There are three possible values for ranger.ldap.ad.referral : follow, throw, and ignore. Recommended settings for ranger.ldap.ad.referral follow When this property is set to follow, the LDAP service provider processes all of the normal entries first, and then follows the continuation references. throw When this property is set to throw, all of the normal entries are

52 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview returned in the enumeration first, before the ReferralException is thrown. By contrast, a "referral" error response is processed immediately when this property is set to follow or throw. ignore When this property is set to ignore, it indicates that the server should return referral entries as ordinary entries (or plain text). This might return partial results for the search.

Note: When you save the authentication method as Active Directory, during the Ranger install, a Dependent Configurations pop-up might appear that recommends that you set the authentication method as LDAP. This recommended configuration should not be applied for AD, so clear (or un-check) the ranger.authentication.method check box, then click OK.

Ranger KMS set up and usage Ranger Key Management Server (KMS) is based on the Hadoop KMS developed by the Apache community. It extends the native Hadoop KMS functions by letting you store keys in a secure database.

There are three main functions within the Ranger KMS: Key management Ranger admin provides the ability to create, update, or delete keys by using the Ambari dashboard, or REST APIs. Access control policies Ranger admin provides the ability to manage access control policies within Ranger KMS. The access policies control permissions to generate or manage keys, adding another layer of security for data encrypted in Hadoop. Audit Ranger provides a full audit trace of all actions that are done by the Ranger KMS. Configuring

Before you can use Ranger, you must first install Ranger on a Kerberized cluster.

After you complete the initial configuration, specify the REPOSITORY_CONFIG_USERNAME in the Ranger > Congifs > Advanced tabs, in the Advanced kms-properties section. Ranger uses this to connect to the Ranger KMS server and to look up keys for creating access policies. The default value is keyadmin.

The user in REPOSITORY_CONFIG_USERNAME will be set to proxy into KMS in a Kerberos mode.

The properties in the Custom kms-site section allow specified system users, such as Hive or Oozie, to proxy on behalf of others when they communicate with Ranger KMS. Add the following properties: v hadoop.kms.proxyuser.hive.users v hadoop.kms.proxyuser.oozie.users v hadoop.kms.proxyuser.HTTP.users v hadoop.kms.proxyuser.ambari.users v hadoop.kms.proxyuser.yarn.users

Chapter 3. Installing IBM Open Platform with Apache Hadoop 53 v hadoop.kms.proxyuser.hive.hosts v hadoop.kms.proxyuser.oozie.hosts v hadoop.kms.proxyuser.HTTP.hosts v hadoop.kms.proxyuser.ambari.hosts v hadoop.kms.proxyuser.yarn.hosts

Add the following properties to the kms-site section, and replace the correct value for keyadmin: v hadoop.kms.proxyuser.keyadmin.groups=* v hadoop.kms.proxyuser.keyadmin.hosts=* v hadoop.kms.proxyuser.keyadmin.users=*

Confirm the settings of the advanced kms-site group: v hadoop.kms.authentication.type=kerberos v hadoop.kms.authentication.kerberos.keytab=/etc/security/ keytabs/ spnego.service.keytab v hadoop.kms.authentication.kerberos.principal=*

Then configure HDFS encryption to use Ranger KMS access. v To use Ranger KMS for HDFS data at rest encryption: 1. Create a link to /etc/hadoop/conf/core-site.xml under /etc/ranger/kms/ configuration: sudo ln -s /etc/hadoop/conf/core-site.xml /etc/ranger/kms/conf/ core-site.xml 2. In HDFS Advanced core-site, specify the value for hadoop.security.key.provider.path. 3. In the Advanced hdfs-site, specify a value for dfs.encryption.key.provider.uri. 4. Set the Ranger KMS host name in the following format: kms://http@:9292/kms 5. Restart the Ranger KMS service and the HDFS service.

Ambari automatically creates a repository in Ranger for the Ranger KMS service to manage Ranger KMS access policies. This repository configuration user also needs to be a principal in Kerberos with a password. Using

Open your browser with http://:6080. Log in as the Ranger KMS user (remember, the default is keyadmin:keyadmin).

The browser window that opens is meant to separate encryption work (keys and policies) from Hadoop cluster management and access policy management work. List and create keys 1. Click the Encription tab. 2. In the Select service drop-down menu, select a service.

54 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview Cleaning up nodes before reinstalling software Before you reinstall IBM Open Platform with Apache Hadoop to manage your cluster, prepare the environment, including uninstalling and removing all installed components and artifacts. Before you begin

Run all of the following clean-up scripts, or commands from the /usr/lib/python/site-packages/ambari_agent directory. The example commands assume a Red Hat Linux environment.

Make sure that you installed the Ambari agent on the same host that runs the Ambari server. The clean-up scripts are part of the Ambari agent. You will need to run the scripts on each Ambari agent node in the cluster. About this task

During the installation process, you might have already installed one or more of the following components: v Ambari server v IBM Open Platform with Apache Hadoop components, including services like YARN, HDFS, Hive, and Zookeeper v Ambari agents

You should clean your system before proceeding with reinstalling IBM Open Platform with Apache Hadoop. Do these steps on each node in your cluster. Procedure 1. Uninstall the Ambari and the IBM Open Platform with Apache Hadoop components by running the HostCleanup.py against the HostCleanup.ini and the HostCleanup_Custom_Actions.ini that are part of the IBM Open Platform. a. Stop all BigInsights services and IBM Open Platform with Apache Hadoop services, by either using the Ambari administration interface or commands that invoke Ambari REST APIs. In the Ambari web interface, click Actions > Stop all from the Ambari web interface. Then, wait for all of the services to stop. Or, use the following steps to stop various services in MyCluster on the Master node server_name, port 8080: #!/bin/bash #Replace the following variables with values #specific to your cluster

ambari_server=localhost ambari_port=8080 ambari_user=admin ambari_password=admin ambari_cluster=MyCluster

#MAKE SURE CODE APPEARS ON ONE LINE

services=$(curl --silent -u ${ambari_user}:${ambari_password} -X GET http://${ambari_server}: ${ambari_port}/api/v1/clusters/ ${ambari_cluster}/services | grep service_name | sed -e ’s,.*:.*"\(.*\)".*,\1,g’)

Chapter 3. Installing IBM Open Platform with Apache Hadoop 55 for serv in $services do curl -u ${ambari_user}:${ambari_password} -H ’X-Requested-By: ambari’ -X PUT -d ’{"RequestInfo": {"context" :"Stop service"}, "Body": {"ServiceInfo": {"state": "INSTALLED"}}}’ http://${ambari_server}: ${ambari_port}/api/v1/clusters/${ambari_cluster}/services/$serv

done b. From a terminal window on the Ambari server node, stop the Ambari server: ambari-server stop c. On each Ambari agent node, stop the agent: ambari-agent stop d. Run the HostCleanup.py script on all nodes of your cluster with the following command: sudo python /usr/lib/python/site-packages/ ambari_agent/HostCleanup.py \ --silent --skip=users \ -f /etc/ambari-agent/conf/HostCleanup.ini, /etc/ambari-agent/conf/HostCleanup_Custom_Actions.ini \ -o /cleanup.log

Note: v Be sure to determine the exact Python version to replace the versionNumber. v Remove --silent if you want to run the command with command-line prompts. v Remove --skip=users if you want to delete the service users. If all of the users are managed with an LDAP server, run the command with the --skip=users parameter. The script is installed as part of the ambari_agent python module in /usr/lib/python-version/site-packages/ambari_agent. It is available only on each Ambari agent node. To use the script on the Ambari server, you must first install the Ambari agent on the server node. To have the HostCleanup.py do a more thorough job of removing binaries, and users, and other artifact from a node, IBM Open Platform with Apache Hadoop provides a cleanup configuration file. The -f parameter points to this cleanup configuration input file of users, processes, or packages to clean up. The cleanup configuration file is packaged and installed as/etc/ambari-agent/conf/HostCleanup.ini. The -o parameter specifies an output log file. The HostCleanup.py performs the following functions: v Stops the Ambari server and the Ambari agent, if they are still running. v Stops the Linux processes started by a list of service users. The users are defined in the HostCleanup.ini file. You can also specify a list of Linux processes to be stopped. v Removes the PRM packages that are listed in the HostCleanup.ini file. v Removes the service users that are listed in the HostCleanup.ini file. v Deletes directories, symbolic links, and files that are listed in the HostCleanup.ini file. v Deletes repositories that are defined in the HostCleanup.ini file. e. Remove the Ambari RPMs, directories and symbolic links: 1) On each Ambari node, run the following command:

56 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview yum erase -y ambari-* 2) On the Ambari server node, run the following command: rm -rf /usr/lib/ambari-server 3) On each Ambari agent node, run the following command: rm -rf /usr/lib/python2.6/site-packages/ambari_agent 4) If you have not already removed links by using RPM uninstall, unlink all of the broken links in /usr/lib/python2.6/site-packages/ that point to the Ambari directories. The following is an example: COMMON_DIR="/usr/lib/python2.6/site-packages/ambari-commons" RESOURCE_MANAGEMENT_DIR="/usr/lib/python2.6/site-packages/resource-management" JINJA_DIR="/usr/lib/python2.6/site-packages/ambari_jinja2" SIMPLEJSON_DIR="/usr/lib/python2.6/site-packages/ambari_simplejson"

You can remove the broken links with the following code: rm -rf /usr/lib/python2.6/site-packages/ambari* /usr/lib/python2.6/site-packages/resource-management f. Optional: Optionally remove postgresql. yum erase -y postgresql-*

After you remove postgresql, on the Ambari server node, clean up the postgresql database by running the following command: rm -rf /var/lib/pgsql*

If you want to keep the postgresql RPM installed, make sure to run ambari-server reset before you deploy the cluster. g. Optional: Remove MySQL: yum erase -y mysql mysql-devel mysql-server rm -rf /var/lib/mysql/*

Note: This can be an optional step. It is possible to reuse MySQL by dropping the database. h. Clean the yum cache and remove the old ambari-repo file: yum clean all rm -f /etc/yum.repos.d/ambari.repo 2. Optional: Clean the artifacts by using the following code to remove folders that were created during the installation process, if they still exist:

Tip: These folders should already have been removed in Step 1d on page 56 sudo rm -rf /usr/iop/ sudo rm -rf /hadoop sudo rm -rf /etc/hadoop/ /etc/hive /etc/hbase/ /etc/oozie/ /etc/zookeeper/ /tmp/spark HostCleanup.ini file The default HostCleanup.ini file is populated with the necessary elements to clean by using the HostCleanup.py script. You can modify the HostCleanup.ini file to include additional artifacts to clean.

The HostCleanup.ini file is a static file that is included with the IBM Open Platform with Apache Hadoop. It includes a full list of RPMs, users, and directories to be removed. You can use the file as it is, or modify it to the specific configuration of your node.

The HostCleanup.py script reads cluster configuration information from the HostCleanup.ini file that lists the artifacts to remove. You pass this configuration

Chapter 3. Installing IBM Open Platform with Apache Hadoop 57 file in by calling the script with the -f parameter. If a user runs the HostCleanup.py without the -f parameter, the python script uses the default clean up configuration file /var/lib/ambari-agent/data/hostcheck.result.

The HostCleanup.ini file is organized in sections. Each section starts with "[" + section name + "]". One section contains the specific properties that are requested by the HostCleanup.py file. Remove the properties that you do not want the HostCleanup.py file to read. Remove entire sections if they do not apply to the specific Ambari node.

The properties are key/value pairs that follow the conventional key=value property format. All property values are comma separated strings. Table 23. HostCleanup.ini sections Section in HostCleanup.ini Property key allowed processes proc_list proc_owner_list proc_identifier users usr_list usr_homedir usr_homedir_list directories dir_lis alternatives symlink_list

processes section proc_list: A list of Linux pid numbers that HostCleanup.py kills during the clean up. proc_owner_list: HostCleanup.py kills Linux processes owned by the users in this list. proc_identifier: HostCleanup.py kills Linux Java processes that can be identified by the text string listed in this property. Make sure the text is unique to avoid killing more processes than needed. users section usr_list: A list of users that HostCleanup.py will remove during the cleanup.

Attention: A user might not have permission to remove a user ID that is exposed thru LDAP. Therefore, call the script with --skip="users" to skip removing users, if necessary. usr_homedir section usr_homedir_list: A list of user home directories that HostCleanup.py will remove as part of the remove user operation. The --skip="users" parameter also causes the HostCleanup.py process to skip removing user home directories. directories section dir_list: A list of directories to be removed by the HostCleanup.py. Use wildcard * to shorten the list as needed. alternatives section symlink_list: Provides a list of symbolic links that HostCleanup.py unlinks under the /etc/alternatives directory.

58 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview HostCleanup_Custom_Actions.ini file The default HostCleanup_Custom_Actions.ini file contains additional clean-up sections and works along with the HostCleanup.ini file. Table 24. HostCleanup_Custom Actions.ini sections Section in HostCleanup_Custom_Actions.ini Property key allowed packages pkg_list repositories repo_list

packages section pkg_list: A list of RPMs that HostCleanup.py will remove during the clean up. You can use wildcard * to shorten the list. repositories section repo_list: A list of repository files to be removed by the HostCleanup.py. Specify the file name only, no extension names are required since the Python script takes care of repo files based on the OS type.

Chapter 3. Installing IBM Open Platform with Apache Hadoop 59 60 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview Chapter 4. Installing the IBM BigInsights value-added services on IBM Open Platform with Apache Hadoop

After you install the IBM Open Platform with Apache Hadoop, you can add additional service modules to your cluster.

The set of value-add services that you can install depends on the service module that you add.

Users, groups, and ports for BigInsights value-add services When you install any of the IBM BigInsights value-added services, the installation adds user accounts and their related directories that you should be aware of. Prerequisite

If you create these user IDs before the installation as LDAP users or local IDs, make sure of the following prerequisites for all users: v The default shell is bash. Do not set to /sbin/nolongin. v Either ensure that passwords do not expire, or, ensure that you have a process for managing password changes so that they are always changed before they expire.

UIDs and GIDs must be consistent across all nodes. If you use local service IDs for the BigInsights value-add services, ensure that the UIDs and GIDs are consistent across the cluster by creating them manually before you install the service, or decide to have Ambari manage the UID for the service users by selecting options in the Customize Services page in the installation wizard. Users, groups, and ports

The following table lists the users, groups, and ports that are created by the installation of the value-added services.

© Copyright IBM Corp. 2013, 2016 61 Table 25. Users, groups, and ports created by the installation of the value-added modules Groups [Default Default ports Service Service user Default UID GID] and protocol

BigSheets bigsheets 1200 hadoop [500], v 31000 http bigsheets [1200] Big Sheets Web app address v 31001 BigSheets PigLocalServer v 31005 https Big Sheets Web app stop address v 31050 jdbc Stored Big Sheets Metadata

Text Analytics tauser 2826 hadoop [500] v 32000 http Text Analytics Web app address v 32005 https Text Analytics Web app stop address

Big SQL bigsql 2824 hadoop [500] v 33000 http Big SQL Monitoring v 32051jdbc Big SQL JDBC v 28051 Fast communication start port v 7053 http Scheduler service port v 7054 http Scheduler administrative port v 7056 http server port for Scheduler v 7055 jmx Java Management Extensions (JMX) port for Scheduler

62 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview Table 25. Users, groups, and ports created by the installation of the value-added modules (continued) Groups [Default Default ports Service Service user Default UID GID] and protocol

BigInsights uiuser 1004 hadoop [500] v 30000 http Home Landing page Web app address v 30005 Jetty Stop Port

Big R bigr 1217 hadoop [500], v 7052 jdbc bigr[1202] v 8152 jmx Data Server dsmadmin 2827 The web manager application service for Big SQL.

Preparing to install the BigInsights value-add services Before you install any of the value-add services, you must prepare your environment by installing IBM Open Platform with Apache Hadoop, setting the default Java version, and following a list of prerequisite steps for each value-add service that you plan to install,

To prepare your environment: 1. You must have already downloaded, installed, and deployed the IBM Open Platform with Apache Hadoop (IOP). You must be an administrator of the IOP and have root access to the cluster, or be a non-root user with root (sudo) privileges. The Ambari server must be running. 2. The IOP installation includes OpenJDK 1.8. For all nodes in your cluster, make sure that the default Java is set to the same JDK version that was selected during the IOP install. Run the following command on all nodes to verify the version: java -version

Tip: During the Ambari setup, you will have the opportunity to use Open JDK 1.8, or a Custom JDK. The default is OpenJDK 1.8.

Note: If the Java command is not found, the JDK might be installed but the PATH environment variable might not be set. For OpenJDK 1.8, the location is /usr/jdk64/java-1.8.0-openjdk-1.8.0.45-28.b13.el6_6.x86_64. The Java commands are located in the /bin/ subdirecory. For each node where the version is not correct, run the following commands to create a symbolic link to the proper version: v For OpenJDK 1.8: cd /etc/alternatives unlink java ln -s /usr/jdk64/java-1.8.0-openjdk-.x86_64/bin/java Then, run the following command on all nodes after the change to verify the JDK version: java -version

Chapter 4. Installing the IBM BigInsights value-added services on IBM Open Platform with Apache Hadoop 63 3. Make sure that all host names are listed as lower case in either the /etc/hosts file or the DNS configuration. 4. Add umask 022 in the .bashrc file of the root user on all nodes. This action makes the file permission executable by all users.

Note: After you complete the installation, you can revert to the default umask setting for a secure environment. 5. Do the following prerequisites for the services that you will be installing:

Important: v When you install any of these value-added services as the non-root user, preface the instructions with sudo where the instruction would normally require the root user. v UIDs and GIDs must be consistent across all nodes. If you use local service IDs for the BigInsights value-add services, ensure that the UIDs and GIDs are consistent across the cluster by creating them manually. For more information about what users and groups to create, see “Users, groups, and ports for BigInsights value-add services” on page 61. BigInsights - BigSheets: a. You must be an IBM Open Platform administrator to add services. b. If you are also going to work with the BigInsights - Big SQL service, consider installing that service prior to installing the BigInsights - BigSheets service. c. Ensure that the bigsheets user has already been created and that the UID for BigSheets is uniform across the cluster or else that the user does not exist on any node. If the user does not exist, the BigSheets installation code creates the user on all nodes with a default UID of 1200. If the user already exists at the time of installation, take note of the existing UID value and provide that to the bigsheets.userid property during installation.

Important: The UID must be consistent across all nodes. d. Set up password-less SSH for the root user between the BigSheets master node and all nodes in the cluster. e. Set up password-less SSH between the BigSheets master node and itself. f. As part of the BigSheets service install changes are made to the Hadoop configuration files core-site.xml and yarn-site.xml. In core-site.xml, the following 2 properties are added: hadoop.proxyuser.bigsheets.groups hadoop.proxyuser.bigsheets.hosts

The BigSheets service user must be able to proxy as other users in the Hadoop system. The property hadoop.proxyuser.bigsheets.groups is added with value *, and the property hadoop.proxyuser.bigsheets.hosts is set to the hostname of the BigSheets master node. In yarn-site.xml, the BigSheets service user is also added to yarn.admin.acl. g. Make sure that in the configuration file hdfs-site.xml the property dfs.namenode.acls.enabled is set to true. This is the default value set when IBM Open Platform with Apache Hadoop is installed.

64 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview h. Make sure that the Java development package is installed on all nodes. If it is not there, run the following command on all nodes: yum install java--openjdk-devel i. Make sure that the ports listed in “Users, groups, and ports for BigInsights value-add services” on page 61 are not in use by other programs or components. BigInsights - Big SQL

Note: The user names and home directory locations used here might vary depending on your user environment. Adapt these instructions to match your local user environment.

Restriction: Ambari must run as root, either from root directly or from sudo, in BigInsights - Big SQL. a. You must install the Big SQL service with at least two nodes in the cluster to see the best performance with at least one node designated as the Big SQL master. b. It is recommended that you run the Big SQL pre-installation checker utility. c. Confirm that you have Hive metastore connectivity from the node where Big SQL will be installed, even if Big SQL will be on the same node as Hive. You can test this connectivity by opening the Hive shell from the command line and running a simple command. Do the following steps: 1) Authenticate to hive: su hive 2) Open the HIVE shell by typing the following from the command line: hive 3) Run a command such as the following that displays tables: hive> show tables; d. Ensure that /home is not mounted with the nosuid option. You can run the mount command in the Linux command line with no options to display information about all known mount points. e. Install the package ksh on all nodes: yum install ksh f. Set passwordless SSH for the root user from all Big SQL head nodes to all Big SQL worker nodes in the cluster. Every Big SQL worker node must be able to passwordless SSH (as root) to the Big SQL head node.

Note: If you are a non-root user, make sure that user account can also perform passwordless SSH from Big SQL head node to all of the other nodes. g. On all nodes, modify the /etc/sudoers file with the following change: 1) Find the line Defaults requiretty and comment it out by using a # prefix: #Defaults requiretty h. If the bigsql user ID does not yet exist, the installer will create it. If you have pre-created the Big SQL service ID (locally, as bigsql, or a

Chapter 4. Installing the IBM BigInsights value-added services on IBM Open Platform with Apache Hadoop 65 non-default user ID), ensure that the bigsql UID or the non-default UID is the same across all of the nodes. You can determine the UID for each node with the following command: cat /etc/passwd | grep bigsql

or possibly cat /etc/passwd | grep

or id bigsql

Also, if you pre-create the bigsql user ID, make sure that you set up passwordless ssh for the bigsql user ID from every node to every node, and every node to itself. If you pre-create the bigsql user ID, make sure that the home directory path for Big SQL (such as /home/bigsql) is the same on all nodes. If you pre-create a user ID other than bigsql, make sure that you set up the passwordless ssh for that user ID from every node to every node, and every node to itself. Also, if you pre-create a user ID other than bigsql, make sure that the home directory path for Big SQL (such as /home/notbigsql) is the same on all nodes. For information about Users and Groups for the Big SQL service, see “Users, groups, and ports for BigInsights value-add services” on page 61. i. Make sure that you added the HDFS, Hive, HBase, Sqoop, HCat, and Oozie clients to all nodes where you intend to install Big SQL components during the installation of IBM Open Platform with Apache Hadoop. If not, you can add them using Ambari. v For example, to confirm where Hive clients are installed, click on the Hive service in the Ambari interface, and click the Hive Clients link in the summary panel to see which nodes have clients installed. v To see what components are installed for all nodes, click on the Hosts tab of the Ambari interface. Then, click the components link for each node to see what has been installed. j. Make sure these IBM Open Platform with Apache Hadoop services are running: v Hive v HDFS v HBase v Oozie v Sqoop v Knox, and the LDAP server is started if you use LDAP. If you are not using LDAP server, start the Knox Demo LDAP service. For more information about the Knox Demo LDAP service, see Step 6 on page 67. BigInsights - Text Analytics a. Ensure that the YARN, HDFS, and MapReduce2 services are running.

66 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview b. Ensure that the jar command is available to the Ambari agent running on the Text Analytics Master node. To verify that the jar command is available, do the following: 1) Determine which user performed the Ambari install (i.e. the user who owns the Ambari service process). This should be root for IOP 4.2. 2) Determine which node in your cluster you would like to use as the Text Analytics Master node. 3) Log into the node you identified in step "ii" as the user identified in step "i.' 4) Type which jar in the command line. v If you see a valid path to the jar command, no further steps are needed for this prerequisite. v If you see a message indicating that the jar command cannot be found, you will need to include the jar executable in the default PATH for the user identified in step "i". For example, you can append any Java bin directory which contains jar to your PATH in the .bashrc file of the relevant user: export PATH=/usr/jdk64/java-1.8.0-openjdk-1.8.0.45- 28.b13.el6_6.x86_64/bin/:$PATH. You should then see a valid path to the jar command when logging on as this user. 5) If you applied any changes to your node in order to make the jar command available, you must ensure that your current logged in session picks up the changes, then restart the Ambari agent from that session. For example, if you modified the user's .bashrc file, you should run source ⌂/.bashrc to apply the changes. To restart the Ambari agent, run the ambari-agent restart command. BigInsights - Big R: a. Ensure that yum-config-manager is installed. It is included with the yum-utensils package. yum install -y yum-utils 6. Knox requires that LDAP is running, even if your cluster is not configured for LDAP. The Knox service that is part of IBM Open Platform with Apache Hadoop provides a Demo LDAP server by default. Make sure that the Knox service is started, and that the LDAP server is started if you are using LDAP. v Click the Knox service. v In the Summary page, click Service Actions, and find the Start LDAP server in the drop down menu.

Chapter 4. Installing the IBM BigInsights value-added services on IBM Open Platform with Apache Hadoop 67 Note: If Kerberos is enabled, the users referenced in the Demo LDAP configuration must also exist as operating system users on all nodes of the cluster in order for all value add operations to succeed. If creating the Demo LDAP users on the operating system, such as the guest user, make sure that the UID assigned to the user is the same on all nodes. Ensure that the user UID is greater than the value set in the Yarn configuration for Minimum user ID for submitting job. By default the value is set to 1000.

Obtaining the BigInsights value-add services If you have acquired software licenses for the BigInsights value-add services, you can download the software from Passport Advantage. BigInsights value-add modules

The following value-add modules can be deployed on IBM Open Platform with Apache Hadoop: TECHNICAL PREVIEW DOWNLOAD ONLY Accept the IBM BigInsights Early Release license agreement: http://www14.software.ibm.com/cgi-bin/weblap/lap.pl? popup=Y&li_formnum=L-MLOY-9YB5S9&accepted_url= http://ibm-open-platform.ibm.com/repos/beta/4.2.0.0/&title= IBM+BigInsights+Beta+License&declined_url= http://www-01.ibm.com/software/data/infosphere/hadoop/trials.html

Then select the appropriate repository file for your environment: RHEL6 https://ibm-open-platform.ibm.com/repos/beta/4.2.0.0/rhel6/

Use the following TAR files: BIPremium-4.2.0.0-beta1.el6.x86_64.tar.gz ambari-2.2.0.0-beta1.el6.x86_64.tar.gz iop-4.2.0.0-beta1-el6.x86_64.tar.gz iop-utils-4.2.0.0-beta1.el6.x86_64.tar.gz RHEL7 https://ibm-open-platform.ibm.com/repos/beta/4.2.0.0/rhel7/

Use the following TAR files:

68 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview BIPremium-4.2.0.0-beta1.el7.x86_64.tar.gz ambari-2.2.0.0-beta1.el7.x86_64.tar.gz iop-4.2.0.0-beta1-el7.x86_64.tar.gz iop-utils-4.2.0.0-beta1.el7.x86_64.tar.gz IBM BigInsights Premium module This module provides Big SQL, Text Analytics, BigSheets, Big R, and the BigInsights Home services. The license for this module also provides limited-use licenses for other software so that you can get even more value out of Hadoop. These additional software packages are optional and can also be downloaded from Passport Advantage. These packages are installed according to their own documentation. This additional software includes: v InfoSphere Data Click v InfoSphere Information Server For more information in this additional software, see Additional related software.

Installing the BigInsights value-add packages After you have prepared the environment and acquired the software, follow these steps to install value added services. Before you begin v Ensure that you installed a Apache Hadoop platform, such as IBM Open Platform with Apache Hadoop. v Ensure that you followed the steps in “Preparing to install the BigInsights value-add services” on page 63. v When you install any of the BigInsights value-added services as the non-root user, preface the instructions with sudo, where the instruction would normally require the root user. About this task

Where you perform the following steps depends on whether the Hadoop cluster has direct internet access. v If the Hadoop cluster does not have direct internet access, perform the steps from a Linux host with direct internet access. Then, transfer the files, as required, to a local repository mirror. TECHNICAL PREVIEW DOWNLOAD ONLY Accept the IBM BigInsights Early Release license agreement: http://www14.software.ibm.com/cgi-bin/weblap/lap.pl? popup=Y&li_formnum=L-MLOY-9YB5S9&accepted_url= http://ibm-open-platform.ibm.com/repos/beta/4.2.0.0/&title= IBM+BigInsights+Beta+License&declined_url= http://www-01.ibm.com/software/data/infosphere/hadoop/trials.html

Then select the appropriate repository file for your environment: RHEL6 https://ibm-open-platform.ibm.com/repos/beta/4.2.0.0/rhel6/

Use the following TAR files:

Chapter 4. Installing the IBM BigInsights value-added services on IBM Open Platform with Apache Hadoop 69 BIPremium-4.2.0.0-beta1.el6.x86_64.tar.gz ambari-2.2.0.0-beta1.el6.x86_64.tar.gz iop-4.2.0.0-beta1-el6.x86_64.tar.gz iop-utils-4.2.0.0-beta1.el6.x86_64.tar.gz RHEL7 https://ibm-open-platform.ibm.com/repos/beta/4.2.0.0/rhel7/

Use the following TAR files: BIPremium-4.2.0.0-beta1.el7.x86_64.tar.gz ambari-2.2.0.0-beta1.el7.x86_64.tar.gz iop-4.2.0.0-beta1-el7.x86_64.tar.gz iop-utils-4.2.0.0-beta1.el7.x86_64.tar.gz

Procedure 1. Set up a local repository. A local repository is required if the Hadoop cluster cannot connect directly to the internet, or if you wish to avoid multiple downloads of the same software when installing services across multiple nodes. In the following steps, the host that performs the repository mirror function is called the repository server. If you do not have an additional Linux host, you can use one of the Hadoop management nodes. The repository server must be accessible over the network by the Hadoop cluster. The repository server requires an HTTP web server. The following instructions describe how to set up a repository server by using a Linux host with an Apache HTTP server. a. On the repository server, if the Apache HTTP server is not installed, install it: yum install httpd b. On the repository server, ensure that the createrepo package is installed. yum install createrepo c. Make sure there is network access from all nodes in your cluster to the repository server. If data nodes are on a private network and the repository server is external to the Hadoop cluster, you might need to configure iptables for “Get ready to install” on page 11. d. On the repository server, create a directory for your value-add repository, such as /repos/valueadds. For example, for Apache httpd, the default is /var/www/html/repos. mkdir /var/www/html/repos/valueadds e. Extract the TAR files that you downloaded from the Technical Preview site. f. Start the web server. If you use Apache httpd, start it by using either of the following commands: apachectl start

or service httpd start Ensure that any firewall settings allow inbound HTTP access from your cluster nodes to the mirror web server. g. Test your local repository by browsing to the web directory: http:///repos/valueadds You should see all of the files that you copied to the repository server. h. On the repository server, run the createrepo command to initialize the repository: createrepo /var/www/html/repos/valueadds

70 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview i. On the Ambari Server node, navigate to the /var/lib/ambari-server/ resources/stacks/BigInsights//repos/repoinfo.xml file. If the file does not exist, create it. Ensure the element for the BIGINSIGHTS-VALUEPACK entry points to your repository server. Remember, there might be multiple sections. Make sure that the URL you tested in step 1g on page 70 matches exactly the value indicated in the element.

HDFS example:

For example, the repoinfo.xml might look like the following content after you change http://ibm-open-platform.ibm.com/repos/BigInsights- Valuepacks/to become http://your.mirror.web.server/repos/valueadds: http:///repos/valueadds BIGINSIGHTS-VALUEPACK.version_number BIGINSIGHTS-VALUEPACK.version_number

Note: The new section might appear as a single line.

Tip: If you later find an error in this configuration file, make corrections and run the following command: yum clean all

Tip: If you are using a local repository URL and you modify the URL at any time, you must remember to update the baseURL. You can update the repoinfo.xml file, or update the fields on the Ambari web tool. Here are the steps to use the Ambari web tool: 1) From the Ambari web dashboard, in the menu bar, click admin > Manage Ambari.

2) From the Clusters panel, click Versions > . 3) Change the URL as needed, and click Save. 2. When the module is installed, restart the Ambari server. sudo ambari-server restart

Chapter 4. Installing the IBM BigInsights value-added services on IBM Open Platform with Apache Hadoop 71 3. Open the Ambari web interface and log in. The default address is the following URL: http://:8080

The default login name is admin and the default password is admin. 4. Click Actions > Add service. In the list of services you will see the services that you previously added as well as the BigInsights services you can now add. What to do next

Select the service that you want to install and deploy. Even though your module might contain multiple services, install the specific service that you want and the BigInsights Home service. Installing one value-add service at a time is recommended. Follow the service specific installation instructions for more information.

To see a suggested layout of services, see Suggested services layout for IBM Open Platform with Apache Hadoop Installing BigInsights Home The BigInsights Home service is the main interface to launch BigInsights - BigSheets, BigInsights - Text Analytics, and BigInsights - Big SQL. Before you begin

The BigInsights Home service requires Knox to be installed, configured and started. Procedure 1. Open a browser and access the Ambari server dashboard. The following is the default URL: http://:8080 The default user name is admin, and the default password is admin. 2. In the Ambari dashboard, click Actions > Add Service. 3. In the Add Service Wizard > Choose Services, select the BigInsights – BigInsights Home service. Click Next. If you do not see the option for BigInsights – BigInsights Home, follow the instructions described in “Installing the BigInsights value-add packages” on page 69. 4. In the Assign Masters page, select a Management node (edge node) that your users can communicate with. BigInsights Home is a web application that your users must be able to open with a web browser. 5. In the Assign Slaves and Clients page, make selections to assign slaves and clients. The nodes that you select will have JSQSH (an open source, command line interface to SQL for Big SQL and other database engines). 6. Click Next to review any options that you might want to customize. If you want to change the default UID for this BigInsights - Home service account, select the Misc tab in the Customize Services page. You can manage the UID for this service account so that Ambari will create the user with the appropriate UID. 7. Click Deploy.

72 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview If the BigInsights – BigInsights Home service fails to install, run the remove_value_add_services.sh cleanup script. The following code is an example command: cd /usr/ibmpacks/bin/ remove_value_add_services.sh -u admin -p admin -x 8080 -s WEBUIFRAMEWORK -r

For more information about cleaning the value-add service environment, see “Removing BigInsights value-add services” on page 90. 8. After installation is complete, click Next > Complete . What to do next 1. Click the Knox service from the Ambari web interface to see the summary page. 2. Select Host Actions > Restart All to restart it and all of its components. 3. If you are using LDAP, you must also start LDAP if it is not already started. 4. Click the BigInsights Home service in the Ambari User Interface. 5. Select Host Actions > Restart All to restart it and all of its components. 6. Make sure that the Knox service is enabled. 7. Open the BigInsights Home page from a web. The URL for BigInsights Home is https://:/ /default/BigInsightsWeb/index.htmlwhere: knox_host The host where Knox is installed and running knox_port The port where Knox is listening (by default this is 8443) knox_gateway_path The value entered in the gateway.path field in the Knox configuration (by default this is 'gateway')

For example, the URL might look like the following address: https://myhost.company.com:8443/gateway/default/BigInsightsWeb/index.html If you are using the Knox Demo LDAP, a default user ID and password is created for you. When you access the web page, use the following preset credentials: User Name = guest Password = guest-password

Chapter 4. Installing the IBM BigInsights value-added services on IBM Open Platform with Apache Hadoop 73 8. To invalidate your session, click the Menu icon on the BigInsights Home page and select Sign Out. You will need to re-authenticate before being able to display the BigInsights Home page again. Installing the BigInsights - Big SQL service To extend the power of the Open Platform for Apache Hadoop, install and deploy the BigInsights - Big SQL service, which is the IBM SQL interface to the Hadoop-based platform, IBM Open Platform with Apache Hadoop. Before you begin

Make sure that you do the prerequisite steps listed in Step 5 on page 64, Preparing to install the BigInsights value-add services.

Remember, you must install the Big SQL service with at least two nodes in the cluster.

Restriction: v Non-root Ambari installations are not supported. v You must install the Big SQL service with at least two nodes in the cluster. v When installing the Big SQL service, ensure that the NodeManager component is not installed on either the head node or the secondary head node as this is not a supported configuration. Procedure 1. Open a browser and access the Ambari server dashboard. The following is the default URL.

74 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview http://:8080

The default user name is admin, and the default password is admin . 2. In the Ambari web interface, click Actions > Add Service. 3. In the Add Service Wizard, Choose Services, select the BigInsights - Big SQL service, and the BigInsights Home service. Click Next. If you do not see the option to select the BigInsights - Big SQL service, complete the steps in “Installing the BigInsights value-add packages” on page 69. 4. In the Assign Masters page, decide which nodes of your cluster you want to run the specified components, or accept the default nodes. Follow these guidelines: v For the Big SQL monitoring and editing tool, you can assign the Data Server Manager (DSM) to a different node from the Hive metastore.

Tip: If you install Big SQL before you install the DSM service, you will not see DSM in any selections. 5. Click Next. 6. In the Assign Slaves and Clients page, accept the defaults, or make specific assignments for your nodes. Follow these guidelines: v Select the non-head nodes for the Big SQL Worker components. You must select at least one node as the worker node. 7. In the Customize Services page, accept the recommended configurations for the Big SQL service, or customize the configuration by expanding the configuration files and modifying the values. Make sure that you have a valid bigsql_user and bigsql_user_password in the appropriate fields in the Advanced bigsql-users-env section. If you want to change the default UID for this BigInsights- Big SQL service account, select the Misc tab in the Customize Services page. You can manage the UID for this service account so that Ambari will create the user with the appropriate UID. 8. You can review your selections in the Review page before accepting them. If you want to modify any values, click the Back button. If you are satisfied with your setup, click Deploy. 9. In the Install, Start and Test page, the Big SQL service is installed and verified. If you have multiple nodes, you can see the progress on each node. When the installation is complete, either view the errors or warnings by clicking the link, or click Next to see a summary and then the new service added to the list of services. If the BigInsights – Big SQL service fails to install, review the errors, correct the problems, and click Retry in Ambari. If the install still fails, and Big SQL needs to be uninstalled, run the remove_value_add_services.sh cleanup script. The following code is an example of the command: cd /usr/ibmpacks/bin/ ./remove_value_add_services.sh -u admin -p admin -x 8080 -s BIGSQL -r When the remove_value_add_services.sh completes, if the Big SQL service still appears in the Ambari list of services, you can remove the Big SQL service from Ambari by running the following REST API: curl -u admin:admin -H "X-Requested-By: ambari" -X DELETE http://ambari_server_host_name:8080/api/v1/clusters/cluster_name/services/BIGSQL

Then, restart the Ambari server.

Chapter 4. Installing the IBM BigInsights value-added services on IBM Open Platform with Apache Hadoop 75 For more information about cleaning the value-add service environment, see “Removing BigInsights value-add services” on page 90. 10. The following properties in the hdfs-site.xml, core-site.xml, and yarn-site.xml sections of the configuration. are updated for you by the installation of Big SQL. You can verify that these properties are configured. a. In the Ambari web interface, click the HDFS service. b. Click the Configs tab and then the Advanced tab/ 1) Expand the Custom core site section to see the following properties: Key: hadoop.proxyuser.bigsql.groups Value: * Key: hadoop.proxyuser.bigsql.hosts Value: Substitute with the fully qualified host name where Big SQL master is installed 2) Expand the Custom hdfs-site to see the following property: Key: dfs.namenode.acls.enabled Value: true c. In the Ambari web interface, click the YARN service. d. Click the Configs tab and then the Advanced tab. Expand the Resource Manager section. e. Find the yarn.admin.acl property. 1) In the Value text field for the yarn.admin.acl property, find the bigsql user. It might look like the following value: yarn,bigsql. 11. Restart the HDFS, YARN, MapReduce2, and Big SQL services, if needed. a. For each service that requires restart, select the service. b. Click Service Actions. c. Click Restart All. 12. A web application interface for Big SQL monitoring and editing is available to your end-users to work with Big SQL. You access this monitoring utility from the IBM BigInsights Home service. If you have not added the BigInsights Home service yet, do that now. 13. Restart the Knox Service. Also start the Knox Demo LDAP service if you have not configured your own LDAP. 14. Restart the BigInsights Home services. 15. To run SQL statements from the Big SQL monitoring and editing tool, type the following address in your browser to open the BigInsights Home service: https://://default/BigInsightsWeb/index.html

Where: knox_host The host where Knox is installed and running knox_port The port where Knox is listening (by default this is 8443) knox_gateway_path The value entered in the gateway.path field in the Knox configuration (by default this is 'gateway')

For example, the URL might look like the following address:

76 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview https://myhost.company.com:8443/gateway/default/BigInsightsWeb/index.html If you use the Knox Demo LDAP service, the default credential is: userid = guest password = guest-password To invalidate your session, click the Menu icon on the BigInsights Home page and select Sign Out. You will need to re-authenticate before being able to display the BigInsights Home page again. Your end users can also use the JSqsh client, which is a component of the BigInsights - Big SQL service. 16. If the BigInsights - Big SQL service shows as unavailable, there might have been a problem with post-installation configuration. Run the following commands as root (or sudo) where the Big SQL monitoring utility (DSM) server is installed: a. Run the dsmKnoxSetup script: cd /usr/ibmpacks/IBM-DSM//ibm-datasrvrmgr/bin/dsmKnoxSetup.sh

where is the node where the Knox gateway service is running. b. Make sure that you do not stop and restart the Knox gateway service within Ambari. If you do, then run the dsmKnoxSetup script again. c. Restart the BigInsights Home service so that the Big SQL monitoring utility (DSM) can be accessed from the BigInsights Home interface. 17. For HBase, do the following post-installation steps: a. For all nodes where HBase is installed, check that the symlinks to hive-serde.jar and hive-common.jar in the hbase/lib directory are valid. v To verify the symlinks are created and valid: – namei /usr/iop//hbase/lib/hive-serde.jar – namei /usr/iop//hbase/lib/hive-common.jar v If they are not valid, do the following steps: cd /usr/iop//hbase/lib rm -rf hive-serde.jar rm -rf hive-common.jar ln -s /usr/iop//hive/lib/hive-serde.jar hive-serde.jar ln -s /usr/iop//hive/lib/hive-common.jar hive-common.jar b. After installing the Big SQL service, and fixing the symlinks, restart the HBase service from the Ambari web interface. 18. If you are planning to use the HA feature with Big SQL, ensure that the local user "bigsql" exists in /etc/passwd on all nodes in the cluster, otherwise you will not be able to use the HA feature with Big SQL. 19. If you want to create and access catalog tables in BigSheets to work with Big SQL, do the following one-time step after Big SQL and BigSheets are both installed, in the exact order listed: a. Restart BigSheets. b. Restart Hive. c. Restart Big SQL. 20. If you want to load data by using the LOAD command into Hadoop tables from files that are compressed by using the lzo codec, do the following steps: a. Update the HDFS configuration. 1) Click the HDFS service and open the Config > Advanced tabs. 2) Expand the Advanced core-site section.

Chapter 4. Installing the IBM BigInsights value-added services on IBM Open Platform with Apache Hadoop 77 3) Edit the io.compression.codecs field and append the following value to it: com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec 4) Restart the HDFS and Big SQL services. b. Make sure the following packages are installed on all nodes in the cluster: v lzo v lzop v hadoop-lzo v hadoop-lzo-native What to do next

After you add Big SQL worker nodes, make sure that you stop and then restart the Hive service.

For information about using BigInsights - Big SQL, see Analyzing and manipulating big data with Big SQL. Preinstallation checker utility for Big SQL Before you install the BigInsights - Big SQL service, run the preinstallation checker utility on all nodes that will be a part of the Big SQL service to verify that your Linux environment is ready to install Big SQL.

The preinstallation checker utility is invoked automatically as part of the installation. It checks each worker node before running the installation on that host. You can also invoke the checker utility manually on each node, which can validate the machine before the service is installed.

The preinstallation checker exists only for the Ambari server node before any installation attempt. It is located in /var/lib/ambari-server/resources/stacks/ BigInsights//services/BIGSQL/package/scripts/.

After the first installation attempt, the Big SQL preinstallation checker utility is located at /var/lib/ambari-agent/cache/stacks/BigInsights//services/ BIGSQL/package/scripts/.

The resulting log file is stored in /tmp/bigsql/logs/bigsql-precheck- .log. Read the log file to get more details about any issues that are found. If one or more checks fails, determine the problem by reading the log and then re-run the utility. If you run the utility before the Big SQL service is installed, some checks are skipped.

The following syntax and examples describe how you can use the Big SQL preinstallation checker utility.

Syntax bigsql-precheck.sh [options]

The following [options] can be included: -A Run the preinstallation checker utility on all hosts -b hbase user The user name for the HBase service. The default value is hbase.

78 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview -f bigsql port The port number used by Big SQL. -F Force all utility warnings to be logged as errors. -h Display the Help screen. -H bigsqlhome The path to BIGSQL_DIST_HOME. The default value is /usr/ibmpacks/current/bigsql/ bigsql. -i hive user The user name for the Hive service. The default value is hive. -l hostlist The path and filename listing all of the Big SQL hosts. -L logdir The name of the log directory that you want to use. The default is /tmp/bigsql/logs -M mode The phase of the check. The default is PRE_ADD_HOST. -N namenode The NameNode host name. -p DB2 port The DB2 fast communication manager (FCM) port number. -P Parallel mode. Use for a large cluster. If used with Verbose mode printing is asynchronous. -R hdfs principal The principal name for the HDFS user when you use Kerberos. -t timestamp The timestamp format to append to the log file name (to ensure consistency in output file names). The default format is YYYY-MM-DD_hh.mm.ss.[fraction]. -T hdfs keytab The keytab for the HDFS user when you use Kerberos. -u bigsql user The user name for the Big SQL user. The default value is bigsql. -v vardir The path to the /var/ directory. The default value is /var/ibm/bigsql -V Verbose mode. Specifies to display test results on the shell stdout. This option is required to see the output. -x hdfs user The user name for the HDFS user. The default value is hdfs. -z uid check The Big SQL User ID test string (for IBM internal use).

Chapter 4. Installing the IBM BigInsights value-added services on IBM Open Platform with Apache Hadoop 79 Usage notes

The Big SQL preinstallation checker utility will perform data collection on all hosts specified by the -h parameter. The tests can be run in parallel.

On each host, the utility checks for the following issues: v A Korn shell, /bin/ksh exists on all nodes. v The db2set DB2RSHCMD is set, and that the userprofile is not empty and larger than X bytes. v If you are using LDAP, the local bigsql user ID and group ID match exactly, or do not exist yet in the LDAP definition. v There is enough disk space. v The /etc/hosts file contains host names in both long and short format. The file should also contain an entry for db2nodes.cfg on each node. v The timediff between nodes is less than MAX_TIME_DIFF. Otherwise, a CREATE DATABASE might not work. v The bigsql user ID and group id are the same across all of the nodes, or that it is possible to have them the same across all nodes. v The /tmp/bigsql/ directory is writable for the bigsql user ID. v The line with requiretty is commented out with # in the /etc/sudoers file. v Big SQL /home/ resolves to the same path across all nodes. v HDFS is available to the bigsql user ID. v The proper privileges have been granted to the installation directories. v SQLLIB already exists in the Big SQL user /home/. v There are no DB2 entries in etc/services. v Determine potential errors with db2ckgpfs. v Validate the existence of passwordless ssh. v Validate that all of the client requirements are met. v Check that the Big SQL user name is within the required length. v Check that there are sufficient resources for Tivoli System Automation (TSA). v Check if the fast communication manager (FCM) port is in use. v Check if the DB2 communication manager (DB2COMM) port is in use. v Check that the HBase user has a valid login shell. v Check the Hive primary group. v Check umask. v Check for upper case in the host name. v Ensure that the Big SQL home directory is not mounted with the nosuid option set.

Examples 1. Run the pre-checker utility with the minimum required parameters and use the default log file: ./bigsql-precheck.sh -V 2. Run the pre-checker utility on list of Big SQL hosts, with HDFS, HBase and Hive user names set, Big SQL and DB2 FCM ports specified, in Verbose mode, and with warnings logged as errors:

80 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview ./bigsql-precheck.sh -u bigsql -l mick_nodes_list.cfg -z bigsql,2824,hadoop,515 -x hdfs -b hbase -i hive -p 32051 -f 28051 -V -F

Installing the BigInsights - Data Server Manager service To extend the power of the Open Platform for Apache Hadoop, install and deploy the BigInsights - Data Server Manager service, which is the IBM monitoring web-based tool for BigInsights - Big SQL.

Before you begin

Make sure that you do the prerequisite steps listed in Step 5 on page 64, Preparing to install the BigInsights value-add services.

The BigInsights - Data Server Manager service requires Knox to be installed, configured, and started.

About this task

Follow these steps to install the BigInsights - Data Server Manager service, and make it available on the BigInsights Home page. From there, you can launch the Data Server Manager.

Chapter 4. Installing the IBM BigInsights value-added services on IBM Open Platform with Apache Hadoop 81 Procedure 1. Open a browser and access the Ambari server dashboard. The following is the default URL. http://:8080

The default user name is admin, and the default password is admin . 2. In the Ambari web interface, click Actions > Add Service. 3. In the Add Service Wizard, Choose Services, select the BigInsights - Data Server Manager service, and the BigInsights Home service if it is not already installed. Click Next. If you do not see the option to select the BigInsights - Data Server Manager service, install the appropriate module and restart Ambari, as described in “Installing the BigInsights value-add packages” on page 69. 4. In the Assign Masters page, decide which node of your cluster you want to run the specified Data Server Manager master. 5. Click Next. 6. In the Assign Slaves and Clients page, all of the defaults are automatically accepted and the next page appears automatically. Since the BigInsights - Data Server Manager service does not have any slaves or clients, the Assign Slaves and Clients page is immediately skipped during the installation. 7. In the Customize Services page, accept the recommended configurations for the Data Server Manager service, or customize the configuration by expanding the configuration files and modifying the values. Make sure that you enter the following information in the :Advanced dsm-config section: dsm_admin_user field Type a Knox user name that will become the administrator for Data Server Manager. If you want to change the default UID for this BigInsights - Data Server Manager service account, select the Misc tab in the Customize Services page. You can manage the UID for this service account so that Ambari will create the user with the appropriate UID. 8. Click Next. 9. You can review your selections in the Review page before accepting them. If you want to modify any values, click the Back button. If you are satisfied with your setup, click Deploy. 10. In the Install, Start and Test page, the Data Server Manager service is installed and verified. If you have multiple nodes, you can see the progress on each node. When the installation is complete, either view the errors or warnings by clicking the link, or click Next to see a summary and then the new service is added to the list of services. 11. Click Complete. If the BigInsights – Data Server Manager service fails to install, run the remove_value_add_services.sh cleanup script. The following code is an example of the command: cd /usr/ibmpacks/bin/ ./remove_value_add_services.sh -u admin -p admin -x 8080 -s DATASERVERMANAGER -r

For more information about cleaning the value-add service environment, see “Removing BigInsights value-add services” on page 90. 12. Make sure that the Knox service is enabled.

82 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview 13. Access the BigInsights – Data Server Manager service from the BigInsights – Home service. Before you can launch the BigInsights – Home service to access DSM, ensure that the following services are installed and started: v If the BigInsights – Home service is not installed, see “Installing BigInsights Home” on page 72. v If the BigInsights – Big SQL service is not installed, “Installing the BigInsights - Big SQL service” on page 74. v Make sure that you restart the BigInsights – Home service so that the BigInsights – Big SQL service icon displays on the Home page. 14. Launch the BigInsights – Home service by typing the following address in your web browser: https://://default/BigInsightsWeb/index.html

Where: knox_host The host where Knox is installed and running knox_port The port where Knox is listening (by default this is 8443) knox_gateway_path The value entered in the gateway.path field in the Knox configuration (by default this is 'gateway')

For example, the URL might look like the following address: https://myhost.company.com:8443/gateway/default/BigInsightsWeb/index.html 15. Click the BigInsights Big SQL icon to launch the Data Server Manager.

Chapter 4. Installing the IBM BigInsights value-added services on IBM Open Platform with Apache Hadoop 83 What to do next

For information about using BigInsights - Data Server Manager, see http://www.ibm.com/support/knowledgecenter/SS5Q8A_1.1.0/ com.ibm.datatools.dsweb.ots.over.doc/topics/welcome.html?lang=en. Installing the Text Analytics service The Text Analytics service provides powerful text extraction capabilities. You can extract structured information from unstructured and semi-structured text. Before you begin

Make sure that you do the prerequisite steps listed in Step 5 on page 64, Preparing to install the BigInsights value-add services.

Follow the steps in Installing the value-add services in the IBM Open Platform with Apache Hadoop before you begin the steps below.

84 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview Procedure 1. Open a browser and access the Ambari server dashboard. The following is the default URL. http://:8080

The default user name is admin, and the default password is admin. 2. In the Ambari dashboard, click Actions > Add Service. 3. In the Add Service Wizard, Choose Services, select the BigInsights - Text Analytics service. If you do not see the option to select the BigInsights - Text Analytics service, complete the steps in “Installing the BigInsights value-add packages” on page 69. 4. To assign master nodes, select the Text Analytics Master server Node. 5. Click Next. The Assign Slaves and Clients page displays. 6. Assign slave and client components to the hosts on which you want them to run. An asterisk (*) after a host name indicates the host is assigned a master component. a. To assign slaves nodes and clients, click All on the Clients column. The client package that is installed contains runtime binaries that are needed to run Text Analytics.

Important: This client needs to be installed on all Datanodes that belong to your cluster. You can also optionally choose nodes that do not have Datanode on them, as the Text Analytics service will deploy it for you. Client nodes will install only the Text Analytics Runtime artifacts. (/usr/ibmpacks/current/text-analytics-runtime). Choose one or more clients. You do not have to choose the Master node as a client since it already installs Text Analytics Runtime. 7. Click Next and select BigInsights - Text Analytics. 8. Expand Advanced ta-database-config and enter the password in the database.password field. Recommended configurations for the service are completed automatically but you can edit these default settings as desired. The database server can either be MySQL or MariaDB. There are two options: v database.create.new = Yes (default) a. The database server will be MySQL for the RHEL 6 platform and MariaDB for the RHEL 7 platform. b. You must enter the password for the database. c. You must ensure that the default port, 32050 is free. You can change the port to any free port. d. You can change the database.username, but any changes to the database.hostname are ignored. v database.create.new = N a. You must enter the database.hostname, database.port (where the existing database server instance is running), database.user and database.password. Ensure that the user and password have full access to create a database in the existing database server instance you specify. Especially if it is a remote MySQL server instance, ensure that all permissions are given to the user and password to access this remote instance. Ensure that the server instance is up and running so that the Text Analytics service can be started successfully.

Chapter 4. Installing the IBM BigInsights value-added services on IBM Open Platform with Apache Hadoop 85 9. Expand Advanced ta-web-tooling-config and select the type of tokenizer you would like to use to run and build the Text Analytics extractors. This can also be modified later. The Text Analytics service provides support for two types of tokenizers: a. standard (default): The standard tokenizer uses white space and punctuation to split tokens. Since a white space tokenizer is efficient, you can use this tokenizer for alphabetic languages like English and Spanish. The standard tokenizer will generally perform more efficiently than the multilingual tokenizer, but is not suitable to process text in languages without word boundary indicators, such as Chinese or Japanese. The standard tokenizer also does not support part-of-speech extraction or tagging, and none of the POS-based extractors are visible or usable when the standard tokenizer has been chosen. b. multilingual: The multilingual tokenizer uses white space and punctuation to split tokens, but also has algorithms for processing ideographic languages, such as Chinese or Japanese. Refer to the Language Support table for the list of languages that the multilingual tokenizer supports.

Note: Whenever you modify the tokenizer type, be sure to restart the Text Analytics service for the changes to take effect. 10. If you want to change the default UID for this BigInsights - Text Analytics service account, select the Misc tab in the Customize Services page. You can manage the UID for this service account so that Ambari will create the user with the appropriate UID. 11. Click Next and in the Review screen that opens, click Deploy. 12. After installation is complete, click Next > Complete. 13. After the installation is successful, click Next and Complete. If the BigInsights - Text Analytics service fails to install, run the remove_value_add_services.sh cleanup script. The following code is an example command: cd /usr/ibmpacks/bin/ remove_value_add_services.sh -u admin -p admin -x 8080 -s TEXTANALYTICS -r

For more information about cleaning the value-add service environment, see “Removing BigInsights value-add services” on page 90. 14. Update the Ambari server to allow the Text Analytics Web Tooling service user to impersonate other users. This step is required for two reasons: v To support the ability to browse the HDFS file system as the logged-in user from within the Text Analytics Web Tooling user interface. v To support the execution of text extractors against data on HDFS by using Hadoop MapReduce jobs, where the jobs are issued on behalf of the logged-in user. To give impersonation privileges to the Text Analytics Web Tooling service user (which defaults to tauser), do the following steps: a. From the Ambari web interface, click the HDFS service. b. At the top of the screen, click the Configs tab. c. Scroll down and expand the Custom core-site section. d. Click Add property to add the following two properties:

86 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview Table 26. Add property Key Value hadoop.proxyuser..groups hadoop.proxyuser. name>.hosts

The value of the textanalytics_user field from the Advanced ta-service-config section of the Text Analytics Web Tooling install configuration (the default value for that field is tauser). The fully qualified host name of the node which you selected as your master node in Step 4.

Remember: Click Save located at the top-right section of the Ambari interface, and restart HDFS, YARN and MAPREDUCE2 services. 15. Make sure that the Knox service is enabled. 16. Restart the Knox service. If you have not configured LDAP service, start the Knox Demo LDAP service.

17. Open the BigInsights Home and launch Text Analytics at the following address: https://://default/BigInsightsWeb/index.html

Where: knox_host The host where Knox is installed and running knox_port The port where Knox is listening (by default this is 8443) knox_gateway_path The value entered in the gateway.path field in the Knox configuration (by default this is 'gateway')

For example, the URL might look like the following address: https://myhost.company.com:8443/gateway/default/BigInsightsWeb/index.html

Chapter 4. Installing the IBM BigInsights value-added services on IBM Open Platform with Apache Hadoop 87 If you use the Knox Demo LDAP service and have not modified the default configuration, the default credential to log into the BigInsights - Home service is: userid = guest password = guest-password To invalidate a session, click the Menu icon on the BigInsights Home page and select Sign Out. You will need to re-authenticate before being able to display the BigInsights Home page again.

Note: If you do not see the Text Analytics service from BigInsights Home, restart the BigInsights Home service in the Ambari interface. What to do next

For information about using BigInsights - Text Analytics, see Analyzing big data with Text Analytics . Enabling Knox for value-add services To enable Knox after installing the BigInsights value-add service, a post install script must be executed. Besides making sure that the Knox service is up and running, this script also sets up the URL path to BigSheets, Text Analytics and Data Server Manager in Knox, which makes the URLs accessible. About this task

After installing IBM Open Platform with Apache Knox and the BigInsights value-add services BigInsights Home, BigSheets, Text Analytics, and/or Data Server Manager a post installation script needs to be executed to enable Knox for the value-add services. The script updates the Ambari topology, Knox configurations, restarts Ambari server and Knox server as part of the process to enable Knox for the value-add services. The script needs to be run after installing any of the following value-add services: BigInsights Home, BigSheets, Text Analytics, and/or Data Server Manager. If the value-add services are installed at different times, the script needs to be rerun anytime one of the value add-services is installed.

Note: The Ambari-server and Knox server are restarted as part of running this script. Procedure 1. The following directory contains the Knox setup scripts and clean up scripts: /usr/ibmpacks/bin/ 2. The BigInsights value-add services include scripts to help you enable Knox for the following value adds: BigInsights Home, BigSheets, Text Analytics, and Data Server Manager.

Option Description Knox enablement for value adds knox_setup.sh -u -p -y

The knox_setup.sh script detects the appropriate http or https protocol automatically. It first tries http and, if it fails, it switches to https. The script

88 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview also gets the port information from the /etc/ambari-server/conf/ ambari.properties file.Use the following parameter definitions: -u The Ambari administrator user name. -p The Ambari administrator password. This parameter is optional. -y Assume yes as the answer to any question. This parameter is optional. If you install IBM Open Platform with Apache Hadoop as the non-root user, preface the knox_setup.sh command with sudo, where the instruction would normally require the root user. Example

Knox setup examples for value-add services:

Note: The examples use the default Ambari port number, 8080. If you modified the port number, adjust for your environment. v Normal Knox enablement of value-add services: sudo ./knox_setup.sh -u admin -p admin

This example includes the optional password parameter. v Knox enablement of value-add services with the assume yes option sudo ./knox_setup.sh -u admin -y

This example includes the optional -y parameter, which means you want to assume yes to any system prompt.

The following is example output from running the Knox setup scripts: [[email protected] 2.0-SNAPSHOT]# ./knox_setup.sh -u admin -p admin Protocol : http *********************************************************************** Is Ambari User Name admin correct? y *********************************************************************** Is Ambari Cluster Name ambari.server.name correct? y *********************************************************************** Is Ambari port 8080 correct? y *********************************************************************** Is Knox gateway server knox.server.name correct? y *********************************************************************** Is IBM BigInsights IOP installed? y *********************************************************************** *********************************************************************** Updating KNOX jars: *********************************************************************** gateway-service-bigsheets-5.3-SNAPSHOT.jar 100% 13KB 13.1KB/s 00:00 gateway-service-dsm-1.2-SNAPSHOT.jar 100% 13KB 13.3KB/s 00:00 gateway-service-text-analytics-web-tooling-3.0-SNAPSHOT.jar 100% 23KB 22.8KB/s 00:00 gateway-service-web-ui-2.5-SNAPSHOT.jar 100% 17KB 16.7KB/s 00:00 Completed *********************************************************************** Update Topology: *********************************************************************** % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 105 737 105 737 0 0 73832 0 --:--:-- --:--:-- --:--:-- 81888 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 5148 100 5148 0 0 699k 0 --:--:-- --:--:-- --:--:-- 1005k Checking for updates for conf/knox_conf.json roles== WEBUIFRAMEWORK Updating URL for = WEBUIFRAMEWORK to http://{{ui_server_host}}:{{ui_server_port}}/biginsights

Chapter 4. Installing the IBM BigInsights value-added services on IBM Open Platform with Apache Hadoop 89 roles== BIGSHEETS Updating URL for = BIGSHEETS to http://{{bigsheets_server_host}}:{{bigsheets_server_port}}/bigsheets roles== TEXTANALYTICS Updating URL for = TEXTANALYTICS to http://{{ta_server_host}}:{{ta_server_port}}/TextAnalyticsWeb roles== DSM Updating URL for = DSM to http://{{dsm_server_host}}:{{dsm_server_port}}/console Provider iop-util already exists, no updates needed for provider % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 4800 0 0 100 4800 0 243k --:--:-- --:--:-- --:--:-- 260k *********************************************************************** Updating KNOX keystore file at Text Analytics master node: textanalytics.server.name.svl.ibm.com *********************************************************************** gateway.jks 100% 1374 1.3KB/s 00:00 gateway.jks 100% 1374 1.3KB/s 00:00 Completed *********************************************************************** Cleanup KNOX deployment files: *********************************************************************** Completed *********************************************************************** Updating KNOX params.py file: *********************************************************************** Completed *********************************************************************** Restarting AMBARI: *********************************************************************** Using python /usr/bin/python2.6 Restarting ambari-server Using python /usr/bin/python2.6 Stopping ambari-server Ambari Server stopped Using python /usr/bin/python2.6 Starting ambari-server Ambari Server running with administrator privileges. Organizing resource files at /var/lib/ambari-server/resources... Server PID at: /var/run/ambari-server/ambari-server.pid Server out at: /var/log/ambari-server/ambari-server.out Server log at: /var/log/ambari-server/ambari-server.log Waiting for server start...... Ambari Server ’start’ completed successfully. *********************************************************************** Stop KNOX: *********************************************************************** Stopping KNOX Service KNOX stopped Stopping KNOX succeeded *********************************************************************** Redeploy KNOX: *********************************************************************** Completed *********************************************************************** Start KNOX: *********************************************************************** Starting KNOX Service KNOX started Starting KNOX succeeded [[email protected] 2.0-SNAPSHOT]# Removing BigInsights value-add services When you remove a BigInsights value-add service, make sure that there are no remaining files that might cause problems for future installations.

90 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview About this task

These clean-up processes do not remove the Ambari server, nor do they impact any of the IBM Open Platform with Apache Hadoop configurations. The scripts will remove the top-level value-add RPMs.

The cleanup for BigSheets removes data on HDFS for the child workbooks. If you want to save any of the child workbook data, use the Export Data option from the BigSheets home page for each of the child workbooks and save the data on HDFS. For information on how to export data, see Exporting data from a BigSheets workbook. Procedure 1. Navigate to the directory that contains the clean up scripts: cd /usr/ibmpacks/bin/ 2. The BigInsights value-add services include scripts to help you remove the value-add services and to clean up your environment:

Option Description

Service cleanup remove_value_add_services.sh -u -p -x -s -a -b -r -l -c Assembly cleanup remove_value_add_services_and_assembly.sh -A -u -p -x -s -q -l -c

Service and assembly cleanup remove_value_add_services_and_assembly.sh -u -p -x -s -a -b -q -r -l -c

Use the following parameter definitions: -A This option is mandatory. Performs assembly removal only. -u The Ambari administrator user name. -p The Ambari administrator password. -x The Ambari server port number.

Chapter 4. Installing the IBM BigInsights value-added services on IBM Open Platform with Apache Hadoop 91 -s Depending on the script that you run, the service to remove, or the service assembly to remove. The following values are allowed: Service v TEXTANALYTICS v WEBUIFRAMEWORK This is the BigInsights Home. v BIGSHEETS v BIGSQL v BIGR v DATASERVERMANAGER Assembly v DS This is the Data Scientist module. v Analytics This is the Analyst module. v ALL -f This option is optional. The FORCE option allows you to continue removing the service, even if intermittent steps fail.

Note: This might result in an Ambari unknown service state. -q This option is optional. Depending on the script that you run, specifies to remove stack files that are associated with the service, or removes the yum repo and the cache.

Note: When running the remove service script, this option prevents a reinstallation. -a This option is optional. Specifies the number of attempts to stop the service. -b This option is optional. Specifies the number of attempts to remove the service. -r This option is optional. Removes service users.

Attention: Although this option is generally optional, be aware of the special considerations that are required for the bigsql user ID.

The bigsql user ID can be created on all nodes automatically as part of the Big SQL service installation. The Big SQL installation relies on the proper setup of root passwordless SSH. If the Big SQL service installation failed because of improper passwordless SSH setup for root, you must use the -r option to ensure a full cleanup of Big SQL ID -l This option is optional. It enables secure https mode. -c This option is optional. Run as user if non-root user is needed. 3. Complete the removal process with the following steps: a. Edit the file /var/lib/ambari-server/resources/stacks/BigInsights/ /repos/repoinfo.xml to remove the reference to BIGINSIGHTS-VALUEPACKS . b. Restart the Ambari service to make sure that the cache is cleared:

92 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview Note: Some previously launched services, such as Knox, may take several minutes to refresh. Ambari may display these services in a warning state and attempts to manually launch these services may fail during the refresh. sudo ambari-server restart c. There are files that are left in /var/lib/ambari-server/resources/stacks/ BigInsights/4.0/services/$SERVICE/package/archive.zip. These can remain and have no impact on future service additions. Example Service cleanup examples: Normal removal Remove the Big SQL service: >sudo remove_value_add_services.sh -u admin -p admin -x 8081 -s BIGSQL

Remove the BigR service: >sudo remove_value_add_services.sh -u admin -p admin -x 8081 -s BIGR Removal including users Remove the Big SQL service including the users: >sudo remove_value_add_services.sh -u admin -p admin -x 8081 -s BIGSQL -r Run as non-root user Remove the Big SQL service as non-root user (with sudo priviledge) biadmin: >sudo remove_value_add_services.sh -u admin -p admin -x 8081 -s BIGSQL -r -c biadmin Assembly cleanup examples: Normal removal Remove the Data Scientist assembly: >sudo remove_value_add_services_and_assembly.sh -u admin -p admin -x 8081 -s DS Removal with repo clean: Remove the Data Scientist assembly: >sudo remove_value_add_services_and_assembly.sh -u admin -p admin -x 8081 -s DS -q Run with repo clean and non root user: Remove the Data Scientist assembly and run as user biadmin: >sudo remove_value_add_services_and_assembly.sh -u admin -p admin -x 8081 -s DS -q -c biadmin Service and assembly cleanup examples: Normal removal Removal all: >sudo remove_value_add_services_and_assembly.sh -u admin -p admin -x 8081 -s All

Removal the Analyst services and assembly: >sudo remove_value_add_services_and_assembly.sh -u admin -p admin -x 8081 -s Analyst

Chapter 4. Installing the IBM BigInsights value-added services on IBM Open Platform with Apache Hadoop 93 Removal including users Remove all services and assemblies and users: >sudo remove_value_add_services_and_assembly.sh -u admin -p admin -x 8081 -s All -r Run as non-root user Remove all and run as user biadmin: >sudo remove_value_add_services_and_assembly.sh -u admin -p admin -x 8081 -s All -r -c biadmin Removal with repo clean and non-root user and removing service users Removing the Data Scientist service and assemblies and running as user biadmin: >sudo remove_value_add_assembly.sh -u admin -p admin -x 8081 -s DS -r -q -c biadmin

94 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview Chapter 5. Some new or enhanced features for 4.2

Impersonation in Big SQL Impersonation is the ability to allow a service user to securely access data in Hadoop on behalf of another user. In Big SQL, you can enable impersonation at the global level to enable impersonation of connected user for actions on hadoop tables. So, while performing CREATE HADOOP TABLE or running a query or load operation, the user that connects to Big SQL will be the one that performs the action in Hive or HDFS. Any DDL operation on Hadoop tables and all schema DDL statements will be impersonated as the connected user when performed in Hive. Big SQL performs native I/O on Hadoop tables by using its own I/O engines. These I/O operations for a query or insert or load will be performed as the connected user. So, all authorizations need to be setup in Hive and HDFS to allow the connected user to perform the operation. Why use impersonation

By default, the bigsql user does all of the read or write operations on HDFS, Hive, and HBase, that is required for the Big SQL service. When the bigsql user is the sole owner and the user of the data, you do not need impersonation.

Without impersonation, you can set up the HDFS permissions and import them into Big SQL by using the IMPORT HDFS AUTHORIZATIONS option in the HCAT_SYNC_OBJECTS stored procedure. But there is still some separation of the control of access between HDFS and Big SQL.

You might need impersonation if the data that you want to analyze is produced outside of the Big SQL service, or if there is sharing of data between multiple services in your cluster.

Without impersonation, you can create a procedure to operate on tables that you own or control, or create a view to select only a few columns from the entire table. You, as the object creator, can then GRANT EXECUTE on the procedure or select on the view to other users, while not giving any permissions on the underlying table to those other users.

With impersonation, those other users must have permissions on the tables in HDFS to perform the I/O operation. So, there is potential for loss of granularity in authorization control with impersonation enabled. This is similar to Hive impersonation behavior. Verify that Big SQL impersonation is possible

You must ensure that the bigsql user is listed in the HDFS configuration property: hadoop.proxyusers.*. This property allows the bigsql user to impersonate other users. It is added, by default, during the Big SQL installation.

You can verify that these properties are configured by doing the following steps: 1. From the Ambari dashboard HDFS service, click the Configs > Advanced tab. 2. Expand the Custom core site to view the following properties: a.

© Copyright IBM Corp. 2013, 2016 95 hadoop.proxyuser..groups The value is *, or set to allow the bigsql user to appropriately impersonate the desired users. hadoop.proxyuser..hosts The value should be the fully qualified host name where the Big SQL master is installed.

Big SQL creates the impersonated table as the connected user in Hive. It is presumed that the connected user has the appropriate authority to create a table in the specific schema, which is governed by either the Storage Based Authorization method (the default) or the SQL Standard Based Authorization method. For more information about these authorization methods, see https://cwiki.apache.org/ confluence/display/Hive/LanguageManual+Authorization.

When impersonating another user, the bigsql user will use HDFS Secure Impersonation. So make sure that the bigsql user is listed in the HDFS configuration property. How to enable Big SQL impersonation from the Installation wizard

When you install the Big SQL service from the Ambari installation wizard, you can enable Big SQL impersonation. In the wizard, select enable_impersonation, which is described in “bigsql.alltables.io.doAs.” You can also select public_table_access, which issues GRANT statements as described in “bigsql.impersonation.create.table.grant.public.” How to enable Big SQL impersonation or switch impersonation on or off

Follow these steps to enable or switch Big SQL impersonation: 1. Modify the properties in bigsql-conf.XML to enable global impersonation for Hadoop tables: bigsql.alltables.io.doAs The default value is false, which means that impersonation is not enabled on all table operations. If the value is true, impersonation is enabled for all table operations. This is recommended if Hive impersonation is enabled and you want to access data through Hive and Big SQL for all users and tables, and include the security controls in HDFS for all users. bigsql.impersonation.create.table.grant.public The default value is false. If set to true, any new Hadoop table is granted INSERT, SELECT, UPDATE, and DELETE to public. This is used in conjunction with global impersonation only to allow for all I/O authorization controls to happen only in HDFS and not in Big SQL. By default, ADMIN authority is not granted to public so that security of the underlying data is maintained. 2. Update HDFS and Hive as appropriate to switch impersonation on or off: Switching from non-impersonation to impersonation mode When you switch from non-impersonation mode to impersonation mode, make sure that you update the following components according to your needs: HDFS With impersonation, all authorization is managed in HDFS, so

96 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview set up the appropriate access controls in HDFS. The owner of tables in Big SQL could be set up as an owner in HDFS. Any other users or groups that are granted access in Big SQL must be set up by using chmod at the group or public level or by using HDFS ACLs as applicable. Hive Tables that are created in Big SQL without impersonation are created and owned by the bigsql user in Hive. Make sure to grant the owner ADMIN privileges in Hive so that the owner can administer the table with impersonation. Also, any other users that are granted ADMIN-like privileges in Big SQL must be granted ADMIN privileges in Hive. Switching from impersonation to non-impersonation mode When you switch from impersonation mode to non-impersonation mode, make sure that you update the following components according to your needs: HDFS With impersonation, all of the table directories in HDFS can be owned by the table creator, depending on the inheritPerms settings in Hive. Without impersonation, all access will happen as the bigsql user. So, set up the appropriate access controls in HDFS so that the bigsql user can read or write all the tables. Hive Tables that are created in Big SQL with impersonation are created and owned by the creator of the table. Make sure to grant the bigsql user ADMIN privileges in Hive so that the bigsql user can administer the table in Hive without impersonation in Big SQL. Big SQL Tables that are created with impersonation and the bigsql.impersonation.create.table.grant.public property is set have INSERT, SELECT, DELETE, and UPDATE operations granted to public. After switching to non-Impersonation, all the authorization happens in Big SQL only, since HDFS and Hive access would be as the bigsql user. Make sure to revoke the public grants and add any grants specific to usage based on the existing sharing requirements.

Note: If you switch impersonation off after running statements on tables with impersonation on, you must remember to update the HDFS and Hive configurations as discussed before accessing those tables again. 3. Restart Big SQL. 4. Run the HCAT_SYNC_OBJECTS procedure with the skip option at the table level: CALL SYSHADOOP.HCAT_SYNC_OBJECTS(’USER2’,’T1’, ’T’, ’SKIP’);

The following HCAT_SYNC_OBJECTS example is for all tables in a schema: CALL SYSHADOOP.HCAT_SYNC_OBJECTS(’USER2’,’.*’, ’T’, ’SKIP’);

Authorization control with impersonation

By default, a result of the CREATE HADOOP TABLE statement is that GRANT statements are applied for a connected user in Hive. This is true when impersonation is not enabled only. If impersonation is enabled, this GRANT is not

Chapter 5. Some new or enhanced features for 4.2 97 required since the connected user is the one that creates the table in Hive, so that user has all of the required privileges in Hive. Therefore, if impersonation is enabled, the GRANT is not applied.

Unlike Hive, Big SQL allows GRANT/REVOKE on impersonated tables. Any access by using Big SQL is checked against authorization controls in Big SQL, as well as by HDFS when the actual access happens as the connected user with impersonation. If a GRANT is done in Big SQL to grant certain privileges to other users, groups, or roles, the privileges are not replicated to HDFS or Hive. You must update these services manually to ensure proper operation with impersonation. For example, if user1 grants SELECT permission on its table T1 to user2, then user2 must also be given read (r) permission in HDFS on the table location and files, as well as execute (x) permission on the schema.db directory. Usage Notes

If global impersonation is enabled, user username can run LOAD HADOOP successfully only if the following are true: v The HDFS directory, /user//, must exist with READ, WRITE, and EXECUTE permissions for the user. v must exist on each node of the cluster.

You can set up permissions in HDFS and use HCAT_SYNC_OBJECTS to sync up GRANTs in Big SQL to create the same authorization controls in Big SQL. For HCAT_SYNC_OBJECTS to be able to sync up any tables, they must be accessible to the connected user. By default, the ability to run the HCAT_SYNC_OBJECTS procedure is available only to the bigsql user. It should be granted only to appropriate users as required. Restrictions and notices v Any impersonation behavior is used for Hadoop tables only. v Any storage handlers or SERDEs that are used in a CREATE HADOOP TABLE statement also see the impersonation behavior. v Impersonation is not used for tables that are created with the CREATE HBASE TABLE statement, even if the bigsql.alltables.io.doAs property is set to true. For HBase tables, the bigsql user creates a logical Big SQL table over an HBase table in Hive in the .db directory in the Hive warehouse. You must have appropriate permissions set up in HDFS for the .db directory to allow for this. If the containing schema is created implicitly during a CREATE HBASE TABLE statement, it will be owned by the bigsql user. Any attempt to drop the schema explicitly will be tried as the connected user, so make sure there are appropriate permissions in HDFS for the connected user to perform the drop operation. Best practices Enable impersonation in Hive If impersonation is enabled in Big SQL, all Hive metadata operations and all HDFS I/O operations are performed as the connected user. In order to ensure that the connected user has appropriate permissions on HDFS, it is recommended to enable impersonation in Hive as well, to have proper authorizations setup in HDFS for access through Hive as well as Big SQL. The best way to set up permissions in HDFS Use one of the following suggestions to set up HDFS permissions:

98 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview v Create schemas before setting impersonation, and then set up permissions according to the users that might be allowed to operate in those schemas. v Change the permissions level on the HDFS warehouse and disable the inheritPerms property in Hive. 1. Let the umask setting of the current user account dictate the permissions of directories and files that are created by them. You can verify or modify the fs.permissions.umask-mode from the Ambari web interface by selecting the HDFS service and clicking the Configs > Advanced tabs. Then, expand the Advanced hdfs-site section. If the umask setting is updated, you must restart HDFS and possibly MapReduce and Yarn.

Restriction: If you set the umask to 077, then when global impersonation is enabled, HADOOP tables and HBASE tables cannot be created in the same schema. This restriction is because the schema directory for an HBASE table must currently be owned by the bigsql service user. The schema directory for a HADOOP table is owned by the current user. 2. Disable the inheritPerms in the Hive service. a. Open the Hive service and click the Configs > Advanced > tabs. b. Expand the Advanced hive-site section. c. Locate the hive.warehouse.subdir.inherit.perms property and set the value to false. d. Click Save and then restart the Hive service. 3. Set the HDFS /apps/hive/warehouse directory to 777. sudo -u hdfs hadoop fs -chmod 777 /apps/hive/warehouse v HDFS files support the sticky bit setting. You can set the sticky bit on directories, which prevents anyone except the superuser, directory owner, or file owner from deleting or moving the files within the directory. There is no effect on individual files.

ANALYZE command Use the ANALYZE command to gather statistics for any Big SQL table. These statistics are used by the Big SQL optimizer to determine the most optimal access plans to efficiently process your queries. The more statistics that you collect on your tables, the better decisions the optimizer can make to provide the best possible access plans. When you run ANALYZE on a table to collect these statistics, the query generally runs faster.

There are two levels of statistics that you can collect: Table level: You can gather statistics about table level characteristics, such as the number of records. Column level: You can gather statistics about your columns, such as the number of distinct values. You can also gather statistics for column groups which is useful if columns have a relationship.

Chapter 5. Some new or enhanced features for 4.2 99 Authorization v CONTROL privilege on the table. The creator of the table automatically has this privilege. You can grant this privilege to users and roles, among others. v DATAACCESS privilege on the database. The creator of the database and DBADM automatically have this privilege. You can grant DBADM access to others. Syntax

►► ANALYZE TABLE table-name COMPUTE STATISTICS ► analyze-col NOSCAN PARTIALSCAN COPYHIVE

► ►◄ table-sampling

analyze-col:

INCREMENTAL FOR ALL COLUMNS FULL colgroup FOR COLUMNS colobj

colgroup:

,

▼ ( cola , colb ... )

colobj:

,

▼ coln colgroup

table-sampling:

TABLESAMPLE SYSTEM ( numeric-literal ) BERNOULLI

Description table-name The name of the table that you want to analyze. You can specify any Big SQL table (including DB2 regular tables) or views. COMPUTE STATISTICS Gathers statistics. You can include the following options:

100 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview analyze-col INCREMENTAL Only partitions that do not have updated statistics are scanned. The option is ignored for HBase tables, or if the table is not partitioned. FULL For a partitioned table, this value results in a full scan of the table to collect statistics. On a non-partitioned table, the scan is always a FULL scan. NOSCAN When you specify the optional parameter NOSCAN, there is some performance improvement in the ANALYZE command because ANALYZE does not scan files. By using NOSCAN, ANALYZE gathers only the following statistics: v Number of files v Table size in bytes v Number of partitions PARTIALSCAN The PARTIALSCAN option is valid for tables that use the RCFile format only. This option is not valid for HBase tables. Only the block header information of the file is accessed to get the file size in bytes, and the number of files. COPYHIVE No statistics are gathered on the table. The statistics that the Hive metastore has on the table and its columns is copied to the Big SQL metastore. FOR COLUMNS Table statistics and column-level statistics are gathered for the columns that you specify. You must specify at least one column, or one column group as a parameter in FOR COLUMNS, or ANALYZE returns a syntax error. colobj You can specify columns, column groups, or both. Separate each column group or column name with a comma. Enclose column groups within parentheses. For a column group, only the number of distinct values is gathered on the grouping of columns. You can intermix an individual column and column groups. For example, ...FOR COLUMNS (c1,c3),c2,(c4,c5),c7,c8...; FOR ALL COLUMNS Table statistics and column-level statistics are gathered for all columns of the table. This option is used when ANALYZE is automatically triggered by Big SQL. colgroup An optional list of column groups only can be included between the COLUMNS and TABLESAMPLE (optional) clauses. Any individual columns that you specify trigger a syntax error.Enclose column groups within parentheses. For a column group, only the number of distinct values is gathered on the grouping of columns. TABLESAMPLE SYSTEM | BERNOULLI (numeric-literal) SYSTEM

Chapter 5. Some new or enhanced features for 4.2 101 This parameter is supported for ALL BIG SQL TABLE types, including views, and is supported in Analyze v2 only. You use this parameter to collect statistics on a sample of HDFS splits. The term splits means the division of work that Big SQL generates to compute data in parallel, which can vary according to file type (such as text file or parquet), table type (HADOOP or HBASE), and configuration settings. These sample statistics are then used to extrapolate the statistics of the entire table. The numeric-literal is the target percentage of splits to scan during ANALYZE. Therefore, if the value of the numeric-literal is 10, it might mean that 10% of the splits are sampled. For example, if the table has data that resides on 10 splits, 1 split is used in the sample. However, if the table is small enough and resides on 2 splits, then TABLESAMPLE SYSTEM (10) scans 1 split, which is about 50% of the table. ANALYZE makes adjustments for small tables so that statistics are extrapolated correctly. A default sample size of 10% is used when ANALYZE is automatically triggered in Big SQL. This value speeds up the performance of the ANALYZE statement with little impact on query performance. BERNOULLI This parameter is supported only for statistical views. You use this parameter to collect statistics on a sample of the rows from the statistical view rather than all of the rows. The numeric-literal is the target percentage of rows to analyze. Statistical views can be very large because they can potentially describe join operations between multiple large tables. When you use this parameter, you can significantly reduce the time it takes to run ANALYZE for a statistical view. Bernoulli sampling considers each row individually, including the row with probability P/100 (where P is the value of the numeric-literal ) and excluding it with probability 1-P/100. Therefore, if the value of the numeric-literal is the value 10, which represents a 10% sample, each row is included with a probability of 0.1, and is excluded with a probability of 0.9. Usage notes v Results are written to the DB2 stats catalogs and to HDFS. v You must include at least one column name in the ANALYZE command if you specify FOR COLUMNS. v There have been major performance and memory improvements to ANALYZE by removing all dependencies from Hive and Map/Reduce. The ANALYZE statement that does not use Map/Reduce is called Analyze v2. You can modify a property in Big SQL that disables or re-enables Analyze 2. The recommendation is to always to use Analyze 2 over Analyze 1. For instructions about turning Analyze v2 on or off, see “Enabling or disabling ANALYZE v2” on page 104. v When Analyze v2 is enabled by default, when you run the ANALYZE command against a table on a set of columns, and then later run ANALYZE on a second set of columns, the statistics that are gathered from the first ANALYZE command are merged with the statistics that are gathered from the second ANALYZE command. Therefore, if you decide that you need to run ANALYZE on additional columns after the first ANALYZE command is run, then run ANALYZE on the second set of columns only. This practice speeds up the time that it takes for ANALYZE to complete. Note the difference in usage for Analyze

102 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview v2. If you use Analyze v1, then ANALYZE statements must always contain all columns that you need to collect statistics for at all times. You can modify a property in Big SQL that disables or re-enables cumulative statistics. For instructions on turning cumulative statistics on or off, see “Enabling or disabling cumulative statistics” on page 105. v Since gathering statistics is crucial for good query performance, Analyze v2 is triggered after a successful LOAD. The ANALYZE statement runs by default on all the columns in the table using a 10% table sample. The following is an example of an ANALYZE statement that runs after a successful LOAD: ANALYZE TABLE schema.table COMPUTE STATISTICS FOR ALL COLUMNS TABLESAMPLE SYSTEM(10); You can modify a property in Big SQL that disables or enables an automatic analyze after LOAD. For instructions on enabling or disabling ANALYZE after LOAD, see Configuring for automatic analyze after LOAD. v Since gathering statistics is crucial for good query performance, ANALYZE v2 is automatically triggered after HCAT_SYNC_OBJECTS is called to ingest data into Big SQL from Hive. The ANALYZE command runs on all the columns in the table by default using a 10% table sample. The following is an example of an ANALYZE statement that is run after a successful HCAT_SYNC_OBJECTS call: ANALYZE TABLE schema.table COMPUTE STATISTICS FOR ALL COLUMNS TABLESAMPLE SYSTEM(10); You can modify a property in Big SQL that disables or enables an automatic analyze after HCAT_SYNC_OBJECTS. For instructions on enabling or disabling ANALYZE after HCAT_SYNC_OBJECTS, see “Configuring automatic ANALYZE after HCAT_SYNC_OBJECTS” on page 106. v When a Big SQL table increases significantly, the statistics that were previously gathered, either by a manual ANALYZE or automatic analyze after a successful LOAD or a call to HCAT_SYNC_OBJECTS, will become out of date. As a result, ANALYZE must be run for optimal query performance. Big SQL automatically checks to see whether Hadoop or HBase tables changed significantly, and if so, ANALYZE v2 is also run automatically. v Although the memory footprint of the ANALYZE command is considerably reduced because of major revisions to ANALYZE v2, here are some ways to reduce the memory footprint of the ANALYZE command even further: – If you do not require distribution statistics for columns, turn them off by setting the following properties to zero (0): biginsights.stats.hist.num The number of histogram buckets. set hadoop property biginsights.stats.hist.num=0; biginsights.stats.mfv.num The number of most frequent values. set hadoop property biginsights.stats.mfv.num=0;

By setting these properties to 0, the ANALYZE command can use less memory and less storage space. With the properties set to 0, histogram and MFV statistics are not collected. However, the basic statistics like min, max, cardinality, and number of distinct values (NDV) are still collected. – Since automatic ANALYZE statements are executed against all of the columns in the table, for cases where the table has a lot of columns, the storage and memory requirements increase when distribution and MFV statistics are collected. By default, if a table has over 50 columns then any ANALYZE

Chapter 5. Some new or enhanced features for 4.2 103 statement using the FOR ALL COLUMNS clause will not collect distribution and MFV statistics. If you want to increase or decrease this value, toggle it by setting the following property: SET HADOOP PROPERTY biginsights.stats.wide.table.min.columns=50; v The NOSCAN option allows ANALYZE to complete much faster. However, it can potentially result in not having the best performance enhancements that can otherwise be achieved. v It is a good idea to gather statistics on all Big SQL tables that are used in your queries. Collect column level statistics for all columns that your predicates reference, including join predicates. Collect column grouping statistics for all sets of columns that your predicates reference that have an inter-relationship, such as (country, city). v When you specify the ANALYZE command to gather statistics for HBase tables, statistics are also gathered for the rowkey. These statistics are stored in the Big SQL metastore. v You can query SYSCAT.TABLES to determine the total number of partitions, the number of files, and the total size in kilobytes in an HADOOP table, if you have run the ANALYZE command. v Do not include ARRAY, ROW, or STRUCT data types in the FOR COLUMNS clause. You can ANALYZE a table that contains ARRAY, ROW, or STRUCT data type, but no statistics are returned for those particular columns. v Using the TABLESAMPLE clause in Analyze has no effect if the number of blocks is small, which is tables with number of blocks smaller than 20. Enabling or disabling ANALYZE v2

Big SQL defaults to using ANALYZE v2 instead of ANALYZE v1 starting with Big SQL 4.2 because ANALYZE v1 was very memory and CPU intensive due to the use of Hive and Map/ Reduce. The recommended setting is to stay with Analyze v2, set the biginsights.stats.use.v2 property to true for ANALYZE v2 or false for ANALYZE v1, either as a session variable or as a system-wide property within the configurations: Session variable Run the following command within the Big SQL shell or interface: SET HADOOP PROPERTY biginsights.stats.use.v2=true;

Setting the value to true to stay with ANALYZE v2, which is the default, is recommended. System-wide property Update the configuration properties: 1. Open the bigsql-conf.xml configuration file at $BIGSQL_HOME/conf/ bigsql-conf.xml on the head node only. 2. Add the following property: biginsights.stats.use.v2 TRUE

Setting the value to true to stay with ANALYZE v2, which is the default, is recommended. 3. Restart the Big SQL service.

104 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview Enabling or disabling cumulative statistics

The default behavior in ANALYZE v2 is to keep the previously collected statistics. To change the behavior of ANALYZE so that you can keep or remove the previously collected statistics, set the biginsights.stats.cumulative property to true or false, either as a session variable or as a system-wide property within the configurations: Session variable Run the following command within the Big SQL shell or interface: SET HADOOP PROPERTY biginsights.stats.cumulative=true; System-wide property Update the configuration properties: 1. Open the bigsql-conf.xml configuration file at $BIGSQL_HOME/conf/ bigsql-conf.xml on the head node only. 2. Add the following property: biginsights.stats.cumulative true 3. Restart the Big SQL service. Configuring automatic ANALYZE after LOAD

To change the behavior of ANALYZE after a LOAD statement, modify the automatic analyze process following a LOAD statement by setting the biginsights.stats.auto.analyze.post.load property to ONCE, NEVER, or ALWAYS. The default value is ONCE. ONCE This is the default. This means that ANALZYE is run after LOAD completes if an ANALYZE command has never been run for the specified table. NEVER This means that ANALYZE is never run after the LOAD completes. ALWAYS This means that ANALYZE is always run after the LOAD completes.

You can set the value as a session variable or as a system-wide property within the configurations. Session variable Run the following command within the Big SQL shell or interface: SET HADOOP PROPERTY biginsights.stats.auto.analyze.post.load=ALWAYS; System-wide property Update the configuration properties: 1. Open the bigsql-conf.xml configuration file at $BIGSQL_HOME/conf/ bigsql-conf.xml on the head node only. 2. Add the following property: biginsights.stats.auto.analyze.post.load ALWAYS 3. Restart the Big SQL service.

Chapter 5. Some new or enhanced features for 4.2 105 Configuring automatic ANALYZE after HCAT_SYNC_OBJECTS

To change the behavior of ANALYZE after a HCAT_SYNC_OBJECTS statement is run, modify the automatic analyze process following a HCAT_SYNC_OBJECTS statement by setting the biginsights.stats.auto.analyze.post.syncobj property to ONCE, NEVER, or COPYHIVE. The default value is ONCE. ONCE This is the default. This means that ANALYZE is run after HCAT_SYNC_OBJECTS completes if an ANALYZE command has never been run for the specified table. NEVER This means that ANALYZE is never run after the HCAT_SYNC_OBJECTS completes. COPYHIVE This means that ANALYZE copies the statistics that are gathered from Hive after HCAT_SYNC_OBJECTS completes.

You can set the value as a session variable or as a system-wide property within the configurations. Session variable Run the following command within the Big SQL shell or interface: SET HADOOP PROPERTY biginsights.stats.auto.analyze.post.syncobj=NEVER; System-wide property Update the configuration properties: 1. Open the bigsql-conf.xml configuration file at $BIGSQL_HOME/conf/ bigsql-conf.xml on the head node only. 2. Add the following property: biginsights.stats.auto.analyze.post.syncobj NEVER 3. Restart the Big SQL service. Examples Example 1: Analyzing a non-partitioned table. ANALYZE TABLE myschema.Table2 COMPUTE STATISTICS FOR COLUMNS (c1,c2),c3,c4;

This statement gathers statistics for Table2, along with column statistics for c3 and c4 and column grouping statistics for (c1,c2). When you run a query on Table2 after you use the ANALYZE command, the query generally runs faster. Example 2: Analyze a table and specific columns and then use SYSSTAT.COLUMNS to view the statistics. Gather statistics on table MRK_PROD_SURVEY_TARG_FACT: ANALYZE TABLE gosalesdw.MRK_PROD_SURVEY_TARG_FACT COMPUTE STATISTICS FOR COLUMNS month_key,product_key,product_survey_key, product_topic_target;

Select from the SYSSTAT.COLUMNS table to display the statistics:

106 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview SELECT cast(COLNAME as varchar(20)) AS "COL_NAME", COLCARD, cast(HIGH2KEY as varchar(100)) AS "HIGH2KEY", cast(LOW2KEY as varchar(100)) AS "LOW2KEY", NUMNULLS ,AVGCOLLEN from SYSSTAT.COLUMNS WHERE TABNAME = ’MRK_PROD_SURVEY_TARG_FACT’ and (COLNAME=’PRODUCT_KEY’ or COLNAME=’PRODUCT_TOPIC_TARGET’) order by COLNAME;

The output that shows the part of the statistics is shown in the following example:

+------+------+------+------+------+------+ | COL_NAME | COLCARD | HIGH2KEY | LOW2KEY | NUMNULLS | AVGCOLLEN | +------+------+------+------+------+------+ | PRODUCT_KEY | 90 | 30132 | 30001 | 0 | 4 | | PRODUCT_TOPIC_TARGET | 8 | 1.0 | 0.495 | 0 | 8 | +------+------+------+------+------+------+

Example 3: Analyzing an HBase table. CREATE HBASE TABLE HBTable ( c1 int,c2 int,c3 int,c4 varchar(20),c5 varchar(40),c6 varchar(90) ) COLUMN MAPPING ( KEY MAPPED BY (c1,c2,c3), cf1:cq1 MAPPED BY (c4,c5) ENCODING DELIMITED FIELDS TERMINATED BY ’\b’, cf1:cq2 MAPPED BY (c6) ENCODING BINARY ) DEFAULT ENCODING BINARY ; ANALYZE TABLE HBTable COMPUTE STATISTICS FOR COLUMNS (c1,c2,c3,c4,c5,c6);

This statement gathers statistics for the table HBTable, along with column statistics for all of the columns in the table. The key is c1,c2, and c3. Columns c4, c5 are part of column family cf1:cq1. Column c6 is part of column family cf1:cq2. Example 4: Determining who has DATAACCESS on the database: Issue a query on the syscat.dbauth table. SELECT CHAR(GRANTOR, 12)GRANTOR, CHAR(GRANTEE, 12)GRANTEE, DBADMAUTH, DATAACCESSAUTH FROM syscat.dbauth ORDER BY grantee, grantor;

GRANTOR GRANTEE DBADMAUTH DATAACCESSAUTH ------SYSIBM DB2INST Y Y SYSIBM PUBLIC N N Example 5: Determining table level privileges: Issue a query on the syscat.tabauth. SELECT CHAR(GRANTOR, 12)GRANTOR, CHAR(GRANTEE, 12)GRANTEE, CHAR(TABNAME, 15)tabname, CONTROLAUTH FROM syscat.tabauth WHERE tabname=’T1’;

GRANTOR GRANTEE TABNAME CONTROLAUTH ------SYSIBM DB2INST T1 Y SYSIBM JOESHMO T1 Y Example 6: Find out the number of partitions, the number of files, and the size of the Hadoop table: SELECT NPARTITIONS, NFILES, TABLESIZE FROM SYSCAT.TABLES WHERE TABNAME=’my_table’;

Chapter 5. Some new or enhanced features for 4.2 107 Example 7: Use the TABLESAMPLE BERNOULLI parameter to collect statistics on your view: Create the view: CREATE VIEW SS_GVIEW AS( SELECT t2.*, t3.*, t4.*, DATE(t2.D_DATE) AS D_D_DATE FROM STORE_SALES AS t1, DATE_DIM as t2, TIME_DIM as t3, STORE as t4 WHERE t1.SS_SOLD_DATE_SK=t2.D_DATE_SK AND t1.SS_SOLD_TIME_SK=t3.T_TIME_SK AND t1.SS_STORE_SK=t4.S_STORE_SK ) ;

Make the view a statistical view by enabling query optimization: ALTER VIEW SS_GVIEW ENABLE QUERY OPTIMIZATION;

Run ANALYZE on the view with a 1% Bernoulli sampling: ANALYZE TABLE SS_GVIEW COMPUTE STATISTICS TABLESAMPLE BERNOULLI (1); Example 8: Use the TABLESAMPLE SYSTEM parameter to collect statistics on your Hadoop or HBase table: ANALYZE TABLE myschema.Table2 COMPUTE STATISTICS FOR COLUMNS (c1,c2),c3,c4 TABLESAMPLE SYSTEM (10); Example 9: Use the cumulative statistics feature of Analyze v2 to collect statistics on an additional set of columns: ANALYZE TABLE myschema.Table2 COMPUTE STATISTICS FOR COLUMNS (c1,c2), c3, c4; ANALYE TABLE myschema.Table2 COMPUTE STATISTICS FOR COLUMNS c5,c6; Example 10: Collect table statistics with no columns specified. ANALYZE TABLE myTable COMPUTE STATISTICS TABLESAMPLE BERNOULLI (5); ANALYZE TABLE myTable COMPUTE STATISTICS TABLESAMPLE SYSTEM (10); Example 11: Run ANALYZE on some column groups of the myTable table with TABLESAMPLE SYSTEM: ANALYZE TABLE myTable COMPUTE STATISTICS FOR COLUMNS (I,J), (K,L,M) TABLESAMPLE SYSTEM (10); Auto-analyze You can use the auto-analyze feature to automatically determine if a table should be analyzed. If you add new data, or if you have never run ANALYZE on a table, then auto-analyze schedules to run an ANALYZE command. Running auto-analyze

When more than 50% of the data is new, or the table has never been analyzed, then auto-analyze detects and runs an analyze. For Big SQL HBase tables, if a major compaction has been done since the last analyze, then auto-analyze detects and runs an analyze.

108 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview Use the DB2 administrator task scheduler (ATS) to schedule and run tasks that execute an ANALYZE command. Analyze tasks are added to ATS when auto-analyze detects that an ANALYZE command should be run. These analyze tasks are scheduled to run once only and immediately.

There is also one task that is scheduled to run every 10 minutes indefinitely to check if there are tables that need to be analyzed. This task is added during the installation of Big SQL. If you need a different schedule, modify the default cron schedule. Use use multiple schedules as shown in “Using multiple schedules for checking analyze.” Modifying the cron schedule for auto-analyze

The task that checks for tables to be analyzed has a default cron schedule of '0,10,20,30,40,50 * * * *'. This mean that the task runs every 10 minutes indefinitely. Change the cron schedule by using the ATS procedure, SYSPROC.ADMIN_TASK_UPDATE. Example that will run once every day at midnight:

CALL SYSPROC.ADMIN_TASK_UPDATE( ’BIGSQL_CHECK_ANALYZE’,NULL, NULL, NULL, ’0 0 * * *’,NULL,NULL); Example that will disable the task so that it does not run: CALL SYSPROC.ADMIN_TASK_UPDATE( ’BIGSQL_CHECK_ANALYZE’,NULL, NULL, 0, NULL,NULL,NULL);

For more information about setting a task cron schedule, see UNIX cron format. Using multiple schedules for checking analyze

There can be multiple tasks with different schedules for calling the BIGSQL_CHECK_ANALYZE procedure and with different maximum concurrent task settings. This task must be added as the bigsql administrator user only.

The following examples show how to add 2 BIGSQL_CHECK_ANALYZE tasks with different schedules and different maximum concurrent tasks. In these examples, one task is for daytime, and it runs from 7:00-18:59, every 15 minutes with a maximum of 1 concurrent analyze task allowed. The other task is for nighttime, and it runs from 00:00-6:59 and 19:00-12:59, every half hour with a maximum of 5 concurrent analyze task allowed. Daytime schedule CALL SYSPROC.ADMIN_TASK_ADD(’ BIGSQL_CHECK_ANALYZE days max 1’,NULL, NULL, NULL, ’0,15,30,45 7-18 * * *’,’SYSHADOOP’, ’BIGSQL_CHECK_ANALYZE’,’VALUES(1)’,NULL,NULL); Nighttime schedule CALL SYSPROC.ADMIN_TASK_ADD( ’BIGSQL_CHECK_ANALYZE nights max 5’,NULL, NULL, NULL, ’0,30 0-6,19-23 * * *’, ’SYSHADOOP’,’BIGSQL_CHECK_ANALYZE’, ’VALUES(5)’,NULL,NULL);

Chapter 5. Some new or enhanced features for 4.2 109 Big SQL procedures for auto-analyze SYSHADOOP.BIGSQL_RUN_ANALYZE This procedure is called by the analyze task to run an ANALYZE command. This procedure is an internal procedure that is intended to be used only by auto-analyze. SYSHADOOP.BIGSQL_CHECK_ANALYZE This procedure is called by a scheduled task that checks if there are tables that should be analyzed. This procedure is an internal procedure that is intended to be used only by auto-analyze. This procedure has one parameter: SYSHADOOP.BIGSQL_CHECK_ANALYZE(maxConcurrentTasks) maxConcurrentTasks The value is an integer that is used to override the property setting of biginsights.stats.auto.analyze.concurrent.max. When the value is set to an integer larger then zero (0), it is used as the maximum number of concurrent tasks. SYSHADOOP.BIGSQL_AUTO_ANALYZE_STATUS This procedure is intended to be used by the Big SQL administrator to check the status of auto-analyze tasks. Successful analyze tasks show a STATUS of COMPLETED and SQLCODE 0. Tasks that are waiting to run contain NULL values for STATUS and END_TIME. A task that is running shows a STATUS of RUNNING. If an error was encountered the SQLCODE and MESSAGE show the reason the task failed. The procedure signature has 2 parameters: SYSHADOOP.BIGSQL_AUTO_ANALYZE_STATUS(schema, table) schema When not NULL the results are filtered to show tables in this schema only. This parameter can be NULL to not filter the results. A wild card % can be used. table When not NULL the results are filtered to show tables matching this name only. This parameter can be NULL to not filter the results. A wild card % can be used.

An example that shows all: CALL SYSHADOOP.BIGSQL_AUTO_ANALYZE_STATUS(NULL, NULL);

Table 27. Status TASKNAME TASKID ANALYZE_INPUT STATUS BEGIN_TIME END_TIME SQLCODE SQLSTATE MESSAGE Analyze 102 VALUES('BIGSQL','HCOUNTRCOMPLETEY_FILE',10,NULL)2016-03-25 2016-03-25 0 SQL0000W Statement processing was 1458949355682 16:50:34.112 16:50:46.831 successful. BIGSQL.HCOUNTRY_FILE Analyze 83 VALUES('TEST','GENDACOMPLETETA_1YR',10,NULL)2016-03-24 2016-03-24 0 SQL0000W Statement processing was 1458912345 18:09:18.384 18:10:23.758 successful. TEST.GENDATA_1YR Analyze 42 VALUES('UIUSER','T_UIUSER',10,NULL)COMPLETE 2016-03-17 2016-03-17 0 SQL0000W Statement processing was 1458912222 11:17:39.281 11:18:37.945 successful. UIUSER.T_UIUSER Analyze 84 VALUES('TEST','BAD',10,NULL)COMPLETE 2016-03-24 2016-03-24 -4302 58040 SQL4302N Procedure or user-defined 1458987654 19:04:19.045 19:15:52.347 function "SYSHADOOOP.BIGSQL_RUN_A", TEST.BAD specific name "BIGSQL_RUN_ANALYZE" aborted with an exception "[BSL-0-48470644e] Error run".

You can also see the status and history of the analyze tasks by querying the ATS views SYSTOOLS.ADMIN_TASK_STATUS and SYSTOOLS.ADMIN_TASK_LIST. These views show all ATS tasks. The BIGSQL_AUTO_ANALYZE_STATUS procedure shows auto-analyze tasks only.

110 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview For more information about viewing tasks and status, see ADMIN_TASK_LIST and ADMIN_TASK_STATUS. Configuration properties for auto-analyze

Use these properties to set the behavior of auto-analyze. biginsights.stats.auto.analyze.concurrent.max The default value is 1. This property limits the maximum number of analyze tasks that can be in the task queue at a time. When the scheduler is asked for tables to be analyzed it checks how many tasks are already in the queue running or waiting and limits the amount returned so the total does not exceed this maximum. Any remaining tables are returned on subsequent requests to the scheduler to prevent too many ANALYZE command from being started concurrently. biginsights.stats.auto.analyze.task.retention.time The default value is 1MONTH. This property controls the housekeeping purge of old tasks that have completed. When old tasks are deleted, the history is also deleted. The available values are 1MONTH, 1WEEK, FOREVER, or NONE.When set to FOREVER, the completed analyze tasks and history are not deleted. When set to NONE, the completed analyze tasks are deleted. When set to 1MONTH or 1WEEK, tasks older then 1 month or 1 week are deleted. Otherwise, the user can use the SYSPROC.ADMIN_TASK_REMOVE procedure to manually purge the old tasks periodically. biginsights.stats.auto.analyze.newdata.min The default is 50. It is the minimum percentage of new data that is added to the table to cause auto-analyze to detect that the table needs to be analyzed. If less then this amount of data has been added to the table, then auto-analyze does not run an ANALYZE command. The value must be an integer larger then 0. Disable and enable auto-analyze

At the time of installation, auto-analyze is enabled by default with the default cron schedule. To disable auto-analyze, either modify the cron schedule of the task or remove the ATS task that checks for tables to be analyzed.

CLOUD environment: You might not be able to add or remove the tasks as a bigsql user. In that case, always modify the cron schedule of the task with biadmin user and never remove the task. The task must run as bigsql user.

Examples Example of how to disable – run as the bigsql administrator user only: CALL SYSPROC.ADMIN_TASK_REMOVE( ’BIGSQL_CHECK_ANALYZE’,NULL); Example of how to enable – run as the bigsql administrator user only: To enable auto-analyze, add the ATS task that checks for tables to be analyzed. CALL SYSPROC.ADMIN_TASK_ADD( ’BIGSQL_CHECK_ANALYZE’,NULL, NULL, NULL, ’0,10,20,30,40,50 * * * *’, ’SYSHADOOP’,’BIGSQL_CHECK_ANALYZE’, ’VALUES(0)’,NULL,NULL);

Chapter 5. Some new or enhanced features for 4.2 111 Example of how to manually remove old completed analyze tasks: SYSPROC.ADMIN_TASK_REMOVE( ’Analyze 1458912222 UIUSER.COUNTRY’, NULL)

Impersonation and auto-analyze

When you enable impersonation for Big SQL, then the ANALYZE command runs as the table owner. Otherwise, the ANALYZE command runs as the bigsql administration user. The ATS tasks are always run as the bigsql administration user.

HCAT_SYNC_OBJECTS stored procedure The HCAT_SYNC_OBJECTS stored procedure imports the definition of the Hive objects into the local database catalogs, and can also assign ownership to a user. This action makes the objects available for use within queries. Syntax

►► HCAT_SYNC_OBJECTS ( schema , object-name ► , object-types

► ) ►◄ , exists-action , error-action options

exists-action:

'SKIP' 'REPLACE' 'ERROR' 'MODIFY'

error-action:

'STOP' 'CONTINUE'

options:

,

▼ IMPORT HDFS AUTHORIZATIONS TRANSFER OWNERSHIP TO username

Authorization

Only the bigsql user or a user with Big SQL administrative privileges, can run this Hadoop procedure. However, the bigsql user can grant execute permission on the HCAT_SYNC_OBJECTS procedure to any user, group, or role. Description schema The name of the schema that contains objects to be imported. You can use

112 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview regular expressions to match multiple schemas. For schemas that you previously defined in the Hive catalogs by using the CREATE TABLE (HADOOP) statement, the schema name is matched on the name as it was originally specified when the schema was created. For schemas that you created outside of a CREATE TABLE (HADOOP) statement, the schema name is matched on the actual name as defined in the Hive metastore. object-name The name of object to be imported from within the schema. You can use regular expressions to match multiple objects. For objects that you previously defined in the Hive catalogs by using the CREATE TABLE (HADOOP) statement, the object name is matched on the name as it was originally specified when the object was created. For objects that you created outside of a CREATE TABLE (HADOOP) statement, the object name is matched on the actual name as defined in the Hive metastore. object-types A string of characters that indicate the types of objects to be imported. The following list contains the valid types: t Indicates that table objects are to be imported without associated constraints. T Indicates that table objects, and all the associated constraints are to be imported. v Indicates that view objects are to be imported. a Indicates that all supported objects are to be imported. This value is equivalent to specifying the string Tv. If the object-types argument is not specified, a is the default. exists-action Indicates the action that the process takes if an object that is being imported exists within the Big SQL catalogs. The following actions are valid: SKIP Indicates that objects that are already defined should be skipped. This value is the default action. REPLACE Indicates that the existing objects should be removed from the Big SQL catalogs and replaced with the definition that is stored in the Hive metastore. When an object is replaced, only the metadata that resides in the Big SQL catalogs is lost. This metadata includes permissions on the object and referential constraints (if constraints are not being imported from the Hive catalogs). The REPLACE process consists of dropping the original object and re-creating the object as defined by the Hive metastore.

Important: All objects that are created by this HCAT_SYNC_OBJECTS procedure are owned by the user that runs the procedure. ERROR Indicates that the presence of an existing object should be considered an error. If error-action is CONTINUE, this action is the equivalent of specifying SKIP. MODIFY

Chapter 5. Some new or enhanced features for 4.2 113 Indicates that the procedure will attempt to modify the existing objects without removing them from Big SQL catalog. There are two types of actions that can be performed: changing a column type and dropping or appending columns. The actions are implemented as ALTER HADOOP TABLE statements. If an object cannot be modified in terms of ALTER TABLE, the procedure returns an ERROR, and the user is expected to update it in a different way, such as by using the REPLACE action. When the MODIFY changes a column data type, there are some limitations that can result in an ERROR: v Big SQL does not support the change from original type to new type. v The type in Hive metastore is not a valid Big SQL type. For more details about valid Big SQL types, see ALTER HADOOP TABLE, Data types that are supported by Big SQL, Data types migrated from Hive applications , and Understanding data types.

Note: Several Hive types are mapped into the same Big SQL type, and in these cases no alteration of the object is needed. Therefore, the result is a SKIP. Big SQL uses Hive comments to store information about the data type. As a consequence, modifying the Hive type of a column with comments is unsafe and leads to undefined behavior. As a general rule, if the type of a column is changed in Hive, the existing comment should be dropped. For complex type columns it is not possible to change the column type, so an ERROR is returned and the user is expected to REPLACE the table. There are limitations about the way columns are removed or inserted in the existing table. Changing the column name is not supported for safety reasons. A table can be safely modified if columns are either removed or appended at the end of the object definition. The following table summarizes the most common cases:

Columns in Big SQL catalog Columns in Hive metastore Notes A B C D A C D Column B is removed by DROP COLUMN. A B C D A B C D E Column E is appended by ADD COLUMN. A B C D A B C E The alteration is performed by dropping column D and appending column E. This is not a renaming action, which is not supported. It is possible to drop the old column and add the new one because the column is at the end of the object definition, which makes it an append.

114 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview Columns in Big SQL catalog Columns in Hive metastore Notes A B C D A X B C D Column X is inserted inside the object definition. As this operation is not supported by MODIFY, the procedure returns an ERROR. The solution is to use the REPLACE action. A B C D A X C D Column B is apparently renamed to X. Since renaming is not supported, and columns cannot be added inside an object definition, the procedure returns an ERROR. The solution is to use the REPLACE action. A B C D B A C D Columns A and B are swapped, so the procedure returns an ERROR. The solution is to use the REPLACE action. error-action Specifies the action that the process should take if there is an error during the import. The value is a string that contains one of the following actions: STOP Stops processing immediately and reports the nature of the error. All import activity is rolled back. CONTINUE Reports the error in the final results, but continues the import process. 'options' A string that contains a comma delimited list of options: IMPORT HDFS AUTHORIZATIONS If specified, the HDFS authorizations on the tables are imported automatically. GRANT statements are issued to the same owner, group, and other roles based on the read/write permissions in the HDFS directory for the specified tables. This option is only applicable for table objects. If you also specify the exists-action=SKIP, then tables that exist are not created again, but the HDFS authorizations are imported. If you also specify the exists-action=REPLACE, then the tables are replaced and HDFS authorizations are imported For example, assume that permissions on the file location are -rwxr-xr-- hdfs biadmin. The following GRANT statements are automatically issued when you specify the IMPORT HDFS AUTHORIZATIONS option: GRANT SELECT ON TO user hdfs; GRANT UPDATE ON TO user hdfs; GRANT DELETE ON TO user hdfs; GRANT INSERT ON TO user hdfs; GRANT SELECT ON TO group biadmin; GRANT SELECT ON TO public;

Chapter 5. Some new or enhanced features for 4.2 115 TRANSFER OWNERSHIP TO username Indicates that the ownership of the table is transferred to the value in username. That value can be HIVEOWNER, which transfers ownership to the original HIVE owner. If you omit the TRANSFER OWNERSHIP TO clause, no transfer action is taken. Usage notes

The procedure returns the following results: Table 28. Results of HCAT_SYNC_OBJECTS Column Type Description OBJSCHEMA VARCHAR(128) The name of the schema from which an object is attempted to be imported. OBJNAME VARCHAR(128) The name of the object that is attempted to be imported. OBJATTRIB VARCHAR(128) For constraints, this column indicates the name of a constraint that was imported. It is NULL for all other object types. TYPE VARCHAR(1) The type, which is designated by one of the following characters: T table V view C constraint

116 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview Table 28. Results of HCAT_SYNC_OBJECTS (continued) Column Type Description STATUS VARCHAR(10) The status can be one of the following results: OK The object was imported. REPLACE The object was imported and replaced an existing object. SKIP The object was skipped because there was an existing object. SKIP_WARN An object was skipped because of a non error condition. ERROR The object could not be imported. WARN_ANALYZE The table was imported into Big SQL successfully, but the ANALYZE failed. DETAILS VARCHAR(4000) Contains more details about the status. If there are no more details, then the value is NULL.

After you import a table into Big SQL, you can control whether an ANALYZE command runs. To change the behavior of ANALYZE after a HCAT_SYNC_OBJECTS statement is run, modify the automatic analyze process following a HCAT_SYNC_OBJECTS statement by setting the biginsights.stats.auto.analyze.post.syncobj property to ONCE, NEVER, or COPYHIVE. The default value is ONCE. ONCE This is the default. This means that ANALYZE is run after HCAT_SYNC_OBJECTS completes if an ANALYZE command has never been run for the specified table. NEVER This means that ANALYZE is never run after the HCAT_SYNC_OBJECTS completes. COPYHIVE This means that ANALYZE copies the statistics that are gathered from Hive after HCAT_SYNC_OBJECTS completes.

You can set the value as a session variable or as a system-wide property within the configurations.

Chapter 5. Some new or enhanced features for 4.2 117 Session variable Run the following command within the Big SQL shell or interface: SET HADOOP PROPERTY biginsights.stats.auto.analyze.post.syncobj=NEVER; System-wide property Update the configuration properties: 1. Open the bigsql-conf.xml configuration file at $BIGSQL_HOME/conf/ bigsql-conf.xml on the head node only. 2. Add the following property: biginsights.stats.auto.analyze.post.syncobj NEVER 3. Restart the Big SQL service.

The HCAT_SYNC_OBJECTS routine uses the maximum default STRING length that is defined in the bigsql.string.size property when it is set. To ensure that you do not exceed any row limits defined by your database manager, set the bigsql.string.size property to a value smaller than the current default of VARCHAR(32672) before you run the HCAT_SYNC_OBJECTS routine. The HCAT_SYNC_OBJECTS routine can estimate the string size to best fit all the columns within the row limit. The following statement is an example of setting the bigsql.string.size property before you run the routine: SET HADOOP PROPERTY bigsql.string.size=4096; CALL SYSHADOOP.HCAT_SYNC_OBJECTS ...

For more information about the STRING data type, see Data types that are supported by Big SQL

When you import tables with their constraints by using the T or a object-types, the matching tables are first imported, then a second pass attempts to import all constraints that are associated with those tables. A constraint that cannot be imported because of a missing reference, produces a SKIP_WARN message to indicate that it was skipped. A constraint that cannot be imported for any other reason is considered an ERROR.

Views in the Hive catalogs are generally defined in HiveQL or by Big SQL 1.0. Views are imported if they meet the following criteria only: v The SQL in the view is fully supported by Big SQL. v All objects that the view references exist in Big SQL. Any view that does not meet the criteria is an ERROR.

A successfully imported view might not have the same behavior as the original view. For example, if the view contains a Hive function that also exists in Big SQL with a different behavior, the view might not be usable. Examples 1. Import all objects within a schema: In the following example, the schema name is EXAMPLES. The statement requests to import all objects within that schema. If an object exists, then replace the current object with the definition in the Hive metastore. If there is an error, report the error and continue the import. CALL SYSHADOOP.HCAT_SYNC_OBJECTS(’EXAMPLES’, ’.*’, ’a’, ’REPLACE’, ’CONTINUE’);

118 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview The following output shows that there were two objects in the schema that the import tried to process: +------+------+------+------+------+------+ |OBJSCHEMA | OBJNAME | OBJATTRIB | TYPE | STATUS | DETAILS | +------+------+------+------+------+------+ | EXAMPLES | HIVE_TABLE | [NULL] | T | SKIP_WARN | Column "C1", type | | | | | | |"decfloat" is not | | | | | | | supported | +------+------+------+------+------+------+ | EXAMPLES | My Table | [NULL] | T | OK | [NULL] | +------+------+------+------+------+------+ 2. Import objects and assign ownership to a user called user1, which also grants access on the table: CALL SYSHADOOP.HCAT_SYNC_OBJECTS( ’EXAMPLES’, ’.*’, ’a’, ’REPLACE’,’CONTINUE’, ’TRANSFERTO user1’ ) ;

Big SQL integration with Apache Spark As of BigInsights 4.2, Big SQL is tightly integrated with Spark. This integration enables the development of hybrid applications where Spark jobs can be executed from Big SQL SELECT statements and the results efficiently streamed in parallel from Spark to Big SQL.

Big SQL applications can treat Spark as a powerful analytic co-processor that complements the rich SQL functionality that is available in Big SQL. Big SQL users, who already enjoy the best SQL performance, richest SQL language, and extensive enterprise capabilities, can also leverage Spark for its non-relational distributed processing capabilities and rich analytic libraries. The capabilities of both engines are seamlessly blended in a single SQL statement, with large volumes of data flowing efficiently between them.

A built-in table function can make Spark functionality directly available at the SQL language level, and a Big SQL SELECT statement can invoke Spark jobs directly and process the results of those jobs as though they were tables. For example, you can use the “EXECSPARK table function” on page 122 to invoke Spark jobs from Big SQL. Polymorphic table functions

When compiling a query, the SQL compiler needs to know the names and data types of columns in the tables that are specified in the FROM clause of the query. In some cases, the schema of the result that is generated by SYSHADOOP.EXECSPARK is not fixed, because it depends on the value of an optional argument to the table function. But the SQL compiler needs to know the schema before that value is processed. Polymorphism is a convenient feature that enables dynamic interaction with an external entity (in this case, Spark) to produce data whose schema is not known up front and that might depend on input arguments.

When dealing with a polymorphic table function (PTF), the SQL engine invokes the function twice: first to inquire about the output schema so that the query can be compiled and an execution plan can be created, and then at run time. A polymorphic table function is not a single piece of executable code. Each invocation can call a different method with the same input arguments and different results (for example, one invocation returns a schema, and the other returns the

Chapter 5. Some new or enhanced features for 4.2 119 data). Such methods can be referred to as describe methods and execute methods, depending on their purpose. Big SQL defines a Java interface called SparkPtf. Classes that are used to specify Spark jobs in EXECSPARK, such as ReadJsonFile, must implement this interface. The main methods of the SparkPtf interface are the describe and execute methods. The describe method returns a Spark StructType that specifies the schema of the result, and the execute method returns the actual result, which is always a Spark data frame, and which is transparently mapped to a Big SQL result set.

When a Big SQL query contains an invocation to EXECSPARK, the Big SQL engine offloads the PTF execution to a “slave” Spark application known as the Big SQL Spark gateway. This gateway is a long-running Spark application that is fully controlled by Big SQL. It can be started and stopped by bigsql (that is, the user who normally starts and stops the Big SQL service on a cluster), by using the same script that is used to manage other services. Building a Spark PTF

Writing a polymorphic table function involves writing a Java or Scala class that implements the SparkPtf interface. For more information, see “EXECSPARK table function” on page 122. Compiling a PTF

The classpath that is used to compile the class for a PTF must include the spark-assembly.jar file (from Spark), which contains all of the Spark libraries, and the bigsql-spark.jar file (from Big SQL), which includes the definition of the SparkPtf interface. For example, if the ReadJsonFile class is in a text file named ReadJsonFile.scala, you can compile it from the command line by using the following commands: $ scalac -cp /usr/iop/current/spark-client/lib/spark-assembly.jar:\ /usr/ibmpacks/current/bigsql/bigsql/lib/java/bigsql-spark.jar ReadJsonFile.scala $ jar cvf examples.jar com

It is assumed that the shell from which these commands are invoked is running on a cluster node that contains Big SQL and Spark. To compile on a different machine, the two JAR files must be copied over and the paths in the command must be updated accordingly. The JAR file that contains the PTF must be copied to the same location on all the nodes in the cluster. A convenient location might be /usr/ibmpacks/current/bigsql/bigsql/userlib.

JAR files that contain PTFs must be added to the classpath of the Spark gateway’s driver and executors (properties spark.driver.extraClassPath and spark.executor.extraClassPath in $BIGSQL_HOME/conf/bigsql-spark.conf). You must then restart the Spark gateway so that changes to bigsql-spark.conf can be picked up. Reading Big SQL tables inside a PTF

A PTF can contain arbitrary Spark code, including Spark SQL statements. In a BigInsights cluster, Big SQL and Spark share the Hive metastore, and the Spark gateway runs under user bigsql. This make is possible to use Spark SQL statements inside PTFs to query Big SQL tables.

120 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview Configuring the Big SQL Spark gateway

The Spark gateway must be configured before it is used for the first time. Configuration involves specifying the values of various Spark settings in the $BIGSQL_HOME/conf/bigsql-spark.conf file, which does not exist when Big SQL is installed. Instead, a template for this file exists under $BIGSQL_HOME/conf/ templates/bigsql-spark.conf. You can copy this template to the $BIGSQL_HOME/conf directory and then update it.

The key configuration parameters to set are the number of executors, the amount of memory for the driver and executors, the number of cores for executors, and the location of the JAR files for PTFs and libraries that are used in the PTFs. The template contains default values for these properties, which are set to fairly low values and should be tuned to match available resources and the kind of Spark jobs that will be executed as PTFs. A reasonable value for the spark.executor.instances property is the number of Big SQL workers. The Spark documentation contains guidelines for the configuration and tuning of other properties. The spark.driver.extraClassPath and spark.executor.extraClassPath properties in the template contain a JAR file named ptf-examples.jar, which contains the ReadJsonFile class. The properties that are specified in bigsql-spark.conf must be appropriate for running a Spark application in YARN client mode, because that is how the Big SQL Spark gateway runs. You can customize the log configuration, in particular the log level for the gateway, by modifying $BIGSQL_HOME/conf/log4j-spark.properties. Managing the Big SQL Spark gateway

After the Spark gateway has been configured, you can start it by using the following command: $BIGSQL_HOME/bin/bigsql start -spark. The start command launches the Spark gateway in YARN client mode and executes a very simple PTF to ensure that the gateway is working correctly.

To stop the Spark gateway, the bigsql user must run the following command: $BIGSQL_HOME/bin/bigsql stop -spark. The Spark gateway does not start when -all is specified in the bigsql start command, because (depending on the configuration) the gateway can consume a nontrivial amount of memory even when it is idle.

In cases where the Spark gateway becomes unresponsive to the bigsql stop command, you can use the forcestop command: $BIGSQL_HOME/bin/bigsql forcestop -spark. You can check the status of the Spark gateway by running the status command: $BIGSQL_HOME/bin/bigsql status -spark. If the gateway is running, this command also returns the status and location of the YARN containers for the executors. Security considerations

The Big SQL Spark gateway runs under the same user ID as other Big SQL engine processes: bigsql. One implication of this is that PTFs have unrestricted access to all Big SQL tables, and for this reason, only bigsql has EXECUTE privilege on SYSHADOP.EXECSPARK. This does not mean, however, that only the bigsql user can use PTFs. User bigsql could, in principle, grant EXECUTE privilege on SYSHADOP.EXECSPARK to other users, but this is not a recommended practice on production systems, because it virtually disables all data access control.

Chapter 5. Some new or enhanced features for 4.2 121 You can control the use of SYSHADOP.EXECSPARK by using SQL stored procedures to wrap specific invocations of SYSHADOP.EXECSPARK. User bigsql can grant EXECUTE privilege on specific stored procedures to different users. The following script (executed by user bigsql) shows how bigsql can grant access to the ReadJsonFile PTF to user joe: call sysproc.set_routine_opts (’DYNAMICRULES BIND’) %

create or replace procedure utils.read_json(in path varchar(1024)) begin declare st statement; declare result cursor with return for st; set stmt = ’select * from table(SYSHADOOP.EXECSPARK(language => ’’java’’, class => ’’com.ibm.biginsights.bigsql.examples.ReadJsonFile’’, path => ’’’ || path || ’’’)) as res’; prepare st from stmt; open result; end %

grant execute on procedure utils.read_json to user joe %

Note that because the procedure is compiled with the DYNAMICRULES BIND option, authorization checking of any dynamic SQL statement that is issued from within the procedure will be done against user bigsql (the owner of the procedure). The invocation of SYSHADOOP.EXECSPARK is done by using dynamic SQL. Currently, this function cannot be invoked statically, because the schema of the result cannot be determined at compile time. The procedure returns the result of the PTF as a result set.

After being granted access, user joe can invoke the procedure by using a CALL statement, such as in the following example: call utils.read_json(’hdfs://host.port.com:8020/user/bigsql/demo.json’) EXECSPARK table function By using the SYSHADOOP.EXECSPARK table function, you can invoke Apache Spark jobs from Big SQL. Syntax

►► EXECSPARK ( 'language' , 'class' ▼ ) ►◄ , arg

Description language Specifies the language in which the Spark job is written. Valid values are 'scala' and 'java'. class Specifies the fully qualified name of the main class (for example, 'com.ibm.biginsights.bigsql.examples.ReadJsonFile'). arg Specifies one or more optional arguments. These values are passed to the Spark job.

122 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview Usage

The class that is specified in the class argument must implement the interface com.ibm.biginsights.bigsql.spark.preview.SparkPtf: StructType describe(org.apache.spark.sql.SQLContext ctx, java.util.Map arguments); DataFrame execute(org.apache.spark.sql.SQLContext ctx, java.util.Map arguments); void destroy(org.apache.spark.sql.SQLContext ctx, java.util.Map arguments); long cardinality(org.apache.spark.sql.SQLContext ctx, java.util.Map arguments);

All the methods in the SparkPtf interface have the same two parameters: an SQLContext object and a map that contains the arguments provided in the invocation. The SQLContext object that is passed to all the methods is an instance of Spark’s HiveContext. It therefore can be used to query tables that are registered with the Hive metastore.

The arguments object is a map that contains all the arguments to SYSHADOOP.EXECSPARK, except language and class. If the arguments are explicitly named by using the arrow syntax, the map key will be the specified name in uppercase characters; otherwise, the key will be the ordinal position (1, 2, and so on).

The PTF class is instantiated twice, once at query compilation time and once at query execution time. The destroy method is called when the engine no longer needs the PTF instance, and provides the user code an opportunity to perform resource cleanup. The cardinality method is called at compilation time so that the PTF can provide an estimate of the number of rows to be returned by the execute method. This information can help the Big SQL optimizer make better query planning decisions.

In the following example, the keys in the argument map are INTARG, DECIMALARG, STRINGARG, and 6. SELECT * FROM TABLE(SYSHADOOP.EXECSPARK( language => ’java’, class => ’com.samples.MyPtf’, intarg => CAST(111 AS INT), decimalarg => 22.2, stringarg => ’hello’, CAST(33.3 AS DOUBLE) ) ) AS j

The last argument gets a key value of 6 because it is the sixth argument, and SQL indexing starts at 1. The values of the arguments to SYSHADOOP.EXECSPARK must be constants. The SQL type of a constant is inferred by the compiler according to standard SQL rules, and you can use the CAST expression to cast a literal to a different type than the default. In this example, 33.3 would be interpreted as a decimal value by default, but the CAST expression turns it into a double value. Example

Use SYSHADOOP.EXECSPARK to invoke a Spark job that reads a JSON file stored on the HDFS. For example: SELECT * FROM TABLE(SYSHADOOP.EXECSPARK( language => ’scala’, class => ’com.ibm.biginsights.bigsql.examples.ReadJsonFile’,

Chapter 5. Some new or enhanced features for 4.2 123 uri => ’hdfs://host.port.com:8020/user/bigsql/demo.json’ ) ) AS doc WHERE doc.country IS NOT NULL LIMIT 5

The following example shows a random record from this JSON file: {"Country":null, "Direction":"UP", "Languge":"English"}

The output might look like the following text: COUNTRY DIRECTION LANGUAGE ------DE UP German RU DOWN Russian US UP English AU - English US UP English 5 record(s) selected

In this example, each argument is explicitly named by using the arrow syntax to improve readability.

The first two arguments, language and class, are mandatory because they specify the executable code for the Spark job; language specifies that the job is a Scala program, and class specifies the fully qualified name of the main class (com.ibm.biginsights.bigsql.examples.ReadJsonFile). Any other argument (in this case, only uri) is passed to the Spark job itself. In this example, when the Spark job instantiates class ReadJsonFile, the instance receives the specified URI as an input argument.

124 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview Chapter 6. Known problems

At the time of publication of this technical preview, the following issues are known.

Known problems are documented in the form of individual technotes, techdocs, or APARs in the Support knowledge base for BigInsights at https://www.ibm.com/ support/entry/portal/Overview/Software/Information_Management/ InfoSphere_BigInsights. As problems are discovered and resolved, the IBM Support team updates the knowledge base. By searching the knowledge base, you can quickly find workarounds or solutions to problems.

The following link starts customized queries of the live Support knowledge base: http://www.ibm.com/support/search.wss?tc=SSPT3X&rank=8&sort=desc &atrn1=SWVersion&atrv1=4.2.0 . Ranger Ranger is not supported for the BigInsights value added services. BigInsights - Big SQL You might encounter an SQL5105N error when you drop a table that is defined in the HDFS encryption zone (a secure directory). This error is because a trash interval property is by default enabled, and dropping a table moves the contents to a trash folder in your home directory, which is a non-encryption zone, which is not allowed. Currently you must drop tables from Hive by using the PURGE clause Ambari heat maps and metrics with SSL enabled There are some issues wth HDFS and YARN metrics and heatmaps after you enable SSL for Hadoop and then restart services. No data will be available for several of the metric widgets in HDFS and YARN, or the heatmap for Yarn will not open and the HDFS heatmap appears gray for most of the metrics. This issue might be related to JIRA AMBARI-14680. Titan and Solr integration does not work when you add Titan into a Kerberos-enabled cluster After you upgrade a Kerberos-enabled cluster from IBM Open Platform with Apache Hadoop 4.1 to 4.2, and you add Titan into the cluster, it fails when you build an index. You see an error from the server, Can not find the specified config set: titan. Do the following work-around after you add Titan into a Kerberos-enabled cluster: 1. Create and change the user permission, owner, and group for the Solr configuration directory: $ mkdir /usr/iop/current/solr-server/server/solr/configsets/titan $ chmod 775 /usr/iop/current/solr-server/server/solr/configsets/titan $ chown solr:hadoop /usr/iop/current/solr-server/server/solr/configsets/titan 2. Use the user hdfs to get the configuration directory and jar. $ su hdfs 3. Get the Titan configuration files, and JAR from HDFS on each of the Solr server nodes: $ hadoop fs -get /apps/titan/conf /usr/iop/current/solr-server/server/solr/configsets/titan $ hadoop fs -get /apps/titan/jts-1.13.jar /usr/iop/current/solr-server/server/solr/configsets/titan

© Copyright IBM Corp. 2013, 2016 125 4. Revert back to the original admin user: $ exit 5. Move the JAR that you got from HDFS to the Solr server lib directory on each of the Solr server nodes: $ mv /usr/iop/current/solr-server/server/solr/configsets/titan/jts-1.13.jar /usr/iop/current/solr-server/server/solr-webapp/webapp/WEB-INF/lib 6. Restart Solr from the Amabari dashboard. 7. Create a collection for Titan and Solr integration on either of the Solr nodes: $ sudo su -c "SOLR_INCLUDE=/etc/solr/conf/solr.in.sh /usr/iop/4.2.0.0/solr/bin/solr create -c titan -s 2 -rf 1 -d titan" - solr

Note: The Titan referenced here is the Titan configuration index.search.solr.configset. The value titan is the default value for the configuration property in IBM Open Platform with Apache Hadoop. If you change that value, also modify the commands referenced here. YARN does not start when using RHEL when Cgroups is enabled (Default Mode) If YARN does not start with RHEL, make the following changes: 1. Disable CPU Scheduling and CPU Isolation by selecting the YARN service, and clicking the Configs tab. Then open the Settings page. You will find both the CPU Scheduling toggle switch and the CPU Isolation toggle switch in Node. 2. Change the scheduler class by opening the Configs tab and then opening the Advanced page. Expand the Isolation section. Modify the yarn.nodemanager.container-executor.class to org.apache.hadoop.yarn.server.nodemanager.util.DefaultLCEResourcesHandler.

126 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview Index A E O ADMIN_TASK_UPDATE 108 EXECSPARK table function 122 Open Platform 28 alerts 11 Ambari 3, 9 ANALYZE 2.0 99 F P authentication functions performance 99 configuring LDAP table planning for installation RHEL 6 25 EXECSPARK 122 preparing to run the installation auto-analyze 108 program 16 ports 61 G postgresql 28 B preinstallation requirements 16 groups 61 BETA known issues 125 Big SQL Apache Spark R integration 119 H removing value-add services 91 Big SQL administrator 16 Hadoop repositories 74 Big SQL monitoring 74, 81 open source technologies 9 runstats 99 Big SQL service 74, 78 Hadoop functions BigInsights Business Analyst module 61, table 68 EXECSPARK 122 BigInsights Data Scientist module 61, 68 HCAT_DESCRIBETAB 99 S BigInsights Enterprise Management HDFS authorizations 112 Spark module 68 Big SQL BigInsights for Apache Hadoop integration 119 module 61, 68 I statistics 99 BigInsights value-add services 61, 68 synching Hadoop objects 112 impersonation 95, 108 biginsights.stats.auto.analyze.concurrent.max 108 importing objects 112 biginsights.stats.auto.analyze.task.retention.time 108 install 11 BIGSQL_AUTO_ANALYZE_STATUS 108 installing Big SQL 78 T BIGSQL_CHECK_ANALYZE 108 IOP 28 transferring ownership 112 BIGSQL_RUN_ANALYZE 108 bigsql-precheck.sh 78 bigsql.alltables.io.doAs 95 bigsql.analyze.java.opts 99 J U bigsql.external.table.io.doAs 95 JDK 36 users 61 bigsql.table.io.doAs 95 bigsql.table.load.user.firstbigsql.table.load.user.first 95 L V LDAP value-add services 61, 68 C configuring authentication groups 61 commands RHEL 6 25 ports 61 ANALYZE 99 LOAD 74 users 61 cumulative statistics 99 lzo compression 74 W D M what's new 3 Data Server Manager 81 MariaDB 28 DB2 metrics 11 group name 16 monitoring 74, 81 preinstallation requirements 16 MySQL 28 unmask 0022 16 deletions 3 deprecations 3 N downloading repos 1 new features DSM 74 overview 3 DSM service 81 non-root user 28 non-root user installation 69

© Copyright IBM Corp. 2013, 2016 127 128 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview

IBM®

Printed in USA