IBM Open Platform with Apache Hadoop and Biginsights 4.2 Technical Preview
Total Page:16
File Type:pdf, Size:1020Kb
BigInsights IBM IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview Version 4 Release 2 BigInsights IBM IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview Version 4 Release 2 Edition notice - early release documentation This document contains proprietary information. All information contained herein shall be kept in confidence. None of this information shall be divulged to persons other than (a) IBM employees authorized by the nature of their duties to receive such information, or (b) individuals with a need to know in organizations authorized by IBM to receive this document in accordance with the terms (including confidentiality) of an agreement under which it is provided. This information might include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements or changes in the product or the programs described in this publication at any time without notice. © Copyright IBM Corporation 2013, 2016. US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents Chapter 1. Introduction to 4.2 ..... 1 Preparing to install the BigInsights value-add Introduction .............. 1 services................ 63 Obtaining the BigInsights value-add services ... 68 Chapter 2. What's New in 4.2...... 3 Installing the BigInsights value-add packages ... 69 Installing BigInsights Home ........ 72 What's new for Version 4.2 ......... 3 Installing the BigInsights - Big SQL service ... 74 Open source technologies .......... 9 Installing the Text Analytics service ..... 84 Enabling Knox for value-add services .... 88 Chapter 3. Installing IBM Open Platform Removing BigInsights value-add services ... 90 with Apache Hadoop ......... 11 Get ready to install ............ 11 Chapter 5. Some new or enhanced Preparing your environment......... 16 features for 4.2 ........... 95 Configuring LDAP server authentication on Red Impersonation in Big SQL ......... 95 Hat Enterprise Linux 6.7 and 7.2 ...... 25 ANALYZE command ........... 99 Creating a mirror repository for the IBM Open Auto-analyze ............ 108 Platform with Apache Hadoop software ..... 27 HCAT_SYNC_OBJECTS stored procedure .... 112 Running the installation package ....... 28 Big SQL integration with Apache Spark .... 119 Upgrading the Java (JDK) version ....... 36 EXECSPARK table function........ 122 Installing and configuring Ranger in the Ambari web interface .............. 37 Configuring MySQL for Ranger ...... 41 Chapter 6. Known problems ..... 125 Installing Ranger plugins ......... 42 Set up user sync from LDAP/AD/Unix to Index ............... 127 Ranger ............... 45 Installing Ranger authentication ...... 48 Ranger KMS set up and usage ........ 53 Cleaning up nodes before reinstalling software .. 55 HostCleanup.ini file .......... 57 HostCleanup_Custom_Actions.ini file..... 59 Chapter 4. Installing the IBM BigInsights value-added services on IBM Open Platform with Apache Hadoop .............. 61 Users, groups, and ports for BigInsights value-add services................ 61 © Copyright IBM Corp. 2013, 2016 iii iv BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview Chapter 1. Introduction to 4.2 Introduction Welcome to the Technical Preview of the IBM® Open Platform with Apache Hadoop and IBM BigInsights 4.2. This README contains information to ensure the successful installation and operation of the IOP and the BigInsights Value add services. The information contained in this Technical Preview documentation might not describe completely the functionality that is available in the 4.2 release. The information represents a snapshot of the full 4.2 release. It describes how to install the product and some of the highlights of the 4.2 release. Because the product documentation is still being refined, you might find links that are not valid. Contact your IBM representative for help in those cases. Description IBM Open Platform with Apache Hadoop and IBM BigInsights Version 4.2 deliver enterprise Hadoop capabilities with easy-to-use analytic tools and visualization for business analysts and data scientists, rich developer tools, powerful analytic functions, complete administration and management capabilities, and the latest versions of Apache Hadoop and associated projects. This 4.2 release provides full function SQL query capability, with security and performance benefits, to data that is stored in Hadoop. Obtaining the Technical Preview for 4.2 TECHNICAL PREVIEW DOWNLOAD ONLY Accept the IBM BigInsights Early Release license agreement: http://www14.software.ibm.com/cgi-bin/weblap/lap.pl? popup=Y&li_formnum=L-MLOY-9YB5S9&accepted_url= http://ibm-open-platform.ibm.com/repos/beta/4.2.0.0/&title= IBM+BigInsights+Beta+License&declined_url= http://www-01.ibm.com/software/data/infosphere/hadoop/trials.html Then select the appropriate repository file for your environment: RHEL6 https://ibm-open-platform.ibm.com/repos/beta/4.2.0.0/rhel6/ Use the following TAR files: BIPremium-4.2.0.0-beta1.el6.x86_64.tar.gz ambari-2.2.0.0-beta1.el6.x86_64.tar.gz iop-4.2.0.0-beta1-el6.x86_64.tar.gz iop-utils-4.2.0.0-beta1.el6.x86_64.tar.gz RHEL7 https://ibm-open-platform.ibm.com/repos/beta/4.2.0.0/rhel7/ Use the following TAR files: BIPremium-4.2.0.0-beta1.el7.x86_64.tar.gz ambari-2.2.0.0-beta1.el7.x86_64.tar.gz iop-4.2.0.0-beta1-el7.x86_64.tar.gz iop-utils-4.2.0.0-beta1.el7.x86_64.tar.gz © Copyright IBM Corp. 2013, 2016 1 2 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview Chapter 2. What's New in 4.2 What's new for Version 4.2 New features for Version 4.2 Note: There is no UPGRADE path to or from the IBM Open Platform with Apache Hadoop and BigInsights Version 4.2 technical preview. Major milestones v ODPi compliant. v Express upgrade is available. You can quickly upgrade the entire cluster while it is shut down. v Apache Spark ecosystem. v Apache Hadoop ecosystem. Operating Systems Refer to the System Requirements for the most up-to-date information on operating system support: v RHEL 6.7+ v RHEL 7.2 Open Source The following open source technologies are now supported: v Ranger 0.5.2 v Phoenix 4.6.1 v Titan 1.0.0 (Titan server and OLAP are not integrated in IBM Open Platform with Apache Hadoop 4.2) The following open source technologies are updated: v Ambari 2.2.0 v Flume 1.6.0 v Hadoop 2.7.2 v HBase 1.2.0 v Kafka 0.9.0.1 v Knox 0.7.0 v Slider 0.90.2 v Solr 5.5 v Spark 1.6.1 BigInsights Big SQL updates 1. BigInsights - Big SQL is now packaged as part of the IBM BigInsights Premium package. 2. Big SQL and Spark Integration is now available as a technical preview. You can invoke Spark jobs from Big SQL by using a table UDF abstraction. The following example calls the SYSHADOOP.EXECSPARK user-defined function to kick off a Spark job that reads a JSON file stored on HDFS: © Copyright IBM Corp. 2013, 2016 3 SELECT * FROM TABLE(SYSHADOOP.EXECSPARK( language => ’scala’, class => ’com.ibm.biginsights.bigsql.examples.ReadJsonFile’, uri => ’hdfs://host.port.com:8020/user/bigsql/demo.json’, card => 100000)) AS doc, products WHERE doc.country IS NOT NULL AND doc.language = products.language; 3. Support for update and delete on Big SQL HBase tables. 4. Impersonation feature, which allows a service user to securely access data in Hadoop on behalf of another user. 5. An auto-analyze feature that runs the ANALYZE command automatically under certain conditions. In addition, ANALYZE command now has a FOR ALL COLUMNS clause. 6. ANALYZE command improvements, some of which are listed in the following table: Table 1. Big SQL ANALYZE improvements Enhancement Description Analyze v2 There are major performance and memory improvements due to the removal of all dependencies from Hive and Map/Reduce. The ANALYZE command with no Map/Reduce dependency is called Analyze v2, which is the default for BigInsights 4.2. You can use Analyze v1 by setting the biginsights.stats.use.v2 property to false. However, Analyze v1 is deprecated and will be removed in future releases of Big SQL. Cumulative statistics When you run the ANALYZE command against a table on a set of columns, and then later run ANALYZE on a second set of columns, the statistics that are gathered from the first ANALYZE command are merged with the statistics that are gathered from the second ANALYZE command. SYSTEM sampling Instead of scanning an entire table, you can specify a percentage of the splits that ANALYZE can run against. Big SQL extrapolates the statistics for the whole table based on the sample of the table that it gathered statistics on. The SYSTEM option can reduce the time to run the ANALYZE command with minor impact on query performance. FOR ALL COLUMNS By using this option, you can collect statistics on all of the columns of the table. 7. Some maintenance improvements: 4 BigInsights: IBM Open Platform with Apache Hadoop and BigInsights 4.2 Technical Preview Table 2. Big SQL maintenance enhancements Enhancement Description Automatic analyze By default, ANALYZE is run automatically after a successful LOAD or HCAT_SYNC_OBJECTS call, which automatically gathers statistics on the table to improve query performance. Also, Big SQL determines whether a table has changed significantly and automatically schedules an ANALYZE, if necessary. Automatic HCAT_SYNC_OBJECTS By default, Big SQL automatically synchronizes the Big SQL and Hive catalogs so that when data is added to Hive it can be assessed automatically by Big SQL. In addition, table metadata is preserved through the support of ALTER column. You can customize Big SQL metadata synchronization with Hive with configuration options. 8. Some performance improvements: Table 3. Big SQL performance improvements Enhancement Description Concurrency improvements Improved performance for high concurrency Hadoop, HBase, and Hybrid workloads which allows for greater throughput and improved CPU utilization by default on Big SQL clusters.