IBM Infosphere Biginsights Version 2.1: Installation Guide Chapter 1

IBM InfoSphere BigInsights Version 2.1 Installation Guide GC19-4100-00 IBM InfoSphere BigInsights Version 2.1 Installation Guide GC19-4100-00 Note Before using this information and the product that it supports, read the information in “Notices and trademarks” on page 89. © Copyright IBM Corporation 2013. US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents Chapter 1. Introduction to InfoSphere Installing the InfoSphere BigInsights Tools for BigInsights .............1 Eclipse ................58 InfoSphere BigInsights features and architecture . 1 Configuring access to the default task controller . 61 Hadoop Distributed File System (HDFS) ....3 Installing and configuring a Linux Task Controller 61 IBM General Parallel File System ......3 Adaptive MapReduce ..........6 Chapter 5. Upgrading InfoSphere Hadoop MapReduce...........7 BigInsights software .........63 Additional Hadoop technologies.......7 Preparing to upgrade software ........64 Text Analytics .............9 Upgrading InfoSphere BigInsights .......65 IBMBigSQL.............10 Upgrading the InfoSphere BigInsights Tools for InfoSphere BigInsights Console .......11 Eclipse ................68 InfoSphere BigInsights Tools for Eclipse ....11 Migrating from HDFS to GPFS ........68 Integration with other IBM products .....12 Chapter 6. Removing InfoSphere Chapter 2. Planning to install BigInsights software .........71 InfoSphere BigInsights ........15 Removing InfoSphere BigInsights by using scripts 71 Reviewing system requirements and release notes 15 Removing InfoSphere BigInsights manually . 72 Directories created when installing InfoSphere Removing the InfoSphere BigInsights Tools for BigInsights ..............15 Eclipse ................73 InfoSphere BigInsights installed components . 18 InfoSphere BigInsights installation worksheet . 20 Chapter 7. Installation problems and Planning for high availability ........24 workarounds ............75 GPFS installation paths ..........24 Installation program hangs and progress does not Security architecture ...........25 update ................75 Choosing user security and authentication . 26 Cannot install the Linux Expect package.....76 Mapping users and groups to roles .....27 Installing optional components ........77 Incorrect hostname information for monitoring Chapter 3. Preparing to install adaptor................77 InfoSphere BigInsights ........31 Installation failure due to insufficient prerequisites 78 Choosing a user to install the product with....31 Hadoop data nodes are in uncertain status ....79 Configuring your browser .........31 Local names do not match the managed nodes . 79 Obtaining InfoSphere BigInsights software ....32 NameNode in safe mode causes errors .....79 Preparing to run the installation program ....32 Incorrect HBase Sudo policy .........80 Configuring LDAP authentication .......36 Administrative user is not listed in AllowUsers Selecting prerequisite checker options .....37 property ...............80 Preparing to install GPFS ..........38 Disk discovery fails due to node passwordless SSH Discovering devices with uncommon names . 40 errors................81 Enabling adaptive MapReduce ........40 HBase status shows as “Unavailable” during Creating a private SSL certificate for a secure installation ..............81 InfoSphere BigInsights Console ........41 A previous GPFS installation failed ......81 Certificate authority sample ........43 Linux Standard Base package is not installed . 82 Linux system does not have prerequisite kernel or Chapter 4. Installing InfoSphere C++ packages .............82 BigInsights software .........45 Stale mounts cause installation errors......83 Installing GPFS by using InfoSphere BigInsights Unable to load one or more GPFS kernel extensions 84 mmcrfs scripts ................45 Installing GPFS by using the command fails 84 Node descriptor file ..........46 Cluster status displays as “Running”, even when Stanza file ..............47 the file system is down ..........85 bi_gpfs.cfg configuration file........47 Applications hang when running as a Installing GPFS by using administration commands 50 non-administrator user ..........86 Installing InfoSphere BigInsights by using the Users cannot log in to the InfoSphere BigInsights wizard ................54 Console when using LDAP authentication ....86 Installing InfoSphere BigInsights by using a response file ..............57 © Copyright IBM Corp. 2013 iii Notices and trademarks .......89 Providing comments on the documentation ...........93 iv IBM InfoSphere BigInsights Version 2.1: Installation Guide Chapter 1. Introduction to InfoSphere BigInsights InfoSphere® BigInsights™ is a software platform for discovering, analyzing, and visualizing data from disparate sources. You use this software to help process and analyze the volume, variety, and velocity of data that continually enters your organization every day. InfoSphere BigInsights helps your organization to understand and analyze massive volumes of unstructured information as easily as smaller volumes of information. The flexible platform is built on an Apache Hadoop open source framework that runs in parallel on commonly available, low-cost hardware. You can easily scale the platform to analyze hundreds of terabytes, petabytes, or more of raw data that is derived from various sources. As information grows, you add more hardware to support the influx of data. InfoSphere BigInsights helps application developers, data scientists, and administrators in your organization quickly build and deploy custom analytics to capture insight from data. This data is often integrated into existing databases, data warehouses, and business intelligence infrastructure. By using InfoSphere BigInsights, users can extract new insights from this data to enhance knowledge of your business. InfoSphere BigInsights incorporates tooling for numerous users, speeding time to value and simplifying development and maintenance: v Software developers can use the Eclipse-based plug-in to develop custom text analytic functions to analyze loosely structured or largely unstructured text data. v Administrators can use the web-based management console to inspect the status of the software environment, review log records, assess the overall health of the system, and more. v Data scientists and business analysts can use the data analysis tool to explore and work with unstructured data in a familiar spreadsheet-like environment. InfoSphere BigInsights features and architecture InfoSphere BigInsights provides distinct capabilities for discovering and analyzing business insights that are hidden in large volumes of data. These technologies and features combine to help your organization manage data from the moment that it enters your enterprise. © Copyright IBM Corp. 2013 1 Analytic Applications Business Exploration Functional Industry Predictive Content Intelligence Visualization Applications Applications Analysis Analysis IBM Big Data Platform Visualization Applications Systems and and Management Discovery Development Accelerators Hadoop Stream Data System Computing Warehouse Information Integration and Governance Cloud Mobile Security By combining these technologies, InfoSphere BigInsights extends the Hadoop open source framework with enterprise-grade security, governance, availability, integration into existing data stores, tools that simplify developer productivity, and more. Hadoop is a computing environment built on top of a distributed, clustered file system that is designed specifically for large-scale data operations. Hadoop is designed to scan through large data sets to produce its results through a highly scalable, distributed batch processing system. Hadoop comprises two main components: a file system, known as the Hadoop Distributed File System (HDFS), 2 IBM InfoSphere BigInsights Version 2.1: Installation Guide and a programming paradigm, known as Hadoop MapReduce. To develop applications for Hadoop and interact with HDFS, you use additional technologies and programming languages such as Pig, Hive, Jaql, Flume, and many others. Apache Hadoop helps enterprises harness data that was previously difficult to manage and analyze. InfoSphere BigInsights features Hadoop and its related technologies as a core component. Hadoop Distributed File System (HDFS) The Hadoop Distributed File System (HDFS) allows applications to run across multiple servers. HDFS is highly fault tolerant, runs on low-cost hardware, and provides high-throughput access to data. Data in a Hadoop cluster is broken into smaller pieces called blocks, and then distributed throughout the cluster. Blocks, and copies of blocks, are stored on other servers in the Hadoop cluster. That is, an individual file is stored as smaller blocks that are replicated across multiple servers in the cluster. Each HDFS cluster has a number of DataNodes, with one DataNode for each node in the cluster. DataNodes manage the storage that is attached to the nodes on which they run. When a file is split into blocks, the blocks are stored in a set of DataNodes that are spread throughout the cluster. DataNodes are responsible for serving read and write requests from the clients on the file system, and also handle block creation, deletion, and replication. Also on each HDFS cluster is a single NameNode, which is a primary server that regulates access to files by clients, and tracks all data files in HDFS. The NameNode determines the mapping of blocks to DataNodes, and handles operations

IBM Infosphere Biginsights Version 2.1: Installation Guide Chapter 1

Towards an Analytics Query Engine *

Evaluation of Xpath Queries on XML Streams with Networks of Early Nested Word Automata Tom Sebastian

QUERYING JSON and XML Performance Evaluation of Querying Tools for Offline-Enabled Web Applications

Programming a Parallel Future

Declarative Data Analytics: a Survey

Big Data Analytic Approaches Classification

A Dataflow Language for Large Scale Processing of RDF Data

Extracting Data from Nosql Databases a Step Towards Interactive Visual Analysis of Nosql Data Master of Science Thesis

HIL: a High-Level Scripting Language for Entity Integration

Schemas and Types for JSON Data

Technology Overview

Data Analytics Using Mapreduce Framework for DB2's Large Scale XML Data Processing