Hortonworks Data Platform Operations (May 17, 2018)

docs.cloudera.com Data Platform May 17, 2018

Hortonworks Data Platform: Apache Ambari Operations Copyright © 2012-2018 Hortonworks, Inc. Some rights reserved.

The Hortonworks Data Platform, powered by , is a massively scalable and 100% open source platform for storing, processing and analyzing large volumes of data. It is designed to deal with data from many sources and formats in a very quick, easy and cost-effective manner. The Hortonworks Data Platform consists of the essential set of Apache Hadoop projects including MapReduce, Hadoop Distributed File System (HDFS), HCatalog, Pig, Hive, HBase, ZooKeeper and Ambari. Hortonworks is the major contributor of code and patches to many of these projects. These projects have been integrated and tested as part of the Hortonworks Data Platform release process and installation and configuration tools have also been included.

Unlike other providers of platforms built using Apache Hadoop, Hortonworks contributes 100% of our code back to the Apache Software Foundation. The Hortonworks Data Platform is Apache-licensed and completely open source. We sell only expert technical support, training and partner-enablement services. All of our technology is, and will remain free and open source.

Please visit the Hortonworks Data Platform page for more information on Hortonworks technology. For more information on Hortonworks services, please visit either the Support or Training page. Feel free to Contact Us directly to discuss your specific needs.

Except where otherwise noted, this document is licensed under Creative Commons Attribution ShareAlike 4.0 License. http://creativecommons.org/licenses/by-sa/4.0/legalcode

ii Hortonworks Data Platform May 17, 2018

Table of Contents

1. Ambari Operations: Overview ...... 1 1.1. Ambari Architecture ...... 1 1.2. Accessing Ambari Web ...... 2 2. Understanding the Cluster Dashboard ...... 5 2.1. Viewing the Cluster Dashboard ...... 5 2.1.1. Scanning Operating Status ...... 6 2.1.2. Viewing Details from a Metrics Widget ...... 7 2.1.3. Linking to Service UIs ...... 7 2.1.4. Viewing Cluster-Wide Metrics ...... 8 2.2. Modifying the Cluster Dashboard ...... 9 2.2.1. Replace a Removed Widget to the Dashboard ...... 10 2.2.2. Reset the Dashboard ...... 10 2.2.3. Customizing Metrics Display ...... 11 2.3. Viewing Cluster Heatmaps ...... 11 3. Managing Hosts ...... 13 3.1. Understanding Host Status ...... 13 3.2. Searching the Hosts Page ...... 14 3.3. Performing Host-Level Actions ...... 17 3.4. Managing Components on a Host ...... 18 3.5. Decommissioning a Master or Slave ...... 19 3.5.1. Decommission a Component ...... 20 3.6. Delete a Component ...... 20 3.7. Deleting a Host from a Cluster ...... 21 3.8. Setting Maintenance Mode ...... 21 3.8.1. Set Maintenance Mode for a Service ...... 22 3.8.2. Set Maintenance Mode for a Host ...... 22 3.8.3. When to Set Maintenance Mode ...... 23 3.9. Add Hosts to a Cluster ...... 24 3.10. Establishing Rack Awareness ...... 25 3.10.1. Set the Rack ID Using Ambari ...... 26 3.10.2. Set the Rack ID Using a Custom Topology Script ...... 27 4. Managing Services ...... 28 4.1. Starting and Stopping All Services ...... 29 4.2. Displaying Service Operating Summary ...... 29 4.2.1. Alerts and Health Checks ...... 30 4.2.2. Modifying the Service Dashboard ...... 30 4.3. Adding a Service ...... 32 4.4. Performing Service Actions ...... 36 4.5. Rolling Restarts ...... 36 4.5.1. Setting Rolling Restart Parameters ...... 37 4.5.2. Aborting a Rolling Restart ...... 38 4.6. Monitoring Background Operations ...... 38 4.7. Removing A Service ...... 40 4.8. Operations Audit ...... 40 4.9. Using Quick Links ...... 40 4.10. Refreshing YARN Capacity Scheduler ...... 41 4.11. Managing HDFS ...... 41 4.11.1. Rebalancing HDFS ...... 42

iii Hortonworks Data Platform May 17, 2018

4.11.2. Tuning Garbage Collection ...... 42 4.11.3. Customizing the HDFS Home Directory ...... 43 4.12. Managing Atlas in a Storm Environment ...... 43 4.13. Enabling the Oozie UI ...... 44 5. Managing Service High Availability ...... 46 5.1. NameNode High Availability ...... 46 5.1.1. Configuring NameNode High Availability ...... 46 5.1.2. Rolling Back NameNode HA ...... 51 5.1.3. Managing Journal Nodes ...... 61 5.2. ResourceManager High Availability ...... 66 5.2.1. Configure ResourceManager High Availability ...... 66 5.2.2. Disable ResourceManager High Availability ...... 67 5.3. HBase High Availability ...... 69 5.4. Hive High Availability ...... 74 5.4.1. Adding a Hive Metastore Component ...... 74 5.4.2. Adding a HiveServer2 Component ...... 74 5.4.3. Adding a WebHCat Server ...... 75 5.5. Storm High Availability ...... 75 5.5.1. Adding a Nimbus Component ...... 75 5.6. Oozie High Availability ...... 76 5.6.1. Adding an Oozie Server Component ...... 76 5.7. Apache Atlas High Availability ...... 77 5.8. Enabling Ranger Admin High Availability ...... 79 6. Managing Configurations ...... 80 6.1. Changing Configuration Settings ...... 80 6.1.1. Adjust Smart Config Settings ...... 81 6.1.2. Edit Specific Properties ...... 82 6.1.3. Review and Confirm Configuration Changes ...... 82 6.1.4. Restart Components ...... 84 6.2. Manage Host Config Groups ...... 84 6.3. Configuring Log Settings ...... 87 6.4. Set Service Configuration Versions ...... 89 6.4.1. Basic Concepts ...... 89 6.4.2. Terminology ...... 90 6.4.3. Saving a Change ...... 90 6.4.4. Viewing History ...... 91 6.4.5. Comparing Versions ...... 92 6.4.6. Reverting a Change ...... 93 6.4.7. Host Config Groups ...... 93 6.5. Download Client Configuration Files ...... 94 7. Administering the Cluster ...... 96 7.1. Using Stack and Versions Information ...... 96 7.2. Viewing Service Accounts ...... 98 7.3. Enabling Kerberos and Regenerating Keytabs ...... 99 7.3.1. Regenerate Key tabs ...... 100 7.3.2. Disable Kerberos ...... 100 7.4. Enable Service Auto-Start ...... 101 8. Managing Alerts and Notifications ...... 104 8.1. Understanding Alerts ...... 104 8.1.1. Alert Types ...... 105 8.2. Modifying Alerts ...... 106

iv Hortonworks Data Platform May 17, 2018

8.3. Modifying Alert Check Counts ...... 106 8.4. Disabling and Re-enabling Alerts ...... 107 8.5. Tables of Predefined Alerts ...... 107 8.5.1. HDFS Service Alerts ...... 108 8.5.2. HDFS HA Alerts ...... 111 8.5.3. NameNode HA Alerts ...... 112 8.5.4. YARN Alerts ...... 113 8.5.5. MapReduce2 Alerts ...... 114 8.5.6. HBase Service Alerts ...... 114 8.5.7. Hive Alerts ...... 115 8.5.8. Oozie Alerts ...... 116 8.5.9. ZooKeeper Alerts ...... 116 8.5.10. Ambari Alerts ...... 116 8.5.11. Ambari Metrics Alerts ...... 117 8.5.12. SmartSense Alerts ...... 118 8.6. Managing Notifications ...... 118 8.7. Creating and Editing Notifications ...... 118 8.8. Creating or Editing Alert Groups ...... 120 8.9. Dispatching Notifications ...... 121 8.10. Viewing the Alert Status Log ...... 121 8.10.1. Customizing Notification Templates ...... 122 9. Using Ambari Core Services ...... 125 9.1. Understanding Ambari Metrics ...... 125 9.1.1. AMS Architecture ...... 125 9.1.2. Using Grafana ...... 126 9.1.3. Grafana Dashboards Reference ...... 131 9.1.4. AMS Performance Tuning ...... 169 9.1.5. AMS High Availability ...... 174 9.1.6. AMS Security ...... 176 9.2. Ambari Log Search (Technical Preview) ...... 181 9.2.1. Log Search Architecture ...... 181 9.2.2. Installing Log Search ...... 182 9.2.3. Using Log Search ...... 182 9.3. Ambari Infra ...... 185 9.3.1. Archiving & Purging Data ...... 186 9.3.2. Performance Tuning for Ambari Infra ...... 192

v Hortonworks Data Platform May 17, 2018

1. Ambari Operations: Overview

Hadoop is a large-scale, distributed data storage and processing infrastructure using clusters of commodity hosts networked together. Monitoring and managing such complex distributed systems is not simple. To help you manage the complexity, Apache Ambari collects a wide range of information from the cluster's nodes and services and presents it to you in an easy-to-use, centralized interface: Ambari Web.

Ambari Web displays information such as service-specific summaries, graphs, and alerts. You use Ambari Web to create and manage your HDP cluster and to perform basic operational tasks, such as starting and stopping services, adding hosts to your cluster, and updating service configurations. You also can use Ambari Web to perform administrative tasks for your cluster, such as enabling Kerberos security and performing Stack upgrades. Any user can view Ambari Web features. Users with administrator-level roles can access more options that operator-level or view-only users can. For example, an Ambari administrator can manage cluster security, an operator user can monitor the cluster, but a view-only user can only access features to which an administrator grants required permissions.

More Information

Hortonworks Data Platform Apache Ambari Administration

Hortonworks Data Platform Apache Ambari Upgrade 1.1. Ambari Architecture

The Ambari Server collects data from across your cluster. Each host has a copy of the Ambari Agent, which allows the Ambari Server to control each host.

The following graphic is a simplified representation of Ambari architecture:

1 Hortonworks Data Platform May 17, 2018

Ambari Web is a client-side JavaScript application that calls the Ambari REST API (accessible from the Ambari Server) to access cluster information and perform cluster operations. After authenticating to Ambari Web, the application authenticates to the Ambari Server. Communication between the browser and server occurs asynchronously using the REST API.

The Ambari Web UI periodically accesses the Ambari REST API, which resets the session timeout. Therefore, by default, Ambari Web sessions do not timeout automatically. You can configure Ambari to timeout after a period of inactivity.

More Information

Ambari Web Inactivity Timeout 1.2. Accessing Ambari Web

To access Ambari Web:

Steps

1. Open a supported browser.

2. Enter the Ambari Web URL:

http://:8080

The Ambari Web login page displays in your browser.

2 Hortonworks Data Platform May 17, 2018

3. Enter your user name and password.

If you are an Ambari administrator accessing the Ambari Web UI for the first time, use the default Ambari administrator account

admin/admin

.

4. Click Sign In.

If Ambari Server is stopped, you can restart it using a command line editor on the Ambari Server host machine:

ambari-server start

Typically, you start the Ambari Server and Ambari Web as part of the installation process.

Ambari administrators access the Ambari Admin page from the Manage Ambari option in Ambari Web:

3 Hortonworks Data Platform May 17, 2018

More Information

Ambari Administration Overview

Hortonworks Data Platform Apache Ambari Installation

4 Hortonworks Data Platform May 17, 2018

2. Understanding the Cluster Dashboard

You monitor your Hadoop cluster using the Ambari Web Cluster dashboard. You access the Cluster dashboard by clicking Dashboard at the top of the Ambari Web UI main window:

More Information

• Viewing the Cluster Dashboard [5]

• Modifying the Cluster Dashboard [9]

• Viewing Cluster Heatmaps [11] 2.1. Viewing the Cluster Dashboard

Ambari Web UI displays the Dashboard page as the home page. Use Dashboard to view the operating status of your cluster.

The left side of Ambari Web displays the list of Hadoop services currently running in your cluster. Dashboard includes Metrics, Heatmaps, and Config History tabs; by default, the Metrics tab is displayed. On the Metrics page, multiple widgets, represent operating status information of services in your HDP cluster. Most widgets display a single metric: for example, HDFS Disk Usage represented by a load chart and a percentage figure:

Metrics Widgets and Descriptions

5 Hortonworks Data Platform May 17, 2018

HDFS metrics

HDFS Disk Usage The percentage of distributed file system (DFS) used, which is a combination of DFS and non-DFS used

Data Nodes Live The number of DataNodes operating, as reported from the NameNode

NameNode Heap The percentage of NameNode Java Virtual Machine (JVM) heap memory used

NameNode RPC The average RPC queue latency

NameNode CPU WIO The percentage of CPU wait I/O

NameNode Uptime The NameNode uptime calculation

YARN metrics (HDP 2.1 or later stacks)

ResourceManager Heap The percentage of ResourceManager JVM heap memory used

ResourceManager Uptime The ResourceManager uptime calculation

NodeManagers Live The number of DataNodes operating, as reported from the ResourceManager

YARN Memory The percentage of available YARN memory (used versus. total available)

HBase metrics

HBase Master Heap The percentage of NameNode JVM heap memory used

HBase Ave Load The average load on the HBase server

HBase Master Uptime The HBase master uptime calculation

Region in Transition The number of HBase regions in transition

Storm metrics (HDP 2.1 or later stacks)

Supervisors Live The number of supervisors operating as reported by the Nimbus server

More Information

Modifying the Service Dashboard [30]

Scanning Operating Status [6] 2.1.1. Scanning Operating Status

The service summary list on the left side of Ambari Web lists all of the Apache component services that are currently monitored. The icon shape, color, and action to the left of each item indicates the operating status of that item:

6 Hortonworks Data Platform May 17, 2018

Status Indicators

Color Status solid green All masters are running. blinking green Starting up solid red At least one master is down. blinking red Stopping

Click a service name to open the Services page, on which you can see more detailed information about that service. 2.1.2. Viewing Details from a Metrics Widget

To see more detailed information about a service, hover your cursor over a Metrics widget:

• To remove a widget, click the white X

• To edit the display of information in a widget, click the edit (pencil) icon.

More Information

Customizing Metrics Display [11] 2.1.3. Linking to Service UIs

The HDFS Links and HBase Links widgets list HDP components for which links to more metrics information, such as thread stacks, logs, and native component UIs, are available. For example, you can link to NameNode, Secondary NameNode, and DataNode components for HDFS by using the links shown in the following example:

7 Hortonworks Data Platform May 17, 2018

Choose the More drop-down to select from the list of links available for each service. The Ambari Dashboard includes additional links to metrics for the following services:

HDFS

NameNode UI Links to the NameNode UI

NameNode Logs Links to the NameNode logs

NameNode JMX Links to the NameNode JMX servlet

Thread Stacks Links to the NameNode thread stack traces

HBase

HBase Master UI Links to the HBase Master UI

HBase Logs Links to the HBase logs

ZooKeeper Info Links to ZooKeeper information

HBase Master JMX Links to the HBase Master JMX servlet

Debug Dump Links to debug information

Thread Stacks Links to the HBase Master thread stack traces 2.1.4. Viewing Cluster-Wide Metrics

From the Metrics tab, you can also view the following cluster-wide metrics:

These metrics widgets show the following information:

Memory usage Cluster-wide memory used, including memory that is cached, swapped, used, and shared

Network usage The cluster-wide network utilization, including in-and-out

CPU Usage Cluster-wide CPU information, including system, user and wait IO

Cluster Load Cluster-wide Load information, including total number of nodes. total number of CPUs, number of running processes and 1-min Load

You can customize this display as follows:

8 Hortonworks Data Platform May 17, 2018

• To remove a widget

Click the white X.

• To magnify the chart or itemize the widget display

Hover your cursor over the widget.

• To remove or add metrics

Select the item on the widget legend.

• To see a larger view of the chart

Select the magnifying glass icon.

Ambari displays a larger version of the widget in a separate window:

You can use the larger view in the same ways that you use the dashboard.

To close the larger view, click OK. 2.2. Modifying the Cluster Dashboard

You can modify the content of the Ambari Cluster dashboard in the following ways:

• Replace a Removed Widget to the Dashboard [10]

9 Hortonworks Data Platform May 17, 2018

• Reset the Dashboard [10]

• Customizing Metrics Display [11] 2.2.1. Replace a Removed Widget to the Dashboard

To replace a widget that has been removed from the dashboard:

Steps

1. Select Metric Actions:

2. Click Add.

3. Select a metric, such as Region in Transition.

4. Click Apply. 2.2.2. Reset the Dashboard

To reset all widgets on the dashboard to display default settings:

Steps

1. Click Metric Actions:

2. Click Edit.

10 Hortonworks Data Platform May 17, 2018

3. Click Reset all widgets to default. 2.2.3. Customizing Metrics Display

Although not all widgets can be edited, you can customize the way that some of them display metrics by using the Edit (pencil) icon, if one is displayed.

Steps

1. Hover your cursor over a widget.

2. Click Edit.

The Customize Widget window appears:

3. Follow the instructions in Customize Widget to customize widget appearance.

In this example, you can adjust the thresholds at which the HDFS Capacity bar chart changes color, from green to orange to red.

4. To save your changes and close the editor, click Apply.

5. To close the editor without saving any changes, choose Cancel. 2.3. Viewing Cluster Heatmaps

As described earlier, the Ambari web interface home page is divided into a status summary panel on the left, and Metrics, Heatmaps, and Config History tabs at the top, with the Metrics page displayed by default. When you want to view a graphical representation of your overall cluster utilization, clicking Heatmaps provides you with that information, using simple color coding known as a heatmap:

A colored block represents each host in your cluster. You can see more information about a specific host by hovering over its block, which causes a separate window to display metrics about HDP components installed on that host.

11 Hortonworks Data Platform May 17, 2018

Colors displayed in the block represent usage in a unit appropriate for the selected set of metrics. If any data necessary to determine usage is not available, the block displays Invalid data. You can solve this issue by changing the default maximum values for the heatmap, using the Select Metric menu:

Heatmaps supports the following metrics:

Host/Disk Space Used % disk.disk_free and disk.disk_total

Host/Memory Used % memory.mem_free and memory.mem_total

Host/CPU Wait I/O % cpu.cpu_wio

HDFS/Bytes Read dfs.datanode.bytes_read

HDFS/Bytes Written dfs.datanode.bytes_written

HDFS/Garbage Collection Time jvm.gcTimeMillis

HDFS/JVM Heap MemoryUsed jvm.memHeapUsedM

YARN/Garbage Collection Time jvm.gcTimeMillis

YARN / JVM Heap Memory Used jvm.memHeapUsedM

YARN / Memory used % UsedMemoryMB and AvailableMemoryMB

HBase/RegionServer read hbase.regionserver.readRequestsCount request count

HBase/RegionServer write hbase.regionserver.writeRequestsCount request count

HBase/RegionServer compaction hbase.regionserver.compactionQueueSize queue size

HBase/RegionServer regions hbase.regionserver.regions

HBase/RegionServer memstore hbase.regionserver.memstoreSizeMB sizes

12 Hortonworks Data Platform May 17, 2018

3. Managing Hosts

As a Cluster administrator or Cluster operator, you need to know the operating status of each hosts. Also, you need to know which hosts have issues that require action. You can use the Ambari Web Hosts page to manage multiple Hortonworks Data Platform (HDP) components, such as DataNodes, NameNodes, NodeManagers, and RegionServers, running on hosts throughout your cluster. For example, you can restart all DataNode components, optionally controlling that task with rolling restarts. Ambari Hosts enables you to filter your selection of host components to manage, based on operating status, host health, and defined host groupings.

The Hosts tab enables you to perform the following tasks:

• Understanding Host Status [13]

• Searching the Hosts Page [14]

• Performing Host-Level Actions [17]

• Managing Components on a Host [18]

• Decommissioning a Master or Slave [19]

• Delete a Component [20]

• Deleting a Host from a Cluster [21]

• Setting Maintenance Mode [21]

• Add Hosts to a Cluster [24]

• Establishing Rack Awareness [25] 3.1. Understanding Host Status

You can view the individual hosts in your cluster on the Ambari Web Hosts page. The hosts are listed by fully qualified domain name (FDQN) and accompanied by a colored icon that indicates the host's operating status:

Red Triangle At least one master component on that host is down. You can hover your cursor over the host name to see a tooltip that lists affected components.

Orange Orange - At least one slave component on that host is down. Hover to see a tooltip that lists affected components.

Yellow Ambari Server has not received a heartbeat from that host for more than 3 minutes.

Green Normal running state.

13 Hortonworks Data Platform May 17, 2018

Maintenace Mode Black "medical bag" icon indicates a host in maintenance mode.

Alert Red square with white number indicates the number of alerts generated on a host.

A red icon overrides an orange icon, which overrides a yellow icon. In other words, a host that has a master component down is accompanied by a red icon, even though it might have slave component or connection issues as well. Hosts in maintenance mode or are experiencing alerts, are accompanied by an icon to the right of the host name.

The following example Hosts page shows three hosts, one having a master component down, one having a slave component down, one running normally, and two with alerts:

More Information

Maintenance Mode

Alerts 3.2. Searching the Hosts Page

You can search the full list of hosts, filtering your search by host name, component attribute, and component operating status. You can also search by keyword, simply by typing a word in the search box.

The Hosts search tool appears above the list of hosts:

Steps

1. Click the search box.

Available search types appear, including:

Search by Host Attribute Search by host name, IP, host status, and other attributes, including:

14 Hortonworks Data Platform May 17, 2018

Search by Service Find hosts that are hosting a component from a given service.

Search by Component Find hosts that are hosting a components in a given state, such as started, stopped, maintenance mode, and so on.

Search by keyword Type any word that describes what you are looking for in the search box. This becomes a text filter.

2. Click a Search type.

A list of available options appears, depending on your selection in step 1.

For example, if you click Service, current services appear:

15 Hortonworks Data Platform May 17, 2018

3. Click an option, (in this example, the YARN service).

The list of hosts that match your current search criteria display on the Hosts page.

4. Click option(s) to further refine your search.

Examples of searches that you can perform, based on specific criteria, and which interface controls to use:

Find all hosts with a DataNode

Find all the hosts with a DataNode that are stopped

16 Hortonworks Data Platform May 17, 2018

Find all the hosts with an HDFS component

Find all the hosts with an HDFS or HBase component 3.3. Performing Host-Level Actions

Use the Actions UI control to act on hosts in your cluster. Actions that you perform that comprise more than one operation, possibly on multiple hosts, are also known as bulk operations.

The Actions control comprises a workflow that uses a sequence of three menus to refine your search: a hosts menu, a menu of objects based on your host choice, and a menu of actions based on your object choice.

For example, if you want to restart the RegionServers on any host in your cluster on which a RegionServer exists:

Steps

1. In the Hosts page, select or search for hosts running a RegionServer:

2. Using the Actions control, click Fitered Hosts > RegionServers > Restart:

3. Click OK to start the selected operation.

4. Optionally, monitor background operations to follow, diagnose, or troubleshoot the restart operation.

17 Hortonworks Data Platform May 17, 2018

More Information

Monitoring Background Operations [38] 3.4. Managing Components on a Host

To manage components running on a specific host, click one of the FQDNs listed on the Hosts page. For example, if you click c6403.ambari.apache.org, that host's page appears. Clicking the Summary tab displays a Components pane that lists all components installed on that host:

To manage all of the components on a single host, you can use the Host Actions control at the top right of the display to start, stop, restart, delete, or turn on maintenance mode for all components installed on the selected host.

Alternatively, you can manage components individually, by using the drop-down menu shown next to an individual component in the Components pane. Each component's menu is labeled with the component's current operating status. Opening the menu displays your available management options, based on that status: for example, you can decommission, restart, or stop the DataNode component for HDFS, as shown here:

18 Hortonworks Data Platform May 17, 2018

3.5. Decommissioning a Master or Slave

Decommissioning is a process that supports removing components and their hosts from the cluster. You must decommission a master or slave running on a host before removing it or its host from service. Decommissioning helps you to prevent potential loss of data or disruption of service . Decommissioning is available for the following component types:

• DataNodes

• NodeManagers

• RegionServers

19 Hortonworks Data Platform May 17, 2018

Decommissioning executes the following tasks:

For DataNodes Safely replicates the HDFS data to other DataNodes in the cluster

For NodeManagers Stops accepting new job requests from the masters and stops the component

For RegionServers Turns on drain mode and stops the component 3.5.1. Decommission a Component

To decommission a component (a DataNode, in the following example):

Steps

1. Using Ambari Web, browse the Hosts page.

2. Find and click the FQDN of the host on which the component resides.

3. Using the Actions control, click Selected Hosts > DataNodes > Decommission:

The UI shows Decommissioning status while in process:

When this DataNode decommissioning process is finished, the status display changes to Decommissioned (shown here for NodeManager). 3.6. Delete a Component

To delete a component:

Steps

1. Using Ambari Web, browse the Hosts page.

2. Find and click the FQDN of the host on which the component resides.

3. In Components, find a decommissioned component.

4. If the component status is Started, stop it.

20 Hortonworks Data Platform May 17, 2018

A decommissioned slave component may restart in the decommissioned state.

5. Click Delete from the component drop-down menu.

Deleting a slave component, such as a DataNode does not automatically inform a master component, such as a NameNode to remove the slave component from its exclusion list. Adding a deleted slave component back into the cluster presents the following issue; the added slave remains decommissioned from the master's perspective. Restart the master component, as a work-around.

6. To enable Ambari to recognize and monitor only the remaining components, restart services.

More Information

Review and Confirm Configuration Changes [82] 3.7. Deleting a Host from a Cluster

Deleting a host removes the host from the cluster.

Prerequisites

Before deleting a host, you must complete the following prerequisites:

• Stop all components running on the host.

• Decommission any DataNodes running on the host.

• Move from the host any master components, such as NameNode or ResourceManager, running on the host.

• Turn off host Maintenance Mode, if it is on.

To delete a component:

Steps

1. Using Ambari Web, browse the hosts page to find and click the FQDN of the host that you want to delete.

2. On the Host-Details page, click Host Actions.

3. Click Delete.

More Information

Review and Confirm Configuration Changes [82] 3.8. Setting Maintenance Mode

Setting Maintenance Mode enables you to suppress alerts and omit bulk operations for specific services, components, and hosts in an Ambari-managed cluster when you want to

21 Hortonworks Data Platform May 17, 2018

focus on performing hardware or software maintenance, changing configuration settings, troubleshooting, decommissioning, or removing cluster nodes.

Explicitly setting Maintenance Mode for a service implicitly sets Maintenance Mode for components and hosts that run the service. While Maintenance Mode prevents bulk operations being performed on the service, component, or host, you may explicitly start and stop a service, component, or host while in Maintenance Mode.

The following sections provide examples of how to use Maintenance Mode in a three-node, Ambari-managed cluster installed using default options and having one data node, on host c6403. They describe how to explicitly turn on Maintenance Mode for the HDFS service, alternative procedures for explicitly turning on Maintenance Mode for a host, and the implicit effects of turning on Maintenance Mode for a service, a component, and a host.

More Information

Set Maintenance Mode for a Service [22]

Set Maintenance Mode for a Host [22]

When to Set Maintenance Mode [23] 3.8.1. Set Maintenance Mode for a Service

1. Using Services, select HDFS.

2. Select Service Actions, then choose Turn On Maintenance Mode.

3. Choose OK to confirm.

Notice, on Services Summary that Maintenance Mode turns on for the NameNode and SNameNode components. 3.8.2. Set Maintenance Mode for a Host

To set Maintanence Mode for a host by using the Host Actions control:

Steps

1. Using Hosts, select c6401.ambari.apache.org.

2. Select Host Actions, then choose Turn On Maintenance Mode.

3. Choose OK to confirm.

Notice on Components, that Maintenance Mode turns on for all components.

To set Maintanence Mode for a host, using the Actions control:

Steps

1. Using Hosts, click c6403.ambari.apache.org.

2. In Actions > Selected Hosts > Hosts, choose Turn On Maintenance Mode.

3. Choose OK.

22 Hortonworks Data Platform May 17, 2018

Your list of hosts shows that Maintenance Mode is set for hosts c6401 and c6403:

If you hover your cursor over each Maintenance Mode icon appearing in the hosts list, you see the following information:

• Hosts c6401 and c6403 are in Maintenance Mode.

• On host c6401, HBaseMaster, HDFS client, NameNode, and ZooKeeper Server are also in Maintenance Mode.

• On host c6403, 15 components are in Maintenance Mode.

• On host c6402, HDFS client and Secondary NameNode are in Maintenance Mode, even though the host is not.

Notice also how the DataNode is affected by setting Maintenance Mode on this host:

• Alerts are suppressed for the DataNode.

• DataNode is omitted from HDFS Start/Stop/Restart All, Rolling Restart.

• DataNode is omitted from all Bulk Operations except Turn Maintenance Mode ON/OFF.

• DataNode is omitted from Start All and / Stop All components.

• DataNode is omitted from a host-level restart/restart all/stop all/start. 3.8.3. When to Set Maintenance Mode

Four common instances in which you might want to set Maintenance Mode are to perform maintenance, to test a configuration change, to delete a service completely, and to address alerts.:

You want to perform hardware, While performing maintenance, you want to be able to firmware, or OS maintenance on do the following: a host. • Prevent alerts generated by all components on this host.

• Be able to stop, start, and restart each component on the host.

• Prevent host-level or service-level bulk operations from starting, stopping, or restarting components on this host.

23 Hortonworks Data Platform May 17, 2018

To achieve these goals, explicitly set Maintenance Mode for the host. Putting a host in Maintenance Mode implicitly puts all components on that host in Maintenance Mode.

You want to test a service To text configuration changes,you want to ensure the configuration change. You following conditions: will stop, start, and restart the service using a "rolling" restart to • No alerts are generated by any components in this test whether restarting activates service. the change. • No host-level or service-level bulk operations start, stop, or restart components in this service.

To achieve these goals, explicitly set Maintenance Mode for the service. Putting a service in Maintenance Mode implicitly turns on Maintenance Mode for all components in the service.

You want to stop a service. To stop a service completely, you want to ensure the following conditions:

• No warnings are generated by the service.

• No components start, stop, or restart due to host- level actions or bulk operations.

To achieve these goals, explicitly set Maintenance Mode for the service. Putting a service in Maintenance Mode implicitly turns on Maintenance Mode for all components in the service.

You want to stop a host To stop a host component from generating alerts, you component from generating must be able to do the following: alerts. • Check the component.

• Assess warnings and alerts generated for the component.

• Prevent alerts generated by the component while you check its condition.

To achieve these goals, explicitly set Maintenance Mode for the host component. Putting a host component in Maintenance Mode prevents host-level and service-level bulk operations from starting or restarting the component. You can restart the component explicitly while Maintenance Mode is on. 3.9. Add Hosts to a Cluster

To add new hosts to your cluster:

Steps

24 Hortonworks Data Platform May 17, 2018

1. Browse to the Hosts page and select Actions > +Add New Hosts.

The Add Host wizard provides a sequence of prompts similar to those in the Ambari Cluster Install wizard.

2. Follow the prompts, providing information similar to that provided to define the first set of hosts in your cluster:

Next Steps

Review and confirm all recommended configuration changes.

Note that if you are adding a new host to your cluster, then you must know that the previous HDP and Ambari components are not added to the new host of your cluster.

More Information

Review and Confirm Configuration Changes [82]

Install Options 3.10. Establishing Rack Awareness

You can establish rack awareness in two ways. Either you can set the rack ID using Ambari or you can set the rack ID using a custom topology script.

25 Hortonworks Data Platform May 17, 2018

More Information

Set the Rack ID Using Ambari [26]

Set the Rack ID Using a Custom Topology Script [27] 3.10.1. Set the Rack ID Using Ambari

By setting the Rack ID, you can enable Ambari to manage rack information for hosts, including displaying the hosts in heatmaps by Rack ID, enabling users to filter and find hosts on the Hosts page by using that Rack ID.

If HDFS is installed in your cluster, Ambari passes this Rack ID information to HDFS by using a topology script. Ambari generates a topology script at /etc/hadoop/conf/topology.py and sets the net.topology.script.file.name property in core-site automatically. This topology script reads a mappings file /etc/hadoop/conf/topology_mappings.data that Ambari automatically generates. When you make changes to Rack ID assignment in Ambari, this mappings file will be updated when you push out the HDFS configuration. HDFS uses this topology script to obtain Rack information about the DataNode hosts.

There are two ways using Ambari Web to set the Rack ID: for multiple hosts, using Actions, or for individual hosts, using Host Actions.

To set the Rack ID for multiple hosts:

Steps

1. Usings Actions, click selected, filtered, or all hosts.

2. Click Hosts.

3. Click Set Rack.

To set the Rack ID on an individual host:

Steps

1. Browse to the Host page.

2. Click Host Actions.

3. Click Set Rack.

26 Hortonworks Data Platform May 17, 2018

3.10.2. Set the Rack ID Using a Custom Topology Script

If you do not want to have Ambari manage the rack information for hosts, you can use a custom topology script. To do this, you mustcreate your own topology script and manage distributing the script to all hosts. Note also that because Ambari will have no access to host rack information, heatmaps will not display by rack in Ambari Web.

To set the Rack ID using a custom topology script:

Steps

1. Browse to Services > HDFS > Configs.

2. Modify net.topology.script.file.name to your own custom topology script.

For example: /etc/hadoop/conf/topology.sh:

3. Distribute that topology script to your hosts.

You can now manage the rack mapping information for your script outside of Ambari.

27 Hortonworks Data Platform May 17, 2018

4. Managing Services

You use the Services tab of the Ambari Web UI home page to monitor and manage selected services running in your Hadoop cluster.

All services installed in your cluster are listed in the leftmost panel:

The Services tab enables you to perform the following tasks:

• Starting and Stopping All Services [29]

• Displaying Service Operating Summary [29]

• Adding a Service [32]

• Changing Configuration Settings [80]

• Performing Service Actions [36]

• Rolling Restarts [36]

• Monitoring Background Operations [38]

• Removing A Service [40]

28 Hortonworks Data Platform May 17, 2018

• Operations Audit [40]

• Using Quick Links [40]

• Refreshing YARN Capacity Scheduler [41]

• Managing HDFS [41]

• Managing Atlas in a Storm Environment [43]

• Enabling the Oozie UI [44] 4.1. Starting and Stopping All Services

To start or stop all listed services simultaneously, click Actions and then click Start All or Stop All:

4.2. Displaying Service Operating Summary

Clicking the name of a service from the list displays a Summary tab containing basic information about the operational status of that service, including any alerts To refresh the monitoring panels and display information about a different service, you can click a different name from the list.

Notice the colored icons next to each service name, indicating service operating status and any alerts generated for the service.

You can click one of the View Host links, as shown in the following example, to view components and the host on which the selected service is running:

29 Hortonworks Data Platform May 17, 2018

4.2.1. Alerts and Health Checks

In the Summary tab, you can click Alerts to see a list of all health checks and their status for the selected service. Critical alerts are shown first. To see alert definitions, you can click the text title of each alert message in the list to see the alert definition. The following example shows the results when you click HBase > Services > Alerts > HBase Master Process:

4.2.2. Modifying the Service Dashboard

Depending on the service, the Summary tab includes a Metrics dashboard that is by default populated with important service metrics to monitor:

If you have the Ambari Metrics service installed and are using Apache HDFS, , Apache HBase, or Apache YARN, you can customize the Metrics dashboard. You can add and remove widgets from the Metrics dashboard, and you can create new widgets and delete widgets. Widgets can be private to you and your dashboard, or they can be shared in a Widget Browser library.

You must have the Ambari Metrics service installed to be able to view, create, and customize the Metrics dashboard. 4.2.2.1. Adding or Removing a Widget

To add or remove a widget in the HDFS, Hive, HBase, or YARN service Metrics dashboard:

1. Either click + to launch the Widget Browser, or click Browse Widgets from Actions > Metrics.

30 Hortonworks Data Platform May 17, 2018

2. The Widget Browser displays the widgets available to add to your service dashboard, including widgets already included in your dashboard, shared widgets, and widgets you have created. Widgets that are shared are identified by the icon highlighted in the following example:

3. If you want to display only the widgets you have created, select the “Show only my widgets” check box.

4. If you want to remove a widget shown as added to your dashboard, click to remove it.

5. If you want to add an available widget that is not already added, click Add. 4.2.2.2. Creating a Widget

1. Click + to launch the Widget Browser.

2. Either click the Create Widget button or Create Widget in the Actions menu Metrics header.

3. Select the type of widget to create.

4. Depending on the service and type of widget, you can select metrics and use operators to create an expression to be displayed in the widget.

A preview of the widget is displayed as you build the expression.

5. Enter the widget name and description.

6. Optionally, choose to share the widget.

Sharing the widget makes the widget available to all Ambari users for this cluster. After a widget is shared, other Ambari Admins or Cluster Operators can modify or delete the widget. This cannot be undone. 4.2.2.3. Deleting a Widget

1. Click on the “ + ” to launch the Widget Browser. Alternatively, you can choose the Actions menu in the Metrics header to Browse Widgets.

2. The Widget Browser displays the available widgets to add to your Service Dashboard. This is a combination of shared widgets and widgets you have created. Widgets that are shared are identified by the icon highlighted in the following example.

31 Hortonworks Data Platform May 17, 2018

3. If a widget is already added to your dashboard, it is shown as Added. Click to remove.

4. For widgets that you created, you can select the More… option to delete.

5. For widgets that are shared, if you are an Ambari Admin or Cluster Operator, you will also have the option to delete.

Deleting a shared widget removes the widget from all users. This cannot be undone. 4.2.2.4. Export Widget Graph Data

You can export the metrics data from widget graphs using the Export capability.

1. Hover your cursor over the widget graph, or click the graph to zoom, to display the Export icon.

2. Click the icon and specify either CSV or JSON format. 4.2.2.5. Setting Display Timezone

You can set the timezone used for displaying metrics data in widget graphs.

1. In Ambari Web, click your user name and select Settings.

2. In the Locale section, select the Timezone.

3. Click Save.

The Ambari Web UI reloads and graphs are displayed using the timezone you have set. 4.3. Adding a Service

The Ambari installation wizard installs all available Hadoop services by default. You can choose to deploy only some services initially, and then add other services as you need them. For example, many customers deploy only core Hadoop services initially. The Add Service option of the Actions control enables you to deploy additional services without interrupting operations in your Hadoop cluster. When you have deployed all available services, the Add Service control display is dimmed, indicating that it is unavailable.

To add a service, follow the steps shown in this example of adding the Apache Falcon service to your Hadoop cluster:

1. Click Actions > Add Service.

32 Hortonworks Data Platform May 17, 2018

The Add Service wizard opens.

2. Click Choose Services.

The Choose Services pane displays, showing a table of those services that are already active in a green background and with their checkboxes checked.

3. In the Choose Services pane, select the empty check box next to the service that you want to add, and then click Next.

Notice that you can also select all services listed by selecting the checkbox next to the Service table column heading.

33 Hortonworks Data Platform May 17, 2018

4. In Assign Masters, confirm the default host assignment.

The Add Services Wizard indicates hosts on which the master components for a chosen service will be installed. A service chosen for addition shows a grey check mark.

Alternatively, use the drop-down menu to choose a different host machine to which master components for your selected service will be added.

5. If you are adding a service that requires slaves and clients, in the Assign Slaves and Clients control, accept the default assignment of slave and client components to hosts by clicking Next.

Alternatively, select hosts on which you want to install slave and client components (at least one host for the slave of each service being added), and click Next.

Host Roles Required for Added Services

Service Added Host Role Required YARN NodeManager HBase RegionServer

6. In Customize Services, accept the default configuration properties.

Alternatively, edit the default values for configuration properties, if necessary. Choose Override to create a configuration group for this service. Then, choose Next:

34 Hortonworks Data Platform May 17, 2018

7. In Review, verify that the configuration settings match your intentions, and then, click Deploy:

8. Monitor the progress of installing, starting, and testing the service, and when that finishes successfully, click Next:

9. When you see the summary display of installation results, click Complete:

10.Review and confirm recommended configuration changes.

11.Restart any other components that have stale configurations as a result of adding services.

More Information

35 Hortonworks Data Platform May 17, 2018

Review and Confirm Configuration Changes [82]

Choose Services

Apache Spark Component Guide

Apache Storm Component Guide

Apache Ambari Kerberos Configuration

Apache Kafka Component Guide

Apache Ambari Kerberos Configuration

Installing and Configuring Apache Atlas

Installing Ranger Using Ambari

Installing Hue

Apache Solr Search Installation

Installing Ambari Log Search (Technical Preview)

Installing Druid (Technical Preview) 4.4. Performing Service Actions

Manage a selected service on your cluster by performing service actions. In the Services tab, click Service Actions and click an option. Available options depend on the service you have selected; for example, HDFS service action options include:

Clicking Turn On Maintenance Mode suppresses alerts and status indicator changes generated by the service, while allowing you to start, stop, restart, move, or perform maintenance tasks on the service.

More Information

Setting Maintenance Mode [21]

Enable Service Auto-Start [101] 4.5. Rolling Restarts

When you restart multiple services, components, or hosts, use rolling restarts to distribute the task. A rolling restart stops and then starts multiple running slave components, such as DataNodes, NodeManagers, RegionServers, or Supervisors, using a batch sequence.

36 Hortonworks Data Platform May 17, 2018

Important

Rolling restarts of DataNodes should be performed only during cluster maintenance.

You set rolling restart parameter values to control the number, time between, tolerance for failures, and limits for restarts of many components across large clusters.

To run a rolling restart, follow these steps:

1. From the service summary pane on the left of the Service display, click a service name.

2. On the service Summary page, click a link, such as DataNodes or RegionServers, of any components that you want to restart.

The Hosts page lists any host names in your cluster on which that component resides.

3. Using the host-level Actions menu, click the name of a slave component option, and then click Restart.

4. Review and set values for Rolling Restart Parameters.

5. Optionally, reset the flag to restart only components with changed configurations.

6. Click Trigger Restart.

After triggering the restart, you should monitor the progress of the background operations.

More Information

Setting Rolling Restart Parameters [37]

Monitoring Background Operations [38]

Performing Host-Level Actions [17]

Aborting a Rolling Restart [38] 4.5.1. Setting Rolling Restart Parameters

When you choose to restart slave components, you should use parameters to control how restarts of components roll. Parameter values based on ten percent of the total number of components in your cluster are set as default values. For example, default settings for a rolling restart of components in a three-node cluster restarts one component at a time, waits two minutes between restarts, proceeds if only one failure occurs, and restarts all existing components that run this service. Enter integer, non-zero values for all parameters.

Batch Size Number of components to include in each restart batch.

Wait Time Time (in seconds) to wait between queuing each batch of components.

Tolerate up to x failures Total number of restart failures to tolerate, across all batches, before halting the restarts and not queuing batches.

37 Hortonworks Data Platform May 17, 2018

If you trigger a rolling restart of components, the default value of Restart components with stale configs is “true.” If you trigger a rolling restart of services, this value is “false.”

More Information

Rolling Restarts [36] 4.5.2. Aborting a Rolling Restart

To abort future restart operations in the batch, click Abort Rolling Restart:

More Information

Rolling Restarts [36] 4.6. Monitoring Background Operations

You can use the Background Operations window to monitor progress and completion of a task that comprises multiple operations, such as a rolling restart of components. The Background Operations window opens by default when you run such a task. For example, to monitor the progress of a rolling restart, click elements in the Background Operations window:

1. Click the right-arrow for each operation to show restart operation progress on each host:

38 Hortonworks Data Platform May 17, 2018

2. After restart operations are complete, you can click either the right-arrow or host name to view log files and any error messages generated on the selected host:

3. Optionally, you can use the Copy, Open, or Host Logs icons located at the upper-right of the Background Operations window to copy, open, or view logs for the rolling restart.

For example, choose Host Logs to view error and output logs information for host c6403.ambari.apache.org:

As shown here, you can also select the check box at the bottom of the Background Operations window to hide the window when performing tasks in the future.

39 Hortonworks Data Platform May 17, 2018

4.7. Removing A Service Important

Removing a service is not reversible and all configuration history will be lost.

To remove a service:

1. Click the name of the service from the left panes of the Services tab.

2. Click Service Actions > Delete.

3. As prompted, remove any dependent services.

4. As prompted, stop all components for the service.

5. Confirm the removal.

After the service is stopped, you must confirm the removal to proceed.

More Information

Review and Confirm Configuration Changes [82] 4.8. Operations Audit

When you perform an operation using Ambari, such as user login or logout, stopping or starting a service, and adding or removing a service, Ambari creates an entry in an audit log. By reading the audit log, you can determine who performed the operation, when the operation occurred, and other, operation-specific information. You can find the Ambari audit log on your Ambari server host, at:

/var/log/ambari-server/ambari-audit.log

When you change configuration for a service, Ambari creates an entry in the audit log, and creates a specific log file, at:

ambari-config-changes.log

By reading the configuration change log, you can find out even more information about each change. For example:

2016-05-25 18:31:26,242 INFO - Cluster 'MyCluster' changed by: 'admin'; service_name='HDFS' config_group='default' config_group_id='-1' version='2'

More Information

Changing Configuration Settings 4.9. Using Quick Links

Select Quick Links options to access additional sources of information about a selected service. For example, HDFS Quick Links options include the following:

40 Hortonworks Data Platform May 17, 2018

Quick Links are not available for every service. 4.10. Refreshing YARN Capacity Scheduler

This topic describes how to “refresh” the Capacity Scheduler from Ambari when you add or modify existing queues. After you modify the Capacity Scheduler configuration, YARN enables you to refresh the queues without restarting your ResourceManager, if you have made no destructive changes (such as completely removing a queue) to your configuration. The Refresh operation will fail with the following message: Failed to re-init queues if you attempt to refresh queues in a case where you performed a destructive change, such as removing a queue. In cases where you have made destructive changes, you must perform a ResourceManager restart for the capacity scheduler change to take effect.

To refresh the Capacity Scheduler, follow these steps:

1. In Ambari Web, browse to Services > YARN > Summary.

2. Click Service Actions, and then click Refresh YARN Capacity Scheduler.

3. Confirm that you want to perform this operation.

The refresh operation is submitted to the YARN ResourceManager.

More Information

ResourceManager High Availability [66] 4.11. Managing HDFS

This section contains information specific to rebalancing and tuning garbage collection in Hadoop Distributed File System (HDFS).

More Information

Rebalancing HDFS [42]

Tuning Garbage Collection [42]

41 Hortonworks Data Platform May 17, 2018

Customizing the HDFS Home Directory [43]

NameNode High Availability [46] 4.11.1. Rebalancing HDFS

HDFS provides a “balancer” utility to help balance the blocks across DataNodes in the cluster. To initiate a balancing process, follow these steps:

1. In Ambari Web, browse to Services > HDFS > Summary.

2. Click Service Actions, and then click Rebalance HDFS.

3. Enter the Balance Threshold value as a percentage of disk capacity.

4. Click Start.

You can monitor or cancel a rebalance process by opening the Background Operations window in Ambari.

More Information

Monitoring Background Operations [38]

Tuning Garbage Collection [42] 4.11.2. Tuning Garbage Collection

The Concurrent Mark Sweep (CMS) garbage collection (GC) process includes a set of heuristic rules used to trigger garbage collection. This makes garbage collection less predictable and tends to delay collection until capacity is reached, creating a Full GC error (which might pause all processes).

Ambari sets default parameter values for many properties during cluster deployment. Within the export HADOOP_NameNode_Opts= clause of the hadoop-env template, two parameters that affect the CMS GC process have the following default settings:

• -XX:+UseCMSInitiatingOccupancyOnly prevents the use of GC heuristics.

• -XX:CMSInitiatingOccupancyFraction= tells the Java VM when the CMS collector should be triggered.

If this percent is set too low, the CMS collector runs too often; if it is set too high, the CMS collector is triggered too late, and concurrent mode failure might occur. The default setting for -XX:CMSInitiatingOccupancyFraction is 70, which means that the application should utilize less than 70% of capacity.

To tune garbage collection by modifying the NameNode CMS GC parameters, follow these steps:

1. In Ambari Web, browse to Services > HDFS.

42 Hortonworks Data Platform May 17, 2018

2. Open the Configs tab and browse to Advanced > Advanced hadoop-env.

3. Edit the hadoop-env template.

4. Save your configurations and restart, as prompted.

More Information

Rebalancing HDFS [42] 4.11.3. Customizing the HDFS Home Directory

By default, the HDFS home directory is set to /user/. You can use the dfs.user.home.base.dir property to customize the HDFS home directory.

1. In Ambari Web, browse to Services > HDFS > Configs > Advanced.

2. Click Custom hdfs-site, then click Add Property.

3. On the Add Property pop-up, add the following property:

dfs.user.home.base.dir=

Where is the path to the new home directory.

4. Click Add, then save the new configuration and restart, as prompted. 4.12. Managing Atlas in a Storm Environment

When you update the Apache Atlas configuration settings in Ambari, Ambari marks the services that require a restart. To restart these services, follow these steps:

1. In Ambari Web, click the Actions control.

2. Click Restart All Required.

Important

Apache Oozie requires a restart after an Atlas configuration update, but might not be marked as requiring restart in Ambari. If Oozie is not included, follow these steps to restart Oozie:

1. In Ambari Web, click Oozie in the services summary pane on the left of the display.

2. Click Service Actions > Restart All.

More Information

Installing and Configuring Atlas Using Ambari

Storm Guide

43 Hortonworks Data Platform May 17, 2018

4.13. Enabling the Oozie UI

Ext JS is GPL licensed software and is no longer included in builds of HDP 2.6. Because of this, the Oozie WAR file is not built to include the Ext JS-based user interface unless Ext JS is manually installed on the Oozie server. If you add Oozie using Ambari 2.6.1.0 to an HDP 2.6.4 or greater stack, no Oozie UI will be available by default. If you want an Oozie UI, you must manually install Ext JS on the Oozie server host, then restart Oozie. During the restart operation, Ambari re-builds the Oozie WAR file and will include the Ext JS-based Oozie UI.

Steps

1. Log in to the Oozie Server host.

2. Download and install the Ext JS package.

CentOS RHEL Oracle Linux 6:

wget https://archive.cloudera.com/p/HDP-UTILS-GPL/1.1.0.22/repos/centos6/ extjs/extjs-2.2-1.noarch.rpm

rpm -ivh extjs-2.2-1.noarch.rpm

CentOS RHEL Oracle Linux 7:

wget https://archive.cloudera.com/p/HDP-UTILS-GPL/1.1.0.22/repos/centos7/ extjs/extjs-2.2-1.noarch.rpm

rpm -ivh extjs-2.2-1.noarch.rpm

CentOS RHEL Oracle Linux 7 (PPC):

wget https://archive.cloudera.com/p/HDP-UTILS-GPL/1.1.0.22/repos/centos7- ppc/extjs/extjs-2.2-1.noarch.rpm

rpm -ivh extjs-2.2-1.noarch.rpm

SUSE11SP3:

wget https://archive.cloudera.com/p/HDP-UTILS-GPL/1.1.0.22/repos/suse11sp3/ extjs/extjs-2.2-1.noarch.rpm

rpm -ivh extjs-2.2-1.noarch.rpm

SUSE11SP4:

wget https://archive.cloudera.com/p/HDP-UTILS-GPL/1.1.0.22/repos/suse11sp4/ extjs/extjs-2.2-1.noarch.rpm

rpm -ivh extjs-2.2-1.noarch.rpm

SLES12:

wget https://archive.cloudera.com/p/HDP-UTILS-GPL/1.1.0.22/repos/sles12/ extjs/extjs-2.2-1.noarch.rpm

rpm -ivh extjs-2.2-1.noarch.rpm

Ubuntu12: 44 Hortonworks Data Platform May 17, 2018

wget https://archive.cloudera.com/p/HDP-UTILS-GPL/1.1.0.22/repos/ubuntu12/ pool/main/e/extjs/extjs_2.2-2_all.deb

dpkg -i extjs_2.2-2_all.deb

Ubuntu14:

wget https://archive.cloudera.com/p/HDP-UTILS-GPL/1.1.0.22/repos/ubuntu14/ pool/main/e/extjs/extjs_2.2-2_all.deb

dpkg -i extjs_2.2-2_all.deb

Ubuntu16:

Wget https://archive.cloudera.com/p/HDP-UTILS-GPL/1.1.0.22/repos/ubuntu16/ pool/main/e/extjs/extjs_2.2-2_all.deb

dpkg -i extjs_2.2-2_all.deb

Debian6:

wget https://archive.cloudera.com/p/HDP-UTILS-GPL/1.1.0.22/repos/debian6/ pool/main/e/extjs/extjs_2.2-2_all.deb

dpkg -i extjs_2.2-2_all.deb

Debian7:

wget https://archive.cloudera.com/p/HDP-UTILS-GPL/1.1.0.22/repos/debian7/ pool/main/e/extjs/extjs_2.2-2_all.deb

dpkg -i extjs_2.2-2_all.deb

3. Remove the following file:

rm /usr/hdp/current/oozie-server/.prepare_war_cmd

4. Restart Oozie Server from the Ambari UI.

Ambari rebuilds the Oozie WAR file.

45 Hortonworks Data Platform May 17, 2018

5. Managing Service High Availability

Ambari web provides a wizard-driven user experience that enables you to configure high availability of the components in many Hortonworks Data Platform (HDP) stack services. High availability is assured through establishing primary and secondary components. In the event that the primary component fails or becomes unavailable, the secondary component is available. After configuring high availability for a service, Ambari enables you to manage and disable (roll back) high availability of components in that service.

• NameNode High Availability [46]

• ResourceManager High Availability [66]

• HBase High Availability [69]

• Hive High Availability [74]

• Oozie High Availability [76]

• Apache Atlas High Availability [77]

• Enabling Ranger Admin High Availability [79] 5.1. NameNode High Availability

To ensure that another NameNode in your cluster is always available if the primary NameNode host fails, you should enable and configure NameNode high availability on your cluster using Ambari Web.

More Information

Configuring NameNode High Availability [46]

Rolling Back NameNode HA [51]

Managing Journal Nodes [61] 5.1.1. Configuring NameNode High Availability

Prerequisites

• Verify that you have at least three hosts in your cluster and are running at least three Apache ZooKeeper servers.

• Verify that the Hadoop Distributed File System (HDFS) and ZooKeeper services are not in Maintenance Mode.

HDFS and ZooKeeper must stop and start when enabling NameNode HA. Maintenance Mode will prevent those start and stop operations from occurring. If the HDFS or

46 Hortonworks Data Platform May 17, 2018

ZooKeeper services are in Maintenance Mode the NameNode HA wizard will not complete successfully.

Steps

1. In Ambari Web, select Services > HDFS > Summary.

2. Click Service Actions, then click Enable NameNode HA.

3. The Enable HA wizard launches. This wizard describes the set of automated and manual steps you must take to set up NameNode high availability.

4. On the Get Started page, type in a Nameservice ID and click Next.

You use this Nameservice ID instead of the NameNode FQDN after HA is set up.

5. On the Select Hosts page, select a host for the additional NameNode and the JournalNodes, and then click Next:

47 Hortonworks Data Platform May 17, 2018

6. On the Review page, confirm your host selections and click Next:

48 Hortonworks Data Platform May 17, 2018

7. Follow the directions on the Manual Steps Required: Create Checkpoint on NameNode page, and then click Next:

You must log in to your current NameNode host and run the commands to put your NameNode into safe mode and create a checkpoint.

8. When Ambari detects success and the message on the bottom of the window changes to Checkpoint created, click Next.

9. On the Configure Components page, monitor the configuration progress bars, then click Next:

10.Follow the instructions on the Manual Steps Required: Initialize JournalNodes page and then click Next:

You must log in to your current NameNode host to run the command to initialize the JournalNodes.

11.When Ambari detects success and the message on the bottom of the window changes to JournalNodes initialized, click Next.

12.On the Start Components page, monitor the progress bars as the ZooKeeper servers and NameNode start; then click Next:

49 Hortonworks Data Platform May 17, 2018

Note

In a cluster with Ranger enabled, and with Hive configured to use MySQL, Ranger will fail to start if MySQL is stopped. To work around this issue, start the Hive MySQL database and then retry starting components.

13.On the Manual Steps Required: Initialize NameNode HA Metadata page : Complete each step, using the instructions on the page, and then click Next.

For this step, you must log in to both the current NameNode and the additional NameNode. Make sure you are logged in to the correct host for each command. Click OK to confirm, after you complete each command.

14.On the Finalize HA Setup page, monitor the progress bars as the wizard completes HA setup, then click Done to finish the wizard.

After the Ambari Web UI reloads, you may see some alert notifications. Wait a few minutes until all the services restart.

15.Restart any components using Ambari Web, if necessary.

16.If you are using Hive, you must manually change the Hive Metastore FS root to point to the Nameservice URI instead of the NameNode URI. You created the Nameservice ID in the Get Started step.

Steps

a. Find the current FS root on the Hive host:

hive --config /etc/hive/conf/conf.server --service metatool - listFSRoot

50 Hortonworks Data Platform May 17, 2018

The output should look similar to Listing FS Roots... hdfs:///apps/hive/warehouse.

b. Change the FS root:

$ hive --config /etc/hive/conf/conf.server --service metatool -updateLocation

For example, if your Nameservice ID is mycluster, you input:

$ hive --config /etc/hive/conf/conf.server --service metatool -updateLocation hdfs://mycluster/apps/hive/warehouse hdfs:// c6401.ambari.apache.org/apps/hive/warehouse.

The output looks similar to:

Successfully updated the following locations...Updated X records in SDS table Important

The Hive configuration path for a default HDP 2.3.x or later stack is / etc/hive/conf/conf.server

The Hive configuration path for a default HDP 2.2.x or earlier stack is / etc/hive/conf

17.Adjust the ZooKeeper Failover Controller retries setting for your environment:

a. Browse to Services > HDFS > Configs > Advanced core-site.

b. Set ha.failover-controller.active-standby- elector.zk.op.retries=120.

Next Steps

Review and confirm all recommended configuration changes.

More Information

Review and Confirm Configuration Changes [82] 5.1.2. Rolling Back NameNode HA

To disable (roll back) NameNode high availability, perform these tasks (depending on your installation):

1. Stop HBase [52]

2. Checkpoint the Active NameNode [52]

3. Stop All Services [53]

4. Prepare the Ambari Server Host for Rollback [53]

51 Hortonworks Data Platform May 17, 2018

5. Restore the HBase Configuration [54]

6. Delete ZooKeeper Failover Controllers [55]

7. Modify HDFS Configurations [55]

8. Re-create the Secondary NameNode [57]

9. Re-enable the Secondary NameNode [58]

10.Delete All JournalNodes [59]

11.Delete the Additional NameNode [60]

12.Verify the HDFS Components [60]

13.Start HDFS [61]

More Information

Configuring NameNode High Availability [46] 5.1.2.1. Stop HBase

1. In the Ambari Web cluster dashboard, click the HBase service.

2. Click Service Actions > Stop.

3. Wait until HBase has stopped completely before continuing. 5.1.2.2. Checkpoint the Active NameNode

If HDFS is used after you enable NameNode HA, but you want to revert to a non-HA state, you must checkpoint the HDFS state before proceeding with the rollback.

If the Enable NameNode HA wizard failed and you need to revert, you can omit this step and proceed to stop all services.

Checkpointing the HDFS state requires different syntax, depending on whether Kerberos security is enabled on the cluster or not:

• If Kerberos security has not been enabled on the cluster, use the following command on the active NameNode host and as the HDFS service user, to save the namespace:

sudo su -l -c 'hdfs dfsadmin -safemode enter' sudo su -l -c 'hdfs dfsadmin -saveNamespace'

• If Kerberos security has been enabled on the cluster, use the following commands to save the namespace:

sudo su -l -c 'kinit -kt /etc/security/keytabs/ nn.service.keytab nn/@;hdfs dfsadmin -safemode enter' sudo su -l -c 'kinit -kt /etc/security/

52 Hortonworks Data Platform May 17, 2018

keytabs/nn.service.keytab nn/@;hdfs dfsadmin - saveNamespace'

In this example, is the HDFS service user (for example, hdfs), is the Active NameNode hostname, and is your Kerberos realm.

More Information

Stop All Services [53] 5.1.2.3. Stop All Services

After stopping HBase and, if necessary, checkpointing the Active NameNode, stop all services:

1. In Ambari Web, click the Services tab.

2. Click Stop All.

3. Wait for all services to stop completely before continuing. 5.1.2.4. Prepare the Ambari Server Host for Rollback

To prepare for the rollback procedure:

Steps

1. Log in to the Ambari server host.

2. Set the following environment variables

export Substitute the value of the administrative user for AMBARI_USER=AMBARI_USERNAMEAmbari Web. The default value is admin.

export Substitute the value of the administrative password AMBARI_PW=AMBARI_PASSWORDfor Ambari Web. The default value is admin.

export Substitute the Ambari Web port. The default value is AMBARI_PORT=AMBARI_PORT 8080.

export Substitute the value of the protocol for connecting to AMBARI_PROTO=AMBARI_PROTOCOLAmbari Web. Options are http or https. The default value is http.

export Substitute the name of your cluster, which you set CLUSTER_NAME=CLUSTER_NAME during installation: for example, mycluster.

export Substitute the FQDN of the host for the non-HA NAMENODE_HOSTNAME=NN_HOSTNAMENameNode: for example, nn01.mycompany.com.

export Substitute the FQDN of the host for the additional ADDITIONAL_NAMENODE_HOSTNAME=ANN_HOSTNAMENameNode in your HA setup.

export Substitute the FQDN of the host for the secondary SECONDARY_NAMENODE_HOSTNAME=SNN_HOSTNAMENameNode for the non-HA setup.

53 Hortonworks Data Platform May 17, 2018

export Substitute the FQDN of the host for the first Journal JOURNALNODE1_HOSTNAME=JOUR1_HOSTNAMENode.

export Substitute the FQDN of the host for the second JOURNALNODE2_HOSTNAME=JOUR2_HOSTNAMEJournal Node.

export Substitute the FQDN of the host for the third Journal JOURNALNODE3_HOSTNAME=JOUR3_HOSTNAMENode.

3. Double check that these environment variables are set correctly. 5.1.2.5. Restore the HBase Configuration

If you have installed HBase, you might need to restore a configuration to its pre-HA state: Note

For Ambari 2.6.0 and higher, config.sh is not supported and will fail. Use config.py instead.

1. From the Ambari server host, determine whether your current HBase configuration must be restored:

/var/lib/ambari-server/resources/scripts/configs.py -u -p -port get localhost hbase-site

Use the environment variables that you set up when preparing the Ambari server host for rollback for the variable names.

If hbase.rootdir is set to the NameService ID you set up using the Enable NameNode HA wizard, you must revert hbase-site to non-HA values. For example, in "hbase.rootdir":"hdfs://:8020/apps/hbase/data", the hbase.rootdir property points to the NameService ID and the value must be rolled back.

If hbase.rootdir points instead to a specific NameNode host, it does not need to be rolled back. For example, in "hbase.rootdir":"hdfs:// :8020/apps/hbase/data", the hbase.rootdir property points to a specific NameNode host and not a NameService ID. This does not need to be rolled back; you can proceed to delete ZooKeeper failover controllers.

2. If you must roll back the hbase.rootdir value, on the Ambari Server host, use the config.sh script to make the necessary change:

/var/lib/ambari-server/resources/scripts/configs.py - u -p -port set localhost hbase-site hbase.rootdir hdfs:// :8020/apps/hbase/data

Use the environment variables that you set up when preparing the Ambari server host for rollback for the variable names.

54 Hortonworks Data Platform May 17, 2018

3. On the Ambari server host, verify that the hbase.rootdir property has been restored properly:

/var/lib/ambari-server/resources/scripts/configs.py -u -p -port get localhost hbase-site

The hbase.rootdir property should now be the same as the NameNode hostname, not the NameService ID.

More Information

Prepare the Ambari Server Host for Rollback [53]

Delete ZooKeeper Failover Controllers [55] 5.1.2.6. Delete ZooKeeper Failover Controllers

Prerequsites

If the following command on the Ambari server host returns an empty items array then you must delete ZooKeeper (ZK) Failover Controllers:

curl -u : -H "X-Requested-By: ambari" -i ://localhost:/api/v1/clusters/ /host_components?HostRoles/component_name=ZKFC

To delete the failover controllers:

Steps

1. On the Ambari Server host, issue the following DELETE commands:

curl -u : -H "X-Requested-By: ambari" -i -X DELETE ://localhost:/ api/v1/clusters//hosts// host_components/ZKFC curl -u : - H "X-Requested-By: ambari" -i -X DELETE :// localhost:/api/v1/clusters//hosts/ /host_components/ZKFC

2. Verify that the controllers are gone:

curl -u : -H "X-Requested-By: ambari" -i ://localhost:/api/v1/clusters/ /host_components?HostRoles/component_name=ZKFC

This command should return an empty items array. 5.1.2.7. Modify HDFS Configurations

You may need to modify your hdfs-site configuration and/or your core-site configuration.

55 Hortonworks Data Platform May 17, 2018

Note

For Ambari 2.6.0 and higher, config.sh is not supported and will fail. Use config.py instead.

Prerequisites

Check whether you need to modify your hdfs-site configuration, by executing the following command on the Ambari Server host:

/var/lib/ambari-server/resources/scripts/configs.py -u -p -port get localhost hdfs-site

If you see any of the following properties, you must delete them from your configuration.

• dfs.nameservices

• dfs.client.failover.proxy.provider.

• dfs.ha.namenodes.

• dfs.ha.fencing.methods

• dfs.ha.automatic-failover.enabled

• dfs.namenode.http-address..nn1

• dfs.namenode.http-address..nn2

• dfs.namenode.rpc-address..nn1

• dfs.namenode.rpc-address..nn2

• dfs.namenode.shared.edits.dir

• dfs.journalnode.edits.dir

• dfs.journalnode.http-address

• dfs.journalnode.kerberos.internal.spnego.principal

• dfs.journalnode.kerberos.principal

• dfs.journalnode.keytab.file

Where is the NameService ID you created when you ran the Enable NameNode HA wizard.

To modify your hdfs-site configuration:

Steps

1. On the Ambari Server host, execute the following for each property you found:

56 Hortonworks Data Platform May 17, 2018

/var/lib/ambari-server/resources/scripts/configs.py -u -p -port delete localhost hdfs-site property_name

Replace property_name with the name of each of the properties to be deleted.

2. Verify that all of the properties have been deleted: /var/lib/ambari-server/resources/scripts/configs.py -u -p -port get localhost hdfs-site

None of the properties listed above should be present.

3. Determine whether you must modify your core-site configuration: /var/lib/ambari-server/resources/scripts/configs.py -u -p -port get localhost core-site

4. If you see the property ha.zookeeper.quorum, delete it: /var/lib/ambari-server/resources/scripts/configs.py -u -p -port delete localhost core-site ha.zookeeper.quorum

5. If the property fs.defaultFS is set to the NameService ID, revert it to its non-HA value: "fs.defaultFS":"hdfs://" The property fs.defaultFS needs to be modified as it points to a NameService ID "fs.defaultFS":"hdfs://"

You need not change the property fs.defaultFS, because it points to a specific NameNode, not to a NameService ID.

6. Revert the property fs.defaultFS to the NameNode host value: /var/lib/ambari-server/resources/scripts/configs.py -u -p -port set localhost core-site fs.defaultFS hdfs://

7. Verify that the core-site properties are now properly set: /var/lib/ambari-server/resources/scripts/configs.py -u -p -port get localhost core-site

The property fs.defaultFS should be the NameNode host and the property ha.zookeeper.quorum should not appear. 5.1.2.8. Re-create the Secondary NameNode

You may need to recreate your secondary NameNode.

57 Hortonworks Data Platform May 17, 2018

Prerequisites

Check whether you need to recreate the secondary NameNode, on the Ambari Server host:

curl -u : -H "X-Requested-By: ambari" -i -X GET ://localhost:/ api/v1/clusters//host_components?HostRoles/ component_name=SECONDARY_NAMENODE

If this returns an empty items array, you must recreate your secondary NameNode. Otherwise you can proceed to re-enable your secondary NameNode.

To recreate your secondary NameNode:

Steps

1. On the Ambari Server host, run the following command:

curl -u : -H "X-Requested-By: ambari" -i -X POST -d '{"host_components" : [{"HostRoles": {"component_name":"SECONDARY_NAMENODE"}}] }' :// localhost:/api/v1/clusters//hosts? Hosts/host_name=

2. Verify that the secondary NameNode now exists. On the Ambari Server host, run the following command:

curl -u : -H "X-Requested-By: ambari" -i -X GET ://localhost:/ api/v1/clusters//host_components?HostRoles/ component_name=SECONDARY_NAMENODE

This should return a non-empty items array containing the secondary NameNode.

More Information

Re-enable the Secondary NameNode [58] 5.1.2.9. Re-enable the Secondary NameNode

To re-enable the secondary NameNode:

Steps

1. Run the following commands on the Ambari Server host:

curl -u : -H "X-Requested- By: ambari" -i -X PUT -d '{"RequestInfo": {"context":"Enable Secondary NameNode"},"Body": {"HostRoles":{"state":"INSTALLED"}}}':// localhost:/api/v1/clusters//hosts/

2. Analyze the output:

58 Hortonworks Data Platform May 17, 2018

• If this returns 200, proceed to delete all JournalNodes.

• If this input returns the value 202, wait a few minutes and then run the following command:

curl -u : -H "X-Requested-By: ambari" -i -X GET "://localhost:/ api/v1/clusters//host_components?HostRoles/ component_name=SECONDARY_NAMENODE&fields=HostRoles/state"

Wait for the response "state" : "INSTALLED" before proceeding.

More Information

Delete All JournalNodes [59] 5.1.2.10. Delete All JournalNodes

You may need to delete any JournalNodes.

Prerequisites

Check to see if you need to delete JournalNodes, on the Ambari Server host:

curl -u : -H "X-Requested-By: ambari" -i -X GET ://localhost:/ api/v1/clusters//host_components?HostRoles/ component_name=JOURNALNODE

If this returns an empty items array, you can go on to Delete the Additional NameNode. Otherwise you must delete the JournalNodes.

To delete the JournalNodes:

Steps

1. On the Ambari Server host, run the following command:

curl -u : -H "X-Requested-By: ambari" -i -X DELETE ://localhost:/api/ v1/clusters//hosts// host_components/JOURNALNODE curl -u : -H "X-Requested-By: ambari" -i -X DELETE :// localhost:/api/v1/clusters//hosts/ /host_components/JOURNALNODE curl -u : -H "X-Requested-By: ambari" -i -X DELETE ://localhost:/api/ v1/clusters//hosts// host_components/JOURNALNODE

2. Verify that all the JournalNodes have been deleted. On the Ambari Server host:

curl -u : -H "X-Requested-By: ambari" -i -X GET ://localhost:/

59 Hortonworks Data Platform May 17, 2018

api/v1/clusters//host_components?HostRoles/ component_name=JOURNALNODE

This should return an empty items array.

More Information

Delete the Additional NameNode [60]

Delete All JournalNodes [59] 5.1.2.11. Delete the Additional NameNode

You may need to delete your Additional NameNode.

Prerequisites

Check to see if you need to delete your Additional NameNode, on the Ambari Server host:

curl -u : -H "X-Requested-By: ambari" -i -X GET ://localhost:/api/v1/clusters/ /host_components?HostRoles/component_name=NAMENODE

If the items array contains two NameNodes, the Additional NameNode must be deleted.

To delete the Additional NameNode that was set up for HA:

Steps

1. On the Ambari Server host, run the following command:

curl -u : -H "X-Requested-By: ambari" -i -X DELETE ://localhost:/api/v1/ clusters//hosts// host_components/NAMENODE

2. Verify that the Additional NameNode has been deleted:

curl -u : -H "X-Requested-By: ambari" -i -X GET ://localhost:/api/v1/clusters/ /host_components?HostRoles/component_name=NAMENODE

This should return an items array that shows only one NameNode. 5.1.2.12. Verify the HDFS Components

Before starting HDFS, verify that you have the correct components:

1. Go to Ambari Web UI > Services; then select HDFS.

2. Check the Summary panel and ensure that the first three lines look like this:

• NameNode

• SNameNode

60 Hortonworks Data Platform May 17, 2018

• DataNodes

You should not see a line for JournalNodes. 5.1.2.13. Start HDFS

1. In the Ambari Web UI, click Service Actions, then click Start.

2. If the progress bar does not show that the service has completely started and has passed the service checks, repeat Step 1.

3. To start all of the other services, click Actions > Start All in the Services navigation panel. 5.1.3. Managing Journal Nodes

After you enable NameNode high availability in your cluster, you must maintain at least three, active JournalNodes in your cluster. You can use the Manage JournalNode wizard to assign, add, or remove JournalNodes on hosts in your cluster. The Manage JournalNode wizard enables you to assign JournalNodes, review and confirm required configuration changes, and will restart all components in the cluster to take advantage of the changes made to JournalNode placement and configuration.

Please note that this wizard will restart all cluster services.

Prerequisites

• NameNode high availability must be enabled in your cluster

To manage JournalNodes in your cluster:

Steps

1. In Ambari Web, select Services > HDFS > Summary.

2. Click Service Actions, then click Manage JournalNodes.

3. On the Assign JournalNodes page, make assignments by clicking the + and - icons and selecting host names in the drop-down menus. The Assign JournalNodes page

61 Hortonworks Data Platform May 17, 2018

enables you to maintain three, current JournalNodes by updating each time you make an assignment.

When you complete your assignments, click Next.

4. On the Review page, verify the summary of your JournalNode host assignments and the related configuration changes. When you are satisfied that all assignments match your intentions, click Next:

5. Using a remote shell, complete the steps on the Save Namespace page. When you have successfully created a checkpoint, click Next:

62 Hortonworks Data Platform May 17, 2018

6. On the Add/Remove JournalNodes page, monitor the progress bars, then click Next:

7. Follow the instructions on the Manual Steps Required: Format JournalNodes page and then click Next:

63 Hortonworks Data Platform May 17, 2018

8. In the remote shell, confirm that you want to initialize JournalNodes, by entering Y, at the following prompt:

Re-format filesystem in QJM to [host.ip.address.1, host.ip.address.2, host.ip.address.3,] ? (Y or N) Y

9. On the Start Active NameNodes page, monitor the progress bars as services re-start; then click Next:

10.On the Manual Steps Required: Bootstrap Standby NameNode page: Complete each step, using the instructions on the page, and then click Next.

64 Hortonworks Data Platform May 17, 2018

11.In the remote shell, confirm that you want to bootstrap the standby NameNode, by entering Y, at the following prompt:

RE-format filesystem in Storage Directory /grid/0/hadoop/hdfs/ namenode ? (Y or N) Y

12.On the Start All Services page, monitor the progress bars as the wizard starts all services, then click Done to finish the wizard.

After the Ambari Web UI reloads, you may see some alert notifications. Wait a few minutes until all the services restart and alerts clear.

13.Restart any components using Ambari Web, if necessary.

Next Steps

Review and confirm all recommended configuration changes.

More Information

65 Hortonworks Data Platform May 17, 2018

Review and Confirm Configuration Changes [82]

Configuring NameNode High Availability [46] 5.2. ResourceManager High Availability

If you are working in an HDP 2.2 or later environment, you can configure high availability for ResourceManager by using the Enable ResourceManager HA wizard.

Prerequisites

You must have at least three:

• hosts in your cluster

• Apache ZooKeeper servers running

More Information

Configure ResourceManager High Availability [66]

Disable ResourceManager High Availability [67] 5.2.1. Configure ResourceManager High Availability

To access the wizard and configure ResourceManager high availability:

Steps

1. In Ambari Web, browse to Services > YARN > Summary.

2. Select Service Actions and choose Enable ResourceManager HA.

The Enable ResourceManager HA wizard launches, describing a set of automated and manual steps that you must take to set up ResourceManager high availability.

3. On the Get Started page, read the overview of enabling ResourceManager HA; then click Next to proceed:

4. On the Select Host page, accept the default selection, or choose an available host, then click Next to proceed.

66 Hortonworks Data Platform May 17, 2018

5. On the Review Selections page, expand YARN, if necessary, to review all the configuration changes proposed for YARN. Click Next to approve the changes and start automatically configuring ResourceManager HA.

6. On the Configure Components page, click Complete when all the progress bars finish tracking:.

More Information

Disable ResourceManager High Availability [67] 5.2.2. Disable ResourceManager High Availability

To disable ResourceManager high availability, you must delete one ResourceManager and keep one ResourceManager. This requires using the Ambari API to modify the cluster configuration to delete the ResourceManager and using the ZooKeeper client to update the znode permissions.

Prerequisites

Because these steps involve using the Ambari REST API, you should test and verify them in a test environment prior to executing against a production environment.

To disable ResourceManager high availability:

67 Hortonworks Data Platform May 17, 2018

Steps

1. In Ambari Web, stop YARN and ZooKeeper services.

2. On the Ambari Server host, use the Ambari API to retrieve the YARN configurations into a JSON file:

Note

For Ambari 2.6.0 and higher, config.sh is not supported and will fail. Use config.py instead.

/var/lib/ambari-server/resources/scripts/configs.py get yarn-site yarn-site.json

In this example, ambari.server is the hostname of your Ambari Server and cluster.name is the name of your cluster.

3. In the yarn-site.json file, change yarn.resourcemanager.ha.enabled to false and delete the following properties:

• yarn.resourcemanager.ha.rm-ids

• yarn.resourcemanager.hostname.rm1

• yarn.resourcemanager.hostname.rm2

• yarn.resourcemanager.webapp.address.rm1

• yarn.resourcemanager.webapp.address.rm2

• yarn.resourcemanager.webapp.https.address.rm1

• yarn.resourcemanager.webapp.https.address.rm2

• yarn.resourcemanager.cluster-id

• yarn.resourcemanager.ha.automatic-failover.zk-base-path

4. Verify that the following properties in the yarn-site.json file are set to the ResourceManager hostname you are keeping:

• yarn.resourcemanager.hostname

• yarn.resourcemanager.admin.address

• yarn.resourcemanager.webapp.address

• yarn.resourcemanager.resource-tracker.address

• yarn.resourcemanager.scheduler.address

• yarn.resourcemanager.webapp.https.address

• yarn.timeline-service.webapp.address 68 Hortonworks Data Platform May 17, 2018

• yarn.timeline-service.webapp.https.address

• yarn.timeline-service.address

• yarn.log.server.url

5. Search the yarn-site.json file and remove any references to the ResourceManager hostname that you are removing.

6. Search the yarn-site.json file and remove any properties that might still be set for ResourceManager IDs: for example, rm1 and rm2.

7. Save the yarn-site.json file and set that configuration against the Ambari Server:

/var/lib/ambari-server/resources/scripts/configs.py set ambari.server cluster.name yarn-site yarn-site.json

8. Using the Ambari API, delete the ResourceManager host component for the host that you are deleting:

curl --user admin:admin -i -H "X-Requested-By: ambari" -X DELETE http://ambari.server:8080/api/v1/clusters/cluster.name/hosts/ hostname/host_components/RESOURCEMANAGER

9. In Ambari Web, start the ZooKeeper service.

10.On a host that has the ZooKeeper client installed, use the ZooKeeper client to change znode permissions:

/usr/hdp/current/zookeeper-client/bin/zkCli.sh getAcl /rmstore/ZKRMStateRoot setAcl /rmstore/ZKRMStateRoot world:anyone:rwcda

11.In Ambari Web, restart ZooKeeper service and start YARN service.

Next Steps

Review and confirm all recommended configuration changes.

More Information

Review and Confirm Configuration Changes [82] 5.3. HBase High Availability

To help you achieve redundancy for high availability in a production environment, Apache HBase supports deployment of multiple HBase Masters in a cluster. If you are working in a Hortonworks Data Platform (HDP) 2.2 or later environment, Apache Ambari enables simple setup of multiple HBase Masters.

During the Apache HBase service installation and depending on your component assignment, Ambari installs and configures one HBase Master component and multiple RegionServer components. To configure high availability for the HBase service, you can run two or more HBase Master components. HBase uses ZooKeeper for coordination of the

69 Hortonworks Data Platform May 17, 2018

active Master in a cluster running two or more HBase Masters. This means, when primary HBase Master fails, the client will be automatically routed to secondary Master.

Set Up Multiple HBase Masters Through Ambari

Hortonworks recommends that you use Ambari to configure multiple HBase Masters. Complete the following tasks:

Add a Secondary HBase Master to a New Cluster

When installing HBase, click the “+” sign that is displayed on the right side of the name of the existing HBase Master to add and select a node on which to deploy a secondary HBase Master:

Add a New HBase Master to an Existing Cluster

1. Log in to the Ambari management interface as a cluster administrator.

2. In Ambari Web, browse to Services > HBase.

3. In Service Actions, click + Add HBase Master.

4. Choose the host on which to install the additional HBase master; then click Confirm Add.

Ambari installs the new HBase Master and reconfigures HBase to manage multiple Master instances.

Set Up Multiple HBase Masters Manually

Before you can configure multiple HBase Masters manually, you must configure the first node (node-1) on your cluster by following the instructions in the Installing, Configuring, and Deploying a Cluster section in Apache Ambari Installation Guide. Then, complete the following tasks:

1. Configure Passwordless SSH Access

2. Prepare node-1

3. Prepare node-2 and node-3

4. Start and test your HBase Cluster

Configure Passwordless SSH Access

The first node on the cluster (node-1) must be able to log in to other nodes on the cluster and then back to itself in order to start the daemons. You can accomplish this by using the same user name on all hosts and by using passwordless Secure Socket Shell (SSH) login:

70 Hortonworks Data Platform May 17, 2018

1. On node-1, stop HBase service.

2. On node-1, log in as an HBase user and generate an SSH key pair:

$ ssh-keygen -t rsa

The system prints the location of the key pair to standard output. The default name of the public key is id_rsa.pub.

3. Create a directory to hold the shared keys on the other nodes:

• On node-2, log in as an HBase user and create an .ssh/ directory in your home directory.

• On node-3, repeat the same procedure.

4. Use Secure Copy (scp) or any other standard secure means to copy the public key from node-1 to the other two nodes.

On each node in the cluster, create a new file called .ssh/authorized_keys (if it does not already exist) and append the contents of the id_rsa.pub file to it:

$ cat id_rsa.pub >> ~/.ssh/authorized_keys

Ensure that you do not overwrite your existing .ssh/authorized_keys files by concatenating the new key onto the existing file using the >> operator rather than the > operator.

5. Use Secure Shell (SSH) from node-1 to either of the other nodes using the same user name.

You should not be prompted for password.

6. On node-2, repeat Step 5, because it runs as a backup Master.

Prepare node-1

Because node-1 should run your primary Master and ZooKeeper processes, you must stop the RegionServer from starting on node-1:

1. Edit conf/regionservers by removing the line that contains localhost and adding lines with the host name or IP addresseses for node-2 and node-3. Note

If you want to run a RegionServer on node-1, you should refer to it by the hostname the other servers would use to communicate with it. For example, for node-1, it is called as node-1.test.com.

2. Configure HBase to use node-2 as a backup Master by creating a new file in conf/ called backup-Masters, and adding a new line to it with the host name for node-2: for example, node-2.test.com.

3. Configure ZooKeeper on node-1 by editing conf/hbase-site.xml and adding the following properties:

71 Hortonworks Data Platform May 17, 2018

hbase.zookeeper.quorum node-1.test.com,node-2.test.com,node-3.test.com hbase.zookeeper.property.dataDir /usr/local/zookeeper

This configuration directs HBase to start and manage a ZooKeeper instance on each node of the cluster. You can learn more about configuring ZooKeeper in ZooKeeper.

4. Change every reference in your configuration to node-1 as localhost to point to the host name that the other nodes use to refer to node-1: in this example, node-1.test.com.

Prepare node-2 and node-3

Before preparing node-2 and node-3, each node of your cluster must have the same configuration information.

node-2 runs as a backup Master server and a ZooKeeper instance.

1. Download and unpack HBase on node-2 and node-3.

2. Copy the configuration files from node-1 to node-2 and node-3.

3. Copy the contents of the conf/ directory to the conf/ directory on node-2 and node-3.

Start and Test your HBase Cluster

1. Use the jps command to ensure that HBase is not running.

2. Kill HMaster, HRegionServer, and HQuorumPeer processes, if they are running.

3. Start the cluster by running the start-hbase.sh command on node-1.

Your output is similar to this:

$ bin/start-hbase.sh node-3.test.com: starting zookeeper, logging to /home/hbuser/hbase-0.98.3- hadoop2/bin/../logs/hbase-hbuser-zookeeper-node-3.test.com.out node-1.example.com: starting zookeeper, logging to /home/hbuser/hbase-0.98. 3-hadoop2/bin/../logs/hbase-hbuser-zookeeper-node-1.test.com.out node-2.example.com: starting zookeeper, logging to /home/hbuser/hbase-0.98. 3-hadoop2/bin/../logs/hbase-hbuser-zookeeper-node-2.test.com.out starting master, logging to /home/hbuser/hbase-0.98.3-hadoop2/bin/../logs/ hbase-hbuser-master-node-1.test.com.out node-3.test.com: starting regionserver, logging to /home/hbuser/hbase-0.98. 3-hadoop2/bin/../logs/hbase-hbuser-regionserver-node-3.test.com.out node-2.test.com: starting regionserver, logging to /home/hbuser/hbase-0.98. 3-hadoop2/bin/../logs/hbase-hbuser-regionserver-node-2.test.com.out node-2.test.com: starting master, logging to /home/hbuser/hbase-0.98.3- hadoop2/bin/../logs/hbase-hbuser-master-node2.test.com.out

72 Hortonworks Data Platform May 17, 2018

ZooKeeper starts first, followed by the Master, then the RegionServers, and finally the backup Masters.

4. Run the jps command on each node to verify that the correct processes are running on each server.

You might see additional Java processes running on your servers as well, if they are used for any other purposes.

Example1. node-1 jps Output

$ jps 20355 Jps 20071 HQuorumPeer 20137 HMaster

Example 2. node-2 jps Output

$ jps 15930 HRegionServer 16194 Jps 15838 HQuorumPeer 16010 HMaster

Example 3. node-3 jps Output

$ jps 13901 Jps 13639 HQuorumPeer 13737 HRegionServer

ZooKeeper Process Name Note

The HQuorumPeer process is a ZooKeeper instance which is controlled and started by HBase. If you use ZooKeeper this way, it is limited to one instance per cluster node and is appropriate for testing only. If ZooKeeper is run outside of HBase, the process is called QuorumPeer. For more about ZooKeeper configuration, including using an external ZooKeeper instance with HBase, see zookeeper section.

5. Browse to the Web UI and test your new connections.

You should be able to connect to the UI for the Master http://node-1.test.com:16010/ or the secondary master at http://node-2.test.com:16010/. If you can connect through localhost but not from another host, check your firewall rules. You can see the web UI for each of the RegionServers at port 16030 of their IP addresses, or by clicking their links in the web UI for the Master.

Web UI Port Changes 73 Hortonworks Data Platform May 17, 2018

Note

In HBase newer than 0.98.x, the HTTP ports used by the HBase Web UI changed from 60010 for the Master and 60030 for each RegionServer to 16010 for the Master and 16030 for the RegionServer. 5.4. Hive High Availability

The Apache Hive service has multiple, associated components. The primary Hive components are Hive Metastore and HiveServer2. You can configure high availability for the Hive service in HDP 2.2 or later by running two or more of each of those components. The relational database that backs the Hive Metastore itself should also be made highly available using best practices defined for the database system in use and should be done after consultation with your in-house DBA.

More Information

Adding a Hive Metastore Component [74] 5.4.1. Adding a Hive Metastore Component

Prerequisites

If you have ACID enabled in Hive, ensure that the Run Compactor setting is enabled (set to True) on only one Hive metastore host.

Steps

1. In Ambari Web, browse to Services > Hive.

2. In Service Actions, click the + Add Hive Metastore option.

3. Choose the host to install the additional Hive Metastore; then click Confirm Add.

4. Ambari installs the component and reconfigures Hive to handle multiple Hive Metastore instances.

Next Steps

Review and confirm all recommended configuration changes.

More Information

Review and Confirm Configuration Changes [82]

Using Host Config Groups 5.4.2. Adding a HiveServer2 Component

Steps

1. In Ambari Web, browse to the host on which you want to install another HiveServer2 component.

74 Hortonworks Data Platform May 17, 2018

2. On the Host page, click +Add.

3. Click HiveServer2 from the list.

Ambari installs the additional HiveServer2.

Next Steps

Review and confirm all recommended configuration changes.

More Information

Review and Confirm Configuration Changes [82] 5.4.3. Adding a WebHCat Server

Steps

1. In Ambari Web, browse to the host on which you want to install another WebHCat Server.

2. On the Host page, click +Add.

3. Click WebHCat from the list.

Ambari installs the new server and reconfigures Hive to manage multiple Metastore instances.

Next Steps

Review and confirm all recommended configuration changes.

More Information

Review and Confirm Configuration Changes [82] 5.5. Storm High Availability

In HDP 2.3 or later, you can configure high availability for the Apache Storm Nimbus server by adding a Nimbus component from Ambari. 5.5.1. Adding a Nimbus Component

Steps

1. In Ambari Web, browse to Services > Storm.

2. In Service Actions, click the + Add Nimbus option.

3. Click the host on which to install the additional Nimbus; then click Confirm Add.

Ambari installs the component and reconfigures Storm to handle multiple Nimbus instances.

75 Hortonworks Data Platform May 17, 2018

Next Steps

Review and confirm all recommended configuration changes.

More Information

Review and Confirm Configuration Changes [82] 5.6. Oozie High Availability

To set up high availability for the service in HDP 2.2 or later, you can run two or more instances of the Oozie Server component.

Prerequisites

• The relational database that backs the Oozie Server should also be made highly available using best practices defined for the database system in use and should be done after consultation with your in-house DBA. Using the default installed Derby database instance is not supported with multiple Oozie Server instances; therefore, you must use an existing relational database. When using for the Oozie Server, you do not have the option to add Oozie Server components to your cluster.

• High availability for Oozie requires the use of an external virtual IP address or load balancer to direct traffic to the Oozie servers.

More Information

Adding an Oozie Server Component [76] 5.6.1. Adding an Oozie Server Component

Steps

1. In Ambari Web, browse to the host on which you want to install another Oozie Server.

2. On the Host page, click the +Add button.

3. Click Oozie Server from the list.

Ambari installs the new Oozie Server.

4. Configure your external load balancer and then update the Oozie configuration.

5. Browse to Services > Oozie > Configs.

6. In oozie-site, add the following property values:

oozie.zookeeper.connection.string List of ZooKeeper hosts with ports: for example,

c6401.ambari.apache.org:2181,

c6402.ambari.apache.org:2181,

c6403.ambari.apache.org:2181

76 Hortonworks Data Platform May 17, 2018

oozie.services.ext org.apache.oozie.service.ZKLocksService,

org.apache.oozie.service.ZKXLogStreamingService,

org.apache.oozie.service.ZKJobsConcurrencyService

oozie.base.url http://:11000/oozie

7. In oozie-env, uncomment the oozie_base_url property and change its value to point to the load balancer:

export oozie_base_url="http://:11000/ oozie"

8. Restart Oozie.

9. Update the HDFS configuration properties for the Oozie proxy user:

a. Browse to Services > HDFS > Configs.

b. In core-site, update the hadoop.proxyuser.oozie.hosts property to include the newly added Oozie Server host.

Use commas to separate multiple host names.

10.Restart services.

Next Steps

Review and Confirm Configuration Changes [82]

More Information

Enabling the Oozie UI [44] 5.7. Apache Atlas High Availability

Prerequisites

In Ambari 2.4.0.0, adding or removing Atlas Metadata Servers requires manually editing the atlas.rest.address property.

Steps

1. Click Hosts on the Ambari dashboard; then select the host on which to install the standby Atlas Metadata Server.

2. On the Summary page of the new Atlas Metadata Server host, click Add > Atlas Metadata Server and add the new Atlas Metadata Server.

Ambari adds the new Atlas Metadata Server in a Stopped state.

3. Click Atlas > Configs > Advanced.

77 Hortonworks Data Platform May 17, 2018

4. Click Advanced application-properties and append the atlas.rest.address property with a comma and the value for the new Atlas Metadata Server: ,http(s)::.

The default protocol is "http". If the atlas.enableTLS property is set to true, use "https". Also, the default HTTP port is 21000 and the default HTTPS port is 21443. These values can be overridden using the atlas.server.http.port and atlas.server.https.port properties, respectively.

5. Stop all of the Atlas Metadata Servers that are currently running. Important

You must use the Stop command to stop the Atlas Metadata Servers. Do not use a Restart command: this attempts to first stop the newly added Atlas Server, which at this point does not contain any configurations in /etc/ atlas/conf.

6. On the Ambari dashboard, click Atlas > Service Actions > Start.

Ambari automatically configures the following Atlas properties in the /etc/atlas/ conf/atlas-application.properties file:

• atlas.server.ids

• atlas.server.address.$id

• atlas.server.ha.enabled

7. To refresh the configuration files, restart the following services that contain Atlas hooks:

• Hive

• Storm

• Falcon

• Oozie

8. Click Actions > Restart All Required to restart all services that require a restart.

When you update the Atlas configuration settings in Ambari, Ambari marks the services that require restart.

9. Click Oozie > Service Actions > Restart All to restart Oozie along with the other services.

Apache Oozie requires a restart after an Atlas configuration update, but may not be included in the services marked as requiring restart in Ambari.

Next Steps

Review and confirm all recommended configuration changes.

78 Hortonworks Data Platform May 17, 2018

More Information

Review and Confirm Configuration Changes [82] 5.8. Enabling Ranger Admin High Availability

You can configure Ranger Admin high availability (HA) with or without SSL on an Ambari- managed cluster. Please note that the configuration settings used in this section are sample values. You should adjust these settings to reflect your environment (folder locations, passwords, file names, and so on).

Prerequisites

Steps

• HTTPD setup for HTTP - Enable Ranger Admin HA with Ambari, begins at step 16.

• HTTPD setup for HTTPS - Enable Ranger Admin HA with Ambari, begins at step 14.

79 Hortonworks Data Platform May 17, 2018

6. Managing Configurations

You can optimize performance of Hadoop components in your cluster by adjusting configuration settings and property values. You can also use Ambari Web to set up and manage groups and versions of configuration settings in the following ways:

• Changing Configuration Settings [80]

• Manage Host Config Groups [84]

• Configuring Log Settings [87]

• Set Service Configuration Versions [89]

• Download Client Configuration Files [94]

More Information

Adjust Smart Config Settings [81]

Edit Specific Properties [82]

Review and Confirm Configuration Changes [82]

Restart Components [84] 6.1. Changing Configuration Settings

You can optimize service performance using the Configs page for each service. The Configs page includes several tabs you can use to manage configuration versions, groups, settings, properties and values. You can adjust settings, called "Smart Configs" that control at a macro-level, memory allocation for each service. Adjusting Smart Configs requires related configuration settings to change throughout your cluster. Ambari prompts you to review and confirm all recommended changes and restart affected services.

Steps

1. In Ambari Web, click a service name in the service summary list on the left.

2. From the the service Summary page, click the Configs tab, then use one of the following tabs to manage configuration settings.

Use the Configs tab to manage configuration versions and groups.

Use the Settings tab to manage "Smart Configs" by adjusting the green, slider buttons.

Use the Advanced tab to edit specific configuration properties and values.

3. Click Save.

80 Hortonworks Data Platform May 17, 2018

Next Steps

Enter a description for this version that includes your current changes, review and confirm each recommended change, and then restart all affected services.

More Information

Adjust Smart Config Settings [81]

Edit Specific Properties [82]

Review and Confirm Configuration Changes [82]

Restart Components [84] 6.1.1. Adjust Smart Config Settings

Use the Settings tab to manage "Smart Configs" by adjusting the green, slider buttons.

Steps

1. On the Settings tab, click and drag a green-colored slider button to the desired value.

81 Hortonworks Data Platform May 17, 2018

2. Edit values for any properties that display the Override option.

Edited values, also called stale configs, show an Undo option.

3. Click Save.

Next Steps

Enter a description for this version that includes your current changes, review and confirm each recommended change, and then restart all affected services.

More Information

Edit Specific Properties [82]

Review and Confirm Configuration Changes [82]

Restart Components [84] 6.1.2. Edit Specific Properties

Use the Advanced tab of the Configs page for each service to access groups of individual properties that affect performance of that service.

Steps

1. On a service Configs page, click Advanced.

2. On a service Configs Advanced page, expand a category.

3. Edit the value for any property.

Edited values, also called stale configs, show an Undo option.

4. Click Save.

Next Steps

Enter a description for this version that includes your current changes, review and confirm each recommended change, and then restart all affected services.

More Information

Review and Confirm Configuration Changes [82]

Restart Components [84] 6.1.3. Review and Confirm Configuration Changes

When you change a configuration property value, the Ambari Stack Advisor captures and recommends changes to all related configuration properties affected by your original change. Changing a single property, a "Smart Configuration", and other actions, such as

82 Hortonworks Data Platform May 17, 2018

adding or deleting a service, host or ZooKeeper Server, moving a master, or enabling high availability for a component, all require that you review and confirm related configuration changes. For example, if you increase the Minimum Container Size (Memory) setting for YARN, Dependent Configurations lists all recommended changes that you must review and (optionally) accept.

Types of changes are highlighted in the following colors:

Value Changes Yellow

Added Properties Green

Deleted properties Red

To review and confirm changes to configuration properties:

Steps

1. In Dependent Configurations, for each listed property review the summary information.

2. If the change is acceptable, proceed to review the next property in the list.

3. If the change is not acceptable, click the check mark in the blue box to the right of the listed property change.

Clicking the check mark clears the box. Changes for which you clear the box are not confirmed and will not occur.

4. After reviewing all listed changes, click OK to confirm that all marked changes occur.

Next Steps

You must restart any components marked for restart to utilize the changes you confirmed.

83 Hortonworks Data Platform May 17, 2018

More Information

Restart Components [84] 6.1.4. Restart Components

After editing and saving configuration changes, a Restart indicator appears next to components that require restarting to use the updated configuration values.

Steps

1. Click the indicated Components or Hosts links to view details about the requested restart.

2. Click Restart and then click the appropriate action.

For example, options to restart YARN components include the following:

More Information

Review and Confirm Configuration Changes [82] 6.2. Manage Host Config Groups

Ambari initially assigns all hosts in your cluster to one default configuration group for each service you install. For example, after deploying a three-node cluster with default configuration settings, each host belongs to one configuration group that has default configuration settings for the HDFS service.

To manage Configuration Groups:

Steps

1. Click a service name, then click Configs.

2. In Configs, click Manage Config Groups.

To create new groups, reassign hosts, and override default settings for host components, you can use the Manage Configuration Groups control:

84 Hortonworks Data Platform May 17, 2018

To create a new configuration group:

Steps

1. In Manage Config Groups, click Create New Configuration Group.

2. Name and describe the group; then choose OK.

To add hosts to the new configuration group:

Steps

1. In Manage Config Groups, click a configuration group name.

2. Click Add Hosts to selected Configuration Group.

85 Hortonworks Data Platform May 17, 2018

3. Using Select Configuration Group Hosts, click Components, then click a component name from the list.

Choosing a component filters the list of hosts to only those on which that component exists for the selected service. To further filter the list of available host names, use the Filter drop-down list. The host list is filtered by IP address, by default.

4. After filtering the list of hosts, click the check box next to each host that you want to include in the configuration group.

5. Choose OK.

6. In Manage Configuration Groups, choose Save.

To edit settings for a configuration group:

Steps

1. In Configs, click a group name.

2. Click a Config Group; then expand components to expose settings that allow Override.

3. Provide a non-default value; then click Override or Save.

Configuration groups enforce configuration properties that allow override, based on installed components for the selected service and group.

86 Hortonworks Data Platform May 17, 2018

4. Override prompts you to choose one of the following options:

a. Either click the name of an existing configuration group (to which the property value override provided in Step 3 applies),

b. Or create a new configuration group (which includes default properties, plus the property override provided in Step 3).

c. Click OK.

5. In Configs, choose Save. 6.3. Configuring Log Settings

Ambari uses sets of properties called properties to control logging activities for each service running in your Hadoop cluster. Initial, default values for each property reside in a -log4j template file. Log4j properties and values that limit the size and number of backup log files for each service appear above the log4j template file. To access the default Log4j settings for a service; in Ambari Web, browse to > Configs > Advanced -log4j. For example, the Advanced yarn-log4j property group for the YARN service looks like:

87 Hortonworks Data Platform May 17, 2018

To change the limits for the size and number of backup log files for a service:

Steps

1. Edit the values for the backup file size and # of backup files properties.

2. Click Save.

To customize Log4j settings for a service:

Steps

1. Edit values of any properties in the log4j template.

2. Copy the content of the log4j template file.

3. Browse to the custom log4j properties group.

4. Paste the copied content into the custom log4j properties, overwriting, the default content.

5. Click Save.

6. Review and confirm any recommended configuration changes, as prompted.

7. Restart affected services, as prompted.

88 Hortonworks Data Platform May 17, 2018

Restarting components in the service pushes the configuration properties displayed in Custom log4j.properites to each host running components for that service.

If you have customized logging properties that define how activities for each service are logged, you see refresh indicators next to each service name after upgrading to Ambari 1.5.0 or higher. Ensure that logging properties displayed in Custom logj4.properties include any customization.

Optionally, you can create configuration groups that include custom logging properties.

More Information

Review and Confirm Configuration Changes [82]

Restart Components [84]

Adjust Smart Config Settings [81]

Manage Host Config Groups [84] 6.4. Set Service Configuration Versions

Ambari enables you to manage configurations associated with a service. You can make changes to configurations, see a history of changes, compare and revert changes, and push configuration changes to the cluster hosts.

• Basic Concepts [89]

• Terminology [90]

• Saving a Change [90]

• Viewing History [91]

• Comparing Versions [92]

• Reverting a Change [93]

• Host Config Groups [93] 6.4.1. Basic Concepts

It is important to understand how service configurations are organized and stored in Ambari. Properties are grouped into configuration types. A set of config types composes the set of configurations for a service.

89 Hortonworks Data Platform May 17, 2018

For example, the Hadoop Distributed File System (HDFS) service includes the hdfs-site, core- site, hdfs-log4j, hadoop-env, and hadoop-policy config types. If you browse to Services > HDFS > Configs, you can edit the configuration properties for these config types.

Ambari performs configuration versioning at the service level. Therefore, when you modify a configuration property in a service, Ambari creates a service config version. The following figure shows V1 and V2 of a service config version with a change to a property in Config Type A. After changing a property value in Config Type A in V1, V2 is created.

6.4.2. Terminology

The following table lists configuration versioning terms and concepts that you should know.

configuration property Configuration property managed by Ambari, such as NameNode heap size or replication factor

configuration type (config type) Group of configuration properties: for example, hdfs- site

service configurations Set of configuration types for a particular service: for example, hdfs-site and core-site as part of the HDFS service configuration

change notes Optional notes to save with a service configuration change

service config version (SCV) A particular version of a configuration for a specific service

host config group (HCG) A set of configuration properties to apply to a specific set of hosts 6.4.3. Saving a Change

1. In Configs, change the value of a configuration property.

2. Choose Save.

90 Hortonworks Data Platform May 17, 2018

3. Optionally, enter notes that describe the change:

4. Click Cancel to continue editing, Discard to leave the control without making any changes, or Save to confirm your change. 6.4.4. Viewing History

You can view your configuration change history in two places in Ambari Web: on the Dashboard page, Config History tab, and on each service page's Configs tab.

The Dashboard > Config History tab shows a table of all versions across all services, with each version number and the date and time the version was created. You can also see which user authored the change, and any notes about the change. Using this table, you can filter, sort, and search across versions:

The Service > Configs tab shows you only the most recent configuration change, although you can use the version scrollbar to see earlier versions. Using this tab enables you to quickly access the most recent changes to a service configuration:

Using this view, you can click any version in the scrollbar to view it, and hover your cursor over it to display an option menu that enables you to compare versions and perform a revert operation, which makes any config version that you select the current version.

91 Hortonworks Data Platform May 17, 2018

6.4.5. Comparing Versions

When browsing the version scroll area on the Services > Configs tab, you can hover your cursor over a version to display options to view, compare, or revert (make current):

To compare two service configuration versions:

Steps

1. Navigate to a specific configuration version: for example, V6.

2. Using the version scrollbar, find the version you want to compare to V6.

For example, if you want to compare V6 to V2, find V2 in the scrollbar.

3. Hover your cursor over V2 to display the option menu, and lick Compare.

Ambari displays a comparison of V6 to V2, with an option to revert to V2 (Make V2 Current). Ambari also filters the display by only Changed properties, under the Filter control:

92 Hortonworks Data Platform May 17, 2018

6.4.6. Reverting a Change

You can revert to an older service configuration version by using the Make Current feature. Make Current creates a new service configuration version with the configuration properties from the version you are reverting: effectively, a clone.

After initiating the Make Current operation, you are prompted, on the Make Current Confirmation control, to enter notes for the clone and save it (Make Current). The notes text includes text about the version being cloned:

There are multiple methods to revert to a previous configuration version:

• View a specific version and click Make V* Current:

• Use the version navigation menu and click Make Current:

• Hover your cursor over a version in the version scrollbar and click Make Current :

• Perform a comparison and click Make V* Current:

6.4.7. Host Config Groups

Service configuration versions are scoped to a host config group. For example, changes made in the default group can be compared and reverted in that config group. The same applies to custom config groups.

The following workflow shows multiple host config groups and creates service configuration versions in each config group:

93 Hortonworks Data Platform May 17, 2018

6.5. Download Client Configuration Files

Client configuration files include; .xml files, env-sh scripts, and log4j properties used to configure Hadoop services. For services that include client components (most services except SmartSense and Ambari Metrics Service), you can download the client configuration files associated with that service. You can also download the client configuration files for your entire cluster as a single archive.

To download client configuration files for a single service:

Steps

1. In Ambari Web, browse to the service for which you want the configurations.

2. Click Service Actions.

3. Click Download Client Configs.

Your browser downloads a "tarball" archive containing only the client configuration files for that service to your default, local downloads directory.

4. If prompted to save or open the client configs bundle:

94 Hortonworks Data Platform May 17, 2018

5. Click Save File, then click OK.

To download all client configuration files for your entire cluster:

Steps

1. In Ambari Web, click Actions at the bottom of the service summary list.

2. Click Download Client Configs.

Your browser downloads a "tarball" archive containing all client configuration files for your cluster to your default, local downloads directory.

95 Hortonworks Data Platform May 17, 2018

7. Administering the Cluster

Using the Ambari Web Admin options:

any user can view information about the stack and versions of each service added to it

Cluster administrators can

• enable Kerberos security

• regenerate required key tabs

• view service user names and values

• enable auto-start for services

Ambari administrators can

• add new services to the stack

• upgrade the stack to a new version, by using the link to the Ambari administration interface

Related Topics

Hortonworks Data Platform Apache Ambari Administration

Using Stack and Versions Information [96]

Viewing Service Accounts [98]

Enabling Kerberos and Regenerating Keytabs [99]

Enable Service Auto-Start [101]

Managing Versions 7.1. Using Stack and Versions Information

The Stack tab includes information about the services installed and available in the cluster stack. Any user can browse the list of services. As an Ambari administrator you can also click Add Service to start the wizard to install each service into your cluster.

96 Hortonworks Data Platform May 17, 2018

The Versions tab includes information about which version of software is currently installed and running in the cluster. As an Ambari administrator, you can initiate an automated cluster upgrade from this page.

97 Hortonworks Data Platform May 17, 2018

More Information

Adding a Service

Hortonworks Data Platform Apache Ambari Administration

Hortonworks Data Platform Apache Ambari Upgrade 7.2. Viewing Service Accounts

As a Cluster administrator, you can view the list of Service Users and Group accounts used by the cluster services.

Steps

In Ambari Web UI > Admin, click Service Accounts.

98 Hortonworks Data Platform May 17, 2018

More Information

Defining Users and Groups for an HDP 2.x Stack 7.3. Enabling Kerberos and Regenerating Keytabs

As a Cluster administrator, you can enable and manage Kerberos security in your cluster.

Prerequisites

Before enabling Kerberos in your cluster, you must prepare the cluster, as described in Configuring Ambari and Hadoop for Kerberos.

Steps

In the Ambari web UI > Admin menu, click Enable Kerberos to launch the Kerberos wizard.

After Kerberos is enabled, you can regenerate key tabs and disable Kerberos from the Ambari web UI > Admin menu.

More Information

Regenerate Key tabs [100]

99 Hortonworks Data Platform May 17, 2018

Disable Kerberos [100]

Configuring Ambari and Hadoop for Kerberos 7.3.1. Regenerate Key tabs

As a Cluster administrator, you can regenerate the key tabs required to maintain Kerberos security in your cluster.

Prerequisites

Before regenerating key tabs in your cluster:

• your cluster must be Kerberos-enabled

• you must have KDC Admin credentials

Steps

1. Browse to Admin > Kerberos.

2. Click Regenerate Kerberos.

3. Confirm your selection to proceed.

4. Ambari connects to the Kerberos Key Distribution Center (KDC) and regenerates the key tabs for the service and Ambari principals in the cluster. Optionally, you can regenerate key tabs for only those hosts that are missing key tabs: for example, hosts that were not online or available from Ambari when enabling Kerberos.

5. Restart all services.

More Information

Disable Kerberos [100]

Configuring Ambari and Hadoop for Kerberos

Managing KDC Admin Credentials 7.3.2. Disable Kerberos

As a Cluster administrator, you can disable Kerberos security in your cluster.

Prerequisites

Before disabling Kerberos security in your cluster, your cluster must be Kerberos-enabled.

Steps

1. Browse to Admin > Kerberos.

100 Hortonworks Data Platform May 17, 2018

2. Click Disable Kerberos.

3. Confirm your selection.

Cluster services are stopped and the Ambari Kerberos security settings are reset.

4. To re-enable Kerberos, click Enable Kerberos and follow the wizard

More Information

Configuring Ambari and Hadoop for Kerberos 7.4. Enable Service Auto-Start

As a Cluster Administrator or Cluster Operator, you can enable each service in your stack to re-start automatically. Enabling auto-start for a service causes the ambari-agent to attempt re-starting service components in a stopped state without manual effort by a user. Auto- Start Services is enabled by default, but only the Ambari Metrics Collector component is set to auto-start by default.

As a first step, you should enable auto-start for the worker nodes in the core Hadoop services, the DataNode and NameNode components in YARN and HDFS, for example. You should also enable auto-start for all components in the SmartSense service. After enabling auto-start, monitor the operating status of your services on the Ambari Web dashboard. Auto-start attempts do not display as background operations. To diagnose issues with service components that fail to start, check the ambari agent logs, located at: /var/log/ ambari-agent.log on the component host.

To manage the auto-start status for components in a service:

Steps

1. In Auto-Start Services, click a service name.

2. Click the grey area in the Auto-Start Services control of at least one component, to change its status to Enabled.

101 Hortonworks Data Platform May 17, 2018

The green icon to the right of the service name indicates the percentage of components with auto-start enabled for the service.

3. To enable auto-start for all components in the service, click Enable All.

The green icon fills to indicate all components have auto-start enabled for the service.

4. To disable auto-start for all components in the service, click Disable All.

The green icon clears to indicate that all components have auto-start disabled for the service.

5. To clear all pending status changes before saving them, click Discard.

6. When you finish changing your auto-start status settings, click Save.

- -

To disable Auto-Start Services:

Steps

1. In Ambari Web, click Admin > Service Auto-Start.

102 Hortonworks Data Platform May 17, 2018

2. In Service Auto Start Configuration, click the grey area in the Auto-Start Services control to change its status from Enabled to Disabled.

3. Click Save.

More Information

Monitoring Background Operations [38]

103 Hortonworks Data Platform May 17, 2018

8. Managing Alerts and Notifications

Ambari uses a predefined set of seven types of alerts (web, port, metric, aggregate, script, server, and recovery) for each cluster component and host. You can use these alerts to monitor cluster health and to alert other users to help you identify and troubleshoot problems. You can modify alert names, descriptions, and check intervals, and you can disable and re-enable alerts.

You can also create groups of alerts and setup notification targets for each group so that you can notify different parties interested in certain sets of alerts by using different methods.

This section provides you with the following information:

• Understanding Alerts [104]

• Modifying Alerts [106]

• Modifying Alert Check Counts [106]

• Disabling and Re-enabling Alerts [107]

• Tables of Predefined Alerts [107]

• Managing Notifications [118]

• Creating and Editing Notifications [118]

• Creating or Editing Alert Groups [120]

• Dispatching Notifications [121]

• Viewing the Alert Status Log [121] 8.1. Understanding Alerts

Ambari predefines a set of alerts that monitor the cluster components and hosts. Each alert is defined by an alert definition, which specifies the alert type check interval and thresholds. When a cluster is created or modified, Ambari reads the alert definitions and creates alert instances for the specific items to monitor in the cluster. For example, if your cluster includes Hadoop Distributed File System (HDFS), there is an alert definition to monitor "DataNode Process". An instance of that alert definition is created for each DataNode in the cluster.

Using Ambari Web, you can browse the list of alerts defined for your cluster by clicking the Alerts tab. You can search and filter alert definitions by current status, by last status change, and by the service the alert definition is associated with (among other things). You can click alert definition name to view details about that alert, to modify the alert properties (such as check interval and thresholds), and to see the list of alert instances associated with that alert definition.

Each alert instance reports an alert status, defined by severity. The most common severity levels are OK, WARNING, and CRITICAL, but there are also severities for UNKNOWN and

104 Hortonworks Data Platform May 17, 2018

NONE. Alert notifications are sent when alert status changes (for example, status changes from OK to CRITICAL).

More Information

Managing Notifications [118]

Tables of Predefined Alerts [107] 8.1.1. Alert Types

Alert thresholds and the threshold units depend on the type of the alert. The following table lists the types of alerts, their possible status, and to what units thresholds can be configured if the thresholds are configurable:

WEB Alert Type WEB alerts watch a web URL on a given component; the alert status is determined based on the HTTP response code. Therefore, you cannot change which HTTP response codes determine the thresholds for WEB alerts. You can customize the response text for each threshold and the overall web connection timeout. A connection timeout is considered a CRITICAL alert. Threshold units are based on seconds.

The response code and corresponding status for WEB alerts is as follows:

• OK status if the web URL responds with a code under 400.

• WARNING status if the web URL responds with code 400 and above.

• CRITICAL status if Ambari cannot connect to the web URL.

PORT Alert Type PORT alerts check the response time to connect to a given a port; the threshold units are based on seconds.

METRIC Alert Type METRIC alerts check the value of a single or multiple metrics (if a calculation is performed). The metric is accessed from a URL endpoint available on a given component. A connection timeout is considered a CRITICAL alert.

The thresholds are adjustable and the units for each threshold depend on the metric. For example, in the case of CPU utilization alerts, the unit is percentage; in the case of RPC latency alerts, the unit is milliseconds.

AGGREGATE Alert Type AGGREGATE alerts aggregate the alert status as a percentage of the alert instances affected. For example, the Percent DataNode Process alert aggregates the DataNode Process alert.

SCRIPT Alert Type SCRIPT alerts execute a script that determines status such as OK, WARNING, or CRITICAL. You can customize the response

105 Hortonworks Data Platform May 17, 2018

text and values for the properties and thresholds for the SCRIPT alert.

SERVER Alert Type SERVER alerts execute a server-side runnable class that determines the alert status such as OK, WARNING, or CRITICAL.

RECOVERY Alert Type RECOVERY alerts are handled by the Ambari Agents that are monitoring for process restarts. Alert status OK, WARNING, and CRITICAL are based on the number of times a process is restarted automatically. This is useful to know when processes are terminating and Ambari is automatically restarting. 8.2. Modifying Alerts

General properties for an alert include name, description, check interval, and thresholds.

The check interval defines the frequency with which Ambari checks alert status. For example, "1 minute" value means that Ambari checks the alert status every minute.

The configuration options for thresholds depend on the alert type.

To modify the general properties of alerts:

Steps

1. Browse to the Alerts section in Ambari Web.

2. Find the alert definition and click to view the definition details.

3. Click Edit to modify the name, description, check interval, and thresholds (as applicable).

4. Click Save.

5. Changes take effect on all alert instances at the next check interval.

More Information

Alert Types [105] 8.3. Modifying Alert Check Counts

Ambari enables you to set the number of alert checks to perform before dispatching a notification. If the alert state changes during a check, Ambari attempts to check the condition a specified number of times (the check count) before dispatching a notification.

Alert check counts are not applicable to AGGREATE alert types. A state change for an AGGREGATE alert results in a notification dispatch.

If your environment experiences transient issues resulting in false alerts, you can increase the check count. In this case, the alert state change is still recorded, but as a SOFT state change. If the alert condition is still triggered after the specified number of checks, the state change is then considered HARD, and notifications are sent.

106 Hortonworks Data Platform May 17, 2018

You generally want to set the check count value globally for all alerts, but you can also override that value for individual alerts if a specific alert or alerts is experiencing transient issues.

To modify the global alert check count:

Steps

1. Browse to the Alerts section in Ambari Web.

2. In the Ambari Web, Actions menu, click Manage Alert Settings.

3. Update the Check Count value.

4. Click Save.

Changes made to the global alert check count might require a few seconds to appear in the Ambari UI for individual alerts.

To override the global alert check count for individual alerts:

Steps

1. Browse to the Alerts section in Ambari Web.

2. Select the alert for which you want to set a specific Check Count.

3. On the right, click the Edit icon next to the Check Count property.

4. Update the Check Count value.

5. Click Save.

More Information

Managing Notifications [118] 8.4. Disabling and Re-enabling Alerts

You can optionally disable alerts. When an alert is disabled, no alert instances are in effect and Ambari will no longer perform the checks for the alert. Therefore, no alert status changes will be recorded and no notifications (i.e. no emails or SNMP traps) will dispatched.

1. Browse to the Alerts section in Ambari Web.

2. Find the alert definition. Click the Enabled or Disabled text to enable/disable the alert.

3. Alternatively, you can click on the alert to view the definition details and click Enabled or Disabled to enable/disable the alert.

4. You will be prompted to confirm enable/disable. 8.5. Tables of Predefined Alerts

• HDFS Service Alerts [108]

• HDFS HA Alerts [111]

107 Hortonworks Data Platform May 17, 2018

• NameNode HA Alerts [112]

• YARN Alerts [113]

• MapReduce2 Alerts [114]

• HBase Service Alerts [114]

• Hive Alerts [115]

• Oozie Alerts [116]

• ZooKeeper Alerts [116]

• Ambari Alerts [116]

• Ambari Metrics Alerts [117]

• SmartSense Alerts [118] 8.5.1. HDFS Service Alerts

Alert Alert Type Description Potential Causes Possible Remedies NameNode METRIC This service-level alert is Some DataNodes are down For critical data, use a Blocks Health triggered if the number of and the replicas that are replication factor of 3. corrupt or missing blocks missing blocks are only on exceeds the configured critical those DataNodes. Bring up the failed DataNodes threshold. with missing or corrupt blocks. The corrupt or missing blocks are from files with a Identify the files associated replication factor of 1. New with the missing or corrupt replicas cannot be created blocks by running the Hadoop because the only replica of the fsck block is missing. command.

Delete the corrupt files and recover them from backup, if one exists. NFS Gateway PORT This host-level alert is NFS Gateway is down. Check for a non-operating NFS Process triggered if the NFS Gateway Gateway in Ambari Web. process cannot be confirmed as active. DataNode METRIC This host-level alert is Cluster storage is full. If the cluster still has storage, Storage triggered if storage capacity use the load balancer is full on the DataNode If cluster storage is not full, to distribute the data to (90% critical). It checks the DataNode is full. relatively less-used DataNodes. DataNode JMX Servlet for the Capacity and Remaining If the cluster is full, delete properties. unnecessary data or add additional storage by adding either more DataNodes or more or larger disks to the DataNodes. After adding more storage, run the load balancer. DataNode PORT This host-level alert is DataNode process is down or Check for non-operating Process triggered if the individual not responding. DataNodes in Ambari Web. DataNode processes cannot be established to be up and DataNode are not down but Check for any errors in the listening on the network is not listening to the correct DataNode logs (/var/log/ for the configured critical network port/address. hadoop/hdfs) and restart the threshold, in seconds. DataNode, if necessary.

108 Hortonworks Data Platform May 17, 2018

Alert Alert Type Description Potential Causes Possible Remedies Run the

netstat-tuplpn

command to check if the DataNode process is bound to the correct network port. DataNode WEB This host-level alert is The DataNode process is not Check whether the DataNode Web UI triggered if the DataNode running. process is running. web UI is unreachable. NameNode METRIC This host-level alert is Unusually high CPU utilization Use the Host CPU triggered if CPU utilization might be caused by a top Utilization of the NameNode exceeds very unusual job or query certain thresholds (200% workload, but this is generally command to determine which warning, 250% critical). It the sign of an issue in the processes are consuming checks the NameNode JMX daemon. excess CPU. Servlet for the SystemCPULoad property. This information Reset the offending process. is available only if you are running JDK 1.7. NameNode WEB This host-level alert is The NameNode process is not Check whether the Web UI triggered if the NameNode running. NameNode process is running. web UI is unreachable. Percent AGGREGATE This service-level alert is Cluster storage is full. If the cluster still has storage, DataNodes triggered if the storage is full use the load balancer with on a certain percentage of If cluster storage is not full, to distribute the data to Available DataNodes (10% warn, 30% DataNode is full. relatively less-used DataNodes. Space critical). If the cluster is full, delete unnecessary data or increase storage by adding either more DataNodes or more or larger disks to the DataNodes. After adding more storage, run the load balancer. Percent AGGREGATE This alert is triggered if the DataNodes are down. Check for non-operating DataNodes number of non-operating DataNodes in Ambari Web. Available DataNodes in the cluster is DataNodes are not down but greater than the configured are not listening to the correct Check for any errors in the critical threshold. This network port/address. DataNode logs (/var/log/ aggregates the DataNode hadoop/hdfs) and restart the process alert. DataNode hosts/processes. Run the

netstat-tuplpn

command to check if the DataNode process is bound to the correct network port. NameNode METRIC This host-level alert is A job or an application Review the job or the RPC Latency triggered if the NameNode is performing too many application for potential bugs operations RPC latency NameNode operations. causing it to perform too exceeds the configured critical many NameNode operations. threshold. Typically an increase in the RPC processing time increases the RPC queue length, causing the average queue wait time to increase for NameNode operations. NameNode SCRIPT This alert will trigger if the Too much time elapsed since Set NameNode checkpoint. Last last time that the NameNode last NameNode checkpoint. Checkpoint performed a checkpoint was Review threshold for too long ago or if the number uncommitted transactions.

109 Hortonworks Data Platform May 17, 2018

Alert Alert Type Description Potential Causes Possible Remedies of uncommitted transactions is Uncommitted transactions beyond a certain threshold. beyond threshold. Secondary WEB If the Secondary NameNode The Secondary NameNode is Check that the Secondary NameNode process cannot be confirmed not running. DataNode process is running. Process to be up and listening on the network. This alert is not applicable when NameNode HA is configured. NameNode METRIC This alert checks if the One or more of the directories Check the NameNode UI for Directory NameNode NameDirStatus are reporting as not healthy. information about unhealthy Status metric reports a failed directories. directory. HDFS METRIC This service-level alert is Cluster storage is full. Delete unnecessary data. Capacity triggered if the HDFS capacity Utilization utilization exceeds the Archive unused data. configured critical threshold (80% warn, 90% critical). It Add more DataNodes. checks the NameNode JMX Add more or larger disks to Servlet for the CapacityUsed the DataNodes. and CapacityRemaining properties. After adding more storage, run the load balancer. DataNode METRIC This service-level alert A DataNode is in an unhealthy Check the NameNode UI Health is triggered if there are state. for the list of non-operating Summary unhealthy DataNodes. DataNodes. HDFS Pending METRIC This service-level alert is Large number of blocks are Deletion triggered if the number of pending deletion. Blocks blocks pending deletion in HDFS exceeds the configured warning and critical thresholds. It checks the NameNode JMX Servlet for the PendingDeletionBlock property. HDFS SCRIPT This service-level alert is The HDFS upgrade is not Finalize any upgrade you have Upgrade triggered if HDFS is not in the finalized. in process. Finalized finalized state. State DataNode SCRIPT This host-level alert is If the mount history file does Check the data directories to Unmounted triggered if one of the data not exist, then report an error confirm they are mounted as Data Dir directories on a host was if a host has one or more expected. previously on a mount point mounted data directories and became unmounted. as well as one or more unmounted data directories on the root partition. This may indicate that a data directory is writing to the root partition, which is undesirable. DataNode METRIC This host-level alert is Heap Usage triggered if heap usage goes past thresholds on the DataNode. It checks the DataNode JMXServlet for the MemHeapUsedM and MemHeapMaxM properties. The threshold values are percentages. NameNode SCRIPT This service-level alert is Client RPC triggered if the deviation of Queue RPC queue latency on client Latency port has grown beyond the

110 Hortonworks Data Platform May 17, 2018

Alert Alert Type Description Potential Causes Possible Remedies specified threshold within an given period. This alert will monitor Hourly and Daily periods. NameNode SCRIPT This service-level alert is Client RPC triggered if the deviation of Processing RPC latency on client port has Latency grown beyond the specified threshold within a given period. This alert will monitor Hourly and Daily periods. NameNode SCRIPT This service-level alert is Service triggered if the deviation of RPC Queue RPC latency on the DataNode Latency port has grown beyond the specified threshold within a given period. This alert will monitor Hourly and Daily periods. NameNode SCRIPT This service-level alert is Service RPC triggered if the deviation of Processing RPC latency on the DataNode Latency port has grown beyond the specified threshold within a given period. This alert will monitor Hourly and Daily periods. HDFS Storage SCRIPT This service-level alert is Capacity triggered if the increase Usage in storage capacity usage deviation has grown beyond the specified threshold within a given period. This alert will monitor Daily and Weekly periods. NameNode SCRIPT This service-level alert is Heap Usage triggered if the NameNode heap usage deviation has grown beyond the specified threshold within a given period. This alert will monitor Daily and Weekly periods.

8.5.2. HDFS HA Alerts

Alert Alert Type Description Potential Causes Possible Remedies JournalNode WEB This host-level alert is The JournalNode process is Check if the JournalNode Web UI triggered if the individual down or not responding. process is running. JournalNode process cannot be established to be up and The JournalNode is not down listening on the network but is not listening to the for the configured critical correct network port/address. threshold, given in seconds. NameNode SCRIPT This service-level alert is The Active, Standby or both On each host running High triggered if either the Active NameNode processes are NameNode, check for any Availability NameNode or Standby down. errors in the logs (/var/log/ Health NameNode are not running. hadoop/hdfs/) and restart the NameNode host/process using Ambari Web.

111 Hortonworks Data Platform May 17, 2018

Alert Alert Type Description Potential Causes Possible Remedies On each host running NameNode, run the

netstat-tuplpn

command to check if the NameNode process is bound to the correct network port. Percent AGGREGATE This service-level alert is JournalNodes are down. Check for dead JournalNodes JournalNodes triggered if the number of in Ambari Web. Available down JournalNodes in the JournalNodes are not down cluster is greater than the but are not listening to the configured critical threshold correct network port/address. (33% warn, 50% crit ). It aggregates the results of JournalNode process checks. ZooKeeper PORT This alert is triggered if the The ZKFC process is down or Check if the ZKFC process is Failover ZooKeeper Failover Controller not responding. running. Controller process cannot be confirmed Process to be up and listening on the network.

8.5.3. NameNode HA Alerts

Alert Alert Type Description Potential Causes Possible Remedies JournalNode WEB This host-level alert is The JournalNode process is Check if the JournalNode Process triggered if the individual down or not responding. process is running. JournalNode process cannot be established to be up and The JournalNode is not down listening on the network but is not listening to the for the configured critical correct network port/address. threshold, given in seconds. NameNode SCRIPT This service-level alert is The Active, Standby or both On each host running High triggered if either the Active NameNode processes are NameNode, check for any Availability NameNode or Standby down. errors in the logs (/var/log/ Health NameNode are not running. hadoop/hdfs/) and restart the NameNode host/process using Ambari Web.

On each host running NameNode, run the

netstat-tuplpn

command to check if the NameNode process is bound to the correct network port. Percent AGGREGATE This service-level alert is JournalNodes are down. Check for non-operating JournalNodes triggered if the number of JournalNodes in Ambari Web. Available down JournalNodes in the JournalNodes are not down cluster is greater than the but are not listening to the configured critical threshold correct network port/address. (33% warn, 50% crit ). It aggregates the results of JournalNode process checks. ZooKeeper PORT This alert is triggered if the The ZKFC process is down or Check if the ZKFC process is Failover ZooKeeper Failover Controller not responding. running. Controller process cannot be confirmed Process to be up and listening on the network.

112 Hortonworks Data Platform May 17, 2018

8.5.4. YARN Alerts

Alert Alert Type Description Potential Causes Possible Remedies App Timeline WEB This host-level alert is The App Timeline Server is Check for non-operating App Web UI triggered if the App Timeline down. Timeline Server in Ambari Server Web UI is unreachable. Web. App Timeline Service is not down but is not listening to the correct network port/ address. Percent AGGREGATE This alert is triggered NodeManagers are down. Check for non-operating NodeManagers if the number of down NodeManagers. Available NodeManagers in the NodeManagers are not down cluster is greater than the but are not listening to the Check for any errors in the configured critical threshold. correct network port/address. NodeManager logs (/var/ It aggregates the results log/hadoop/yarn) and restart of DataNode process alert the NodeManagers hosts/ checks. processes, as necessary. Run the

netstat-tuplpn

command to check if the NodeManager process is bound to the correct network port. ResourceManagerWEB This host-level alert The ResourceManager process Check if the ResourceManager Web UI is triggered if the is not running. process is running. ResourceManager Web UI is unreachable. ResourceManagerMETRIC This host-level alert A job or an application Review the job or the RPC Latency is triggered if the is performing too many application for potential ResourceManager operations ResourceManager operations. bugs causing it to perform RPC latency exceeds the too many ResourceManager configured critical threshold. operations. Typically an increase in the RPC processing time increases the RPC queue length, causing the average queue wait time to increase for ResourceManager operations. ResourceManagerMETRIC This host-level alert is Unusually high CPU utilization: Use the triggered if CPU utilization Can be caused by a very top CPU of the ResourceManager unusual job/query workload, Utilization exceeds certain thresholds but this is generally the sign of command to determine which (200% warning, 250% an issue in the daemon. processes are consuming critical). It checks the excess CPU. ResourceManager JMX Servlet for the SystemCPULoad Reset the offending process. property. This information is only available if you are running JDK 1.7. NodeManager WEB This host-level alert is NodeManager process is down Check if the NodeManager is Web UI triggered if the NodeManager or not responding. running. process cannot be established to be up and listening on the NodeManager is not down Check for any errors in the network for the configured but is not listening to the NodeManager logs (/var/log/ critical threshold, given in correct network port/address. hadoop/yarn) and restart the seconds. NodeManager, if necessary. NodeManager SCRIPT This host-level alert checks the NodeManager Health Check Check in the NodeManager Health node health property available script reports issues or is not logs (/var/log/hadoop/yarn) Summary from the NodeManager configured. for health check errors and component.

113 Hortonworks Data Platform May 17, 2018

Alert Alert Type Description Potential Causes Possible Remedies restart the NodeManager, and restart if necessary.

Check in the ResourceManager UI logs (/ var/log/hadoop/yarn) for health check errors. NodeManager SCRIPT This host-level alert The NodeManager process is Check in the NodeManager Health checks the nodeHealthy down or not responding. logs (/var/log/hadoop/yarn) property available from the for health check errors and NodeManager component. restart the NodeManager, and restart if necessary. 8.5.5. MapReduce2 Alerts

Alert Alert Type Description Potential Causes Possible Remedies History Server WEB This host-level alert is The HistoryServer process is Check if the HistoryServer Web UI triggered if the HistoryServer not running. process is running. Web UI is unreachable. History Server METRIC This host-level alert is A job or an application Review the job or the RPC latency triggered if the HistoryServer is performing too many application for potential operations RPC latency HistoryServer operations. bugs causing it to perform exceeds the configured critical too many HistoryServer threshold. Typically an increase operations. in the RPC processing time increases the RPC queue length, causing the average queue wait time to increase for NameNode operations. History METRIC This host-level alert is Unusually high CPU utilization: Use the Server CPU triggered if the percent Can be caused by a very top Utilization of CPU utilization on the unusual job/query workload, HistoryServer exceeds the but this is generally the sign of command to determine which configured critical threshold. an issue in the daemon. processes are consuming excess CPU.

Reset the offending process. History Server PORT This host-level alert is HistoryServer process is down Check the HistoryServer is Process triggered if the HistoryServer or not responding. running. process cannot be established to be up and listening on the HistoryServer is not down but Check for any errors in the network for the configured is not listening to the correct HistoryServer logs (/var/log/ critical threshold, given in network port/address. hadoop/mapred) and restart seconds. the HistoryServer, if necessary. 8.5.6. HBase Service Alerts

Alert Description Potential Causes Possible Remedies Percent This service-level alert is triggered Misconfiguration or less-than- Check the dependent services RegionServers if the configured percentage of ideal configuration caused the to make sure they are operating Available Region Server processes cannot RegionServers to crash. correctly. be determined to be up and listening on the network for the Cascading failures brought on Look at the RegionServer log files configured critical threshold. The by some workload caused the (usually /var/log/hbase/*.log) for default setting is 10% to produce RegionServers to crash. further information. a WARN alert and 30% to produce a CRITICAL alert. It aggregates the The RegionServers shut themselves If the failure was associated with results of RegionServer process own because there were problems a particular workload, try to down checks. understand the workload better.

114 Hortonworks Data Platform May 17, 2018

Alert Description Potential Causes Possible Remedies in the dependent services, Restart the RegionServers. ZooKeeper or HDFS.

GC paused the RegionServer for too long and the RegionServers lost contact with Zookeeper. HBase Master This alert is triggered if the HBase The HBase master process is down. Check the dependent services. Process master processes cannot be confirmed to be up and listening The HBase master has shut Look at the master log files on the network for the configured itself down because there were (usually /var/log/hbase/*.log) for critical threshold, given in seconds. problems in the dependent services, further information. ZooKeeper or HDFS. Look at the configuration files (/ etc/hbase/conf).

Restart the master. HBase This host-level alert is triggered if Unusually high CPU utilization: Can Use the Master CPU CPU utilization of the HBase Master be caused by a very unusual job/ top Utilization exceeds certain thresholds (200% query workload, but this is generally warning, 250% critical). It checks the sign of an issue in the daemon. command to determine which the HBase Master JMX Servlet for processes are consuming excess CPU the SystemCPULoad property. This information is only available if you Reset the offending process. are running JDK 1.7. RegionServers This service-level alert is triggered if The RegionServer process is down Check for dead RegionServer in Health there are unhealthy RegionServers. on the host. Ambari Web. Summary The RegionServer process is up and running but not listening on the correct network port (default 60030). HBase This host-level alert is triggered if The RegionServer process is down Check for any errors in the logs (/ RegionServer the RegionServer processes cannot on the host. var/log/hbase/) and restart the Process be confirmed to be up and listening RegionServer process using Ambari on the network for the configured The RegionServer process is up Web. critical threshold, given in seconds. and running but not listening on the correct network port (default Run the 60030). netstat-tuplpn

command to check if the RegionServer process is bound to the correct network port. 8.5.7. Hive Alerts

Alert Description Potential Causes Possible Remedies HiveServer2 This host-level alert is triggered HiveServer2 process is not running. Using Ambari Web, check status of Process if the HiveServer cannot be HiveServer2 component. Stop and determined to be up and HiveServer2 process is not then restart. responding to client requests. responding. HiveMetastore This host-level alert is triggered The Hive Metastore service is down. Using Ambari Web, stop the Hive Process if the Hive Metastore process service and then restart it. cannot be determined to be up and The database used by the Hive listening on the network for the Metastore is down. configured critical threshold, given in seconds. The Hive Metastore host is not reachable over the network. WebHCat This host-level alert is triggered The WebHCat server is down. Restart the WebHCat server using Server Status if the WebHCat server cannot Ambari Web. be determined to be up and The WebHCat server is hung and responding to client requests. not responding.

115 Hortonworks Data Platform May 17, 2018

Alert Description Potential Causes Possible Remedies The WebHCat server is not reachable over the network. 8.5.8. Oozie Alerts

Alert Description Potential Causes Possible Remedies Oozie Server This host-level alert is triggered The Oozie server is down. Check for dead Oozie Server in Web UI if the Oozie server Web UI is Ambari Web. unreachable. Oozie Server is not down but is not listening to the correct network port/address. Oozie Server This host-level alert is triggered The Oozie server is down. Restart the Oozie service using Status if the Oozie server cannot Ambari Web. be determined to be up and The Oozie server is hung and not responding to client requests. responding. The Oozie server is not reachable over the network. 8.5.9. ZooKeeper Alerts

Alert Alert Type Description Potential Causes Possible Remedies Percent AGGREGATE This service-level alert is The majority of your Check the dependent services ZooKeeper triggered if the configured ZooKeeper servers are down to make sure they are Servers percentage of ZooKeeper and not responding. operating correctly. Available processes cannot be determined to be up and Check the ZooKeeper listening on the network logs (/var/log/hadoop/ for the configured critical zookeeper.log) for further threshold, given in seconds. information. It aggregates the results of ZooKeeper process checks. If the failure was associated with a particular workload, try to understand the workload better.

Restart the ZooKeeper servers from the Ambari UI. ZooKeeper PORT This host-level alert is The ZooKeeper server process Check for any errors in the Server triggered if the ZooKeeper is down on the host. ZooKeeper logs (/var/log/ Process server process cannot be hbase/) and restart the determined to be up and The ZooKeeper server process ZooKeeper process using listening on the network is up and running but not Ambari Web. for the configured critical listening on the correct threshold, given in seconds. network port (default 2181). Run the netstat-tuplpn

command to check if the ZooKeeper server process is bound to the correct network port. 8.5.10. Ambari Alerts

Alert Alert Type Description Potential Causes Possible Remedies Host Disk SCRIPT This host-level alert is The amount of free disk space Check host for disk space to Usage triggered if the amount of left is low. free or add more storage.

116 Hortonworks Data Platform May 17, 2018

Alert Alert Type Description Potential Causes Possible Remedies disk space used on a host goes above specific thresholds (50% warn, 80% crit ). Ambari Agent SERVER This alert is triggered if the Ambari Server host is Check connection from Agent Heartbeat server has lost contact with an unreachable from Agent host host to Ambari Server agent. Ambari Agent is not running Check Agent is running Ambari Server SERVER This alert is triggered if the Agents are not reporting alert Check that all Agents are Alerts server detects that there are status running and heartbeating alerts which have not run in a timely manner Agents are not running Ambari Server SERVER This alert is triggered if This type of issue can arise for Check your Ambari Server Performance the Ambari Server detects many reasons, but is typically database connection and that there is a potential attributed to slow database database activity. Check performance problem with queries and host resource your Ambari Server host for Ambari. exhaustion. resource exhaustion such as memory. 8.5.11. Ambari Metrics Alerts

Alert Description Potential Causes Possible Remedies Metrics This alert is triggered if the Metrics The Metrics Collector process is not Check the Metrics Collector is Collector Collector cannot be confirmed to be running. running. Process up and listening on the configured port for number of seconds equal to threshold. Metrics This host-level alert is triggered if The Metrics Collector process is not Check the Metrics Collector is Collector – the Metrics Collector ZooKeeper running. running. ZooKeeper Server Process cannot be Server determined to be up and listening Process on the network. Metrics This alert is triggered if the Metrics The Metrics Collector process is not Check the Metrics Collector is Collector – Collector HBase Master Processes running. running. HBase Master cannot be confirmed to be up and Process listening on the network for the configured critical threshold, given in seconds. Metrics This host-level alert is triggered Unusually high CPU utilization Tune the Ambari Metrics Collector. Collector if CPU utilization of the Metrics generally the sign of an issue in the – HBase Collector exceeds certain daemon configuration. Master CPU thresholds. Utilization Metrics This host-level alert is triggered if The Metrics Monitor is down. Check whether the Metrics Monitor Monitor the Metrics Monitor process cannot is running on the given host. Status be confirmed to be up and running on the network. Percent This is an AGGREGATE alert of the Metrics Monitors are down. Check the Metrics Monitors are Metrics Metrics Monitor Status. running. Monitors Available Metrics This alert is triggered if the Metrics The Metrics Collector is running but Tune the Ambari Metrics Collector. Collector - Collector has been auto-started is unstable and causing restarts. This Auto-Restart for number of times equal to start could be due to improper tuning. Status threshold in a 1 hour timeframe. By default if restarted 2 times in an hour, you will receive a Warning alert. If restarted 4 or more times in an hour, you will receive a Critical alert.

117 Hortonworks Data Platform May 17, 2018

Alert Description Potential Causes Possible Remedies Percent This is an AGGREGATE alert of the Metrics Monitors are down. Check the Metrics Monitors. Metrics Metrics Monitor Status. Monitors Available Grafana Web This host-level alert is triggered Grafana process is not running. Check whether the Grafana process UI if the AMS Grafana Web UI is is running. Restart if it has gone unreachable. down.

More Information

Tuning Ambari Metrics 8.5.12. SmartSense Alerts

Alert Description Potential Causes Possible Remedies SmartSense This alert is triggered if the HST HST server is not running. Start HST server process. If startup Server server process cannot be confirmed fails, check the hst-server.log. Process to be up and listening on the network for the configured critical threshold, given in seconds. SmartSense This alert is triggered if the last Some nodes are timed out during From the "Bundles" page check the Bundle triggered SmartSense bundle is capture or fail during data capture. status of bundle. Next, check which Capture failed or timed out. It could also be because upload to agents have failed or timed out, Failure Hortonworks fails. and review their logs.

You can also initiate a new capture. SmartSense This alert is triggered if the Service components that are Restart the services that are not Long Running SmartSense in-progress bundle getting collected may not be running. Force-complete the bundle Bundle has possibility of not completing running. Or some agents may be and start a new capture. successfully on time. timing out during data collection/ upload. SmartSense This alert is triggered if the SmartSense Gateway is not running. Start the gateway. If gateway start Gateway SmartSense Gateway server process fails, review hst-gateway.log Status is enabled but is unable to reach. 8.6. Managing Notifications

Using alert groups and notifications enables you to create groups of alerts and set up notification targets for each group in such a way that you can notify different parties interested in certain sets of alerts by using different methods. For example, you might want your Hadoop Operations team to receive all alerts by email, regardless of status, while at the same time you want your System Administration team to receive only RPC and CPU- related alerts that are in Critical state, and only by simple network management protocol (SNMP).

To achieve these different results, you can have one alert notification that manages email for all alert groups for all severity levels, and a different alert notification group that manages SNMP on critical-severity alerts for an alert group that contains the RPC and CPU alerts. 8.7. Creating and Editing Notifications

To create or edit alert notifications:

Steps

118 Hortonworks Data Platform May 17, 2018

1. In Ambari Web, click Alerts.

2. On the Alerts page, click the Actions menu, then click Manage Notifications.

3. In Manage Alert Notifications, click + to create a new alert notification.

In Create Alert Notification,

• In Name, enter a name for the notification

• In Groups, click All or Custom to assign the notification to every or set of groups that you specify

• In Description, type a phrase that describes the notification

• In Method, click EMAIL, SNMP (for MIB-based) or Custom SNMP as the method by which Ambari server handles delivery of this notification.

4. Complete the fields for the notification method you selected.

• For email notification, provide information about your SMTP infrastructure, such as SMTP server, port, to and from addresses, and whether authentication is required to relay messages through the server.

You can add custom properties to the SMTP configuration based on Javamail SMTP options.

Email To A comma-separated list of one or more email addresses to which to send the alert email

SMTP Server The FQDN or IP address of the SMTP server to use to relay the alert email

SMTP Port The SMTP port on the SMTP server

Email From A single email address to be the originator of the alert email

Use Authentication Determine whether your SMTP server requires authentication before it can relay messages. Be sure to also provide the username and password credentials

• For MIB-based SNMP notification, provide the version, community, host, and port to which the SNMP trap should be sent.:

Version SNMPv1 or SNMPv2c, depending on the network environment

Hosts A comma-separated list of one or more host FQDNs to which to send the trap

Port The port on which a process is listening for SNMP traps

For SNMP notifications, Ambari uses a "MIB", a text file manifest of alert definitions, to transfer alert information from cluster operations to the alerting infrastructure. A MIB summarizes how object IDs map to objects or attributes.

119 Hortonworks Data Platform May 17, 2018

For example, MIB file content looks like this:

You can find the MIB file for your cluster on the Ambari Server host, at: /var/lib/ambari-server/resources/APACHE-AMBARI-MIB.txt

• For Custom SNMP notification, provide the version, community, host, and port to which the SNMP trap should be sent.

Also, the OID parameter must be configured properly for SNMP trap context. If no custom, enterprise-specific OID is used, you should use the following:

Version SNMPv1 or SNMPv2c, depending on the network environment

OID 1.3.6.1.4.1.18060.16.1.1

Hosts A comma-separated list of one or more host FQDNs to which to send the trap

Port The port on which a process is listening for SNMP traps

5. Click Save.

More Information

Managing Notifications [118]

Javamail SMTP options 8.8. Creating or Editing Alert Groups

To create or edit alert groups:

120 Hortonworks Data Platform May 17, 2018

Steps

1. In Ambari Web, click Alerts.

2. On the Alerts page, click the Actions menu, then click Manage Alert Groups.

3. In Manage Alert Groups, click + to create a new alert notification.

4. In, Create Alert Group, enter a group name and click Save.

5. By clicking on the custom group in the list, you can add or delete alert definitions from this group, and change the notification targets for the group.

6. When you finish your assignments, click Save. 8.9. Dispatching Notifications

When an alert is enabled and the alert status changes (for example, from OK to CRITICAL or CRITICAL to OK), Ambari sends either an email or SNMP notification, depending on how notifications are configured.

For email notifications, Ambari sends an email digest that includes all alert status changes. For example, if two alerts become critical, Ambari sends one email message that Alert A is CRITICAL and Ambari B alert is CRITICAL. Ambari does not send another email notification until status changes again.

For SNMP notifications, Ambari sends one SNMP trap per alert status change. For example, if two alerts become critical, Ambari sends two SNMP traps, one for each alert, and then sends two more when the two alerts change. 8.10. Viewing the Alert Status Log

Whether or not Ambari is configured to send alert notifications, it writes alert status changes to a log on the Ambari Server host. To view this log:

Steps

1. On the Ambari server host, browse to the log directory:

cd /var/log/ambari-server/

2. View the ambari-alerts.log file.

3. Log entries include the time of the status change, the alert status, the alert definition name, and the response text:

2015-08-10 22:47:37,120 [OK] [HARD] [STORM] (Storm Server Process) TCP OK - 0.000s response on port 8744 2015-08-11 11:06:18,479 [CRITICAL] [HARD] [AMBARI] [ambari_server_agent_heartbeat] (Ambari Agent Heartbeat) c6401.ambari. apache.org is not sending heartbeats 2015-08-11 11:08:18,481 [OK] [HARD] [AMBARI] [ambari_server_agent_heartbeat] (Ambari Agent Heartbeat) c6401.ambari.apache.org is healthy

121 Hortonworks Data Platform May 17, 2018

8.10.1. Customizing Notification Templates

The notification template content produced by Ambari is tightly coupled to a notification type. Email and SNMP notifications both have customizable templates that you can use to generate content. This section describes the steps necessary to change the template used by Ambari when creating alert notifications.

Alert Templates XML Location

By default, an alert-templates.xml ships with Ambari,. This file contains all of the templates for every known type of notification (for example, EMAIL and SNMP). This file is bundled in the Ambari server .jar file so that the template is not exposed on the disk; however, that file is used in the following text, as a reference example.

When you customize the alert template, you are effectively overriding the default alert template's XML, as follows:

1. On the Ambari server host, browse to /etc/ambari-server/conf directory.

2. Edit the ambari.properties file.

3. Add an entry for the location of your new template:

alerts.template.file=/foo/var/alert-templates-custom.xml

4. Save the file and restart Ambari Server.

After you restart Ambari, any notification types defined in the new template override those bundled with Ambari. If you choose to provide your own template file, you only need to define notification templates for the types that you wish to override. If a notification template type is not found in the customized template, Ambari will default to the templates that ship with the JAR.

Alert Templates XML Structure

The structure of the template file is defined as follows. Each element declares what type of alert notification it should be used for.

Subject Content Body Content Subject Content Body Content

122 Hortonworks Data Platform May 17, 2018

Template Variables

The template uses to render all tokenized content. The following variables are available for use in your template:

$alert.getAlertDefinition() The definition of which the alert is an instance.

$alert.getAlertText() The specific alert text.

$alert.getAlertName() The name of the alert.

$alert.getAlertState() The alert state (OK, WARNING, CRITICAL, or UNKNOWN)

$alert.getServiceName() The name of the service that the alert is defined for.

$alert.hasComponentName() True if the alert is for a specific service component.

$alert.getComponentName() The component, if any, that the alert is defined for.

$alert.hasHostName() True if the alert was triggered for a specific host.

$alert.getHostName() The hostname, if any, that the alert was triggered for.

$ambari.getServerUrl() The Ambari Server URL.

$ambari.getServerVersion() The Ambari Server version.

$ambari.getServerHostName() The Ambari Server hostname.

$dispatch.getTargetName() The notification target name.

$dispatch.getTargetDescription() The notification target description.

$summary.getAlerts(service,alertState)A list of all alerts for a given service or alert state (OK| WARNING|CRITICAL|UNKNOWN)

$summary.getServicesByAlertState(alertState)A list of all services for a given alert state (OK| WARNING|CRITICAL|UNKNOWN)

$summary.getServices() A list of all services that are reporting an alert in the notification.

$summary.getCriticalCount() The CRITICAL alert count.

$summary.getOkCount() The OK alert count.

$summary.getTotalCount() The total alert count.

$summary.getUnknownCount() The UNKNOWN alert count.

$summary.getWarningCount() The WARNING alert count.

$summary.getAlerts() A list of all of the alerts in the notification.

Example: Modify Alert EMAIL Subject

123 Hortonworks Data Platform May 17, 2018

The following example illustrates how to change the subject line of all outbound email notifications to include a hard-coded identifier:

1. Download the alert-templates.xml code as your starting point.

2. On the Ambari Server, save the template to a location such as /var/lib/ambari- server/resources/alert-templates-custom.xml .

3. Edit the alert-templates-custom.xml file and modify the subject link for the template:

4. Save the file.

5. Browse to /etc/ambari-server/conf directory.

6. Edit the ambari.properties file.

7. Add an entry for the location of your new template file.

alerts.template.file=/var/lib/ambari-server/resources/alert- templates-custom.xml

8. Save the file and restart Ambari Server.

124 Hortonworks Data Platform May 17, 2018

9. Using Ambari Core Services

The Ambari core services enable you to monitor, analyze, and search the operating status of hosts in your cluster. This chapter describes how to use and configure the following Ambari Core Services:

• Understanding Ambari Metrics [125]

• Ambari Log Search (Technical Preview) [181]

• Ambari Infra [185] 9.1. Understanding Ambari Metrics

Ambari Metrics System (AMS) collects, aggregates, and serves Hadoop and system metrics in Ambari-managed clusters.

• AMS Architecture [125]

• Using Grafana [126]

• Grafana Dashboards Reference [131]

• AMS Performance Tuning [169]

• AMS High Availability [174] 9.1.1. AMS Architecture

AMS has four components: Metrics Monitors, Hadoop Sinks, Metrics Collector, and Grafana.

• Metrics Monitors on each host in the cluster collect system-level metrics and publish to the Metrics Collector.

• Hadoop Sinks plug in to Hadoop components to publish Hadoop metrics to the Metrics Collector.

• The Metrics Collector is a daemon that runs on a specific host in the cluster and receives data from the registered publishers, the Monitors, and the Sinks.

• Grafana is a daemon that runs on a specific host in the cluster and serves pre-built dashboards for visualizing metrics collected in the Metrics Collector.

The following high-level illustration shows how the components of AMS work together to collect metrics and make those metrics available to Ambari.

125 Hortonworks Data Platform May 17, 2018

9.1.2. Using Grafana

Ambari Metrics System includes Grafana with pre-built dashboards for advanced visualization of cluster metrics.

• Accessing Grafana [126]

• Viewing Grafana Dashboards [127]

• Viewing Selected Metrics on Grafana Dashboards [129]

• Viewing Metrics for Selected Hosts [130]

More Information

http://grafana.org/

9.1.2.1. Accessing Grafana

To access the Grafana UI:

Steps

1. In Ambari Web, browse to Services > Ambari Metrics > Summary.

2. Select Quick Links and then choose Grafana.

A read-only version of the Grafana interface opens in a new tab in your browser:

126 Hortonworks Data Platform May 17, 2018

9.1.2.2. Viewing Grafana Dashboards

On the Grafana home page, Dashboards provides a short list of links to AMS, Ambari server, Druid and HBase metrics.

To view specific metrics included in the list:

Steps

1. In Grafana, browse to Dashboards.

2. Click a dashboard name.

3. To see more available dashboards, click the Home list.

127 Hortonworks Data Platform May 17, 2018

4. Scroll down to view the whole list.

1. Click a dashboard name, for example System - Servers.

The System - Servers dashboard opens:

128 Hortonworks Data Platform May 17, 2018

9.1.2.3. Viewing Selected Metrics on Grafana Dashboards

On a dashboard, expand one or more rows to view detailed metrics, continuing the previous example using the System - Servers dashboard:

1. In the System - Servers dashboard, click a row name. For example, click System Load Average - 1 Minute.

The row expands to display a chart that shows metrics information: in this example, the System Load Average - 1 Minute and the System Load Average - 15 Minute rows:

129 Hortonworks Data Platform May 17, 2018

2.

9.1.2.4. Viewing Metrics for Selected Hosts

By default, Grafana shows metrics for all hosts in your cluster. You can limit the displayed metrics to one or more hosts by selecting them from the Hosts menu.:

1. Expand Hosts.

2. Select one or more host names.

A check mark appears next to selected host names:

130 Hortonworks Data Platform May 17, 2018

Note

Selections in the Hosts menu apply to all metrics in the current dashboard. Grafana refreshes the current dashboards when you select a new set of hosts. 9.1.3. Grafana Dashboards Reference

Ambari Metrics System includes Grafana with pre-built dashboards for advanced visualization of cluster metrics.

• AMS HBase Dashboards [131]

• Ambari Dashboards [139]

• HDFS Dashboards [141]

• YARN Dashboards [145]

• Hive Dashboards [148]

• Hive LLAP Dashboards [150]

• HBase Dashboards [154]

• Kafka Dashboards [163]

• Storm Dashboards [165]

• System Dashboards [166]

• NiFi Dashboard [168] 9.1.3.1. AMS HBase Dashboards

AMS HBase refers to the HBase instance managed by Ambari Metrics Service independently. It does not have any connection with the cluster HBase service. AMS HBase Grafana dashboards track the same metrics as the regular HBase dashboard, but for the AMS-owned instance.

The following Grafana dashboards are available for AMS HBase:

• AMS HBase - Home [132]

• AMS HBase - RegionServers [133]

131 Hortonworks Data Platform May 17, 2018

• AMS HBase - Misc [138]

9.1.3.1.1. AMS HBase - Home

The AMS HBase - Home dashboards display basic statistics about an HBase cluster. These dashboards provide insight to the overall status for the HBase cluster.

Row Metrics Description Num Total number of RegionServers in the cluster. RegionServers Num Dead Total number of RegionServers that are dead in the cluster. REGIONSERVERS / REGIONS RegionServers Num Regions Total number of regions in the cluster. Avg Num Regions Average number of regions per RegionServer. per RegionServer Num Regions / Total number of regions and stores (column families) in the Stores - Total cluster. NUM REGIONS/STORES Store File Size / Total data file size and number of store files. Count - Total Num Requests - Total number of requests (read, write and RPCs) in the cluster. Total NUM REQUESTS Num Request - Total number of get,put,mutate,etc requests in the cluster. Breakdown - Total RegionServer Average used, max or committed on-heap and offheap Memory - Average memory for RegionServers. REGIONSERVER MEMORY RegionServer Average used, free or committed on-heap and offheap Offheap Memory - memory for RegionServers. Average Memstore - Average blockcache and memstore sizes for RegionServers. BlockCache - MEMORY - MEMSTORE BLOCKCACHE Average Num Blocks in Total number of (hfile) blocks in the blockcaches across all BlockCache - Total RegionServers. BlockCache Hit/ Total number of blockcache hits misses and evictions across all Miss/s Total RegionServers. BLOCKCACHE BlockCache Hit Average blockcache hit percentage across all RegionServers. Percent - Average Get Latencies - Average min, median, max, 75th, 95th, 99th percentile Average latencies for Get operation across all RegionServers. OPERATION LATENCIES - GET/MUTATE Mutate Latencies - Average min, median, max, 75th, 95th, 99th percentile Average latencies for Mutate operation across all RegionServers. Delete Latencies - Average min, median, max, 75th, 95th, 99th percentile OPERATION LATENCIES - DELETE/ Average latencies for Delete operation across all RegionServers. INCREMENT Increment Average min, median, max, 75th, 95th, 99th percentile Latencies - Average latencies for Increment operation across all RegionServers. Append Latencies - Average min, median, max, 75th, 95th, 99th percentile OPERATION LATENCIES - APPEND/ Average latencies for Append operation across all RegionServers. REPLAY Replay Latencies - Average min, median, max, 75th, 95th, 99th percentile Average latencies for Replay operation across all RegionServers. RegionServer RPC - Average number of RPCs, active handler threads and open Average connections across all RegionServers. REGIONSERVER RPC RegionServer RPC Average number of calls in different RPC scheduling queues Queues - Average and the size of all requests in the RPC queue across all RegionServers.

132 Hortonworks Data Platform May 17, 2018

Row Metrics Description RegionServer Average sent and received bytes from the RPC across all REGIONSERVER RPC RPC Throughput - RegionServers. Average

9.1.3.1.2. AMS HBase - RegionServers

The AMS HBase - RegionServers dashboards display metrics for RegionServers in the monitored HBase cluster, including some performance-related data. These dashboards help you view basic I/O data and compare load among RegionServers.

Row Metrics Description NUM REGIONS Num Regions Number of regions in the RegionServer. Store File Size Total size of the store files (data files) in the RegionServer. STORE FILES Store File Count Total number of store files in the RegionServer. Num Total Total number of requests (both read and write) per second in Requests /s the RegionServer. Num Write Total number of write requests per second in the NUM REQUESTS Requests /s RegionServer. Num Read Total number of read requests per second in the RegionServer. Requests /s Num Get Total number of Get requests per second in the RegionServer. Requests /s NUM REQUESTS - GET / SCAN Num Scan Next Total number of Scan requests per second in the RegionServer. Requests /s Num Mutate Total number of Mutate requests per second in the Requests - /s RegionServer. NUM REQUESTS - MUTATE / DELETE Num Delete Total number of Delete requests per second in the Requests /s RegionServer. Num Append Total number of Append requests per second in the Requests /s RegionServer. Num Increment Total number of Increment requests per second in the NUM REQUESTS - APPEND / INCREMENT Requests /s RegionServer. Num Replay Total number of Replay requests per second in the Requests /s RegionServer. RegionServer Heap Memory used by the RegionServer. Memory Used MEMORY RegionServer Offheap Memory used by the RegionServer. Offheap Memory Used MEMSTORE Memstore Size Total Memstore memory size of the RegionServer. BlockCache - Size Total BlockCache size of the RegionServer. BlockCache - Free Total free space in the BlockCache of the RegionServer. BLOCKCACHE - OVERVIEW Size Num Blocks in Total number of hfile blocks in the BlockCache of the Cache RegionServer. Num BlockCache Number of BlockCache hits per second in the RegionServer. Hits /s Num BlockCache Number of BlockCache misses per second in the RegionServer. Misses /s BLOCKCACHE - HITS/MISSES Num BlockCache Number of BlockCache evictions per second in the Evictions /s RegionServer. BlockCache Percentage of BlockCache hits per second for requests that Caching Hit Percent requested cache blocks in the RegionServer.

133 Hortonworks Data Platform May 17, 2018

Row Metrics Description BlockCache Hit Percentage of BlockCache hits per second in the RegionServer. Percent Get Latencies - Mean latency for Get operation in the RegionServer. Mean Get Latencies - Median latency for Get operation in the RegionServer. Median Get Latencies - 75th 75th percentile latency for Get operation in the RegionServer OPERATION LATENCIES - GET Percentile Get Latencies - 95th 95th percentile latency for Get operation in the RegionServer. Percentile Get Latencies - 99th 99th percentile latency for Get operation in the RegionServer. Percentile Get Latencies - Max Max latency for Get operation in the RegionServer. Scan Next Mean latency for Scan operation in the RegionServer. Latencies - Mean Scan Next Median latency for Scan operation in the RegionServer. Latencies - Median Scan Next 75th percentile latency for Scan operation in the RegionServer. Latencies - 75th Percentile OPERATION LATENCIES - SCAN NEXT Scan Next 95th percentile latency for Scan operation in the RegionServer. Latencies - 95th Percentile Scan Next 99th percentile latency for Scan operation in the RegionServer. Latencies - 99th Percentile Scan Next Max latency for Scan operation in the RegionServer. Latencies - Max Mutate Latencies - Mean latency for Mutate operation in the RegionServer. Mean Mutate Latencies - Median latency for Mutate operation in the RegionServer. Median Mutate Latencies - 75th percentile latency for Mutate operation in the 75th Percentile RegionServer. OPERATION LATENCIES - MUTATE Mutate Latencies - 95th percentile latency for Mutate operation in the 95th Percentile RegionServer. Mutate Latencies - 99th percentile latency for Mutate operation in the 99th Percentile RegionServer. Mutate Latencies - Max latency for Mutate operation in the RegionServer. Max Delete Latencies - Mean latency for Delete operation in the RegionServer. Mean Delete Latencies - Median latency for Delete operation in the RegionServer. Median Delete Latencies - 75th percentile latency for Delete operation in the 75th Percentile RegionServer. OPERATION LATENCIES - DELETE Delete Latencies - 95th percentile latency for Delete operation in the 95th Percentile RegionServer. Delete Latencies - 99th percentile latency for Delete operation in the 99th Percentile RegionServer. Delete Latencies - Max latency for Delete operation in the RegionServer. Max

134 Hortonworks Data Platform May 17, 2018

Row Metrics Description Increment Mean latency for Increment operation in the RegionServer. Latencies - Mean Increment Median latency for Increment operation in the RegionServer. Latencies - Median Increment 75th percentile latency for Increment operation in the Latencies - 75th RegionServer. Percentile OPERATION LATENCIES - INCREMENT Increment 95th percentile latency for Increment operation in the Latencies - 95th RegionServer. Percentile Increment 99th percentile latency for Increment operation in the Latencies - 99th RegionServer. Percentile Increment Max latency for Increment operation in the RegionServer. Latencies - Max Append Latencies - Mean latency for Append operation in the RegionServer. Mean Append Latencies - Median latency for Append operation in the RegionServer. Median Append Latencies - 75th percentile latency for Append operation in the 75th Percentile RegionServer. OPERATION LATENCIES - APPEND Append Latencies - 95th percentile latency for Append operation in the 95th Percentile RegionServer. Append Latencies - 99th percentile latency for Append operation in the 99th Percentile RegionServer. Append Latencies Max latency for Append operation in the RegionServer. - Max Replay Latencies - Mean latency for Replay operation in the RegionServer. Mean Replay Latencies - Median latency for Replay operation in the RegionServer. Median Replay Latencies - 75th percentile latency for Replay operation in the 75th Percentile RegionServer. OPERATION LATENCIES - REPLAY Replay Latencies - 95th percentile latency for Replay operation in the 95th Percentile RegionServer. Replay Latencies - 99th percentile latency for Replay operation in the 99th Percentile RegionServer. Replay Latencies - Max latency for Replay operation in the RegionServer. Max Num RPC /s Number of RPCs per second in the RegionServer. Num Active Number of active RPC handler threads (to process requests) in RPC - OVERVIEW Handler Threads the RegionServer. Num Connections Number of connections to the RegionServer. Num RPC Calls in Number of RPC calls in the general processing queue in the General Queue RegionServer. Num RPC Calls in Number of RPC calls in the high priority (for system tables) Priority Queue processing queue in the RegionServer. RPC - QUEUES Num RPC Calls in Number of RPC calls in the replication processing queue in the Replication Queue RegionServer. RPC - Total Call Total data size of all RPC calls in the RPC queues in the Queue Size RegionServer. RPC - Call Queued Mean latency for RPC calls to stay in the RPC queue in the RPC - CALL QUEUED TIMES Time - Mean RegionServer.

135 Hortonworks Data Platform May 17, 2018

Row Metrics Description RPC - Call Queued Median latency for RPC calls to stay in the RPC queue in the Time - Median RegionServer. RPC - Call Queued 75th percentile latency for RPC calls to stay in the RPC queue in Time - 75th the RegionServer. Percentile RPC - Call Queued 95th percentile latency for RPC calls to stay in the RPC queue in Time - 95th the RegionServer. Percentile RPC - Call Queued 99th percentile latency for RPC calls to stay in the RPC queue in Time - 99th the RegionServer. Percentile RPC - Call Queued Max latency for RPC calls to stay in the RPC queue in the Time - Max RegionServer. RPC - Call Process Mean latency for RPC calls to be processed in the Time - Mean RegionServer. RPC - Call Process Median latency for RPC calls to be processed in the Time - Median RegionServer. RPC - Call Process 75th percentile latency for RPC calls to be processed in the Time - 75th RegionServer. Percentile RPC - CALL PROCESS TIMES RPC - Call Process 95th percentile latency for RPC calls to be processed in the Time - 95th RegionServer. Percentile RPC - Call Process 99th percentile latency for RPC calls to be processed in the Time - 99th RegionServer. Percentile RPC - Call Process Max latency for RPC calls to be processed in the RegionServer. Time - Max RPC - Received Received bytes from the RPC in the RegionServer. RPC - THROUGHPUT bytes /s RPC - Sent bytes /s Sent bytes from the RPC in the RegionServer. Num WAL - Files Number of Write-Ahead-Log files in the RegionServer. WAL - FILES Total WAL File Size Total files sized of Write-Ahead-Logs in the RegionServer. WAL - Num Number of append operations per second to the filesystem in Appends /s the RegionServer. WAL - THROUGHPUT WAL - Num Sync /s Number of sync operations per second to the filesystem in the RegionServer. WAL - Sync Mean latency for Write-Ahead-Log sync operation to the Latencies - Mean filesystem in the RegionServer. WAL - Sync Median latency for Write-Ahead-Log sync operation to the Latencies - Median filesystem in the RegionServer. WAL - Sync 75th percentile latency for Write-Ahead-Log sync operation to Latencies - 75th the filesystem in the RegionServer. Percentile WAL - SYNC LATENCIES WAL - Sync 95th percentile latency for Write-Ahead-Log sync operation to Latencies - 95th the filesystem in the RegionServer. Percentile WAL - Sync 99th percentile latency for Write-Ahead-Log sync operation to Latencies - 99th the filesystem in the RegionServer. Percentile WAL - Sync Max latency for Write-Ahead-Log sync operation to the Latencies - Max filesystem in the RegionServer. WAL - Append Mean latency for Write-Ahead-Log append operation to the WAL - APPEND LATENCIES Latencies - Mean filesystem in the RegionServer.

136 Hortonworks Data Platform May 17, 2018

Row Metrics Description WAL - Append Median latency for Write-Ahead-Log append operation to the Latencies - Median filesystem in the RegionServer. WAL - Append 95th percentile latency for Write-Ahead-Log append operation Latencies - 75th to the filesystem in the RegionServer. Percentile WAL - Append 95th percentile latency for Write-Ahead-Log append operation Latencies - 95th to the filesystem in the RegionServer. Percentile WAL - Append 99th percentile latency for Write-Ahead-Log append operation Latencies - 99th to the filesystem in the RegionServer. Percentile WAL - Append Max latency for Write-Ahead-Log append operation to the Latencies - Max filesystem in the RegionServer. WAL - Append Mean data size for Write-Ahead-Log append operation to the Sizes - Mean filesystem in the RegionServer. WAL - Append Median data size for Write-Ahead-Log append operation to Sizes - Median the filesystem in the RegionServer. WAL - Append 75th percentile data size for Write-Ahead-Log append Sizes - 75th operation to the filesystem in the RegionServer. Percentile WAL - APPEND SIZES WAL - Append 95th percentile data size for Write-Ahead-Log append Sizes - 95th operation to the filesystem in the RegionServer. Percentile WAL - Append 99th percentile data size for Write-Ahead-Log append Sizes - 99th operation to the filesystem in the RegionServer. Percentile WAL - Append Max data size for Write-Ahead-Log append operation to the Sizes - Max filesystem in the RegionServer. WAL Num Slow Number of append operations per second to the filesystem Append /s that took more than 1 second in the RegionServer. Num Slow Gets /s Number of Get requests per second that took more than 1 second in the RegionServer. SLOW OPERATIONS Num Slow Puts /s Number of Put requests per second that took more than 1 second in the RegionServer. Num Slow Number of Delete requests per second that took more than 1 Deletes /s second in the RegionServer. FLUSH/COMPACTION QUEUES Flush Queue Number of Flush operations waiting to be processed in the Length RegionServer. A higher number indicates flush operations being slow. Compaction Queue Number of Compaction operations waiting to be processed Length in the RegionServer. A higher number indicates compaction operations being slow. Split Queue Length Number of Region Split operations waiting to be processed in the RegionServer. A higher number indicates split operations being slow. GC Count /s Number of Java Garbage Collections per second. GC Count ParNew / Number of Java ParNew (YoungGen) Garbage Collections per JVM - GC COUNTS s second. GC Count CMS /s Number of Java CMS Garbage Collections per second. GC Times /s Total time spend in Java Garbage Collections per second. GC Times ParNew / Total time spend in Java ParNew(YoungGen) Garbage JVM - GC TIMES s Collections per second. GC Times CMS /s Total time spend in Java CMS Garbage Collections per second.

137 Hortonworks Data Platform May 17, 2018

Row Metrics Description Percent Files Local Percentage of files served from the local DataNode for the LOCALITY RegionServer.

9.1.3.1.3. AMS HBase - Misc

The AMS HBase - Misc dashboards display miscellaneous metrics related to the HBase cluster. You can use these metrics for tasks like debugging authentication and authorization issues and exceptions raised by RegionServers.

Row Metrics Description Master - Regions in Number of regions in transition in the cluster. Transition Master - Regions in Number of regions in transition that are in transition state for Transition Longer longer than 1 minute in the cluster. REGIONS IN TRANSITION Than Threshold Time Regions in Maximum time that a region stayed in transition state. Transition Oldest Age Master Num Number of threads in the Master. Threads - Runnable NUM THREADS - RUNNABLE RegionServer Num Number of threads in the RegionServer. Threads - Runnable Master Num Number of threads in the Blocked State in the Master. Threads - Blocked NUM THREADS - BLOCKED RegionServer Num Number of threads in the Blocked State in the RegionServer. Threads - Blocked Master Num Number of threads in the Waiting State in the Master. Threads - Waiting NUM THREADS - WAITING RegionServer Num Number of threads in the Waiting State in the RegionServer. Threads - Waiting Master Num Number of threads in the Timed-Waiting State in the Master. Threads - Timed Waiting NUM THREADS - TIMED WAITING RegionServer Num Number of threads in the Timed-Waiting State in the Threads - Timed RegionServer. Waiting Master Num Number of threads in the New State in the Master. Threads - New NUM THREADS - NEW RegionServer Num Number of threads in the New State in the RegionServer. Threads - New Master Num Number of threads in the Terminated State in the Master. Threads - Terminated NUM THREADS - TERMINATED RegionServer Number of threads in the Terminated State in the Num Threads - RegionServer. Terminated RegionServer RPC Number of RPC successful authentications per second in the Authentication RegionServer. Successes /s RPC AUTHENTICATION RegionServer RPC Number of RPC failed authentications per second in the Authentication RegionServer. Failures /s

138 Hortonworks Data Platform May 17, 2018

Row Metrics Description RegionServer RPC Number of RPC successful autorizations per second in the Authorization RegionServer. Successes /s RPC Authorization RegionServer RPC Number of RPC failed autorizations per second in the Authorization RegionServer. Failures /s Master Number of exceptions in the Master. Exceptions /s EXCEPTIONS RegionServer Number of exceptions in the RegionServer. Exceptions /s

9.1.3.2. Ambari Dashboards

The following Grafana dashboards are available for Ambari:

• Ambari Server Database [139]

• Ambari Server JVM [139]

• Ambari Server Top N [140]

9.1.3.2.1. Ambari Server Database

Metrics that show operating status for the Ambari server database.

Row Metrics Description Total Read All Total ReadAllQuery operations performed. Query Counter TOTAL READ ALL QUERY (Rate) Total Read All Total time spent on ReadAllQuery. Query Timer (Rate) Total Cache Hits Total cache hits on Ambari Server with respect to EclipseLink (Rate) cache. TOTAL CACHE HITS & MISSES Total Cache Misses Total cache misses on Ambari Server with respect to EclipseLink (Rate) cache. Query Stages Average time spent on every query sub stage by Ambari Server Timings QUERY Query Types Avg. Average time spent on every query type by Ambari Server. Timings Counter.ReadAllQuery.HostRoleCommandEntityRate (num operations per second) in which ReadAllQuery (Rate) operation on HostRoleCommandEntity is performed. Timer.ReadAllQuery.HostRoleCommandEntityRate in which ReadAllQuery operation on HOST ROLE COMMAND ENTITY (Rate) HostRoleCommandEntity is performed. ReadAllQuery.HostRoleCommandEntityAverage time taken for a ReadAllQuery operation on HostRoleCommandEntity (Timer / Counter).

9.1.3.2.2. Ambari Server JVM

Metrics to see status for the Ambari Server Java virtual machine.

Row Metrics Description JVM - MEMORY PRESSURE Heap Usage Used, max or committed on-heap memory for Ambari Server.

139 Hortonworks Data Platform May 17, 2018

Row Metrics Description Off-Heap Usage Used, max or committed off-heap memory for Ambari Server. GC Count Par Number of Java ParNew (YoungGen) Garbage Collections per new /s second. GC Time Par new /s Total time spend in Java ParNew(YoungGen) Garbage JVM GC COUNT Collections per second. GC Count CMS /s Number of Java Garbage Collections per second. GC Time Par CMS / Total time spend in Java CMS Garbage Collections per second. s Thread Count Number of active, daemon, deadlock, blocked and runnable JVM THREAD COUNT threads.

9.1.3.2.3. Ambari Server Top N

Metrics to see top performing users and operations for Ambari.

Row Metrics Description Top ReadAllQuery Top N Ambari Server entities by number of ReadAllQuery Counters operations performed. READ ALL QUERY Top ReadAllQuery Top N Ambari Server entities by time spent on ReadAllQuery Timers operations. Cache Misses Cache Misses Top N Ambari Server entities by number of Cache Misses.

9.1.3.3. Druid Dashboards

The following Grafana dashboards are available for Druid:

• Druid - Home [140]

• Druid - Ingestion [141]

• Druid - Query [141]

9.1.3.3.1. Druid - Home

Metrics that show operating status for Druid.

Row Metrics Description JVM Heap JVM Heap used by the Druid Broker Node DRUID BROKER JVM GCM Time Time spent by the Druid Broker Node in JVM Garbage collection JVM Heap JVM Heap used by the Druid Historical Node DRUID HISTORICAL JVM GCM Time Time spent by the Druid Historical Node in JVM Garbage collection JVM Heap JVM Heap used by the Druid Coordinator Node DRUID COORDINATER JVM GCM Time Time spent by the Druid Coordinator Node in JVM Garbage collection JVM Heap JVM Heap used by the Druid Overlord Node DRUID OVERLORD JVM GCM Time Time spent by the Druid Overlord Node in JVM Garbage collection DRUID MIDDLEMANAGER JVM Heap JVM Heap used by the Druid Middlemanager Node

140 Hortonworks Data Platform May 17, 2018

Row Metrics Description JVM GCM Time Time spent by the Druid Middlemanager Node in JVM Garbage collection

9.1.3.3.2. Druid - Ingestion

Metrics to see status for Druid data ingestion rates.

Row Metrics Description Ingested Events Number of events ingested on real time nodes Events Thrown Number of events rejected because they are outside the INGESTION METRICS Away windowPeriod. Unparseable Number of events rejected because they did not parse Events Persisted Rows Number of Druid rows persisted on disk Average Persist Average time taken to persist intermediate segments to disk INTERMEDIATE PERSISTS METRICS Time Intermediate Number of times that intermediate segments were persisted Persist Count Ave Segment Size Average size of added Druid segments SEGMENT SIZE METRICS Total Segment Size Total size of added Druid segments

9.1.3.3.3. Druid - Query

Metrics to see status of Druid queries.

Row Metrics Description Broker Query Time Average Time taken by Druid Broker node to process queries Historical Query Average time taken by Druid historical nodes to process Query Time Metrics Time queries Realtime Query Average time taken by Druid real time nodes to process Time queries Historical Segment Average time taken by Druid historical nodes to scan individual Scan Time segments Realtime Segment Average time taken by Druid real time nodes to scan individual Scan Time segments Historical Query Average time spent waiting for a segment to be scanned on Wait Time historical node SEGMENT SCAN METRICS Realtime Query Average time spent waiting for a segment to be scanned on Wait Time real time node Pending Historical Average Number of pending segment scans on historical nodes Segment Scans Pending Realtime Average Number of pending segment scans on real time nodes Segment Scans

9.1.3.4. HDFS Dashboards

The following Grafana dashboards are available for Hadoop Distributed File System (HDFS) components:

• HDFS - Home [142]

• HDFS - NameNodes [142]

141 Hortonworks Data Platform May 17, 2018

• HDFS - DataNodes [143]

• HDFS - Top-N [144]

• HDFS - Users [145]

9.1.3.4.1. HDFS - Home

The HDFS - Home dashboard displays metrics that show operating status for HDFS. Note

In a NameNode HA setup, metrics are collected from and displayed for both the active and the standby NameNode.

Row Metrics Description Number of Files Number of HDFS files that are still being written. NUMBER OF FILES UNDER Under Construction CONSTRUCTION & RPC CLIENT CONNECTIONS RPC Client Number of open RPC connections from clients on Connections NameNode(s). Total File Total number of operations on HDFS files, including file Operations creation/deletion/rename/truncation, directory/file/block TOTAL FILE OPERATIONS & CAPACITY information retrieval, and snapshot related operations. USED Capacity Used "CapacityTotalGB" shows total HDFS storage capacity, in GB. "CapacityUsedGB" indicates total used HDFS storage capacity, in GB. RPC Client Port Number of slow RPC requests on NameNode. A "slow" RPC Slow Calls request is one that takes more time to complete than 99.7% of RPC CLIENT PORT SLOW CALLS & HDFS other requests. TOTAL LOAD HDFS Total Load Total number of connections on all the DataNodes sending/ receiving data. Add Block Time The average time (in ms) serving addBlock RPC request on NameNode(s). ADD BLOCK STATUS Add Block Num The rate of addBlock RPC requests on NameNode(s). Ops

9.1.3.4.2. HDFS - NameNodes

Metrics to see status for the NameNodes.

Row Metrics Description RPC Client Port Average time that a RPC request (on the RPC port facing to Queue Time the HDFS clients) waits in the queue. RPC CLIENT QUEUE TIME RPC Client Port Total number of RPC requests in the client port queue. Queue Num Ops RPC Client Port Average RPC request processing time in milliseconds, on the Processing Time client port. RPC CLIENT PORT PROCESSING TIME RPC Client Port Total number of RPC active requests through the client port. Processing Num Ops GC Count Shows the JVM garbage collection rate on the NameNode. GC COUNT & GC TIME GC Time Shows the garbage collection time in milliseconds. GC Count Par New The number of times young generation garbage collection GC PAR NEW happened. GC Time Par New Indicates the duration of young generation garbage collection.

142 Hortonworks Data Platform May 17, 2018

Row Metrics Description GC Extra Sleep Indicates total garbage collection extra sleep time. Time GC EXTRA SLEEP & WARNING THRESHOLD EXCEEDED GC Warning Indicates number of times that the garbage collection warning Threshold threshold is exceeded Exceeded Count RPC Client Port Indicates the current length of the RPC call queue. Queue Length RPC CLIENT PORT QUEUE & BACKOFF RPC Client Port Indicates number of client backoff requests. Backoff RPC Service Port Average time a RPC request waiting in the queue, in Queue Time milliseconds. These requests are on the RPC port facing to the HDFS services, including DataNodes and the other NameNode. RPC SERVICE PORT QUEUE & NUM OPS RPC Service Port Total number of RPC requests waiting in the queue. These Queue Num Ops requests are on the RPC port facing to the HDFS services, including DataNodes and the other NameNode. RPC Service Port Average RPC request processing time in milliseconds, for the Processing Time service port. RPC SERVICE PORT PROCESSING TIME & NUM OPS RPC Service Port Number of RPC requests processed for the service port. Processing Num Ops RPC Service Port The current length of the RPC call queue. RPC SERVICE PORT CALL QUEUE Call Queue Length LENGTH & SLOW CALLS RPC Service Port The number of slow RPC requests, for the service port. Slow Calls Transactions Since Total number of transactions since the last editlog segment. TRANSACTIONS SINCE LAST EDIT & Last Edit Roll CHECKPOINT Transactions Since Total number of transactions since the last editlog segment Last Checkpoint checkpoint. Lock Queue Length Shows the length of the wait Queue for the LOCK QUEUE LENGTH & EXPIRED FSNameSystemLock. HEARTBEATS Expired Heartbeats Indicates the number of times expired heartbeats are detected on NameNode. Threads Blocked Indicates the number of threads in a BLOCKED state, which means they are waiting for a lock. THREADS BLOCKED / WAITING Threads Waiting Indicates the number of threads in a WAITING state, which means they are waiting for another thread to perform an action.

9.1.3.4.3. HDFS - DataNodes

Metrics to see status for the DataNodes.

Row Metrics Description Blocks Written The rate or number of blocks written to a DataNode. BLOCKS WRITTEN / READ Blocks Read The rate or number of blocks read from a DataNode. Fsynch Time Average fsync time. FSYNCH TIME / NUM OPS Fsynch Num Ops Total number of fsync operations. Data Packet Indicates the average waiting time of transfering a data packet Blocked Time on a DataNode. DATA PACKETS BLOCKED / NUM OPS Data Packet Indicates the number of data packets transferred on a Blocked Num Ops DataNode. PACKET TRANSFER BLOCKED / NUM Packet Transfer Average transfer time of sending data packets on a DataNode. OPS Time

143 Hortonworks Data Platform May 17, 2018

Row Metrics Description Packet Transfer Indicates the number of data packets blocked on a DataNode. Num Ops Network Errors Rate of network errors on JVM. NETWORK ERRORS / GC COUNT GC Count Garbage collection DataNode hits. GC Time JVM garbage collection time on a DataNode. GC TIME / GC TIME PARNEW GC Time ParNew Young generation (ParNew) garbage collection time on a DataNode.

9.1.3.4.4. HDFS - Top-N

Metrics that show

• Which users perform most HDFS operations on the cluster

• Which HDFS operations run most often on the cluster.

Row Metrics Description Top N Total Represents the metrics that show the total operation count Operations Count per operation for all users

1 min sliding 1 minute interval window Top N Total Represents the metrics that show the total operation count Operations Count per operation for all users TOP N - Operations Count 5 min sliding 5 minute interval window Top N Total Represents the metrics that show the total operation count Operations Count per operation for all users

25 min sliding 25 minute interval window Top N Total Represents the metrics that show the total operation count Operations Count per user by User shown for 1-minute intervals 1 min sliding window Top N Total Represents the metrics that show the total operation count Operations Count per user by User TOP N - Total Operations Count By User shown for 5-minute intervals 5 min sliding window Top N Total Represents the metrics that show the total operation count Operations Count per user by User shown for 25-minute intervals 25 min sliding window TOP N - Operations Represents the drilled down User x Op metrics against the by User TotalCount

1 min sliding shown for 1-minute intervals TOP N - Operations by User window TOP N - Operations Represents the drilled down User x Op metrics against the by User TotalCount

shown for 5-minute intervals

144 Hortonworks Data Platform May 17, 2018

Row Metrics Description 5 min sliding window TOP N - Operations Represents the drilled down User x Op metrics against the by User TotalCount

25 min sliding shown for 25-minute intervals window

9.1.3.4.5. HDFS - Users

Metrics to see status for Users.

Row Metrics Description Namenode Rpc Number of RPC calls made by top(10) users. Namenode Rpc Caller Volume Caller Volume Namenode Rpc Priority assignment for incoming calls from top(10) users. Namenode Rpc Caller Priority Caller Priority

9.1.3.5. YARN Dashboards

The following Grafana dashboards are available for YARN:

• YARN - Home [145]

• YARN - Applications [145]

• YARN - MR JobHistory Server [146]

• YARN - MR JobHistory Server [146]

• YARN - NodeManagers [146]

• YARN - Queues [147]

• YARN - ResourceManager [147]

9.1.3.5.1. YARN - Home

Metrics to see the overall status for the YARN cluster.

Metrics Description Nodes The number of (active, unhealthy, lost) nodes in the cluster. Apps The number of (running, pending, completed, failed) apps in the cluster. Cluster Memory Available Total available memory in the cluster.

9.1.3.5.2. YARN - Applications

Metrics to see status of Applications on the YARN Cluster.

Metrics Description Applications By Running Time Number of apps by running time in 4 categories by default ( < 1 hour, 1 ~ 5 hours, 5 ~ 24 hours, > 24 hours).

145 Hortonworks Data Platform May 17, 2018

Metrics Description Apps Running vs Pending The number of running apps vs the number of pending apps in the cluster. Apps Submitted vs The number of submitted apps vs the number of completed apps in the cluster. Completed Avg AM Launch Delay The average time taken from allocating an AM container to launching an AM container. Avg AM Register Delay The average time taken from RM launches an AM container to AM registers back with RM.

9.1.3.5.3. YARN - MR JobHistory Server

Metrics to see status of the Job History Server.

Row Metrics Description GC Count Accumulated GC count over time. GC Time Accumulated GC time over time. JVM METRICS Heap Mem Usage Current heap memory usage. NonHeap Mem Current non-heap memory usage. Usage

9.1.3.5.4. YARN - NodeManagers

Metrics to see status of YARN NodeManagers on the YARN cluster.

Row Metrics Description Containers Current number of running containers. Running Containers Failed Accumulated number of failed containers. NUM CONTAINERS Containers Killed Accumulated number of killed containers. Containers Accumulated number of completed containers. Completed Memory Available Available memory for allocating containers on this node. MEMORY UTILIZATION Used Memory Used memory by containers on this node. Disk Utilization for Disk utilization percentage across all good log directories. Good Log Dirs Disk Utilization for Disk utilization percentage across all good local directories. DISK UTILIZATION Good Local Dirs Bad Log Dirs Number of bad log directories. Bad Local Dirs Number of bad local directories. Ave Container Average time taken for a NM to launch a container. AVE CONTAINER LAUNCH DELAY Launch Delay RPC Avg Processing Average time for processing a RPC call. Time RPC Avg Queue Average time for queuing a PRC call. RPC METRICS Time RPC Call Queue The length of the RPC call queue. Length RPC Slow Calls Number of slow RPC calls. Heap Mem Usage Current heap memory usage. NonHeap Mem Current non-heap memory usage. JVM METRICS Usage GC Count Accumulated GC count over time. GC Time Accumulated GC time over time.

146 Hortonworks Data Platform May 17, 2018

Row Metrics Description LOG ERROR Number of ERROR logs. LOG4J METRICS LOG FATAL Number of FATAL logs.

9.1.3.5.5. YARN - Queues

Metrics to see status of Queues on the YARN cluster.

Row Metrics Description Apps Runnning Current number of running applications. Apps Pending Current number of pending applications. Apps Completed Accumulated number of completed applications over time. NUM APPS Apps Failed Accumulated number of failed applications over time. Apps Killed Accumulated number of killed applications over time. Apps Submitted Accumulated number of submitted applications over time. Containers Current number of running containers. Running Containers Pending Current number of pending containers. Containers Current number of Reserved containers. Reserved Total Containers Accumulated number of containers allocated over time. Allocated NUM CONTAINERS Total Node Accumulated number of node-local containers allocated over Local Containers time. Allocated Total Rack Local Accumulated number of rack-local containers allocated over Containers time. Allocated Total OffSwitch Accumulated number of off-switch containers allocated over Containers time. Allocated Allocated Memory Current amount of memory allocated for containers. Pending Memory Current amount of memory asked by applications for allocating containers. MEMORY UTILIZATION Available Memory Current amount of memory available for allocating containers. Reserved Memory Current amount of memory reserved for containers. Memory Used by Current amount of memory used by AM containers. AM Ave AM Container Average time taken to allocate an AM container since the AM CONTAINER ALLOCATION DELAY Allocation Delay container is requested.

9.1.3.5.6. YARN - ResourceManager

Metrics to see status of ResourceManagers on the YARN cluster.

Row Metrics Description RPC Avg Average time for processing/queuing a RPC call. Processing / Queue Time RPC STATS RPC Call Queue The length of the RPC call queue. Length RPC Slow calls Number of slow RPC calls.

147 Hortonworks Data Platform May 17, 2018

Row Metrics Description Heap Mem Usage Current heap memory usage. MEMORY USAGE NonHeap Mem Current non-heap memory usage. Usage GC count Accumulated GC count over time. GC STATS GcTime Accumulated GC time over time. LOG ERRORS Log Error / Fatal Number of ERROR/FATAL logs. RPC Authorization Number of authorization failures. AUTHORIZATION & AUTHENTICATION Failures FAILURES RPC Authentication Number of authentication failures. Failures

9.1.3.5.7. YARN - TimelineServer

Metrics to see the overall status for TimelineServer.

Row Metrics Description Timeline Entity Accumulated number of read operations. Data Reads DATA READS Timeline Entity Average time for reading a timeline entity. Data Read time Timeline Entity Accumulated number of write operations. Data Write DATA WRITES Timeline Entity Average time for writing a timeline entity. Data Write Time GC Count Accumulated GC count over time. GC Time Accumulated GC time over time. JVM METRICS Heap Usage Current heap memory usage. NonHeap Usage Current non-heap memory usage.

9.1.3.6. Hive Dashboards

The following Grafana dashboards are available for Hive:

• Hive - Home [148]

• Hive - HiveMetaStore [149]

• Hive - HiveServer2 [149]

9.1.3.6.1. Hive - Home

Metrics that show the overall status for Hive service.

Row Metrics Description DB count at startup Number of databases present at the last warehouse service startup time. Table count at Number of tables present at the last warehouse service startup WAREHOUSE SIZE - AT A GLANCE startup time. Partition count at Number of partitions present at the last warehouse service startup startup time.

148 Hortonworks Data Platform May 17, 2018

Row Metrics Description #tables created Number of tables created since the last warehouse service (ongoing) startup. WAREHOUSE SIZE - REALTIME GROWTH #partitions created Number of partitions created since the last warehouse service (ongoing) startup. HiveMetaStore Heap memory usage by Hive MetaStores. If applicable, Memory - Max indicates max usage across multiple instances. HiveServer2 Heap memory usage by HiveServer2. If applicable, indicates Memory - Max max usage across multiple instances. HiveMetaStore Non-heap memory usage by Hive MetaStores. If applicable, Offheap Memory - indicates max usage across multiple instances. Max

MEMORY PRESSURE HiveServer2 Non-heap memory usage by HiveServer2. If applicable, Offheap Memory - indicates max across multiple instances. Max HiveMetaStore app Total time spent in application pauses caused by garbage stop times (due to collection across Hive MetaStores. GC stops) HiveServer2 app Total time spent in application pauses caused by garbage stop times (due to collection across HiveServer2. GC stops) API call times Time taken to process a low-cost call made by health checks to - Health Check all metastores. roundtrip METASTORE - CALL TIMES (get_all_databases) API call times - Time taken to process a moderate-cost call made by queries/ Moderate size call exports/etc to all metastores. Data for this metric may not be (get_partitions_by_names)available in a less active warehouse.

9.1.3.6.2. Hive - HiveMetaStore

Metrics that show operating status for HiveMetaStore hosts. Select a HiveMetaStore and a host to view relevant metrics.

Row Metrics Description API call times Time taken to process a low-cost call made by health checks to - Health Check this metastore. roundtrip API TIMES (get_all_databases) API call times - Time taken to process a moderate-cost call made by queries/ Moderate size call exports/etc to this metastore. Data for this metric may not be (get_partitions_by_names)available in a less active warehouse. App Stop times Time spent in application pauses caused by garbage collection. (due to GC) MEMORY PRESSURE Heap Usage Current heap memory usage. Off-Heap Usage Current non-heap memory usage.

9.1.3.6.3. Hive - HiveServer2

Metrics that show operating status for HiveServer2 hosts. Select a HiveServer2 and a host to view relevant metrics.

Row Metrics Description API call times Time taken to process a low-cost cal made by health checks API TIMES - Health Check to the metastore embedded in this HiveServer2. Data for this

149 Hortonworks Data Platform May 17, 2018

Row Metrics Description roundtrip metric may not be available if HiverServer2 is not running in an (get_all_databases) embedded-metastore mode. API call times - Time taken to process a moderate-cost call made by queries/ Moderate size call exports/etc to the metastore embedded in this HiveServer2. (get_partitions_by_names)Data for this metric may not be available in a less active warehouse, or if HiveServer2 is not running in an embedded- metastore mode. App Stop times Time spent in application pauses caused by garbage collection. (due to GC) MEMORY PRESSURE Heap Usage Current heap memory usage. Off-Heap Usage Current non-heap memory usage. Active operation Current number of active operations in HiveServer2 and their count running states. THREAD STATES Completed Number of completed operations on HiveServer2 since the operation states last restart. Indicates whether they completed as expected or encountered errors.

9.1.3.7. Hive LLAP Dashboards

The following Grafana dashboards are available for Apache Hive LLAP. The LLAP Heat map dashboard and the LLAP Overview dashboard enable you to quickly see the hotspots among the LLAP daemons. If you find an issue and want to navigate to more specific information for a specific system, use the LLAP Daemon dashboard.

Note that all Hive LLAP dashboards show the state of the cluster and are useful for looking at cluster information from the previous hour or day. The dashboards do not show real- time results.

• Hive LLAP - Heatmap [150]

• Hive LLAP - Overview [151]

• Hive LLAP - Daemon [153]

9.1.3.7.1. Hive LLAP - Heatmap

The heat map dashboard shows all the nodes that are running LLAP daemons and includes a percentage summary for available executors and cache. This dashboard enables you to identify the hotspots in the cluster in terms of executors and cache.

The values in the table are color coded based on threshold: if the threshold is more than 50%, the color is green; between 20% and 50%, the color is yellow; and less than 20%, the color is red.

Row Metrics Description Remaining Cache Shows the percentage of cache capacity remaining across the Capacity nodes. For example, if the grid is green, the cache is being under utilized. If the grid is red, there is high utilization of cache. Heat maps Remaining Cache Same as above (Remaining Cache Capacity), but shows the Capacity cache hit ratio. Executor Free Slots Shows the percentage of executor free slots that are available on each nodes.

150 Hortonworks Data Platform May 17, 2018

9.1.3.7.2. Hive LLAP - Overview

The overview dashboard shows the aggregated information across all of the clusters: for example, the total cache memory from all the nodes. This dashboard enables you to see that your cluster is configured and running correctly. For example, you might have configured 10 nodes but you see only 8 nodes running.

If you find an issue by viewing this dashboard, you can open the LLAP Daemon dashboard to see which node is having the problem.

Row Metrics Description Total Executor Shows the total number of executors across all nodes. Threads Total Executor Shows the total amount of memory for executors across all Memory nodes. Overview Total Cache Shows the total amount of memory for cache across all nodes. Memory Total JVM Memory Shows the total amount of max Java Virtual Machine (JVM) memory across all nodes. Total Cache Usage Shows the total amount of cache usage (Total, Remaining, and Used) across all nodes. Average Cache Hit As the data is released from the cache, the curve should Rate increase. For example, the first query should run at 0, the second at 80-90 seconds, and then the third 10% faster. If, Cache Metrics Across all nodes instead, it decreases, there might be a problem in the cluster. Average Cache Shows how many requests are being made for the cache Read Requests and how many queries you are able to run that make use of the cache. If it says 0, for example, your cache might not be working properly and this grid might reveal a configuration issue. Total Cache Usage Shows the total amount of cache usage (Total, Remaining, and Used) across all nodes. Average Cache Hit As the data is released from the cache, the curve should Rate increase. For example, the first query should run at 0, the second at 80-90 seconds, and then the third 10% faster. If, Cache Metrics Across all nodes instead, it decreases, there might be a problem in the cluster. Average Cache Shows how many requests are being made for the cache Read Requests and how many queries you are able to run that make use of the cache. If it says 0, for example, your cache might not be working properly and this grid might reveal a configuration issue. Total Executor Shows the total number of task requests that were handled, Requests succeeded, failed, killed, evicted and rejected across all nodes.

Handled: Total requests across all sub-groups

Succeed: Total requests that were processed. For example, if you have 8 core machines, the number of total executor requests would be 8 Executor Metrics Across All nodes Failed: Did not complete successfully because, for example, you ran out of memory

Rejected: If all task priorities are the same, but there are still not enough slots to fulfill the request, the system will reject some tasks

Evicted: Lower priority requests are evicted if the slots are filled by higher priority requests

151 Hortonworks Data Platform May 17, 2018

Row Metrics Description Total Execution Shows the total execution slots, the number of free or Slots available slots, and number of slots occupied in the wait queue across all nodes.

Ideally, the threads available (blue) result should be the same as the threads that are occupied in the queue result. Time to Kill Pre- Shows the time that it took to kill a query due to pre-emption empted Task (300s in percentile (50th, 90th, 99th) latencies in 300 second interval) intervals. Max Time To Shows the maximum time taken to kill a task due to pre- Kill Task (due to emption. This grid and the one above show you if you are preemption) wasting a lot of time killing queries. Time lost while a task is waiting to be killed is time lost in the cluster. If your max time to kill is high, you might want to disable this feature. Pre-emption Time Shows the time lost due to pre-emption in percentile (50th, Lost (300s interval) 90th, 99th) latencies in 300 second intervals. Max Time Lost In Shows the maximum time lost due to pre-emption. If your max Cluster (due to pre- time to kill is high, you might want to disable this feature. emption) IO Elevator Metrics Across All Nodes Column Decoding Shows the percentile (50th, 90th, 99th) latencies for time it Time (30s interval) takes to decode the column chunk (convert encoded column chunk to column vector batches for processing) in 30 second intervals.

The cache comes from IO Elevator. It loads data from HDFS to the cache, and then from the cache to the executor. This metric shows how well the threads are performing and is useful to see that the threads are running. Max Column Shows the maximum time taken to decode column chunk Decoding Time (convert encoded column chunk to column vector batches for processing). JVM Metrics across all nodes Average JVM Heap Shows the average amount of Java Virtual Machine (JVM) Usage heap memory used across all nodes.

If the heap usage keeps increasing, you might run out of memory and the task failure count would also increase. Average JVM Non- Shows the average amount of JVM non-heap memory used Heap Usage across all nodes.

Max Shows the maximum garbage collection extra sleep time in GcTotalExtraSleepTimemilliseconds across all nodes. Garbage collection extra sleep time measures when the garbage collection monitoring is delayed (for example, the thread does not wake up after 500 milliseconds). Max GcTimeMillis Shows the total maximum GC time in milliseconds across all nodes. Total JVM Threads Shows the total number of JVM threads that are in a NEW, RUNNABLE, WAITING, TIMED_WAITING, and TERMINATED state across all nodes. JVM Metrics Total JVM Heap Shows the total amount of Java Virtual Machine (JVM) heap Used memory used in the daemon.

If the heap usage keeps increasing, you might run out of memory and the task failure count would also increase. Total JVM Non- Shows the total amount of JVM non-heap memory used in the Heap Used LLAP daemon.

152 Hortonworks Data Platform May 17, 2018

Row Metrics Description If the non-heap memory is over-allocated, you might run out of memory and the task failure count would also increase. Max Shows the maximum garbage collection extra sleep time in GcTotalExtraSleepTimemilliseconds in the LLAP daemon. Garbage collection extra sleep time measures when the garbage collection monitoring is delayed (for example, the thread does not wake up after 500 milliseconds). Max GcTimeMillis Shows the total maximum GC time in milliseconds in the LLAP daemon. Max JVM Threads Shows the maximum number of Java Virtual Machine (JVM) Runnable threads that are in RUNNABLE state. Max JVM Threads Shows the maximum number of JVM threads that are in Blocked BLOCKED state. If you are seeing spikes in the threads blocked, you might have a problem with your LLAP daemon. Max JVM Threads Shows the maximum number of JVM threads that are in Waiting WAITING state. Max JVM Threads Shows the maximum number of JVM threads that are in Timed Waiting TIMED_WAITING state.

9.1.3.7.3. Hive LLAP - Daemon

Metrics that show operating status for Hive LLAP Daemons.

Row Metrics Description Total Requests Shows the total number of task requests handled by the Submitted daemon. Total Requests Shows the total number of successful task requests handled by Succeeded the daemon. Total Requests Shows the total number of failed task requests handled by the Failed daemon. Total Requests Shows the total number of killed task requests handled by the Killed daemon. Total Requests Shows the total number of task requests handled by the Evicted From Wait daemon that were evicted from the wait queue. Tasks are Queue evicted if all of the executor threads are in use by higher priority tasks. Total Requests Shows the total number of task requests handled by the Rejected daemon that were rejected by the task executor service. Task are rejected if all of the executor threads are in use and the Executor Metrics wait queue is full of tasks that are not eligible for eviction. Available Shows the total number of free slots that are available for Execution Slots execution including free executor threads and free slots in the wait queue. 95th Percentile Pre- Shows the 95th percentile latencies for time lost due to pre- emption Time Lost emption in 300 second intervals. (300s interval) Max Pre-emption Shows the maximum time lost due to pre-emption. Time Lost 95th Percentile Shows the 95th percentile latencies for time taken to kill tasks Time to Kill Pre- due to pre-emption in 300 second intervals. empted Task (300s interval) Max Time To Kill Shows the maximum time taken to kill a task due to pre- Task Pre-empted emption. Task

153 Hortonworks Data Platform May 17, 2018

Row Metrics Description Total Cache Used Shows the total amount of cache usage (Total, Remaining, and Used) in LLAP daemon cache. Heap Usage Shows the amount of memory remaining in LLAP daemon cache. Average Cache Hit As the data is released from the cache, the curve should Cache Metrics Rate increase. For example, the first query should run at 0, the second at 80-90 seconds, and then the third 10% faster. If, instead, it decreases, there might be a problem in the LLAP daemon. Total Cache Read Shows the total number of read requests received by LLAP Requests daemon cache. 95th Percentile Shows the 95th percentile latencies for time it takes to decode Column Decoding the column chunk (convert encoded column chunk to column Time (30s interval) vector batches for processing) in 30 second intervals. The cache comes from IO Elevator. It loads data from HDFS to the cache, and then from the cache to the executor. This metric shows THREAD STATES how well the threads are performing and is useful to see that the threads are running. Max Column Shows the maximum time taken to decode column chunk Decoding Time (convert encoded column chunk to column vector batches for processing). 9.1.3.8. HBase Dashboards

Monitoring an HBase cluster is essential for maintaining a high-performance and stable system. The following Grafana dashboards are available for HBase:

• HBase - Home [154]

• HBase - RegionServers [155]

• HBase - Misc [160]

• HBase - Tables [161]

• HBase - Users [163] Important

Ambari disables per-region, per-table, and per-user metrics for HBase by default. See Enabling Individual Region, Table, and User Metrics for HBase if you want the Ambari Metrics System to display the more granular metrics of HBase system performance on the individual region, table, or user level.

9.1.3.8.1. HBase - Home

The HBase - Home dashboards display basic statistics about an HBase cluster. These dashboards provide insight to the overall status for the HBase cluster.

Row Metrics Description Num Total number of RegionServers in the cluster. RegionServers REGIONSERVERS / REGIONS Num Dead Total number of RegionServers that are dead in the cluster. RegionServers Num Regions Total number of regions in the cluster.

154 Hortonworks Data Platform May 17, 2018

Row Metrics Description Avg Num Regions Average number of regions per RegionServer. per RegionServer Num Regions / Total number of regions and stores (column families) in the Stores - Total cluster. NUM REGIONS/STORES Store File Size / Total data file size and number of store files. Count - Total Num Requests - Total number of requests (read, write and RPCs) in the cluster. Total NUM REQUESTS Num Request - Total number of get,put,mutate,etc requests in the cluster. Breakdown - Total RegionServer Average used, max or committed on-heap and offheap Memory - Average memory for RegionServers. REGIONSERVER MEMORY RegionServer Average used, free or committed on-heap and offheap Offheap Memory - memory for RegionServers. Average Memstore - Average blockcache and memstore sizes for RegionServers. BlockCache - MEMORY - MEMSTORE BLOCKCACHE Average Num Blocks in Total number of (hfile) blocks in the blockcaches across all BlockCache - Total RegionServers. BlockCache Hit/ Total number of blockcache hits misses and evictions across all Miss/s Total RegionServers. BLOCKCACHE BlockCache Hit Average blockcache hit percentage across all RegionServers. Percent - Average Get Latencies - Average min, median, max, 75th, 95th, 99th percentile Average latencies for Get operation across all RegionServers. OPERATION LATENCIES - GET/MUTATE Mutate Latencies - Average min, median, max, 75th, 95th, 99th percentile Average latencies for Mutate operation across all RegionServers. Delete Latencies - Average min, median, max, 75th, 95th, 99th percentile OPERATION LATENCIES - DELETE/ Average latencies for Delete operation across all RegionServers. INCREMENT Increment Average min, median, max, 75th, 95th, 99th percentile Latencies - Average latencies for Increment operation across all RegionServers. Append Latencies - Average min, median, max, 75th, 95th, 99th percentile OPERATION LATENCIES - APPEND/ Average latencies for Append operation across all RegionServers. REPLAY Replay Latencies - Average min, median, max, 75th, 95th, 99th percentile Average latencies for Replay operation across all RegionServers. RegionServer RPC - Average number of RPCs, active handler threads and open Average connections across all RegionServers. REGIONSERVER RPC RegionServer RPC Average number of calls in different RPC scheduling queues Queues - Average and the size of all requests in the RPC queue across all RegionServers. RegionServer Average sent and received bytes from the RPC across all REGIONSERVER RPC RPC Throughput - RegionServers. Average

9.1.3.8.2. HBase - RegionServers

The HBase - RegionServers dashboards display metrics for RegionServers in the monitored HBase cluster, including some performance-related data. These dashboards help you view basic I/O data and compare load among RegionServers.

Row Metrics Description NUM REGIONS Num Regions Number of regions in the RegionServer. STORE FILES Store File Size Total size of the store files (data files) in the RegionServer.

155 Hortonworks Data Platform May 17, 2018

Row Metrics Description Store File Count Total number of store files in the RegionServer. Num Total Total number of requests (both read and write) per second in Requests /s the RegionServer. Num Write Total number of write requests per second in the NUM REQUESTS Requests /s RegionServer. Num Read Total number of read requests per second in the RegionServer. Requests /s Num Get Total number of Get requests per second in the RegionServer. Requests /s NUM REQUESTS - GET / SCAN Num Scan Next Total number of Scan requests per second in the RegionServer. Requests /s Num Mutate Total number of Mutate requests per second in the Requests - /s RegionServer. NUM REQUESTS - MUTATE / DELETE Num Delete Total number of Delete requests per second in the Requests /s RegionServer. Num Append Total number of Append requests per second in the Requests /s RegionServer. Num Increment Total number of Increment requests per second in the NUM REQUESTS - APPEND / INCREMENT Requests /s RegionServer. Num Replay Total number of Replay requests per second in the Requests /s RegionServer. RegionServer Heap Memory used by the RegionServer. Memory Used MEMORY RegionServer Offheap Memory used by the RegionServer. Offheap Memory Used MEMSTORE Memstore Size Total Memstore memory size of the RegionServer. BlockCache - Size Total BlockCache size of the RegionServer. BlockCache - Free Total free space in the BlockCache of the RegionServer. BLOCKCACHE - OVERVIEW Size Num Blocks in Total number of hfile blocks in the BlockCache of the Cache RegionServer. Num BlockCache Number of BlockCache hits per second in the RegionServer. Hits /s Num BlockCache Number of BlockCache misses per second in the RegionServer. Misses /s Num BlockCache Number of BlockCache evictions per second in the BLOCKCACHE - HITS/MISSES Evictions /s RegionServer. BlockCache Percentage of BlockCache hits per second for requests that Caching Hit Percent requested cache blocks in the RegionServer. BlockCache Hit Percentage of BlockCache hits per second in the RegionServer. Percent Get Latencies - Mean latency for Get operation in the RegionServer. Mean Get Latencies - Median latency for Get operation in the RegionServer. Median Get Latencies - 75th 75th percentile latency for Get operation in the RegionServer OPERATION LATENCIES - GET Percentile Get Latencies - 95th 95th percentile latency for Get operation in the RegionServer. Percentile Get Latencies - 99th 99th percentile latency for Get operation in the RegionServer. Percentile

156 Hortonworks Data Platform May 17, 2018

Row Metrics Description Get Latencies - Max Max latency for Get operation in the RegionServer. Scan Next Mean latency for Scan operation in the RegionServer. Latencies - Mean Scan Next Median latency for Scan operation in the RegionServer. Latencies - Median Scan Next 75th percentile latency for Scan operation in the RegionServer. Latencies - 75th Percentile OPERATION LATENCIES - SCAN NEXT Scan Next 95th percentile latency for Scan operation in the RegionServer. Latencies - 95th Percentile Scan Next 99th percentile latency for Scan operation in the RegionServer. Latencies - 99th Percentile Scan Next Max latency for Scan operation in the RegionServer. Latencies - Max Mutate Latencies - Mean latency for Mutate operation in the RegionServer. Mean Mutate Latencies - Median latency for Mutate operation in the RegionServer. Median Mutate Latencies - 75th percentile latency for Mutate operation in the 75th Percentile RegionServer. OPERATION LATENCIES - MUTATE Mutate Latencies - 95th percentile latency for Mutate operation in the 95th Percentile RegionServer. Mutate Latencies - 99th percentile latency for Mutate operation in the 99th Percentile RegionServer. Mutate Latencies - Max latency for Mutate operation in the RegionServer. Max Delete Latencies - Mean latency for Delete operation in the RegionServer. Mean Delete Latencies - Median latency for Delete operation in the RegionServer. Median Delete Latencies - 75th percentile latency for Delete operation in the 75th Percentile RegionServer. OPERATION LATENCIES - DELETE Delete Latencies - 95th percentile latency for Delete operation in the 95th Percentile RegionServer. Delete Latencies - 99th percentile latency for Delete operation in the 99th Percentile RegionServer. Delete Latencies - Max latency for Delete operation in the RegionServer. Max Increment Mean latency for Increment operation in the RegionServer. Latencies - Mean Increment Median latency for Increment operation in the RegionServer. Latencies - Median Increment 75th percentile latency for Increment operation in the Latencies - 75th RegionServer. OPERATION LATENCIES - INCREMENT Percentile Increment 95th percentile latency for Increment operation in the Latencies - 95th RegionServer. Percentile Increment 99th percentile latency for Increment operation in the Latencies - 99th RegionServer. Percentile

157 Hortonworks Data Platform May 17, 2018

Row Metrics Description Increment Max latency for Increment operation in the RegionServer. Latencies - Max Append Latencies - Mean latency for Append operation in the RegionServer. Mean Append Latencies - Median latency for Append operation in the RegionServer. Median Append Latencies - 75th percentile latency for Append operation in the 75th Percentile RegionServer. OPERATION LATENCIES - APPEND Append Latencies - 95th percentile latency for Append operation in the 95th Percentile RegionServer. Append Latencies - 99th percentile latency for Append operation in the 99th Percentile RegionServer. Append Latencies Max latency for Append operation in the RegionServer. - Max Replay Latencies - Mean latency for Replay operation in the RegionServer. Mean Replay Latencies - Median latency for Replay operation in the RegionServer. Median Replay Latencies - 75th percentile latency for Replay operation in the 75th Percentile RegionServer. OPERATION LATENCIES - REPLAY Replay Latencies - 95th percentile latency for Replay operation in the 95th Percentile RegionServer. Replay Latencies - 99th percentile latency for Replay operation in the 99th Percentile RegionServer. Replay Latencies - Max latency for Replay operation in the RegionServer. Max Num RPC /s Number of RPCs per second in the RegionServer. Num Active Number of active RPC handler threads (to process requests) in RPC - OVERVIEW Handler Threads the RegionServer. Num Connections Number of connections to the RegionServer. Num RPC Calls in Number of RPC calls in the general processing queue in the General Queue RegionServer. Num RPC Calls in Number of RPC calls in the high priority (for system tables) Priority Queue processing queue in the RegionServer. RPC - QUEUES Num RPC Calls in Number of RPC calls in the replication processing queue in the Replication Queue RegionServer. RPC - Total Call Total data size of all RPC calls in the RPC queues in the Queue Size RegionServer. RPC - Call Queued Mean latency for RPC calls to stay in the RPC queue in the Time - Mean RegionServer. RPC - Call Queued Median latency for RPC calls to stay in the RPC queue in the Time - Median RegionServer. RPC - Call Queued 75th percentile latency for RPC calls to stay in the RPC queue in Time - 75th the RegionServer. Percentile RPC - CALL QUEUED TIMES RPC - Call Queued 95th percentile latency for RPC calls to stay in the RPC queue in Time - 95th the RegionServer. Percentile RPC - Call Queued 99th percentile latency for RPC calls to stay in the RPC queue in Time - 99th the RegionServer. Percentile RPC - Call Queued Max latency for RPC calls to stay in the RPC queue in the Time - Max RegionServer.

158 Hortonworks Data Platform May 17, 2018

Row Metrics Description RPC - Call Process Mean latency for RPC calls to be processed in the Time - Mean RegionServer. RPC - Call Process Median latency for RPC calls to be processed in the Time - Median RegionServer. RPC - Call Process 75th percentile latency for RPC calls to be processed in the Time - 75th RegionServer. Percentile RPC - CALL PROCESS TIMES RPC - Call Process 95th percentile latency for RPC calls to be processed in the Time - 95th RegionServer. Percentile RPC - Call Process 99th percentile latency for RPC calls to be processed in the Time - 99th RegionServer. Percentile RPC - Call Process Max latency for RPC calls to be processed in the RegionServer. Time - Max RPC - Received Received bytes from the RPC in the RegionServer. RPC - THROUGHPUT bytes /s RPC - Sent bytes /s Sent bytes from the RPC in the RegionServer. Num WAL - Files Number of Write-Ahead-Log files in the RegionServer. WAL - FILES Total WAL File Size Total files sized of Write-Ahead-Logs in the RegionServer. WAL - Num Number of append operations per second to the filesystem in Appends /s the RegionServer. WAL - THROUGHPUT WAL - Num Sync /s Number of sync operations per second to the filesystem in the RegionServer. WAL - Sync Mean latency for Write-Ahead-Log sync operation to the Latencies - Mean filesystem in the RegionServer. WAL - Sync Median latency for Write-Ahead-Log sync operation to the Latencies - Median filesystem in the RegionServer. WAL - Sync 75th percentile latency for Write-Ahead-Log sync operation to Latencies - 75th the filesystem in the RegionServer. Percentile WAL - SYNC LATENCIES WAL - Sync 95th percentile latency for Write-Ahead-Log sync operation to Latencies - 95th the filesystem in the RegionServer. Percentile WAL - Sync 99th percentile latency for Write-Ahead-Log sync operation to Latencies - 99th the filesystem in the RegionServer. Percentile WAL - Sync Max latency for Write-Ahead-Log sync operation to the Latencies - Max filesystem in the RegionServer. WAL - Append Mean latency for Write-Ahead-Log append operation to the Latencies - Mean filesystem in the RegionServer. WAL - Append Median latency for Write-Ahead-Log append operation to the Latencies - Median filesystem in the RegionServer. WAL - Append 95th percentile latency for Write-Ahead-Log append operation Latencies - 75th to the filesystem in the RegionServer. Percentile WAL - APPEND LATENCIES WAL - Append 95th percentile latency for Write-Ahead-Log append operation Latencies - 95th to the filesystem in the RegionServer. Percentile WAL - Append 99th percentile latency for Write-Ahead-Log append operation Latencies - 99th to the filesystem in the RegionServer. Percentile WAL - Append Max latency for Write-Ahead-Log append operation to the Latencies - Max filesystem in the RegionServer.

159 Hortonworks Data Platform May 17, 2018

Row Metrics Description WAL - Append Mean data size for Write-Ahead-Log append operation to the Sizes - Mean filesystem in the RegionServer. WAL - Append Median data size for Write-Ahead-Log append operation to Sizes - Median the filesystem in the RegionServer. WAL - Append 75th percentile data size for Write-Ahead-Log append Sizes - 75th operation to the filesystem in the RegionServer. Percentile WAL - APPEND SIZES WAL - Append 95th percentile data size for Write-Ahead-Log append Sizes - 95th operation to the filesystem in the RegionServer. Percentile WAL - Append 99th percentile data size for Write-Ahead-Log append Sizes - 99th operation to the filesystem in the RegionServer. Percentile WAL - Append Max data size for Write-Ahead-Log append operation to the Sizes - Max filesystem in the RegionServer. WAL Num Slow Number of append operations per second to the filesystem Append /s that took more than 1 second in the RegionServer. Num Slow Gets /s Number of Get requests per second that took more than 1 second in the RegionServer. SLOW OPERATIONS Num Slow Puts /s Number of Put requests per second that took more than 1 second in the RegionServer. Num Slow Number of Delete requests per second that took more than 1 Deletes /s second in the RegionServer. FLUSH/COMPACTION QUEUES Flush Queue Number of Flush operations waiting to be processed in the Length RegionServer. A higher number indicates flush operations being slow. Compaction Queue Number of Compaction operations waiting to be processed Length in the RegionServer. A higher number indicates compaction operations being slow. Split Queue Length Number of Region Split operations waiting to be processed in the RegionServer. A higher number indicates split operations being slow. GC Count /s Number of Java Garbage Collections per second. GC Count ParNew / Number of Java ParNew (YoungGen) Garbage Collections per JVM - GC COUNTS s second. GC Count CMS /s Number of Java CMS Garbage Collections per second. GC Times /s Total time spend in Java Garbage Collections per second. GC Times ParNew / Total time spend in Java ParNew(YoungGen) Garbage JVM - GC TIMES s Collections per second. GC Times CMS /s Total time spend in Java CMS Garbage Collections per second. Percent Files Local Percentage of files served from the local DataNode for the LOCALITY RegionServer.

9.1.3.8.3. HBase - Misc

The HBase - Misc dashboards display miscellaneous metrics related to the HBase cluster. You can use these metrics for tasks like debugging authentication and authorization issues and exceptions raised by RegionServers.

Row Metrics Description Master - Regions in Number of regions in transition in the cluster. Transition REGIONS IN TRANSITION Master - Regions in Number of regions in transition that are in transition state for Transition Longer longer than 1 minute in the cluster.

160 Hortonworks Data Platform May 17, 2018

Row Metrics Description Than Threshold Time Regions in Maximum time that a region stayed in transition state. Transition Oldest Age Master Num Number of threads in the Master. Threads - Runnable NUM THREADS - RUNNABLE RegionServer Num Number of threads in the RegionServer. Threads - Runnable Master Num Number of threads in the Blocked State in the Master. Threads - Blocked NUM THREADS - BLOCKED RegionServer Num Number of threads in the Blocked State in the RegionServer. Threads - Blocked Master Num Number of threads in the Waiting State in the Master. Threads - Waiting NUM THREADS - WAITING RegionServer Num Number of threads in the Waiting State in the RegionServer. Threads - Waiting Master Num Number of threads in the Timed-Waiting State in the Master. Threads - Timed Waiting NUM THREADS - TIMED WAITING RegionServer Num Number of threads in the Timed-Waiting State in the Threads - Timed RegionServer. Waiting Master Num Number of threads in the New State in the Master. Threads - New NUM THREADS - NEW RegionServer Num Number of threads in the New State in the RegionServer. Threads - New Master Num Number of threads in the Terminated State in the Master. Threads - Terminated NUM THREADS - TERMINATED RegionServer Number of threads in the Terminated State in the Num Threads - RegionServer. Terminated RegionServer RPC Number of RPC successful authentications per second in the Authentication RegionServer. Successes /s RPC AUTHENTICATION RegionServer RPC Number of RPC failed authentications per second in the Authentication RegionServer. Failures /s RegionServer RPC Number of RPC successful autorizations per second in the Authorization RegionServer. Successes /s RPC Authorization RegionServer RPC Number of RPC failed autorizations per second in the Authorization RegionServer. Failures /s Master Number of exceptions in the Master. Exceptions /s EXCEPTIONS RegionServer Number of exceptions in the RegionServer. Exceptions /s

9.1.3.8.4. HBase - Tables

HBase - Tables metrics reflect data on the table level. The dashboards and data help you compare load distribution and resource use among tables in a cluster at different times.

161 Hortonworks Data Platform May 17, 2018

Row Metrics Description Num Regions Number of regions for the table(s). NUM REGIONS/STORES Num Stores Number of stores for the table(s). Table Size Total size of the data (store files and MemStore) for the table(s). TABLE SIZE Average Region Average size of the region for the table(s). Average Region Size Size is calculated from average of average region sizes reported by each RegionServer (may not be the true average). MEMSTORE SIZE MemStore Size Total MemStore size of the table(s). Store File Size Total size of the store files (data files) for the table(s). STORE FILES Num Store Files Total number of store files for the table(s). Max Store File Age Maximum age of store files for the table(s). As compactions rewrite data, store files are also rewritten. Max Store File Age is calculated from the maximum of all maximum store file ages reported by each RegionServer. Min Store File Age Minimum age of store files for the table(s). As compactions rewrite data, store files are also rewritten. Min Store File Age is calculated from the minimum of all minimum store file ages STORE FILE AGE reported by each RegionServer. Average Store File Average age of store files for the table(s). As compactions Age rewrite data, store files are also rewritten. Average Store File Age is calculated from the average of average store file ages reported by each RegionServer. Num Reference Total number of reference files for the table(s). Files - Total on All NUM TOTAL REQUESTS Num Total Total number of requests (both read and write) per second for Requests /s on the table(s). Tables NUM READ REQUESTS Num Read Total number of read requests per second for the table(s). Requests /s NUM WRITE REQUESTS Num Write Total number of write requests per second for the table(s). Requests /s NUM FLUSHES Num Flushes /s Total number of flushes per second for the table(s). Flushed MemStore Total number of flushed MemStore bytes for the table(s). Bytes FLUSHED BYTES Flushed Output Total number of flushed output bytes for the table(s). Bytes Flush Time Mean Mean latency for Flush operation for the table(s). Flush Time Median Median latency for Flush operation for the table(s). FLUSH TIME HISTOGRAM Flush Time 95th 95th percentile latency for Flush operation for the table(s). Percentile Flush Time Max Maximum latency for Flush operation for the table(s). Flush MemStore Mean size of the MemStore for Flush operation for the Size Mean table(s). Flush MemStore Median size of the MemStore for Flush operation for the Size Median table(s). FLUSH MEMSTORE SIZE HISTOGRAM Flush Output Size 95th percentile size of the MemStore for Flush operation for 95th Percentile the table(s). Flush MemStore Max size of the MemStore for Flush operation for the table(s). Size Max Flush Output Size Mean size of the output file for Flush operation for the Mean table(s). FLUSH OUTPUT SIZE HISTOGRAM Flush Output Size Median size of the output file for Flush operation for the Median table(s).

162 Hortonworks Data Platform May 17, 2018

Row Metrics Description Flush Output Size 95th percentile size of the output file for Flush operation for 95th Percentile the table(s). Flush Output Size Max size of the output file for Flush operation for the table(s). Max

9.1.3.8.5. HBase - Users

The HBase - Users dashboards display metrics and detailed data on a per-user basis across the cluster. You can click the second drop-down arrow in the upper-left corner to select a single user, a group of users, or all users, and you can change your user selection at any time.

Row Metrics Description Num Get Total number of Get requests per second for the user(s). Requests /s NUM REQUESTS - GET/SCAN Num Scan Next Total number of Scan requests per second for the user(s). Requests /s Num Mutate Total number of Mutate requests per second for the user(s). Requests /s NUM REQUESTS - MUTATE/DELETE Num Delete Total number of Delete requests per second for the user(s). Requests /s Num Append Total number of Append requests per second for the user(s). Requests /s NUM REQUESTS - APPEND/INCREMENT Num Increment Total number of Increment requests per second for the user(s). Requests /s 9.1.3.9. Kafka Dashboards

The following Grafana dashboards are available for Kafka:

• Kafka - Home [163]

• Kafka - Hosts [164]

• Kafka - Topics [164] 9.1.3.9.1. Kafka - Home

Metrics that show overall status for the Kafka cluster.

Row Metrics Description BYTES IN & OUT / MESSAGES IN Bytes In & Bytes Rate at which bytes are produced into the Kafka cluster and Out /sec the rate at which bytes are being consumed from the Kafka cluster. Messages In /sec Number of messages produced into the Kafka cluster. CONTROLLER/LEADER COUNT & Active Controller Number of active controllers in the Kafka cluster. This should REPLICA MAXLAG Count always equal one. Replica MaxLag Shows the lag of each replica from the leader. Leader Count Number of partitions for which a particular host is the leader. UNDER REPLICATED PATRITIONS & Under Replicated Indicates if any partitions in the cluster are under-replicated. OFFLINE PARTITONS COUNT Partitions Offline Partitions Indicates if any partitions are offline (which means that no Count leaders or replicas are available for producing or consuming). PRODUCER & CONSUMER REQUESTS Producer Req /sec Rate at which producer requests are made to the Kafka cluster.

163 Hortonworks Data Platform May 17, 2018

Row Metrics Description Consumer Req /sec Rate at which consumer requests are made from the Kafka cluster. LEADER ELECTION AND UNCLEAN Leader Election Rate at which leader election is happening in the Kafka cluster. LEADER ELECTIONS Rate Unclean Leader Indicates if there are any unclean leader elections. Unclean Elections leader election indicates that a replica which is not part of ISR is elected as a leader. ISR SHRINKS / ISR EXPANDED IsrShrinksPerSec If the broker goes down, ISR shrinks. In such case, this metric indicates if any of the partitions are not part of ISR. IsrExpandsPerSec Once the broker comes back up and catches up with the leader, this metric indicates if any partitions rejoined ISR. REPLICA FETCHER MANAGER ReplicaFetcherManagerThe maximum lag in messages between the follower and MaxLag leader replicas.

9.1.3.9.2. Kafka - Hosts

Metrics that show operating status for Kafka cluster on a per broker level.

Use the drop-down menus to customize your results:

• Kafka broker

• Host

• Whether to view the largest (top) or the smallest (bottom) values

• Number of values that you want to view

• Aggregator to use: average, max value, or the sum of values

Row Metrics Description BYTES IN & OUT / MESSAGES IN / Bytes In & Bytes Rate at which bytes produced into the Kafka broker and rate UNDER REPLICATED PARTITIONS Out /sec at which bytes are being consumed from the Kafka broker. Messages In /sec Number of messages produced into the Kafka broker. Under Replicated Number of under-replicated partitions in the Kafka broker. Partitions PRODUCER & CONSUMER REQUESTS Producer Req /sec Rate at which producer requests are made to the Kafka broker. Consumer Req /sec Rate at which consumer requests are made from the Kafka broker. REPLICA MANAGER PARTITION/ Replica Manager Number of topic partitions being replicated for the Kafka LEADER/FETCHER MANAGER MAX LAG Partition Count broker. Replica Manager Number of topic partitions for which the Kafka broker is the Leader Count leader. Replica Fetcher Shows the lag in replicating topic partitions. Manager MaxLag clientId Replica ISR SHRINKS / ISR EXPANDS IsrShrinks /sec Indicates if any replicas failed to be in ISR for the host. IsrExpands /sec Indicates if any replica has caught up with leader and re-joined the ISR for the host.

9.1.3.9.3. Kafka - Topics

Metrics related to Kafka cluster on a per topic level. Select a topic (by default, all topics are selected) to view the metrics for that topic.

164 Hortonworks Data Platform May 17, 2018

Row Metrics Description MESSAGES IN/OUT & BYTES IN/OUT MessagesInPerSec Rate at which messages are being produced into the topic. MessagesOutPerSec Rate at which messages are being consumed from the topic. TOTAL FETCH REQUESTS TotalFetchRequestsPerSecNumber of consumer requests coming for the topic. TOTAL PRODUCE REQUESTS /SEC TotalProduceRequestsPerSecNumber of producer requests being sent to the topic. FETCHER LAG METRICS CONSUMER LAG FetcherLagMetrics Shows the replica fetcher lag for the topic. ConsumnerLag

9.1.3.10. Storm Dashboards

The following Grafana dashboards are available for Storm:

• Storm - Home [165]

• Storm - Topology [165]

• Storm - Components [166]

9.1.3.10.1. Storm - Home

Metrics that show the operating status for Storm.

Row Metrics Description Unnamed Topologies Number of topologies in the cluster. Supervisors Number of supervisors in the cluster. Total Executors Total number of executors running for all topologies in the cluster. Total Tasks Total number of tasks for all topologies in the cluster. Unnamed Free Slots Number of free slots for all supervisors in the cluster. Used Slots Number of used slots for all supervisors in the cluster. Total Slots Total number of slots for all supervisors in the cluster. Should be more than 0.

9.1.3.10.2. Storm - Topology

Metrics that show the overall operating status for Storm topologies. Select a topology (by default, all topologies are selected) to view metrics for that topology.

Row Metrics Description RECORDS All Tasks Input/ Input Records is the number of input messages executed on all Output tasks, and Output Records is the number of messages emitted on all tasks. All Tasks Acked Number of messages acked (completed) on all tasks. Tuples All Tasks Failed Number of messages failed on all tasks. Tuples LATENCY / QUEUE All Spouts Latency Average latency on all spout tasks. All Tasks Queue Receive Queue Population is the total number of tuples waiting in the receive queue, and Send Queue Population is the total number of tuples waiting in the send queue. MEMORY USAGE All workers Used bytes on heap for all workers in topology. memory usage on heap

165 Hortonworks Data Platform May 17, 2018

Row Metrics Description All workers Used bytes on non-heap for all workers in topology. memory usage on non-heap GC All workers GC PSScavenge count is the number of occurrences for parallel count scavenge collector and PSMarkSweep count is the number of occurrences for parallel scavenge mark and sweep collector. All workers GC PSScavenge timeMs is the sum of the time parallel scavenge time collector takes (in milliseconds), and PSMarkSweep timeMs is the sum of the time parallel scavenge mark and sweep collector takes (in milliseconds). Note that GC metrics are provided based on worker GC setting, so these metrics are only available for default GC option for worker.childopts. If you use another GC option for worker, you need to copy the dashboard and update the metric name manually.

9.1.3.10.3. Storm - Components

Metrics that show operating status for Storm topologies on a per component level. Select a topology and a component to view related metrics.

Row Metrics Description RECORDS Input/Output Input Records is the number of messages executed on the selected component, and Output Records is the number of messages emitted on the selected component. Acked Tuples Number of messages acked (completed) on the selected component. Failed Tuples Number of messages failed on the selected component. LATENCY / QUEUE Latency Complete Latency is the average complete latency on the select component (for Spout), and Process Latency is the average process latency on the selected component (for Bolt). Queue Receive Queue Population is the total number of tuples waiting in receive queues on the selected component, and Send Queue Population is the total number of tuples waiting in send queues on the selected component. 9.1.3.11. System Dashboards

The following Grafana dashboards are available for System:

• System - Home [166]

• System - Servers [167] 9.1.3.11.1. System - Home

Metrics to see the overall status of the cluster.

Row Metrics Description Logical CPU Count Average number of CPUs (including hyperthreading) Per Server aggregated for selected hosts. Total Memory Per Total system memory available per server aggregated for OVERVIEW AVERAGES Server selected hosts. Total Disk Space Total disk space per server aggregated for selected hosts. Per Server Logical CPU Count Total Number of CPUs (including hyperthreading) aggregated OVERVIEW - TOTALS Total for selected hosts.

166 Hortonworks Data Platform May 17, 2018

Row Metrics Description Total Memory Total system memory available per server aggregated for selected hosts. Total Disk Space Total disk space per server aggregated for selected hosts. CPU Utilization - CPU utilization aggregated for selected hosts. CPU Average System Load - Load average (1 min, 5 min and 15 min) aggregated for SYSTEM LOAD Average selected hosts. Memory - Average Average system memory utilization aggregated for selected MEMORY hosts. Memory - Total Total system memory available aggregated for selected hosts. Disk Utilitzation - Average disk usage aggregated for selected hosts. Average DISK UTILITZATION Disk Utilitzation - Total disk available for selected hosts. Total Disk IO - Average Disk read/write counts (iops) co-related with bytes aggregated for selected hosts. (upper chart) DISK IO Disk IO - Average Average Individual read/write statistics as MBps aggregated for selected hosts. (lower chart) Disk IO - Total Sum of read/write bytes/sec aggregated for selected hosts. Network IO - Average Network statistics as MBps aggregated for selected NETWORK IO Average hosts. Network IO - Total Sum of Network packets as MBps aggregated for selected hosts. NETWORK PACKETS Network Packets - Average of Network packets as KBps aggregated for selected Average hosts. Swap Space - Average swap space statistics aggregated for selected hosts. Average SWAP/NUM PROCESSES Num Processes - Average number of processes aggregated for selected hosts. Average

Note

• Average implies sum/count for values reported by all hosts in the cluster. Example: In a 30 second window, if 98 out of 100 hosts reported 1 or more value, it is the SUM(Avg value from each host + Interpolated value for 2 missing hosts)/100.

• Sum/Total implies the sum of all values in a timeslice (30 seconds) from all hosts in the cluster. The same interpolation rule applies.

9.1.3.11.2. System - Servers

Metrics to see the system status per host on the server.

Row Metrics Description CPU Utilization - CPU utilization per user for selected hosts. User CPU - USER/SYSTEM CPU Utilization - CPU utilization per system for selected hosts. System CPU Utilization - CPU nice (Unix) time spent for selected hosts. CPU - NICE/IDLE Nice

167 Hortonworks Data Platform May 17, 2018

Row Metrics Description CPU Utilization - CPU idle time spent for selected hosts. Idle CPU Utilization - CPU IO wait time for selected hosts. iowait CPU - IOWAIT/INTR CPU Utilization CPU IO interrupt execute time for selected hosts. - Hardware Interrupt CPU Utilization - CPU time spent processing soft irqs for selected hosts. Software Interrupt CPU - SOFTINTR/STEAL CPU Utilization - CPU time spent processing steal time (virtual cpu wait) for Steal (VM) selected hosts. SYSTEM LOAD - 1 MINUTE System Load 1 minute load average for selected hosts. Average - 1 Minute SYSTEM LOAD - 5 MINUTE System Load 5 minute load average for selected hosts. Average - 5 Minute SYSTEM LOAD - 15 MINUTE System Load 15 minute load average for selected hosts. Average - 15 Minute Memory - Total Total memory in GB for selected hosts. MEMORY - TOTAL/USED Memory - Used Used memory in GB for selected hosts. Memory - Free Total free memory in GB for selected hosts. MEMORY - FREE/CACHED Memory - Cached Total cached memory in GB for selected hosts. Memory - Buffered Total buffered memory in GB for selected hosts. MEMORY - BUFFERED/SHARED Memory - Shared Total shared memory in GB for selected hosts. Disk Used Disk space used in GB for selected hosts. DISK UTILITZATION Disk Free Disk space available in GB for selected hosts. Read Bytes IOPS as read MBps for selected hosts. DISK IO Write Bytes IOPS as write MBps for selected hosts. Read Count IOPS as read count for selected hosts. DISK IOPS Write Count IOPS as write count for selected hosts. NETWORK IO Network Bytes Network utilization as byte/sec received for selected hosts. Received Network Bytes Network utilization as byte/sec sent for selected hosts. Sent NETWORK PACKETS Network Packets Network utilization as packets received for selected hosts. Received Network Packets Network utilization as packets sent for selected hosts. Sent SWAP Swap Space - Total Total swap space available for selected hosts. Swap Space - Free Total free swap space for selected hosts. NUM PROCESSES Num Processes - Count of processes and total running processes for selected Total hosts. Num Processes - Count of processes and total running processes for selected Runnable hosts.

9.1.3.12. NiFi Dashboard

The following Grafana dashboard is available for NiFi:

• NiFi-Home [169]

168 Hortonworks Data Platform May 17, 2018

9.1.3.12.1. NiFi-Home

You can use the following metrics to assess the general health of your NiFi cluster.

For all metrics available in the NiFi-Home dashboard, the single value you see is the average of the information submitted by each node in your NiFi cluster.

Row Metrics Description JVM Heap Usage Displays the amount of memory being used by the JVM process. For NiFi, the default configuration is 512 MB. JVM File Descriptor Shows the number of connections to the operating system. Usage You can monitor this metric to ensure that your JVM file JVM Info descriptors, or connections, are opening and closing as tasks complete. JVM Uptime Displays how long a Java process has been running. You can use this metric to monitor Java process longevity, and any unexpected restarts. Active Threads NiFi has two user configurable thread pools:

• Maximum timer driven thread count (default 10)

• Maximum event driven thread count (default 5)

This metrics displays the number of active threads from these two pools. Thread Info Thread Count Displays the total number of threads for the JVM process that is running NiFi. This value is larger than the two pools above, because NiFi uses more than just the timer and event driven threads. Daemon Thread Displays the number of daemon threads that are running. A Count daemon thread is a thread that does not prevent the JVM from exiting when the program finishes, even if the thread is still running. FlowFiles Received Displays the number of FlowFiles received into NiFi from an external system in the last 5 minutes. FlowFiles Sent Displays the number of FlowFiles sent from NiFi to an external FlowFile Info system in the last 5 minutes. FlowFiles Queued Displays the number of FlowFiles queued in a NiFi processor connection. Bytes Received Displays the number of bytes of FlowFile data received into NiFi from an external system, in the last 5 minutes. Bytes Sent Displays the number of bytes of FlowFile data sent from NiFi to Byte Info an external system, in the last 5 minutes. Bytes Queued Displays the number of bytes of FlowFile data queued in a NiFi processor connection. 9.1.4. AMS Performance Tuning

To set up Ambari Metrics System,in your environment, review and customize the following Metrics Collector configuration options.

• Customizing the Metrics Collector Mode [170]

• Customizing TTL Settings [171]

• Customizing Memory Settings [172]

169 Hortonworks Data Platform May 17, 2018

• Customizing Cluster-Environment-Specific Settings [172]

• Moving the Metrics Collector [173]

• (Optional) Enabling Individual Region, Table, and User Metrics for HBase [174] 9.1.4.1. Customizing the Metrics Collector Mode

Metrics Collector is built using Hadoop technologies such as Apache HBase, , and (ATS). The Collector can store metrics data on the local file system, referred to as embedded mode, or use an external HDFS, referred to as distributed mode. By default, the Collector runs in embedded mode. In embedded mode, the Collector captures and writes metrics to the local file system on the host where the Collector is running. Important

When running in embedded mode, you should confirm that hbase.rootdir and hbase.tmp.dir have adequately sized and lightly used partitions. Directory configurations in Ambari Metrics > Configs > Advanced > ams-hbase- site are using a sufficiently sized and not heavily utilized partition, such as:

file:///grid/0/var/lib/ambari-metrics-collector/hbase.

You should also confirm that the TTL settings are appropriate.

When the Collector is configured for distributed mode, it writes metrics to HDFS, and the components run in distributed processes, which helps to manage CPU and memory consumption.

To switch the Metrics Collector from embedded mode to distributed mode,

Steps

1. In Ambari Web, browse to Services > Ambari Metrics > Configs.

2. Change the values of listed properties to the values shown in the following table:

Configuration Property Description Value Section General Metrics Service Designates whether to run in distributed distributed operation mode or embedded mode. (timeline.metrics.service.operation.mode) Advanced ams- hbase.cluster.distributedIndicates AMS will run in distributed true hbase-site mode. Advanced ams- hbase.rootdir 1 The HDFS directory location where hdfs://$NAMENODE_FQDN:8020/apps/ hbase-site metrics will be stored. ams/metrics

3. Using Ambari Web > Hosts > Components restart the Metrics Collector.

If your cluster if configured for a highly available NameNode, set the hbase.rootdir value to use the HDFS name service instead of the NameNode host name:

hdfs://hdfsnameservice/apps/ams/metrics

170 Hortonworks Data Platform May 17, 2018

Optionally, you can migrate existing data from the local store to HDFS prior to switching to distributed mode:

Steps

1. Create an HDFS directory for the ams user:

su - hdfs -c 'hdfs dfs -mkdir -p /apps/ams/metrics'

2. Stop Metrics Collector.

3. Copy the metric data from the AMS local directory to an HDFS directory. This is the value of hbase.rootdir in Advanced ams-hbase-site used when running in embedded mode. For example:

su - hdfs -c 'hdfs dfs -copyFromLocal /var/lib/ambari-metrics-collector/hbase/* /apps/ams/metrics'

su - hdfs -c 'hdfs dfs -chown -R ams:hadoop /apps/ams/metrics'

4. Switch to distributed mode.

5. Restart the Metrics Collector.

If you are working with Apache HBase cluster metrics and want to display the more granular metrics of HBase cluster performance on the individual region, table, or user level, see .

More Information

Customizing Cluster-Environment-Specific Settings [172]

Customizing TTL Settings [171]

Enabling Individual Region, Table, and User Metrics for HBase 9.1.4.2. Customizing TTL Settings

AMS enables you to configure Time To Live (TTL) for aggregated metrics by navigating to Ambari Metrics > Configs > Advanced ams-siteEach property name is self explanatory and controls the amount of time to keep metrics (in seconds) before they are purged. The values for these TTL’s are set in seconds.

For example, assume that you are running a single-node sandbox and want to ensure that no values are stored for more than seven days, to reduce local disk space consumption. In this case, you can set to 604800 (seven days, in seconds) any property ending in .ttl that has a value greater than 604800.

You likely want to do this for properties such as timeline.metrics.cluster.aggregator.daily.ttl, which controls the daily aggregation TTL and is set by default to two years. Two other properties that consume a lot of disk space are

• timeline.metrics.cluster.aggregator.minute.ttl, which controls minute -level aggregated metrics TTL, and

• timeline.metrics.host.aggregator.ttl, which controls host-based precision metrics TTL.

171 Hortonworks Data Platform May 17, 2018

If you are working in an environment prior to Apache Ambari 2.1.2, you should make these settings during installation; otherwise, you must use the HBase shell by running the following command from the Collector host:

/usr/lib/ams-hbase/bin/hbase --config /etc/ams-hbase/conf shell

After you are connected, you must update each of the following tables with the TTL value hbase(main):000:0> alter 'METRIC_RECORD_DAILY', { NAME => '0', TTL => 604800}:

Map This TTL Property... To This HBase Table... timeline.metrics.cluster.aggregator.daily.ttl METRIC_AGGREGATE_DAILY timeline.metrics.cluster.aggregator.hourly.ttl METRIC_AGGREGATE_HOURLY timeline.metrics.cluster.aggregator.minute.ttl METRIC_AGGREGATE timeline.metrics.host.aggregator.daily.ttl METRIC_RECORD_DAILY timeline.metrics.host.aggregator.hourly.ttl METRIC_RECORD_HOURLY timeline.metrics.host.aggregator.minute.ttl METRIC_RECORD_MINUTE timeline.metrics.host.aggregator.ttl METRIC_RECORD

9.1.4.3. Customizing Memory Settings

Because AMS uses multiple components (such as Apache HBase and Apache Phoenix) for metrics storage and query, multiple tunable properties are available to you for tuning memory use:

Configuration Property Description Advanced ams-env metrics_collector_heapsize Heap size configuration for the Collector. Advanced ams-hbase-env hbase_regionserver_heapsize Heap size configuration for the single AMS HBase Region Server. Advanced ams-hbase-env hbase_master_heapsize Heap size configuration for the single AMS HBase Master. Advanced ams-hbase-env regionserver_xmn_size Maximum value for the young generation heap size for the single AMS HBase RegionServer. Advanced ams-hbase-env hbase_master_xmn_size Maximum value for the young generation heap size for the single AMS HBase Master.

9.1.4.4. Customizing Cluster-Environment-Specific Settings

The Metrics Collector mode, TTL settings, memory settings, and disk space requirements for AMS depend on the number of nodes in the cluster. The following table lists specific recommendations and tuning guidelines for each.

Cluster Host Disk Collector TTL Memory Settings Environment Count Space Mode Single-Node 1 2GB embedded Reduce TTLs metrics_collector_heap_size=1024 Sandbox to 7 Days hbase_regionserver_heapsize=512

hbase_master_heapsize=512

hbase_master_xmn_size=128 PoC 1-5 5GB embedded Reduce TTLs metrics_collector_heap_size=1024 to 30 Days hbase_regionserver_heapsize=512

hbase_master_heapsize=512

172 Hortonworks Data Platform May 17, 2018

Cluster Host Disk Collector TTL Memory Settings Environment Count Space Mode hbase_master_xmn_size=128 Pre-Production 5-20 20GB embedded Reduce TTLs metrics_collector_heap_size=1024 to 3 Months hbase_regionserver_heapsize=1024

hbase_master_heapsize=512

hbase_master_xmn_size=128 Production 20-50 50GB embedded n.a. metrics_collector_heap_size=1024

hbase_regionserver_heapsize=1024

hbase_master_heapsize=512

hbase_master_xmn_size=128 Production 50-200 100GB embedded n.a. metrics_collector_heap_size=2048

hbase_regionserver_heapsize=2048

hbase_master_heapsize=2048

hbase_master_xmn_size=256 Production 200-400 200GB embedded n.a. metrics_collector_heap_size=2048

hbase_regionserver_heapsize=2048

hbase_master_heapsize=2048

hbase_master_xmn_size=512 Production 400-800 200GB distributed n.a. metrics_collector_heap_size=8192

hbase_regionserver_heapsize=122288

hbase_master_heapsize=1024

hbase_master_xmn_size=1024

regionserver_xmn_size=1024 Production 800+ 500GB distributed n.a. metrics_collector_heap_size=12288

hbase_regionserver_heapsize=16384

hbase_master_heapsize=16384

hbase_master_xmn_size=2048

regionserver_xmn_size=1024

9.1.4.5. Moving the Metrics Collector

Use this procedure to move the Ambari Metrics Collector to a new host:

1. In Ambari Web , stop the Ambari Metrics service.

2. Execute the following API call to delete the current Metric Collector component:

curl -u admin:admin -H "X-Requested-By:ambari" - i -X DELETE http://ambari.server:8080/api/v1/clusters/cluster.name/ hosts/metrics.collector.hostname/host_components/METRICS_COLLECTOR

3. Execute the following API call to add Metrics Collector to a new host:

173 Hortonworks Data Platform May 17, 2018

curl -u admin:admin -H "X-Requested-By:ambari" - i -X POST http://ambari.server:8080/api/v1/clusters/cluster.name/ hosts/metrics.collector.hostname/host_components/METRICS_COLLECTOR

4. In Ambari Web, go the page of the host on which you installed the new Metrics Collector and click Install the Metrics Collector.

5. In Ambari Web, start the Ambari Metrics service. Note

Restarting all services is not required after moving the Ambari Metrics Collector, using Ambari 2.5 and later. 9.1.4.6. (Optional) Enabling Individual Region, Table, and User Metrics for HBase

Other than HBase RegionServer metrics, Ambari disables per region, per table, and per user metrics by default, because these metrics can be numerous and therefore cause performance issues.

If you want Ambari to collect these metrics, you can re-enable them; however, you should first test this option and confirm that your AMS performance is acceptable.

1. On the Ambari Server, browse to the following location:

/var/lib/ambari-server/resources/common-services/ HBASE/0.96.0.2.0/package/templates

2. Edit the following template files:

hadoop-metrics2-hbase.properties-GANGLIA-MASTER.j2

hadoop-metrics2-hbase.properties-GANGLIA-RS.j2

3. Either comment out or remove the following lines:

*.source.filter.class=org.apache.hadoop.metrics2.filter.RegexFilter

hbase.*.source.filter.exclude=.*(Regions|Users|Tables).*

4. Save the template files and restart Ambari Server for the changes to take effect. Important

If you upgrade Ambari to a newer version, you must re-apply this change to the template file. 9.1.5. AMS High Availability

Ambari installs the Ambari Metrics System (AMS) , into the cluster with a single Metrics Collector component by default. The Collector is a daemon that runs on a specific host in the cluster and receives data from the registered publishers, the Monitors and Sinks.

174 Hortonworks Data Platform May 17, 2018

Depending on your needs, you might require AMS to have two Collectors to cover a High Availability scenario. This section describes the steps to enable AMS High Availability.

Prerequisite

You must deploy AMS in distributed (not embedded) mode.

To provide AMS High Availability:

Steps

1. In Ambari Web, browse to the host where you would like to install another collector.

2. On the Host page, choose +Add.

3. Select Metrics Collector from the list.

Ambari installs the new Metrics Collector and configures Ambari Metrics for HA.

The new Collector will be installed in a “stopped” state.

4. In Ambari Web, will have to start the new Collector component from Ambari Web.

175 Hortonworks Data Platform May 17, 2018

Note

If you attempt to add a second Collector to the cluster without first switching AMS to distributed mode, the collector will install but will not be able to be started.

Traceback (most recent call last): File "/var/lib/ambari-agent/cache/common-services/AMBARI_METRICS/0.1.0/ package/scripts/metrics_collector.py", line 150, in AmsCollector().execute() File "/usr/lib/python2.6/site-packages/resource_management/libraries/ script/script.py", line 313, in execute method(env) File "/var/lib/ambari-agent/cache/common-services/AMBARI_METRICS/0.1. 0/package/scripts/metrics_collector.py", line 48, in start self. configure(env, action = 'start') # for security File "/usr/lib/python2.6/site-packages/resource_management/ libraries/script/script.py", line 116, in locking_configure original_configure(obj, *args, **kw) File "/var/lib/ambari-agent/cache/common-services/AMBARI_METRICS/0.1.0/ package/scripts/metrics_collector.py", line 42, in configure raise Fail("AMS in embedded mode cannot have more than 1 instance. Delete all but 1 instances or switch to Distributed mode ") resource_management.core.exceptions. Fail: AMS in embedded mode cannot have more than 1 instance. Delete all but 1 instances or switch to Distributed mode

Workaround: Delete the newly added Collector, enable distributed mode, then re-add the Collector.

More Information

AMS Architecture [125]

Customizing the Metrics Collector Mode [170] 9.1.6. AMS Security

The following sections describe tasks to be performed when settting up securty for the Ambari Metrics System.

• Changing the Grafana Admin Password [176]

• Set Up HTTPS for AMS [177]

• Set Up HTTPS for Grafana [180] 9.1.6.1. Changing the Grafana Admin Password

If you need to change the Grafana Admin password after you initially install Ambari, you have to change the password directly in Grafana, and then make the same change in the Ambari Metrics configuration:

176 Hortonworks Data Platform May 17, 2018

Steps

1. In Ambari Web, browse to Services > Ambari Metrics select Quick Links, and then choose Grafana.

The Grafana UI opens in read-only mode.

2. Click Sign In, in the left column.

3. Log in as admin, using the unchanged password.

4. Click the admin label in the left column to view the admin profile, and then click Change password.

5. Enter the unchanged password, enter and confirm the new password, and click Change Password.

6. Return to Ambari Web > Services > Ambari Metrics and browse to the Configs tab.

7. In the General section, update and confirm the Grafana Admin Password with the new password.

8. Save the configuration and restart the services, as prompted. 9.1.6.2. Set Up HTTPS for AMS

If you want to limit access to AMS to HTTPS connections, you must provide a certificate. While it is possible to use a self-signed certificate for initial trials, it is not suitable for production environments. After your get your certificate, you must run a special setup command.

Steps

1. Create your own CA certificate.

openssl req -new -x509 -keyout ca.key -out ca.crt -days 365

2. Import CA certificate into the truststore.

# keytool -keystore //truststore.jks -alias CARoot -import -file ca. crt -storepass bigdata

3. Check truststore.

# keytool -keystore //truststore.jks -list Enter keystore password:

Keystore type: JKS Keystore provider: SUN

Your keystore contains 2 entries

caroot, Feb 22, 2016, trustedCertEntry, Certificate fingerprint (SHA1): AD:EE:A5:BC:A8:FA:61:2F:4D:B3:53:3D:29:23:58:AB:2E:B1:82:AF

177 Hortonworks Data Platform May 17, 2018

You should see trustedCertEntry for CA.

4. Generate certificate for AMS Collector and store private key in keystore.

# keytool -genkey -alias c6401.ambari.apache.org -keyalg RSA -keysize 1024 -dname "CN=c6401.ambari.apache.org,OU=IT,O=Apache,L=US,ST=US,C=US" -keypass bigdata -keystore //keystore.jks -storepass bigdata

Note

If you use an alias different than the default hostname (c6401.ambari.apache.org), then, in step 12, set the ssl.client.truststore.alias config to use that alias.

5. Create certificate request for AMS collector certificate.

keytool -keystore //keystore.jks -alias c6401.ambari.apache.org - certreq -file c6401.ambari.apache.org.csr -storepass bigdata

6. Sign the certificate request with the CA certificate.

openssl x509 -req -CA ca.crt -CAkey ca.key -in c6401.ambari.apache.org.csr -out c6401.ambari.apache.org_signed.crt -days 365 -CAcreateserial -passin pass:bigdata

7. Import CA certificate into the keystore.

keytool -keystore //keystore.jks -alias CARoot -import -file ca.crt - storepass bigdata

8. Import signed certificate into the keystore.

keytool -keystore //keystore.jks -alias c6401.ambari.apache.org - import -file c6401.ambari.apache.org_signed.crt -storepass bigdata

9. Check keystore.

caroot2, Feb 22, 2016, trustedCertEntry, Certificate fingerprint (SHA1): 7C:B7:0C:27:8E:0D:31:E7:BE:F8:BE:A1:A4:1E:81:22:FC:E5:37:D7 [root@c6401 tmp]# keytool -keystore /tmp/keystore.jks -list Enter keystore password:

Keystore type: JKS Keystore provider: SUN

Your keystore contains 2 entries

caroot, Feb 22, 2016, trustedCertEntry, Certificate fingerprint (SHA1): AD:EE:A5:BC:A8:FA:61:2F:4D:B3:53:3D:29:23:58:AB:2E:B1:82:AF c6401.ambari.apache.org, Feb 22, 2016, PrivateKeyEntry, Certificate fingerprint (SHA1): A2:F9:BE:56:7A:7A:8B:4C:5E:A6:63:60:B7:70:50:43:34:14:EE:AF

You should see PrivateKeyEntry for the ams collector hostname entry and trustedCertEntry for CA.

178 Hortonworks Data Platform May 17, 2018

10.Copy //truststore.jks to all nodes to //truststore.jks and set appropriate access permissions.

11.Copy //keystore.jks to AMS collector node ONLY to //keystore.jks and set appropriate access permissions. Recommended: set owner to ams user and access permnissions to 400.

12.In Ambari Web, update the following AMS configs, in Advanced:

• ams-site/timeline.metrics.service.http.policy=HTTPS_ONLY

• ams-ssl-server/ssl.server.keystore.keypassword=bigdata

• ams-ssl-server/ssl.server.keystore.location=//keystore.jks

• ams-ssl-server/ssl.server.keystore.password=bigdata

• ams-ssl-server/ssl.server.keystore.type=jks

• ams-ssl-server/ssl.server.truststore.location=//truststore.jks

• ams-ssl-server/ssl.server.truststore.password=bigdata

• ams-ssl-server/ssl.server.truststore.reload.interval=10000

• ams-ssl-server/ssl.server.truststore.type=jks

• ams-ssl-client/ssl.client.truststore.location=//truststore.jks

• ams-ssl-client/ssl.client.truststore.password=bigdata

• ams-ssl-client/ssl.client.truststore.type=jks

13.In Ambari Web, Add the following AMS config property, using Custom ams-ssl-client # Add Property:

[metrics_collector_hostname_fqdn].ssl.client.truststore.alias=

14.Restart services with stale configs.

15.Configure Ambari server to use truststore.

# ambari-server setup-security Using python /usr/bin/python Security setup options... ======Choose one of the following options: [1] Enable HTTPS for Ambari server. [2] Encrypt passwords stored in ambari.properties file. [3] Setup Ambari kerberos JAAS configuration. [4] Setup truststore. [5] Import certificate to truststore. ======Enter choice, (1-5): 4 Do you want to configure a truststore [y/n] (y)? TrustStore type [jks/jceks/pkcs12] (jks):jks

179 Hortonworks Data Platform May 17, 2018

Path to TrustStore file ://keystore.jks Password for TrustStore: Re-enter password: Ambari Server 'setup-security' completed successfully.

16.Configure ambari server to use https instead of http in requests to AMS Collector by adding "server.timeline.metrics.https.enabled=true" to ambari.properties file.

# echo "server.timeline.metrics.https.enabled=true" >> /etc/ambari-server/ conf/ambari.properties

17.Restart ambari server.

9.1.6.3. Set Up HTTPS for Grafana

If you want to limit access to the Grafana to HTTPS connections, you must provide a certificate. While it is possible to use a self-signed certificate for initial trials, it is not suitable for production environments. After your get your certificate, you must run a special setup command.

Steps

1. Log on to the host with Grafana.

2. Browse to the Grafana configuration directory:

cd /etc/ambari-metrics-grafana/conf/

3. Locate your certificate.

If you want to create a temporary self-signed certificate, you can use this as an example:

openssl genrsa -out ams-grafana.key 2048 openssl req -new -key ams-grafana.key -out ams-grafana.csr openssl x509 -req -days 365 -in ams-grafana.csr -signkey ams-grafana.key - out ams-grafana.crt

4. Set the certificate and key file ownership and permissions so that they are accessible to Grafana:

chown ams:hadoop ams-grafana.crt chown ams:hadoop ams-grafana.key chmod 400 ams-grafana.crt chmod 400 ams-grafana.key

For a non-root Ambari user, use

chmod 444 ams-grafana.crt

to enable the agent user to read the file.

5. In Ambari Web, browse to > Services > Ambari Metrics > Configs.

6. Update the following properties in the Advanced ams-grafana-ini section:

protocol https

180 Hortonworks Data Platform May 17, 2018

cert_file /etc/ambari-metrics-grafana/conf/ams-grafana.crt

cert-Key /etc/ambari-metrics-grafana/conf/ams-grafana.key

7. Save the configuration and restart the services as prompted. 9.2. Ambari Log Search (Technical Preview)

The following sections describe the Technical Preview release of Ambari Log Search, which you should use only in non-production clusters with fewer than 150 nodes. .

• Log Search Architecture [181]

• Installing Log Search [182]

• Using Log Search [182] 9.2.1. Log Search Architecture

Ambari Log Search enables you to search for logs generated by Ambari-managed HDP components. Ambari Log Search relies on the Ambari Infra service to provide indexing services. Two components compose the Log Search solution:

• Log Feeder [181]

• Log Search Server [182]

9.2.1.1. Log Feeder

The Log Feeder component parses component logs. A Log Feeder is deployed to every node in the cluster and interacts with all component logs on that host. When started, the

181 Hortonworks Data Platform May 17, 2018

Log Feeder begins to parse all known component logs and sends them to the Apache Solr instances (managed by the Ambari Infra service) to be indexed.

By default, only FATAL, ERROR, and WARN logs are captured by the Log Feeder. You can temporarily or permanently add other log levels using the Log Search UI filter settings

(for temporary log level capture) or through the Log Search configuration control in Ambari.

9.2.1.2. Log Search Server

The Log Search Server hosts the Log Search UI web application, providing the API that is used by Ambari and the Log Search UI to access the indexed component logs. After logging in as a local or LDAP user, you can use the Log Search UI to visualize, explore, and search indexed component logs. 9.2.2. Installing Log Search

Log Search is a built-in service in Ambari 2.4 and later. You can add it during a new installation by using the +Add Service menu. The Log Feeders are automatically installed on all nodes in the cluster; you manually place the Log Search Server, optionally on the same server as the Ambari Server. 9.2.3. Using Log Search

Using Ambari Log Search includes the following activities:

• Accessing Log Search [182]

• Using Log Search to Troubleshoot [184]

• Viewing Service Logs [184]

• Viewing Access Logs [185]

9.2.3.1. Accessing Log Search

After Log Search is installed, you can use any of three ways to search the indexed logs:

• Ambari Background Ops Log Search Link [183]

• Host Detail Logs Tab [183]

• Log Search UI [183]

Note

Single Sign On (SSO) between Ambari and Log Search is not currently available.

182 Hortonworks Data Platform May 17, 2018

9.2.3.1.1. Ambari Background Ops Log Search Link

When you perform lifecycle operations such as starting or stopping services, it is critical that you have access to logs that can help you recover from potential failures. These logs are now available in Background Ops. Background Ops also links to the Host Detail Logs tab, which lists all the log files that have been indexed and can be viewed for a specific host:

More Information

Background Ops

9.2.3.1.2. Host Detail Logs Tab

A Logs tab is added to each host detail page, containing a list of indexed, viewable log files, organized by service, component, and type. You can open and search each of these files by using a link to the Log Search UI:

9.2.3.1.3. Log Search UI

The Log Search UI is a purpose-built web application used to search HDP component logs. The UI is focussed on helping operators quickly access and search logs from a single location. Logs can be filtered by log level, time, component, and can be searched by keyword. Helpful tools such as histograms to show number of logs by level for a time

183 Hortonworks Data Platform May 17, 2018

period are available, as well as controls to help rewind and fast forward search sessions, contextual click to include/exclude terms in log viewing, and multi-tab displays for troubleshooting multi-component and host issues.

The Log Search UI is available from the Quick Links menu of the Log Search Service within Ambari Web.

To see a guided tour of Log Search UI features, choose Take a Tour from the Log Search UI main menu. Click Next to view each topic in the guided tour series.

9.2.3.2. Using Log Search to Troubleshoot

To find logs related to a specific problem, use the Troubleshooting tab in the UI to select the service, components, and time frame related to the problem you are troubleshooting. For example, if you select HDFS, the UI automatically searches for HDFS- related components. You can select a time frame of yesterday or last week, or you can specify a custom value. Each of these specifications filters the results to match your interests When you are ready to view the matching logs, you can click Go to Logs:

9.2.3.3. Viewing Service Logs

The Service Logs tab enables you to search across all component logs for specific keywords and to filter for specific log levels, components, and time ranges. The UI is organized so that you can quickly see how many logs were captured for each log level across the entire cluster, search for keywords, include and exclude components, and match logs to your search query:

184 Hortonworks Data Platform May 17, 2018

9.2.3.4. Viewing Access Logs

When troubleshooting HDFS-related issues, you might find it helpful to search for and spot trends in HDFS access by users. The Access Logs tab enables you to view HDFS Audit log entries for a specific time frame, to see aggregated usage data showing the top ten HDFS users by file system resources accessed, as well as the top ten file system resources accessed across all users. This can help you find anomalies or hot and cold data sets.

9.3. Ambari Infra

Many services in HDP depend on core services to index data. For example, Apache Atlas uses indexing services for tagging lineage-free text search, and Apache Ranger uses indexing for audit data. The role of Ambari Infra is to provide these common shared services for stack components.

Currently, the Ambari Infra Service has only one component: the Infra Solr Instance. The Infra Solr Instance is a fully managed Apache Solr installation. By default, a single-node SolrCloud installation is deployed when the Ambari Infra Service is chosen for installation; however, you should install multiple Infra Solr Instances so that you have distributed indexing and search for Atlas, Ranger, and LogSearch (Technical Preview).

To install multiple Infra Solr Instances, you simply add them to existing cluster hosts through Ambari’s +Add Service capability. The number of Infra Solr Instances you deploy depends on the number of nodes in the cluster and the services deployed.

185 Hortonworks Data Platform May 17, 2018

Because one Ambari Infra Solr Instance is used by multiple HDP components, you should be careful when restarting the service, to avoid disrupting those dependent services. In HDP 2.5 and later, Atlas, Ranger, and Log Search (Technical Preview) dependent on Ambari Infra Service. Note

Infra Solr Instance is intended for use only by HDP components; use by third- party components or applications is not supported. 9.3.1. Archiving & Purging Data

Large clusters produce many log entries, and Ambari Infra provides a convenient utility for archiving and purging logs that are no longer required.

This utility is called the Solr Data Manager. The Solr Data Manager is a python program available in /usr/bin/infra-solr-data-manager. This program allows users to quickly archive, delete, or save data from a Solr collection, with the following usage options. 9.3.1.1. Command Line Options

Operation Modes

-m MODE, --mode=MODE archive | delete | save

The mode to use depends on the intent. Archive will store data into the desired storage medium and then remove the data after it has been stored, Delete is self explanatory, and Save is just like Archive except that data is not deleted after it has been stored.

---

Connecting to Solr

-s SOLR_URL, --solr-url=SOLR_URL>

The URL to use to connect to the specific Solr Cloud instance.

For example:

http://c6401.ambari.apache.org:8886/solr.

-c COLLECTION, --collection=COLLECTION

The name of the Solr collection. For example: ‘hadoop_logs’

-k SOLR_KEYTAB,--solr-keytab=SOLR_KEYTAB

The keytab file to use when operating against a kerberized Solr instance.

-n SOLR_PRINCIPAL, --solr-principal=SOLR_PRINCIPAL

The principal name to use when operating against a kerberized Solr instance.

186 Hortonworks Data Platform May 17, 2018

--

Record Schema

-i ID_FIELD, --id-field=ID_FIELD

The name of the field in the solr schema to use as the unique identifier for each record.

-f FILTER_FIELD, --filter-field=FILTER_FIELD

The name of the field in the solr schema to filter off of. For example: 'logtime’

-o DATE_FORMAT, --date-format=DATE_FORMAT

The custom date format to use with the -d DAYS field to match log entries that are older than a certain number of days.

-e END

Based on the filter field and date format, this argument configures the date that should be used as the end of the date range. If you use ‘2018-08-29T12:00:00.000Z’, then any records with a filter field that is after that date will be saved, deleted, or archived depending on the mode.

-d DAYS, --days=DAYS

Based on the filter field and date format, this argument configures the number days before today should be used as the end of the range. If you use ‘30’, then any records with a filter field that is older than 30 days will be saved, deleted, or archived depending on the mode.

-q ADDITIONAL_FILTER, --additional-filter=ADDITIONAL_FILTER

Any additional filter criteria to use to match records in the collection.

--

Extracting Records

-r READ_BLOCK_SIZE, --read-block-size=READ_BLOCK_SIZE

The number of records to read at a time from Solr. For example: ‘10’ to read 10 records at a time.

-w WRITE_BLOCK_SIZE, --write-block-size=WRITE_BLOCK_SIZE

The number of records to write per output file. For example: ‘100’ to write 100 records per file.

-j NAME, --name=NAME name included in result files

Additional name to add to the final filename created in save or archive mode.

--json-file

187 Hortonworks Data Platform May 17, 2018

Default output format is one valid json document per record delimited by a newline. This option will write out a single valid JSON document containing all of the records.

-z COMPRESSION, --compression=COMPRESSION none | tar.gz | tar.bz2 | zip | gz

Depending on how output files will be analyzed, you have the choice to choose the optimal compression and file format to use for output files. Gzip compression is used by default.

--

Writing Data to HDFS

-a HDFS_KEYTAB, --hdfs-keytab=HDFS_KEYTAB

The keytab file to use when writing data to a kerberized HDFS instance.

-l HDFS_PRINCIPAL, --hdfs-principal=HDFS_PRINCIPAL

The principal name to use when writing data to a kerberized HDFS instance.

-u HDFS_USER, --hdfs-user=HDFS_USER

The user to connect to HDFS as.

-p HDFS_PATH, --hdfs-path=HDFS_PATH

The path in HDFS to write data to in save or archive mode.

--

Writing Data to S3

-t KEY_FILE_PATH, --key-file-path=KEY_FILE_PATH

The path to the file on the local file system that contains the AWS Access and Secret Keys. The file should contain the keys in this format: ,

-b BUCKET, --bucket=BUCKET

The name of the bucket that data should be uploaded to in save or archive mode.

-y KEY_PREFIX, --key-prefix=KEY_PREFIX

The key prefix allows you to create a logical grouping of the objects in an S3 bucket. The prefix value is similar to a directory name enabling you to store data in the same directory in a bucket. For example, if your Amazon S3 bucket name is logs, and you set prefix to hadoop/, and the file on your storage device is hadoop_logs_- _2017-10-28T01_25_40.693Z.json.gz, then the file would be identified by this URL: http:// s3.amazonaws.com/logs/hadoop/hadoop_logs_-_2017-10-28T01_25_40.693Z.json.gz

-g, --ignore-unfinished-uploading

To deal with connectivity issues, uploading extracted data can be retried. If you do not wish to resume uploads, use the -g flag to disable this behaviour.

188 Hortonworks Data Platform May 17, 2018

--

Writing Data Locally

-x LOCAL_PATH, --local-path=LOCAL_PATH

The path on the local file system that should be used to write data to in save or archive mode

--

Examples

Deleting Indexed Data

In delete mode (-m delete), the program deletes data from the Solr collection. This mode uses the filter field (-f FITLER_FIELD) option to control which data should be removed from the index.

The command below will delete log entries from the hadoop_logs collection, which have been created before August 29, 2017, we'll use the -f option to specify the field in the Solr collection to use as a filter field, and the -e option to denote the end of the range of values to remove.

infra-solr-data-manager -m delete -s ://c6401.ambari.apache.org:8886/solr -c hadoop_logs -f logtime -e 2017-08-29T12:00:00.000Z

Archiving Indexed Data

In archive mode, the program fetches data from the Solr collection and writes it out to HDFS or S3, then deletes the data.

The program will fetch records from Solr and creates a file once the write block size is reached, or if there are no more matching records found in Solr. The program keeps track of its progress by fetching the records ordered by the filter field, and the id field, and always saves their last values. Once the file is written, it’s is compressed using the configured compression type.

After the compressed file is created the program creates a command file containing instructions with next steps. In case of any interruptions or error during the next run for the same collection the program will start executing the saved command file, so all the data would be consistent. If the error is due to invalid configuration, and failures persist, the -g option can be used to ignore the saved command file. The program supports writing data to HDFS, S3, or Local Disk.

The command below will archive data from the solr collection hadoop_logs accessible at http://c6401.ambari.apache.org:8886/solr, based on the field logtime, and will extract everything older than 1 day, read 10 documents at once, write 100 documents into a file, and copy the zip files into the local directory /tmp.

infra-solr-data-manager -m archive -s http://c6401.ambari.apache.org:8886/solr -c hadoop_logs -f logtime -d 1 -r 10 -w 100 -x /tmp -v

189 Hortonworks Data Platform May 17, 2018

Saving Indexed Data

Saving is similar to Archiving data except that the data is not deleted from Solr after the files are created and uploaded. The Save mode is recommended for testing that the data is written as expected before running the program in Archive mode with the same parameters.

The below example will save the last 3 days of hdfs audit logs into HDFS path "/" with the user hdfs, fetching data from a kerberized Solr.

infra-solr-data-manager -m save -s http://c6401.ambari.apache.org:8886/solr -c audit_logs -f logtime -d 3 -r 10 -w 100 -q type:\”hdfs_audit\” -j hdfs_audit -k /etc/security/keytabs/ambari-infra-solr.service.keytab -n infra-solr/[email protected] -u hdfs -p /

Analyzing Archived Data With Hive

Once data has been archived or saved to HDFS, Hive tables can be used to quickly access and analyzed stored data. Only line delimited JSON files can be analyzed with Hive. Line delimited JSON files are created by default unless the --json-file argument is passed. Data saved or archived using --json-file cannot be analyzed with Hive. In the following examples, the hive-json-serde.jar is used to process the stored JSON data. Prior to creating the included tables, the jar must be added in the Hive shell:

ADD JAR /hive-json-serde.jar

Here are some examples for table schemes for various log types. Using external tables is recommended, as it has the advantage of keeping the archives in HDFS. First ensure a directory is created to store archived or stored line delimited logs:

hadoop fs -mkdir

Hadoop Logs

CREATE EXTERNAL TABLE hadoop_logs ( logtime string, level string, thread_name string, logger_name string, file string, line_number int, method string, log_message string, cluster string, type string, path string, logfile_line_number int, host string, ip string, id string, event_md5 string, message_md5 string, seq_num int ) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'

190 Hortonworks Data Platform May 17, 2018

LOCATION '';

Audit Logs

As audit logs have a slightly different field set, we suggest to archive them separately using --additional-filter, and we offer separate schemas for HDFS, Ambari, and Ranger audit logs.

HDFS Audit Logs

CREATE EXTERNAL TABLE audit_logs_hdfs ( evtTime string, level string, logger_name string, log_message string, resource string, result int, action string, cliType string, req_caller_id string, ugi string, reqUser string, proxyUsers array, authType string, proxyAuthType string, dst string, perm string, cluster string, type string, path string, logfile_line_number int, host string, ip string, cliIP string, id string, event_md5 string, message_md5 string, seq_num int ) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION '';

Ambari Audit Logs

CREATE EXTERNAL TABLE audit_logs_ambari ( evtTime string, log_message string, resource string, result int, action string, reason string, ws_base_url string, ws_command string, ws_component string, ws_details string, ws_display_name string, ws_os string, ws_repo_id string, ws_repo_version string,

191 Hortonworks Data Platform May 17, 2018

ws_repositories string, ws_request_id string, ws_roles string, ws_stack string, ws_stack_version string, ws_version_note string, ws_version_number string, ws_status string, ws_result_status string, cliType string, reqUser string, task_id int, cluster string, type string, path string, logfile_line_number int, host string, cliIP string, id string, event_md5 string, message_md5 string, seq_num int ) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION '';

Ranger Audit Logs

CREATE EXTERNAL TABLE audit_logs_ranger ( evtTime string, access string, enforcer string, resource string, result int, action string, reason string, resType string, reqUser string, cluster string, cliIP string, id string, seq_num int ) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION ''; 9.3.2. Performance Tuning for Ambari Infra

When using Ambari Infra to index and store Ranger audit logs, you should properly tune Solr to handle the number of audit records stored per day. The following sections describe recommendations for tuning your operating system and Solr, based on how you use Ambari Infra and Ranger in your environment. 9.3.2.1. Operating System Tuning

Solr clients use many network connections when indexing and searching, and to avoid many open network connections, the following sysctl parameters are recommended:

192 Hortonworks Data Platform May 17, 2018

net.ipv4.tcp_max_tw_buckets = 1440000 net.ipv4.tcp_tw_recycle = 1 net.ipv4.tcp_tw_reuse = 1

These settings can be made permanent by placing them in /etc/sysctl.d/net.conf, or they can be set at runtime using the following sysctl command example:

sysctl -w net.ipv4.tcp_max_tw_buckets=1440000 sysctl -w net.ipv4.tcp_tw_recycle=1 sysctl -w net.ipv4.tcp_tw_reuse=1

Additionally, the number of user processes for solr should be increased to avoid exceptions related to creating new native threads. This can be done by creating a new file named / etc/security/limits.d/infra-solr.conf with the following contents:

infra-solr - nproc 6000 9.3.2.2. JVM - GC Settings

The heap sizing and garbage collection settings are very important for production Solr instances when indexing many Ranger audit logs. For production deployments, we suggest setting the “Infra Solr Minimum Heap Size,” and “Infra Solr Maximum Heap Size” to 12 GB. These settings can be found and applied by following the steps below:

Steps

1. In Ambari Web, browse to Services > Ambari Infra > Configs.

2. In the Settings tab you will see two sliders controlling the Infra Solr Heap Size.

3. Set the Infra Solr Minimum Heap Size to 12GB or 12,288MB.

4. Set the Infra Solr Maximum Heap Size to 12GB or 12,288MB.

5. Click Save to save the configuration and then restart the affected services as prompted by Ambari.

Using the G1 Garbage Collector is also recommended for production deployments. To use the G1 Garbage Collector with the Ambari Infra Solr Instance, follow the steps below:

Steps

1. In Ambari Web, browse to Services > Ambari Infra > Configs.

2. In the Advanced tab expand the section for Advanced infra-solr-env

3. In the infra-solr-env template locate the multi-line GC_TUNE environmental variable definition, and replace it with the following content:

GC_TUNE="-XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=4m -XX:MaxGCPauseMillis=250 -XX:InitiatingHeapOccupancyPercent=75 -XX:+UseLargePages

193 Hortonworks Data Platform May 17, 2018

-XX:+AggressiveOpts"

The value used for the -XX:G1HeapRegionSize is based on the 12GB Solr Maximum Heap recommended. If you choose to use a different heap size for the Solr server, please consult the following table for recommendations:

Heap Size G1HeapRegionSize < 4GB 1MB 4-8GB 2MB 8-16GB 4MB 16-32GB 8MB 32-64GB 16MB >64GB 32MB

9.3.2.3. Environment-Specific Tuning Parameters

Each of the recommendations below are dependent on the number of audit records that are indexed per day. To quickly determine how many audit records are indexed per day, use the command examples below:

Using a HTTP client such as curl, execute the following command:

curl -g "http://:8886/solr/ranger_audits/select?q= (evtTime:[NOW-7DAYS+TO+*])&wt=json&indent=true&rows=0"

You should receive a message similar to the following:

{ "responseHeader":{ "status":0, "QTime":1, "params":{ "q":"evtTime:[NOW-7DAYS TO *]", "indent":"true", "rows":"0", "wt":"json"}}, "response":{"numFound":306,"start":0,"docs":[] }}

Take the numFound element of the response and divide it by 7 to get the average number of audit records being indexed per day. You can also replace the ‘7DAYS’ in the curl request with a broader time range, if necessary, using the following key words:

• 1MONTHS

• 7DAYS

Just ensure you divide by the appropriate number if you change the event time query. The average number of records per day will be used to identify which recommendations below apply to your environment.

Less Than 50 Million Audit Based on the Solr REST API call if your average number Records Per Day of documents per day is less than 50 million records per day, the following recommendations apply. In each recommendation, the time to live, or TTL, which

194 Hortonworks Data Platform May 17, 2018

controls how long a document should be kept in the index until it is removed is taken into consideration. The default TTL is 90 days, but some customers choose to be more aggressive, and remove documents from the index after 30 days. Due to this, recommendations for both common TTL settings are specified.

These recommendations assume that you are using our recommendation of 12GB heap per Solr server instance. In each situation we have recommendations for co- locating Solr with other master services, and for using dedicated Solr servers. Testing has shown that Solr performance requires different server counts depending on whether Solr is co-located or on dedicated servers. Based on our testing with Ranger, Solr shard sizes should be around 25GB for best overall performance. However, Solr shard sizes can go up to 50GB without a significant performance impact.

This configuration is our best recommendation for just getting started with Ranger and Ambari Infra so the only recommendation is using the default TTL of 90 days.

Default Time To Live (TTL) 90 days:

• Estimated total index size: ~150 GB to 450 GB

• Total number of primary/leader shards: 6

• Total number of shards including 1 replica each: 12

• Total number of co-located Solr nodes: ~3 nodes, up to 2 shards per node

(does not include replicas)

• Total number of dedicated Solr nodes: ~1 node, up to 12 shards per node

(does not include replicas)

50 - 100 Million Audit Records 50 to 100 million records ~ 5 - 10 GB data per day. Per Day Default Time To Live (TTL) 90 days:

• Estimated total index size: ~ 450 - 900 GB for 90 days

• Total number of primary/leader shards: 18-36

• Total number of shards including 1 replica each: 36-72

• Total number of co-located Solr nodes: ~9-18 nodes, up to 2 shards per node 195 Hortonworks Data Platform May 17, 2018

(does not include replicas)

• Total number of dedicated Solr nodes: ~3-6 nodes, up to 12 shards per node

(does not include replicas)

Custom Time To Live (TTL) 30 days:

• Estimated total index size: 150 - 300 GB for 30 days

• Total number of primary/leader shards: 6-12

• Total number of shards including 1 replica each: 12-24

• Total number of co-located Solr nodes: ~3-6 nodes, up to 2 shards per node

(does not include replicas)

• Total number of dedicated Solr nodes: ~1-2 nodes, up to 12 shards per node

(does not include replicas)

100 - 200 Million Audit Records 100 to 200 million records ~ 10 - 20 GB data per day. Per Day Default Time To Live (TTL) 90 days:

• Estimated total index size: ~ 900 - 1800 GB for 90 days

• Total number of primary/leader shards: 36-72

• Total number of shards including 1 replica each: 72-144

• Total number of co-located Solr nodes: ~18-36 nodes, up to 2 shards per node

(does not include replicas)

• Total number of dedicated Solr nodes: ~3-6 nodes, up to 12 shards per node

(does not include replicas)

Custom Time To Live (TTL) 30 days:

• Estimated total index size: 300 - 600 GB for 30 days

• Total number of primary/leader shards: 12-24

• Total196 number of shards including 1 replica each: 24-48 Hortonworks Data Platform May 17, 2018

• Total number of co-located Solr nodes: ~6-12 nodes, up to 2 shards per node

(does not include replicas)

• Total number of dedicated Solr nodes: ~1-3 nodes, up to 12 shards per node

(does not include replicas)

If you choose to use at least 1 replica for high availability, then increase the number of nodes accordingly. If high availability is a requirement, then consider using no less than 3 Solr nodes in any configuration.

As illustrated in these examples, a lower TTL requires less resources. If your compliance objectives call for longer data retention, you can use the SolrDataManager to archive data into long term storage (HDFS, or S3) and provides Hive tables allowing you to easily query that data. With this strategy, hot data can be stored in Solr for rapid access through the Ranger UI, and cold data can be archived to HDFS, or S3 with access provided through Ranger.

More Information

Archiving and Purging Data 9.3.2.4. Adding New Shards

If after reviewing the recommendations above, you need to add additional shards to your existing deployment, the following Solr documentation will help you understand how to accomplish that task: https://archive.apache.org/dist/lucene/solr/ref-guide/apache-solr-ref- guide-5.5.pdf 9.3.2.5. Out of Memory Exceptions

When using Ambari Infra with Ranger Audit, if you are seeing many instances of Solr exiting with Java “Out Of Memory” exceptions, a solution exists to update the Ranger Audit schema to use less heap memory by enabling DocValues. This change requires a re-index of data and is disruptive, but helps tremendously with heap memory consumption. Refer to this HCC article for the instructions on making this change: https:// community.hortonworks.com/articles/156933/restore-backup-ranger-audits-to-newly- collection.html

197