White Paper: Group State Quorum
Total Page:16
File Type:pdf, Size:1020Kb
WHITE PAPER DELL EMC POWERSCALE ONEFS CLUSTER COMPOSITION, QUORUM, AND GROUP STATE Abstract This paper explores cluster quorum, group state, and the group management protocol on a Dell EMC PowerScale cluster. April 2021 1 | Dell EMC PowerScale OneFS Cluster Composition, Quorum, & Group State ©202 1 Dell Inc. or its subsidiaries. Revisions Version Date Comment 1.0 November 2017 Updated for OneFS 8.1.1 2.0 February 2019 Updated for OneFS 8.1.3 3.0 April 2019 Updated for OneFS 8.2 4.0 August 2019 Updated for OneFS 8.2.1 5.0 December 2019 Updated for OneFS 8.2.2 6.0 June 2020 Updated for OneFS 9.0 7.0 October 2020 Updated for OneFS 9.1 8.0 April 2021 Updated for OneFS 9.2 Acknowledgements This paper was produced by the following: Author: Nick Trimbee The information in this publication is provided “as is.” Dell Inc. makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose. Use, copying, and distribution of any software described in this publication requires an applicable software license. Copyright © Dell Inc. or its subsidiaries. All Rights Reserved. Dell, EMC, Dell EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries. Other trademarks may be trademarks of their respective owners. 2 | Dell EMC PowerScale OneFS Cluster Composition, Quorum, & Group State ©202 1 Dell Inc. or its subsidiaries. TABLE OF CONTENTS Intended Audience ............................................................................................................................... 4 Cluster Composition, Quorum and Group State ............................................................................................................ 4 Cluster Composition and Group State ........................................................................................................................... 5 Understanding and Analyzing Group Membership ........................................................................................................ 9 Interpreting Group Changes ........................................................................................................................................ 13 Constructing an event timeline ..................................................................................................................................... 17 Group management considerations ............................................................................................................................. 21 Summary ........................................................................................................................................ 21 3 | Dell EMC PowerScale OneFS Cluster Composition, Quorum, & Group State ©202 1 Dell Inc. or its subsidiaries. Intended Audience This paper explores the topics of cluster composition, quorum and the group management protocol on a Dell EMC PowerScale cluster. It also offers techniques and recommendations to help detect and debug changes in cluster composition. This paper does not intend to provide a comprehensive background to the OneFS architecture. Please refer to the OneFS Technical Overview white paper for further details on the OneFS architecture. The target audience for this white paper is anyone managing a OneFS powered clustered storage environment. It is assumed that the reader has an understanding and working knowledge of the OneFS components, architecture, commands and features. More information on OneFS commands and feature configuration is available in the OneFS Administration Guide. Cluster Composition, Quorum and Group State In order for a cluster to properly function and accept data writes, a quorum or majority of nodes must be active and responding. OneFS uses this notion of a quorum to prevent “split-brain” conditions that could be introduced if a cluster temporarily split into two clusters. OneFS clustering is based on the CAP theorem, which states that it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees: OneFS does not compromise on Consistency, and uses a simple quorum to prevent partitioning, or ‘split-brain’ conditions that can be introduced if the cluster should temporarily divide into two clusters. This is a pre-requisite of CAP. The quorum rule guarantees that, regardless of how many nodes fail or come back online, if a write takes place, it can be made consistent with any previous writes that have ever taken place. As such, cluster quorum dictates the number of nodes required in order to move to a given data protection level. For an erasure-code (FEC) based protection-level of N+M, the cluster must contain at least 2 M+1 nodes. For example, a minimum of seven nodes is required for a +3n protection level. This allows for a simultaneous loss of three nodes while still maintaining a quorum of four nodes for the cluster to remain fully operational. If a cluster does drop below quorum, the file system will automatically be placed into a protected, read-only state, denying writes, but still allowing read access to the available data. In this state, it will not accept write requests from any protocols, regardless of any 4 | Dell EMC PowerScale OneFS Cluster Composition, Quorum, & Group State ©202 1 Dell Inc. or its subsidiaries. particular node pool membership issues. In the instances where a protection level is set too high for OneFS to achieve using FEC, the default behavior is to protect that data using mirroring instead. Obviously, this has a negative impact on space utilization. Since OneFS does not compromise on consistency, so a mechanism is required to manage a cluster’s transient state and quorum. As such, the primary role of the OneFS Group Management Protocol (GMP) is to help create and maintain a group of synchronized nodes. A group is a given set of nodes which have synchronized state, and a cluster may form multiple groups as connection state changes. Quorum is a property of the GMP group, which helps enforce consistency across node disconnects and other transient events. Having a consistent view of the cluster state is crucial, since initiators need to know which node and drives are available to write to, etc. Cluster Composition and Group State One of the most significant impacts to a cluster’s workflow, particularly at scale, is the effect of group changes resulting from the addition, removal, or rebooting of a node, or other hardware failure or transience. Having the ability to understand a cluster’s group state and changes is an invaluable tool when administering and managing large clusters. It allows you to determine the current health of a cluster, as well as reconstruct the cluster's history when troubleshooting issues that involve cluster stability, network health, etc. The primary role of the OneFS Group Management Protocol (GMP) is to help create and maintain a group of synchronized nodes. A group is a given set of nodes which have synchronized state, and a cluster may form multiple groups as connection state changes. GMP distributes a variety of state information about nodes and drives, from identifiers to usage statistics. The most fundamental of these is the composition of the cluster, or ‘static aspect’ of the group, which is managed by the isi_boot_d daemon and stored in the array.xml file. Similarly, the state of a node’s drives is stored in the drives.xml file, along with a flag indicating whether the drive is an SSD. Whereas GMP manages node states directly, drive states are actually managed by the ‘drv’ module, and broadcast via GMP. A significant difference between nodes and drives is that for nodes, the static aspect is distributed to every node in the array.xml file, whereas drive state is only stored locally on a node. A group change operation, based on GMP, is a coherent way of changing the cluster-wide shared state. Merge is the group change operation for addition of nodes. Merges affect cluster availability due to their need to pause any filesystem operations for the duration of the operation. The array.xml information is needed by every node in order to define the cluster and allow nodes to form connections. In contrast, drives.xml is only stored locally on a node. If a node goes down, other nodes have no method to obtain the drive configuration of that node. Drive information may be cached by the GMP, but it is not available if that cache is cleared. Conversely, ‘dynamic aspect’ refers to the state of nodes and drives which may change. These states indicate the health of nodes and their drives to the various file system modules - plus whether or not components can be used for particular operations. For example, a soft-failed node or drive should not be used for new allocations. These components can be in one of seven states: Component State Description UP Component is responding DOWN Component is not responding DEAD Component is not allowed to come back to the UP state and should be removed STALLED Drive is responding slowly GONE Component has been removed Soft-failed Component is in the process of being removed Read-only This state only applies to nodes Figure 1: OneFS Group Management – Component States 5 | Dell EMC PowerScale OneFS Cluster Composition, Quorum, & Group State ©202 1 Dell Inc. or its subsidiaries. A node or drive may go from ‘down, soft-failed’ to ‘up, soft-failed’ and back. These flags are persistently stored in the array.xml file for nodes and the drives.xml file