SUSE Enterprise Storage 6 Deployment Guide Deployment Guide SUSE Enterprise Storage 6 by Tomáš Bažant, Alexandra Settle, Liam Proven, and Sven Seeberg

Publication Date: 07/04/2021

SUSE LLC 1800 South Novell Place Provo, UT 84606 USA https://documentation.suse.com

Copyright © 2021 SUSE LLC

Copyright © 2016, RedHat, Inc, and contributors.

The text of and illustrations in this document are licensed under a Creative Commons Attribution- Share Alike 4.0 International ("CC-BY-SA"). An explanation of CC-BY-SA is available at http:// creativecommons.org/licenses/by-sa/4.0/legalcode . In accordance with CC-BY-SA, if you distribute this document or an adaptation of it, you must provide the URL for the original version.

Red Hat, Red Hat Enterprise , the Shadowman logo, JBoss, MetaMatrix, Fedora, the Innity Logo, and RHCE are trademarks of Red Hat, Inc., registered in the United States and other countries. Linux® is the registered trademark of Linus Torvalds in the United States and other countries. Java® is a registered trademark of Oracle and/or its aliates. XFS® is a trademark of International Corp. or its subsidiaries in the United States and/or other countries. All other trademarks are the property of their respective owners. For SUSE trademarks, see http://www.suse.com/company/legal/ . All other third-party trademarks are the property of their respective owners. Trademark symbols (®, ™ etc.) denote trademarks of SUSE and its aliates. Asterisks (*) denote third-party trademarks.

All information found in this book has been compiled with utmost attention to detail. However, this does not guarantee complete accuracy. Neither SUSE LLC, its aliates, the authors nor the translators shall be held liable for possible errors or the consequences thereof. Contents

About This Guide x 1 Available Documentation x

2 Feedback xi

3 Documentation Conventions xi

4 About the Making of This Manual xii

5 Contributors xii

I SUSE ENTERPRISE STORAGE 1 1 SUSE Enterprise Storage 6 and Ceph 2 1.1 Ceph Features 2

1.2 Core Components 3 RADOS 3 • CRUSH 4 • Ceph Nodes and Daemons 5

1.3 Storage Structure 6 Pool 6 • Placement Group 7 • Example 7

1.4 BlueStore 8

1.5 Additional Information 10

2 Hardware Requirements and Recommendations 11

2.1 Network Overview 11 Network Recommendations 12

2.2 Multiple Architecture Configurations 14

2.3 Hardware Configuration 15 Minimum Cluster Configuration 15 • Recommended Production Cluster Configuration 17

iv Deployment Guide 2.4 Nodes 18 Minimum Requirements 18 • Minimum Disk Size 19 • Recommended Size for the BlueStore's WAL and DB Device 19 • Using SSD for OSD Journals 19 • Maximum Recommended Number of Disks 20

2.5 Monitor Nodes 20

2.6 Object Gateway Nodes 21

2.7 Metadata Server Nodes 21

2.8 Admin Node 21

2.9 iSCSI Nodes 22

2.10 SUSE Enterprise Storage 6 and Other SUSE Products 22 SUSE Manager 22

2.11 Naming Limitations 22

2.12 OSD and Monitor Sharing One Server 22

3 Admin Node HA Setup 24

3.1 Outline of the HA Cluster for Admin Node 24

3.2 Building a HA Cluster with Admin Node 25

4 User Privileges and Command Prompts 27

4.1 Salt/DeepSea Related Commands 27

4.2 Ceph Related Commands 27

4.3 General Linux Commands 28

4.4 Additional Information 28

II CLUSTER DEPLOYMENT AND UPGRADE 29 5 Deploying with DeepSea/Salt 30 5.1 the Release Notes 30

v Deployment Guide 5.2 Introduction to DeepSea 31 Organization and Important Locations 32 • Targeting the Minions 33

5.3 Cluster Deployment 35

5.4 DeepSea CLI 45 DeepSea CLI: Monitor Mode 46 • DeepSea CLI: Stand-alone Mode 46

5.5 Configuration and Customization 48 The policy.cfg File 48 • DriveGroups 53 • Adjusting ceph.conf with Custom Settings 63

6 Upgrading from Previous Releases 64

6.1 General Considerations 64

6.2 Steps to Take before Upgrading the First Node 65 Read the Release Notes 65 • Verify Your Password 65 • Verify the Previous Upgrade 65 • Upgrade Old RBD Kernel Clients 67 • Adjust AppArmor 67 • Verify MDS Names 67 • Consolidate Scrub-related Configuration 68 • Back Up Cluster Data 69 • Migrate from ntpd to chronyd 69 • Patch Cluster Prior to Upgrade 71 • Verify the Current Environment 73 • Check the Cluster's State 74 • Migrate OSDs to BlueStore 75

6.3 Order in Which Nodes Must Be Upgraded 77

6.4 Oine Upgrade of CTDB Clusters 77

6.5 Per-Node Upgrade Instructions 78 Manual Node Upgrade Using the Installer DVD 79 • Node Upgrade Using the SUSE Distribution Migration System 81

6.6 Upgrade the Admin Node 83

6.7 Upgrade Ceph Monitor/Ceph Manager Nodes 84

6.8 Upgrade Metadata Servers 84

6.9 Upgrade Ceph OSDs 86

6.10 Upgrade Gateway Nodes 89

vi Deployment Guide 6.11 Steps to Take after the Last Node Has Been Upgraded 91 Update Ceph Monitor Setting 91 • Disable Insecure Clients 92 • Enable the Telemetry Module 92

6.12 Update policy.cfg and Deploy Ceph Dashboard Using DeepSea 93

6.13 Migration from Profile-based Deployments to DriveGroups 95 Analyze the Current Layout 95 • Create DriveGroups Matching the Current Layout 96 • OSD Deployment 97 • More Complex Setups 97

7 Customizing the Default Configuration 98

7.1 Using Customized Configuration Files 98 Disabling a Deployment Step 98 • Replacing a Deployment Step 99 • Modifying a Deployment Step 100 • Modifying a Deployment Stage 101 • Updates and Reboots during Stage 0 103

7.2 Modifying Discovered Configuration 104 Enabling IPv6 for Ceph Cluster Deployment 106

III INSTALLATION OF ADDITIONAL SERVICES 108 8 Installation of Services to Access your Data 109

9 Ceph Object Gateway 110

9.1 Object Gateway Manual Installation 110 Object Gateway Configuration 111

10 Installation of iSCSI Gateway 117

10.1 iSCSI Storage 117 The Linux Kernel iSCSI Target 118 • iSCSI Initiators 118

10.2 General Information about ceph- 119

10.3 Deployment Considerations 120

vii Deployment Guide 10.4 Installation and Configuration 121 Deploy the iSCSI Gateway to a Ceph Cluster 121 • Create RBD Images 121 • Export RBD Images via iSCSI 122 • Authentication and Access Control 123 • Advanced Settings 125

10.5 Exporting RADOS Block Device Images Using tcmu-runner 128

11 Installation of CephFS 130

11.1 Supported CephFS Scenarios and Guidance 130

11.2 Ceph Metadata Server 131 Adding and Removing a Metadata Server 131 • Configuring a Metadata Server 131

11.3 CephFS 137 Creating CephFS 137 • MDS Cluster Size 138 • MDS Cluster and Updates 139 • File Layouts 140

12 Installation of NFS Ganesha 145

12.1 Preparation 145 General Information 145 • Summary of Requirements 146

12.2 Example Installation 146

12.3 Active-Active Configuration 147 Prerequisites 147 • Configure NFS Ganesha 148 • Populate the Cluster Grace Database 149 • Restart NFS Ganesha Services 150 • Conclusion 150

12.4 More Information 150

IV CLUSTER DEPLOYMENT ON TOP OF SUSE CAAS PLATFORM 4 (TECHNOLOGY PREVIEW) 151 13 SUSE Enterprise Storage 6 on Top of SUSE CaaS Platform 4 Kubernetes Cluster 152 13.1 Considerations 152

13.2 Prerequisites 152

viii Deployment Guide 13.3 Get Rook Manifests 153

13.4 Installation 153 Configuration 153 • Create the Rook Operator 155 • Create the Ceph Cluster 155

13.5 Using Rook as Storage for Kubernetes Workload 156

13.6 Uninstalling Rook 157 A Ceph Maintenance Updates Based on Upstream 'Nautilus' Point Releases 158

Glossary 170

B Documentation Updates 173

B.1 Maintenance update of SUSE Enterprise Storage 6 documentation 173

B.2 June 2019 (Release of SUSE Enterprise Storage 6) 174

ix Deployment Guide About This Guide

SUSE Enterprise Storage 6 is an extension to SUSE Linux Enterprise Server 15 SP1. It combines the capabilities of the Ceph (http://ceph.com/ ) storage project with the enterprise engineering and support of SUSE. SUSE Enterprise Storage 6 provides IT organizations with the ability to deploy a distributed storage architecture that can support a number of use cases using commodity hardware platforms. This guide helps you understand the concept of the SUSE Enterprise Storage 6 with the main focus on managing and administrating the Ceph infrastructure. It also demonstrates how to use Ceph with other related solutions, such as OpenStack or KVM. Many chapters in this manual contain links to additional documentation resources. These include additional documentation that is available on the system as well as documentation available on the Internet. For an overview of the documentation available for your product and the latest documentation updates, refer to https://documentation.suse.com .

1 Available Documentation

The following manuals are available for this product:

Book “Administration Guide” The guide describes various administration tasks that are typically performed after the installation. The guide also introduces steps to integrate Ceph with solutions such as libvirt , , or KVM, and ways to access objects stored in the cluster via iSCSI and RADOS gateways.

Deployment Guide Guides you through the installation steps of the Ceph cluster and all services related to Ceph. The guide also illustrates a basic Ceph cluster structure and provides you with related terminology.

HTML versions of the product manuals can be found in the installed system under /usr/share/ doc/manual . Find the latest documentation updates at https://documentation.suse.com where you can download the manuals for your product in multiple formats.

x Available Documentation SES 6 2 Feedback

Several feedback channels are available:

Bugs and Enhancement Requests For services and support options available for your product, refer to http://www.suse.com/ support/ . To report bugs for a product component, log in to the Novell Customer Center from http:// www.suse.com/support/ and select My Support Service Request.

User Comments We want to hear your comments and suggestions for this manual and the other documentation included with this product. If you have questions, suggestions, or corrections, contact [email protected], or you can also click the Report Documentation Bug link beside each chapter or section heading.

Mail For feedback on the documentation of this product, you can also send a mail to doc- [email protected] . Make sure to include the document title, the product version, and the publication date of the documentation. To report errors or suggest enhancements, provide a concise description of the problem and refer to the respective section number and page (or URL).

3 Documentation Conventions

The following typographical conventions are used in this manual:

/etc/passwd : names and le names

placeholder : replace placeholder with the actual value

PATH : the environment variable

ls , --help : commands, options, and parameters

user : users or groups

Alt , Alt – F1 : a key to press or a key combination; keys are shown in uppercase as on a keyboard

xi Feedback SES 6 File, File Save As: menu items, buttons

Dancing Penguins (Chapter Penguins, ↑Another Manual): This is a reference to a chapter in another manual.

4 About the Making of This Manual

This book is written in GeekoDoc, a subset of DocBook (see http://www.docbook.org ). The XML source les were validated by xmllint , processed by xsltproc , and converted into XSL- FO using a customized version of Norman Walsh's stylesheets. The nal PDF can be formatted through FOP from Apache or through XEP from RenderX. The authoring and publishing tools used to produce this manual are available in the package daps . The DocBook Authoring and Publishing Suite (DAPS) is developed as source software. For more information, see http:// daps.sf.net/ .

5 Ceph Contributors

The Ceph project and its documentation is a result of the work of hundreds of contributors and organizations. See https://ceph.com/contributors/ for more details.

xii About the Making of This Manual SES 6 I SUSE Enterprise Storage

1 SUSE Enterprise Storage 6 and Ceph 2

2 Hardware Requirements and Recommendations 11

3 Admin Node HA Setup 24

4 User Privileges and Command Prompts 27 1 SUSE Enterprise Storage 6 and Ceph

SUSE Enterprise Storage 6 is a distributed storage system designed for scalability, reliability and performance which is based on the Ceph technology. A Ceph cluster can be run on commodity servers in a common network like Ethernet. The cluster scales up well to thousands of servers (later on referred to as nodes) and into the petabyte range. As opposed to conventional systems which have allocation tables to store and fetch data, Ceph uses a deterministic algorithm to allocate storage for data and has no centralized information structure. Ceph assumes that in storage clusters the addition or removal of hardware is the rule, not the exception. The Ceph cluster automates management tasks such as data distribution and redistribution, data , failure detection and recovery. Ceph is both self-healing and self-managing which results in a reduction of administrative and budget overhead. This chapter provides a high level overview of SUSE Enterprise Storage 6 and briey describes the most important components.

Tip Since SUSE Enterprise Storage 5, the only cluster deployment method is DeepSea. Refer to Chapter 5, Deploying with DeepSea/Salt for details about the deployment process.

1.1 Ceph Features

The Ceph environment has the following features:

Scalability Ceph can scale to thousands of nodes and manage storage in the range of petabytes.

Commodity Hardware No special hardware is required to run a Ceph cluster. For details, see Chapter 2, Hardware Requirements and Recommendations

Self-managing The Ceph cluster is self-managing. When nodes are added, removed or fail, the cluster automatically redistributes the data. It is also aware of overloaded disks.

No Single Point of Failure

2 Ceph Features SES 6 No node in a cluster stores important information alone. The number of redundancies can be congured.

Open Source Software Ceph is an open source software solution and independent of specic hardware or vendors.

1.2 Core Components

To make full use of Ceph's power, it is necessary to understand some of the basic components and concepts. This section introduces some parts of Ceph that are often referenced in other chapters.

1.2.1 RADOS

The basic component of Ceph is called RADOS (Reliable Autonomic Distributed Object Store). It is responsible for managing the data stored in the cluster. Data in Ceph is usually stored as objects. Each object consists of an identier and the data. RADOS provides the following access methods to the stored objects that cover many use cases:

Object Gateway Object Gateway is an HTTP REST gateway for the RADOS object store. It enables direct access to objects stored in the Ceph cluster.

RADOS Block Device RADOS Block Devices (RBD) can be accessed like any other block device. These can be used for example in combination with libvirt for virtualization purposes.

CephFS The Ceph is a POSIX-compliant le system. librados librados is a library that can be used with many programming languages to create an application capable of directly interacting with the storage cluster. librados is used by Object Gateway and RBD while CephFS directly interfaces with RADOS Figure 1.1, “Interfaces to the Ceph Object Store”.

3 Core Components SES 6 VM / App App Client Host

RADOSGW RBD Ceph FS librados

RADOSRADOS

FIGURE 1.1: INTERFACES TO THE CEPH OBJECT STORE

1.2.2 CRUSH

At the core of a Ceph cluster is the CRUSH algorithm. CRUSH is the acronym for Controlled Replication Under Scalable Hashing. CRUSH is a function that handles the storage allocation and needs comparably few parameters. That means only a small amount of information is necessary to calculate the storage position of an object. The parameters are a current map of the cluster including the health state, some administrator-dened placement rules and the name of the object that needs to be stored or retrieved. With this information, all nodes in the Ceph cluster are able to calculate where an object and its replicas are stored. This makes writing or reading data very ecient. CRUSH tries to evenly distribute data over all nodes in the cluster. The CRUSH map contains all storage nodes and administrator-dened placement rules for storing objects in the cluster. It denes a hierarchical structure that usually corresponds to the physical structure of the cluster. For example, the data-containing disks are in hosts, hosts are in racks, racks in rows and rows in data centers. This structure can be used to dene failure domains. Ceph then ensures that replications are stored on dierent branches of a specic failure domain. If the failure domain is set to rack, replications of objects are distributed over dierent racks. This can mitigate outages caused by a failed switch in a rack. If one power distribution unit supplies a row of racks, the failure domain can be set to row. When the power distribution unit fails, the replicated data is still available on other rows.

4 CRUSH SES 6 1.2.3 Ceph Nodes and Daemons

In Ceph, nodes are servers working for the cluster. They can run several dierent types of daemons. We recommend running only one type of daemon on each node, except for Ceph Manager daemons which can be co-located with Ceph Monitors. Each cluster requires at least Ceph Monitor, Ceph Manager, and Ceph OSD daemons:

Admin Node Admin Node is a Ceph cluster node where the Salt master service is running. The Admin Node is a central point of the Ceph cluster because it manages the rest of the cluster nodes by querying and instructing their Salt minion services.

Ceph Monitor Ceph Monitor (often abbreviated as MON) nodes maintain information about the cluster health state, a map of all nodes and data distribution rules (see Section 1.2.2, “CRUSH”). If failures or conicts occur, the Ceph Monitor nodes in the cluster decide by majority which information is correct. To form a qualied majority, it is recommended to have an odd number of Ceph Monitor nodes, and at least three of them. If more than one site is used, the Ceph Monitor nodes should be distributed over an odd number of sites. The number of Ceph Monitor nodes per site should be such that more than 50% of the Ceph Monitor nodes remain functional if one site fails.

Ceph Manager The Ceph Manager collects the state information from the whole cluster. The Ceph Manager daemon runs alongside the monitor daemons. It provides additional monitoring, and interfaces the external monitoring and management systems. It includes other services as well, for example the Ceph Dashboard Web UI. The Ceph Dashboard Web UI runs on the same node as the Ceph Manager. The Ceph manager requires no additional conguration, beyond ensuring it is running. You can deploy it as a separate role using DeepSea.

Ceph OSD A Ceph OSD is a daemon handling Object Storage Devices which are a physical or logical storage units (hard disks or partitions). Object Storage Devices can be physical disks/ partitions or logical volumes. The daemon additionally takes care of data replication and rebalancing in case of added or removed nodes. Ceph OSD daemons communicate with monitor daemons and provide them with the state of the other OSD daemons.

5 Ceph Nodes and Daemons SES 6 To use CephFS, Object Gateway, NFS Ganesha, or iSCSI Gateway, additional nodes are required:

Metadata Server (MDS) The Metadata Servers store metadata for the CephFS. By using an MDS you can execute basic le system commands such as ls without overloading the cluster.

Object Gateway The Object Gateway is an HTTP REST gateway for the RADOS object store. It is compatible with OpenStack Swift and Amazon S3 and has its own user management.

NFS Ganesha NFS Ganesha provides an NFS access to either the Object Gateway or the CephFS. It runs in the user instead of the kernel space and directly interacts with the Object Gateway or CephFS. iSCSI Gateway iSCSI is a storage network protocol that allows clients to send SCSI commands to SCSI storage devices (targets) on remote servers.

Samba Gateway The Gateway provides a SAMBA access to data stored on CephFS.

1.3 Storage Structure

1.3.1 Pool

Objects that are stored in a Ceph cluster are put into pools. Pools represent logical partitions of the cluster to the outside world. For each pool a set of rules can be dened, for example, how many replications of each object must exist. The standard conguration of pools is called replicated pool. Pools usually contain objects but can also be congured to act similar to a RAID 5. In this conguration, objects are stored in chunks along with additional coding chunks. The coding chunks contain the redundant information. The number of data and coding chunks can be dened by the administrator. In this conguration, pools are referred to as erasure coded pools.

6 Storage Structure SES 6 1.3.2 Placement Group

Placement Groups (PGs) are used for the distribution of data within a pool. When creating a pool, a certain number of placement groups is set. The placement groups are used internally to group objects and are an important factor for the performance of a Ceph cluster. The PG for an object is determined by the object's name.

1.3.3 Example

This section provides a simplied example of how Ceph manages data (see Figure 1.2, “Small Scale Ceph Example”). This example does not represent a recommended conguration for a Ceph cluster. The hardware setup consists of three storage nodes or Ceph OSDs ( Host 1 , Host 2 , Host 3 ). Each node has three hard disks which are used as OSDs ( osd.1 to osd.9 ). The Ceph Monitor nodes are neglected in this example.

Note: Dierence between Ceph OSD and OSD While Ceph OSD or Ceph OSD daemon refers to a daemon that is run on a node, the word OSD refers to the that the daemon interacts with.

The cluster has two pools, Pool A and Pool B . While Pool A replicates objects only two times, resilience for Pool B is more important and it has three replications for each object. When an application puts an object into a pool, for example via the REST API, a Placement Group ( PG1 to PG4 ) is selected based on the pool and the object name. The CRUSH algorithm then calculates on which OSDs the object is stored, based on the Placement Group that contains the object. In this example the failure domain is set to host. This ensures that replications of objects are stored on dierent hosts. Depending on the replication level set for a pool, the object is stored on two or three OSDs that are used by the Placement Group. An application that writes an object only interacts with one Ceph OSD, the primary Ceph OSD. The primary Ceph OSD takes care of replication and conrms the completion of the write process after all other OSDs have stored the object. If osd.5 fails, all object in PG1 are still available on osd.1 . As soon as the cluster recognizes that an OSD has failed, another OSD takes over. In this example osd.4 is used as a replacement for osd.5 . The objects stored on osd.1 are then replicated to osd.4 to restore the replication level.

7 Placement Group SES 6 Objects Objects

Pool A Pool B (2 replications, 2 Placement Groups) (3 replications, 2 Placement Groups)

PG1 PG2 PG3 PG4 osd.1 osd.3 osd.2 osd.3 osd.5 (failed) osd.8 osd.6 osd.4 osd.4 (substitute) osd.7 osd.9

Logical Structure

Hardware CRUSH algorithm

osd.1 osd.2 osd.3 osd.4 osd.5 osd.6 osd.7 osd.8 osd.9 (failed)

Host 1 replicates data Host 2 Host 3

FIGURE 1.2: SMALL SCALE CEPH EXAMPLE

If a new node with new OSDs is added to the cluster, the cluster map is going to change. The CRUSH function then returns dierent locations for objects. Objects that receive new locations will be relocated. This process results in a balanced usage of all OSDs.

1.4 BlueStore

BlueStore is a new default storage back-end for Ceph from SUSE Enterprise Storage 5. It has better performance than FileStore, full data check-summing, and built-in compression. BlueStore manages either one, two, or three storage devices. In the simplest case, BlueStore consumes a single primary storage device. The storage device is normally partitioned into two parts:

1. A small partition named BlueFS that implements le system-like functionalities required by RocksDB.

2. The rest of the device is normally a large partition occupying the rest of the device. It is managed directly by BlueStore and contains all of the actual data. This primary device is normally identied by a block in the data directory.

8 BlueStore SES 6 It is also possible to deploy BlueStore across two additional devices: A WAL device can be used for BlueStore’s internal journal or write-ahead log. It is identied by the block.wal symbolic link in the data directory. It is only useful to use a separate WAL device if the device is faster than the primary device or the DB device, for example when:

The WAL device is an NVMe, and the DB device is an SSD, and the data device is either SSD or HDD.

Both the WAL and DB devices are separate SSDs, and the data device is an SSD or HDD.

A DB device can be used for storing BlueStore’s internal metadata. BlueStore (or rather, the embedded RocksDB) will put as much metadata as it can on the DB device to improve performance. Again, it is only helpful to provision a shared DB device if it is faster than the primary device.

Tip: Plan for the DB Size Plan thoroughly to ensure sucient size of the DB device. If the DB device lls up, metadata will spill over to the primary device, which badly degrades the OSD's performance. You can check if a WAL/DB partition is getting full and spilling over with the ceph daemon osd.ID dump command. The slow_used_bytes value shows the amount of data being spilled out:

cephadm@adm > ceph daemon osd.ID perf dump | jq '.bluefs' "db_total_bytes": 1073741824, "db_used_bytes": 33554432, "wal_total_bytes": 0, "wal_used_bytes": 0, "slow_total_bytes": 554432, "slow_used_bytes": 554432,

9 BlueStore SES 6 1.5 Additional Information

Ceph as a community project has its own extensive online documentation. For topics not found in this manual, refer to http://docs.ceph.com/docs/master/ .

The original publication CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data by S.A. Weil, S.A. Brandt, E.L. Miller, C. Maltzahn provides helpful insight into the inner workings of Ceph. Especially when deploying large scale clusters it is a recommended reading. The publication can be found at http://www.ssrc.ucsc.edu/papers/weil-sc06.pdf .

SUSE Enterprise Storage can be used with non-SUSE OpenStack distributions. The Ceph clients need to be at a level that is compatible with SUSE Enterprise Storage.

Note SUSE supports the server component of the Ceph deployment and the client is supported by the OpenStack distribution vendor.

10 Additional Information SES 6 2 Hardware Requirements and Recommendations

The hardware requirements of Ceph are heavily dependent on the IO workload. The following hardware requirements and recommendations should be considered as a starting point for detailed planning. In general, the recommendations given in this section are on a per-process basis. If several processes are located on the same machine, the CPU, RAM, disk and network requirements need to be added up.

2.1 Network Overview

Ceph has several logical networks:

A trusted internal network, the back-end network called the the cluster network .

A public client network called public network .

Client networks for gateways, these are optional.

The trusted internal network is the back-end network between the OSD nodes for replication, re- balancing and recovery.Ideally, this network provides twice the bandwidth of the public network with default 3-way replication since the primary OSD sends 2 copies to other OSDs via this network. The public network is between clients and gateways on the one side to talk to monitors, managers, MDS nodes, OSD nodes. It is also used by monitors, managers, and MDS nodes to talk with OSD nodes.

11 Network Overview SES 6 FIGURE 2.1: NETWORK OVERVIEW

2.1.1 Network Recommendations

For the Ceph network environment, we recommend two bonded 25 GbE (or faster) network interfaces bonded using 802.3ad (LACP). The use of two network interfaces provides aggregation and fault-tolerance. The bond should then be used to provide two VLAN interfaces, one for the public network, and the second for the cluster network. Details on bonding the interfaces can be found in https://documentation.suse.com/sles/15-SP1/html/SLES-all/cha- network.html#sec-network-iface-bonding . can be enhanced through isolating the components into failure domains. To improve fault tolerance of the network, bonding one interface from two separate Network Interface Cards (NIC) oers protection against failure of a single NIC. Similarly, creating a bond across two switches protects against failure of a switch. We recommend consulting with the network equipment vendor in order to architect the level of fault tolerance required.

Important: Administration Network not Supported Additional administration network setup—that enables for example separating SSH, Salt, or DNS networking—is neither tested nor supported.

12 Network Recommendations SES 6 Tip: Nodes Configured via DHCP If your storage nodes are congured via DHCP, the default timeouts may not be sucient for the network to be congured correctly before the various Ceph daemons start. If this happens, the Ceph MONs and OSDs will not start correctly (running systemctl status ceph\* will result in "unable to bind" errors). To avoid this issue, we recommend increasing the DHCP client timeout to at least 30 seconds on each node in your storage cluster. This can be done by changing the following settings on each node: In /etc/sysconfig/network/dhcp , set

DHCLIENT_WAIT_AT_BOOT="30"

In /etc/sysconfig/network/config , set

WAIT_FOR_INTERFACES="60"

2.1.1.1 Adding a Private Network to a Running Cluster

If you do not specify a cluster network during Ceph deployment, it assumes a single public network environment. While Ceph operates ne with a public network, its performance and security improves when you set a second private cluster network. To support two networks, each Ceph node needs to have at least two network cards. You need to apply the following changes to each Ceph node. It is relatively quick to do for a small cluster, but can be very time consuming if you have a cluster consisting of hundreds or thousands of nodes.

1. Stop Ceph related services on each cluster node. Add a line to /etc/ceph/ceph.conf to dene the cluster network, for example:

cluster network = 10.0.0.0/24

If you need to specically assign static IP addresses or override cluster network settings, you can do so with the optional cluster addr .

2. Check that the private cluster network works as expected on the OS level.

3. Start Ceph related services on each cluster node.

root # systemctl start ceph.target

13 Network Recommendations SES 6 2.1.1.2 Monitor Nodes on Dierent Subnets

If the monitor nodes are on multiple subnets, for example they are located in dierent rooms and served by dierent switches, you need to adjust the ceph.conf le accordingly. For example, if the nodes have IP addresses 192.168.123.12, 1.2.3.4, and 242.12.33.12, add the following lines to their global section:

[global] [...] mon host = 192.168.123.12, 1.2.3.4, 242.12.33.12 mon initial members = MON1, MON2, MON3 [...]

Additionally, if you need to specify a per-monitor public address or network, you need to add a [mon.X] section for each monitor:

[mon.MON1] public network = 192.168.123.0/24

[mon.MON2] public network = 1.2.3.0/24

[mon.MON3] public network = 242.12.33.12/0

2.2 Multiple Architecture Configurations

SUSE Enterprise Storage supports both x86 and Arm architectures. When considering each architecture, it is important to note that from a cores per OSD, frequency, and RAM perspective, there is no real dierence between CPU architectures for sizing. As with smaller x86 processors (non-server), lower-performance Arm-based cores may not provide an optimal experience, especially when used for erasure coded pools.

Note Throughout the documentation, SYSTEM-ARCH is used in place of x86 or Arm.

14 Multiple Architecture Configurations SES 6 2.3 Hardware Configuration

For the best product experience, we recommend to start with the recommended cluster conguration. For a test cluster or a cluster with less performance requirements, we document a minimal supported cluster conguration.

2.3.1 Minimum Cluster Configuration

A minimal product cluster conguration consists of:

At least four physical nodes (OSD nodes) with co-location of services

Dual-10 Gb Ethernet as a bonded network

A separate Admin Node (can be virtualized on an external node)

A detailed conguration is:

Separate Admin Node with 4 GB RAM, four cores, 1 TB storage capacity. This is typically the Salt master node. Ceph services and gateways, such as Ceph Monitor, Metadata Server, Ceph OSD, Object Gateway, or NFS Ganesha are not supported on the Admin Node as it needs to orchestrate the cluster update and upgrade processes independently.

At least four physical OSD nodes, with eight OSD disks each, see Section 2.4.1, “Minimum Requirements” for requirements. The total capacity of the cluster should be sized so that even with one node unavailable, the total used capacity (including redundancy) does not exceed 80%.

Three Ceph Monitor instances. Monitors need to be run from SSD/NVMe storage, not HDDs, for latency reasons.

Monitors, Metadata Server, and gateways can be co-located on the OSD nodes, see Section 2.12, “OSD and Monitor Sharing One Server” for monitor co-location. If you co-locate services, the memory and CPU requirements need to be added up.

iSCSI Gateway, Object Gateway, and Metadata Server require at least incremental 4 GB RAM and four cores.

If you are using CephFS, S3/Swift, iSCSI, at least two instances of the respective roles (Metadata Server, Object Gateway, iSCSI) are required for redundancy and availability.

15 Hardware Configuration SES 6 The nodes are to be dedicated to SUSE Enterprise Storage and must not be used for any other physical, containerized, or virtualized workload.

If any of the gateways (iSCSI, Object Gateway, NFS Ganesha, Metadata Server, ...) are deployed within VMs, these VMs must not be hosted on the physical machines serving other cluster roles. (This is unnecessary, as they are supported as collocated services.)

When deploying services as VMs on outside the core physical cluster, failure domains must be respected to ensure redundancy. For example, do not deploy multiple roles of the same type on the same , such as multiple MONs or MDSs instances.

When deploying inside VMs, it is particularly crucial to ensure that the nodes have strong network connectivity and well working time synchronization.

The hypervisor nodes must be adequately sized to avoid interference by other workloads consuming CPU, RAM, network, and storage resources.

FIGURE 2.2: MINIMUM CLUSTER CONFIGURATION

16 Minimum Cluster Configuration SES 6 2.3.2 Recommended Production Cluster Configuration

Once you grow your cluster, we recommend to relocate monitors, Metadata Server, and gateways on separate nodes to ensure better fault tolerance.

Seven Object Storage Nodes

No single node exceeds ~15% of total storage.

The total capacity of the cluster should be sized so that even with one node unavailable, the total used capacity (including redundancy) does not exceed 80%.

25 Gb Ethernet or better, bonded for internal cluster and external public network each.

56+ OSDs per storage cluster.

See Section 2.4.1, “Minimum Requirements” for further recommendation.

Dedicated physical infrastructure nodes.

Three Ceph Monitor nodes: 4 GB RAM, 4 core processor, RAID 1 SSDs for disk. See Section 2.5, “Monitor Nodes” for further recommendation.

Object Gateway nodes: 32 GB RAM, 8 core processor, RAID 1 SSDs for disk. See Section 2.6, “Object Gateway Nodes” for further recommendation.

iSCSI Gateway nodes: 16 GB RAM, 6-8 core processor, RAID 1 SSDs for disk. See Section 2.9, “iSCSI Nodes” for further recommendation.

Metadata Server nodes (one active/one hot standby): 32 GB RAM, 8 core processor, RAID 1 SSDs for disk. See Section 2.7, “Metadata Server Nodes” for further recommendation.

One SES Admin Node: 4 GB RAM, 4 core processor, RAID 1 SSDs for disk.

17 Recommended Production Cluster Configuration SES 6 2.4 Object Storage Nodes

2.4.1 Minimum Requirements

The following CPU recommendations account for devices independent of usage by Ceph:

1x 2GHz CPU Thread per spinner.

2x 2GHz CPU Thread per SSD.

4x 2GHz CPU Thread per NVMe.

Separate 10 GbE networks (public/client and internal), required 4x 10 GbE, recommended 2x 25 GbE.

Total RAM required = number of OSDs x (1 GB + osd_memory_target ) + 16 GB The default for osd_memory_target is 4 GB. Refer to Book “Administration Guide”, Chapter 25 “Ceph Cluster Configuration”, Section 25.2.1 “Automatic Cache Sizing” for more details on osd_memory_target .

OSD disks in JBOD congurations or or individual RAID-0 congurations.

OSD journal can reside on OSD disk.

OSD disks should be exclusively used by SUSE Enterprise Storage.

Dedicated disk and SSD for the operating system, preferably in a RAID 1 conguration.

Allocate at least an additional 4 GB of RAM if this OSD host will host part of a cache pool used for cache tiering.

Ceph Monitors, gateway and Metadata Servers can reside on Object Storage Nodes.

For disk performance reasons, OSD nodes are bare metal nodes. No other workloads should run on an OSD node unless it is a minimal setup of Ceph Monitors and Ceph Managers.

SSDs for Journal with 6:1 ratio SSD journal to OSD.

18 Object Storage Nodes SES 6 2.4.2 Minimum Disk Size

There are two types of disk space needed to run on OSD: the space for the disk journal (for FileStore) or WAL/DB device (for BlueStore), and the primary space for the stored data. The minimum (and default) value for the journal/WAL/DB is 6 GB. The minimum space for data is 5 GB, as partitions smaller than 5 GB are automatically assigned the weight of 0. So although the minimum disk space for an OSD is 11 GB, we do not recommend a disk smaller than 20 GB, even for testing purposes.

2.4.3 Recommended Size for the BlueStore's WAL and DB Device

Tip: More Information Refer to Section 1.4, “BlueStore” for more information on BlueStore.

We recommend reserving 4 GB for the WAL device. While the minimal DB size is 64 GB for RBD-only workloads, the recommended DB size for Object Gateway and CephFS workloads is 2% of the main device capacity (but at least 196 GB).

If you intend to put the WAL and DB device on the same disk, then we recommend using a single partition for both devices, rather than having a separate partition for each. This allows Ceph to use the DB device for the WAL operation as well. Management of the disk space is therefore more eective as Ceph uses the DB partition for the WAL only if there is a need for it. Another advantage is that the probability that the WAL partition gets full is very small, and when it is not used fully then its space is not wasted but used for DB operation. To share the DB device with the WAL, do not specify the WAL device, and specify only the DB device. Find more information about specifying an OSD layout in Section 5.5.2, “DriveGroups”.

2.4.4 Using SSD for OSD Journals

Solid-state drives (SSD) have no moving parts. This reduces random access time and read latency while accelerating data throughput. Because their price per 1MB is signicantly higher than the price of spinning hard disks, SSDs are only suitable for smaller storage.

19 Minimum Disk Size SES 6 OSDs may see a signicant performance improvement by storing their journal on an SSD and the object data on a separate hard disk.

Tip: Sharing an SSD for Multiple Journals As journal data occupies relatively little space, you can mount several journal directories to a single SSD disk. Keep in mind that with each shared journal, the performance of the SSD disk degrades. We do not recommend sharing more than six journals on the same SSD disk and 12 on NVMe disks.

2.4.5 Maximum Recommended Number of Disks

You can have as many disks in one server as it allows. There are a few things to consider when planning the number of disks per server:

Network bandwidth. The more disks you have in a server, the more data must be transferred via the network card(s) for the disk write operations.

Memory. RAM above 2 GB is used for the BlueStore cache. With the default osd_memory_target of 4 GB, the system has a reasonable starting cache size for spinning media. If using SSD or NVME, consider increasing the cache size and RAM allocation per OSD to maximize performance.

Fault tolerance. If the complete server fails, the more disks it has, the more OSDs the cluster temporarily loses. Moreover, to keep the replication rules running, you need to copy all the data from the failed server among the other nodes in the cluster.

2.5 Monitor Nodes

At least three Ceph Monitor nodes are required. The number of monitors should always be odd (1+2n).

4 GB of RAM.

Processor with four logical cores.

20 Maximum Recommended Number of Disks SES 6 An SSD or other suciently fast storage type is highly recommended for monitors, specically for the /var/lib/ceph path on each monitor node, as quorum may be unstable with high disk latencies. Two disks in RAID 1 conguration is recommended for redundancy. It is recommended that separate disks or at least separate disk partitions are used for the monitor processes to protect the monitor's available disk space from things like log le creep.

There must only be one monitor process per node.

Mixing OSD, monitor, or Object Gateway nodes is only supported if sucient hardware resources are available. That means that the requirements for all services need to be added up.

Two network interfaces bonded to multiple switches.

2.6 Object Gateway Nodes

Object Gateway nodes should have six to eight CPU cores and 32 GB of RAM (64 GB recommended). When other processes are co-located on the same machine, their requirements need to be added up.

2.7 Metadata Server Nodes

Proper sizing of the Metadata Server nodes depends on the specic use case. Generally, the more open les the Metadata Server is to handle, the more CPU and RAM it needs. The following are the minimum requirements:

3 GB of RAM for each Metadata Server daemon.

Bonded network interface.

2.5 GHz CPU with at least 2 cores.

2.8 Admin Node

At least 4 GB of RAM and a quad-core CPU are required. This includes running the Salt master on the Admin Node. For large clusters with hundreds of nodes, 6 GB of RAM is suggested.

21 Object Gateway Nodes SES 6 2.9 iSCSI Nodes iSCSI nodes should have six to eight CPU cores and 16 GB of RAM.

2.10 SUSE Enterprise Storage 6 and Other SUSE Products

This section contains important information about integrating SUSE Enterprise Storage 6 with other SUSE products.

2.10.1 SUSE Manager

SUSE Manager and SUSE Enterprise Storage are not integrated, therefore SUSE Manager cannot currently manage a SUSE Enterprise Storage cluster.

2.11 Naming Limitations

Ceph does not generally support non-ASCII characters in conguration les, pool names, user names and so forth. When conguring a Ceph cluster we recommend using only simple alphanumeric characters (A-Z, a-z, 0-9) and minimal punctuation ('.', '-', '_') in all Ceph object/ conguration names.

2.12 OSD and Monitor Sharing One Server

Although it is technically possible to run Ceph OSDs and Monitors on the same server in test environments, we strongly recommend having a separate server for each monitor node in production. The main reason is performance—the more OSDs the cluster has, the more I/O operations the monitor nodes need to perform. And when one server is shared between a monitor node and OSD(s), the OSD I/O operations are a limiting factor for the monitor node. Another consideration is whether to share disks between an OSD, a monitor node, and the operating system on the server. The answer is simple: if possible, dedicate a separate disk to OSD, and a separate server to a monitor node.

22 iSCSI Nodes SES 6 Although Ceph supports directory-based OSDs, an OSD should always have a dedicated disk other than the operating system one.

Tip If it is really necessary to run OSD and monitor node on the same server, run the monitor on a separate disk by mounting the disk to the /var/lib/ceph/mon directory for slightly better performance.

23 OSD and Monitor Sharing One Server SES 6 3 Admin Node HA Setup

The Admin Node is a Ceph cluster node where the Salt master service runs. The Admin Node is a central point of the Ceph cluster because it manages the rest of the cluster nodes by querying and instructing their Salt minion services. It usually includes other services as well, for example the Grafana dashboard backed by the Prometheus monitoring toolkit. In case of Admin Node failure, you usually need to provide new working hardware for the node and restore the complete cluster conguration stack from a recent backup. Such a method is time consuming and causes cluster outage. To prevent the Ceph cluster performance downtime caused by the Admin Node failure, we recommend making use of a High Availability (HA) cluster for the Ceph Admin Node.

3.1 Outline of the HA Cluster for Admin Node

The idea of an HA cluster is that in case of one cluster node failing, the other node automatically takes over its role, including the virtualized Admin Node. This way, other Ceph cluster nodes do not notice that the Admin Node failed. The minimal HA solution for the Admin Node requires the following hardware:

Two bare metal servers able to run SUSE Linux Enterprise with the High Availability extension and virtualize the Admin Node.

Two or more redundant network communication paths, for example via Network Device Bonding.

Shared storage to host the disk image(s) of the Admin Node virtual machine. The shared storage needs to be accessible from both servers. It can be, for example, an NFS export, a Samba share, or iSCSI target.

Find more details on the cluster requirements at https://documentation.suse.com/sle-ha/15-SP1/ single-html/SLE-HA-install-quick/#sec-ha-inst-quick-req .

24 Outline of the HA Cluster for Admin Node SES 6 FIGURE 3.1: 2-NODE HA CLUSTER FOR ADMIN NODE

3.2 Building a HA Cluster with Admin Node

The following procedure summarizes the most important steps of building the HA cluster for virtualizing the Admin Node. For details, refer to the indicated links.

1. Set up a basic 2-node HA cluster with shared storage as described in https://documentation.suse.com/sle-ha/15-SP1/single-html/SLE-HA-install- quick/#art-sleha-install-quick .

2. On both cluster nodes, install all packages required for running the KVM hypervisor and the libvirt toolkit as described in https://documentation.suse.com/sles/15-SP1/single- html/SLES-virtualization/#sec-vt-installation-kvm .

3. On the rst cluster node, create a new KVM virtual machine (VM) making use of libvirt as described in https://documentation.suse.com/sles/15-SP1/single-html/SLES-virtualization/ #sec-libvirt-inst-virt-install . Use the precongured shared storage to store the disk images of the VM.

4. After the VM setup is complete, export its conguration to an XML le on the shared storage. Use the following syntax:

root # virsh dumpxml VM_NAME > /path/to/shared/vm_name.xml

25 Building a HA Cluster with Admin Node SES 6 5. Create a resource for the Admin Node VM. Refer to https://documentation.suse.com/sle- ha/15-SP1/single-html/SLE-HA-guide/#cha-conf-hawk2 for general info on creating HA resources. Detailed info on creating resources for a KVM virtual machine is described in http://www.linux-ha.org/wiki/VirtualDomain_%28resource_agent%29 .

6. On the newly-created VM guest, deploy the Admin Node including the additional services you need there. Follow the relevant steps in Section 5.3, “Cluster Deployment”. At the same time, deploy the remaining Ceph cluster nodes on the non-HA cluster servers.

26 Building a HA Cluster with Admin Node SES 6 4 User Privileges and Command Prompts

As a Ceph cluster administrator, you will be conguring and adjusting the cluster behavior by running specic commands. There are several types of commands you will need:

4.1 Salt/DeepSea Related Commands

These commands help you to deploy or upgrade the Ceph cluster, run commands on several (or all) cluster nodes at the same time, or assist you when adding or removing cluster nodes. The most frequently used are salt , salt-run , and deepsea . You need to run Salt commands on the Salt master node (refer to Section 5.2, “Introduction to DeepSea” for details) as root . These commands are introduced with the following prompt:

root@master #

For example:

root@master # salt '*.example.net' test.ping

4.2 Ceph Related Commands

These are lower level commands to congure and ne tune all aspects of the cluster and its gateways on the command line, for example ceph , rbd , radosgw-admin , or crushtool . To run Ceph related commands, you need to have read access to a Ceph key. The key's capabilities then dene your privileges within the Ceph environment. One option is to run Ceph commands as root (or via sudo ) and use the unrestricted default keyring 'ceph.client.admin.key'. Safer and recommended option is to create a more restrictive individual key for each administrator user and put it in a directory where the users can read it, for example:

~/.ceph/ceph.client.USERNAME.keyring

Tip: Path to Ceph Keys To use a custom admin user and keyring, you need to specify the user name and path to the key each time you run the ceph command using the -n client.USER_NAME and -- keyring PATH/TO/KEYRING options.

27 Salt/DeepSea Related Commands SES 6 To avoid this, include these options in the CEPH_ARGS variable in the individual users' ~/.bashrc les.

Although you can run Ceph related commands on any cluster node, we recommend running them on the Admin Node. This documentation uses the cephadm user to run the commands, therefore they are introduced with the following prompt:

cephadm@adm >

For example:

cephadm@adm > ceph auth list

Tip: Commands for Specific Nodes If the documentation instructs you to run a command on a cluster node with a specic role, it will be addressed by the prompt. For example:

cephadm@mon >

4.3 General Linux Commands

Linux commands not related to Ceph or DeepSea, such as mount , cat , or openssl , are introduced either with the cephadm@adm > or root # prompts, depending on which privileges the related command requires.

4.4 Additional Information

For more information on Ceph key management, refer to Book “Administration Guide”, Chapter 19 “Authentication with cephx”, Section 19.2 “Key Management”.

28 General Linux Commands SES 6 II Cluster Deployment and Upgrade

5 Deploying with DeepSea/Salt 30

6 Upgrading from Previous Releases 64

7 Customizing the Default Configuration 98 5 Deploying with DeepSea/Salt

Salt along with DeepSea is a stack of components that help you deploy and manage server infrastructure. It is very scalable, fast, and relatively easy to get running. Read the following considerations before you start deploying the cluster with Salt:

Salt minions are the nodes controlled by a dedicated node called Salt master. Salt minions have roles, for example Ceph OSD, Ceph Monitor, Ceph Manager, Object Gateway, iSCSI Gateway, or NFS Ganesha.

A Salt master runs its own Salt minion. It is required for running privileged tasks—for example creating, authorizing, and copying keys to minions—so that remote minions never need to run privileged tasks.

Tip: Sharing Multiple Roles per Server You will get the best performance from your Ceph cluster when each role is deployed on a separate node. But real deployments sometimes require sharing one node for multiple roles. To avoid trouble with performance and the upgrade procedure, do not deploy the Ceph OSD, Metadata Server, or Ceph Monitor role to the Admin Node.

Salt minions need to correctly resolve the Salt master's host name over the network. By default, they look for the salt host name, but you can specify any other network- reachable host name in the /etc/salt/minion le, see Section 5.3, “Cluster Deployment”.

5.1 Read the Release Notes

In the release notes you can nd additional information on changes since the previous release of SUSE Enterprise Storage. Check the release notes to see whether:

your hardware needs special considerations.

any used software packages have changed signicantly.

special precautions are necessary for your installation.

The release notes also provide information that could not make it into the manual on time. They also contain notes about known issues.

30 Read the Release Notes SES 6 After having installed the package release-notes-ses , nd the release notes locally in the directory /usr/share/doc/release-notes or online at https://www.suse.com/releasenotes/ .

5.2 Introduction to DeepSea

The goal of DeepSea is to save the administrator time and condently perform complex operations on a Ceph cluster. Ceph is a very congurable software solution. It increases both the freedom and responsibility of system administrators. The minimal Ceph setup is good for demonstration purposes, but does not show interesting features of Ceph that you can see with a big number of nodes. DeepSea collects and stores data about individual servers, such as addresses and device names. For a distributed storage system such as Ceph, there can be hundreds of such items to collect and store. Collecting the information and entering the data manually into a conguration management tool is exhausting and error prone. The steps necessary to prepare the servers, collect the conguration, and congure and deploy Ceph are mostly the same. However, this does not address managing the separate functions. For day to day operations, the ability to trivially add hardware to a given function and remove it gracefully is a requirement. DeepSea addresses these observations with the following strategy: DeepSea consolidates the administrator's decisions in a single le. The decisions include cluster assignment, role assignment and prole assignment. And DeepSea collects each set of tasks into a simple goal. Each goal is a stage:

DEEPSEA STAGES DESCRIPTION

Stage 0—the preparation— during this stage, all required updates are applied and your system may be rebooted.

Important: Re-run Stage 0 after the Admin Node Reboot If the Admin Node reboots during stage 0 to load the new kernel version, you need to run stage 0 again, otherwise minions will not be targeted.

31 Introduction to DeepSea SES 6 Stage 1—the discovery—here all hardware in your cluster is being detected and necessary information for the Ceph conguration is being collected. For details about conguration, refer to Section 5.5, “Configuration and Customization”.

Stage 2—the conguration —you need to prepare conguration data in a particular format.

Stage 3—the deployment—creates a basic Ceph cluster with mandatory Ceph services. See Section 1.2.3, “Ceph Nodes and Daemons” for their list.

Stage 4—the services—additional features of Ceph like iSCSI, Object Gateway and CephFS can be installed in this stage. Each is optional.

Stage 5—the removal stage. This stage is not mandatory and during the initial setup it is usually not needed. In this stage the roles of minions and also the cluster conguration are removed. You need to run this stage when you need to remove a storage node from your cluster. For details refer to Book “Administration Guide”, Chapter 2 “Salt Cluster Administration”, Section 2.3 “Removing and Reinstalling Cluster Nodes”.

5.2.1 Organization and Important Locations

Salt has several standard locations and several naming conventions used on your master node:

/srv/pillar The directory stores conguration data for your cluster minions. Pillar is an interface for providing global conguration values to all your cluster minions.

/srv/salt/ The directory stores Salt state les (also called sls les). State les are formatted descriptions of states in which the cluster should be.

/srv/module/runners The directory stores Python scripts known as runners. Runners are executed on the master node.

/srv/salt/_modules The directory stores Python scripts that are called modules. The modules are applied to all minions in your cluster.

/srv/pillar/ceph

32 Organization and Important Locations SES 6 The directory is used by DeepSea. Collected conguration data are stored here.

/srv/salt/ceph A directory used by DeepSea. It stores sls les that can be in dierent formats, but each subdirectory contains sls les. Each subdirectory contains only one type of sls le. For example, /srv/salt/ceph/stage contains orchestration les that are executed by salt- run state.orchestrate .

5.2.2 Targeting the Minions

DeepSea commands are executed via the Salt infrastructure. When using the salt command, you need to specify a set of Salt minions that the command will aect. We describe the set of the minions as a target for the salt command. The following sections describe possible methods to target the minions.

5.2.2.1 Matching the Minion Name

You can target a minion or a group of minions by matching their names. A minion's name is usually the short host name of the node where the minion runs. This is a general Salt targeting method, not related to DeepSea. You can use globbing, regular expressions, or lists to limit the range of minion names. The general syntax follows:

root@master # salt target example.module

Tip: Ceph-only Cluster If all Salt minions in your environment belong to your Ceph cluster, you can safely substitute target with '*' to include all registered minions.

Match all minions in the example.net domain (assuming the minion names are identical to their "full" host names):

root@master # salt '*.example.net' test.ping

Match the 'web1' to 'web5' minions:

root@master # salt 'web[1-5]' test.ping

33 Targeting the Minions SES 6 Match both 'web1-prod' and 'web1-devel' minions using a regular expression:

root@master # salt -E 'web1-(prod|devel)' test.ping

Match a simple list of minions:

root@master # salt -L 'web1,web2,web3' test.ping

Match all minions in the cluster:

root@master # salt '*' test.ping

5.2.2.2 Targeting with a DeepSea Grain

In a heterogeneous Salt-managed environment where SUSE Enterprise Storage 6 is deployed on a subset of nodes alongside other cluster solutions, you need to mark the relevant minions by applying a 'deepsea' grain to them before running DeepSea stage 0. This way, you can easily target DeepSea minions in environments where matching by the minion name is problematic. To apply the 'deepsea' grain to a group of minions, run:

root@master # salt target grains.append deepsea default

To remove the 'deepsea' grain from a group of minions, run:

root@master # salt target grains.delval deepsea destructive=True

After applying the 'deepsea' grain to the relevant minions, you can target them as follows:

root@master # salt -G 'deepsea:*' test.ping

The following command is an equivalent:

root@master # salt -C 'G@deepsea:*' test.ping

5.2.2.3 Set the deepsea_minions Option

Setting the deepsea_minions option's target is a requirement for DeepSea deployments. DeepSea uses it to instruct minions during the execution of stages (refer to DeepSea Stages Description for details.

34 Targeting the Minions SES 6 To set or change the deepsea_minions option, edit the /srv/pillar/ceph/ deepsea_minions.sls le on the Salt master and add or replace the following line:

deepsea_minions: target

Tip: deepsea_minions Target As the target for the deepsea_minions option, you can use any targeting method: both Matching the Minion Name and Targeting with a DeepSea Grain. Match all Salt minions in the cluster:

deepsea_minions: '*'

Match all minions with the 'deepsea' grain:

deepsea_minions: 'G@deepsea:*'

5.2.2.4 For More Information

You can use more advanced ways to target minions using the Salt infrastructure. The 'deepsea-minions' manual page gives you more details about DeepSea targeting ( man 7 deepsea_minions ).

5.3 Cluster Deployment

The cluster deployment process has several phases. First, you need to prepare all nodes of the cluster by conguring Salt and then deploy and congure Ceph.

Tip: Deploying Monitor Nodes without Defining OSD Profiles If you need to skip dening storage roles for OSD as described in Section 5.5.1.2, “Role Assignment” and deploy Ceph Monitor nodes rst, you can do so by setting the DEV_ENV variable. This allows deploying monitors without the presence of the role-storage/ directory, as well as deploying a Ceph cluster with at least one storage, monitor, and manager role.

35 Cluster Deployment SES 6 To set the environment variable, either enable it globally by setting it in the /srv/ pillar/ceph/stack/global.yml le, or set it for the current shell session only:

root@master # export DEV_ENV=true

As an example, /srv/pillar/ceph/stack/global.yml can be created with the following contents:

DEV_ENV: True

The following procedure describes the cluster preparation in detail.

1. Install and register SUSE Linux Enterprise Server 15 SP1 together with the SUSE Enterprise Storage 6 extension on each node of the cluster.

2. Verify that proper products are installed and registered by listing existing software repositories. Run zypper lr -E and compare the output with the following list:

SLE-Product-SLES15-SP1-Pool SLE-Product-SLES15-SP1-Updates SLE-Module-Server-Applications15-SP1-Pool SLE-Module-Server-Applications15-SP1-Updates SLE-Module-Basesystem15-SP1-Pool SLE-Module-Basesystem15-SP1-Updates SUSE-Enterprise-Storage-6-Pool SUSE-Enterprise-Storage-6-Updates

3. Congure network settings including proper DNS name resolution on each node. The Salt master and all the Salt minions need to resolve each other by their host names. For more information on conguring a network, see https://documentation.suse.com/sles/15- SP1/single-html/SLES-admin/#sec-network-yast For more information on conguring a DNS server, see https://documentation.suse.com/sles/15-SP1/single-html/SLES-admin/#cha- dns .

Important If cluster nodes are congured for multiple networks, DeepSea will use the network to which their host names (or FQDN) resolves. Consider the following example / etc/hosts :

192.168.100.1 ses1.example.com ses1

36 Cluster Deployment SES 6 172.16.100.1 ses1clus.cluster.lan ses1clus

In the above example, the ses1 minion will resolve to the 192.168.100.x network and DeepSea will use this network as the public network. If the desired public network is 172.16.100.x , then host name should be changed to ses1clus .

4. Install the salt-master and salt-minion packages on the Salt master node:

root@master # zypper in salt-master salt-minion

Check that the salt-master service is enabled and started, and enable and start it if needed:

root@master # systemctl enable salt-master.service root@master # systemctl start salt-master.service

5. If you intend to use rewall, verify that the Salt master node has ports 4505 and 4506 open to all Salt minion nodes. If the ports are closed, you can open them using the yast2 firewall command by allowing the SaltStack service.

Warning: DeepSea Stages Fail with Firewall DeepSea deployment stages fail when rewall is active (and even congured). To pass the stages correctly, you need to either turn the rewall o by running

root # systemctl stop firewalld.service

or set the FAIL_ON_WARNING option to 'False' in /srv/pillar/ceph/stack/ global.yml :

FAIL_ON_WARNING: False

6. Install the package salt-minion on all minion nodes.

root@minion > zypper in salt-minion

Make sure that the fully qualied domain name of each node can be resolved to the public network IP address by all other nodes.

37 Cluster Deployment SES 6 7. Congure all minions (including the master minion) to connect to the master. If your Salt master is not reachable by the host name salt , edit the le /etc/salt/minion or create a new le /etc/salt/minion.d/master.conf with the following content:

master: host_name_of_salt_master

If you performed any changes to the conguration les mentioned above, restart the Salt service on all Salt minions:

root@minion > systemctl restart salt-minion.service

8. Check that the salt-minion service is enabled and started on all nodes. Enable and start it if needed:

root # systemctl enable salt-minion.service root # systemctl start salt-minion.service

9. Verify each Salt minion's ngerprint and accept all salt keys on the Salt master if the ngerprints match.

Note If the Salt minion ngerprint comes back empty, make sure the Salt minion has a Salt master conguration and it can communicate with the Salt master.

View each minion's ngerprint:

root@master # salt-call --local key.finger local: 3f:a3:2f:3f:b4:d3:d9:24:49:ca:6b:2c:e1:6c:3f:c3:83:37:f0:aa:87:42:e8:ff...

After gathering ngerprints of all the Salt minions, list ngerprints of all unaccepted minion keys on the Salt master:

root@master # salt-key -F [...] Unaccepted Keys: minion1: 3f:a3:2f:3f:b4:d3:d9:24:49:ca:6b:2c:e1:6c:3f:c3:83:37:f0:aa:87:42:e8:ff...

38 Cluster Deployment SES 6 If the minions' ngerprints match, accept them:

root@master # salt-key --accept-all

10. Verify that the keys have been accepted:

root@master # salt-key --list-all

11. By default, DeepSea uses the Admin Node as the time server for other cluster nodes. Therefore, if the Admin Node is not virtualized, select one or more time servers or pools, and synchronize the local time against them. Verify that the time synchronization service is enabled on each system start-up. Find more information on setting up time synchronization in https://documentation.suse.com/sles/15-SP1/html/SLES- all/cha-ntp.html#sec-ntp-yast . If the Admin Node is a virtual machine, provide better time sources for the cluster nodes by overriding the default NTP client conguration:

1. Edit /srv/pillar/ceph/stack/global.yml on the Salt master node and add the following line:

time_server: CUSTOM_NTP_SERVER

To add multiple time servers, the format is as follows:

time_server: - CUSTOM_NTP_SERVER1 - CUSTOM_NTP_SERVER2 - CUSTOM_NTP_SERVER3 [...]

2. Refresh the Salt pillar:

root@master # salt '*' saltutil.pillar_refresh

3. Verify the changed value:

root@master # salt '*' pillar.items

4. Apply the new setting:

root@master # salt '*' state.apply ceph.time

39 Cluster Deployment SES 6 12. Prior to deploying SUSE Enterprise Storage 6, manually zap all the disks. Remember to replace 'X' with the correct disk letter:

a. Stop all processes that are using the specic disk.

b. Verify whether any partition on the disk is mounted, and unmount if needed.

c. If the disk is managed by LVM, deactivate and delete the whole LVM infrastructure. Refer to https://documentation.suse.com/sles/15-SP1/html/SLES-all/ cha-lvm.html for more details.

d. If the disk is part of MD RAID, deactivate the RAID. Refer to https://documentation.suse.com/sles/15-SP1/single-html/SLES-storage/#part- software- for more details.

e. Tip: Rebooting the Server If you get error messages such as 'partition in use' or 'kernel cannot be updated with the new partition table' during the following steps, reboot the server.

Wipe data and partitions on the disk:

cephadm@adm > ceph-volume lvm zap /dev/sdX --destroy

f. Verify that the drive is empty (with no GPT structures) using:

root # parted -s /dev/sdX print free

or

root # dd if=/dev/sdX bs=512 count=34 | hexdump -C root # dd if=/dev/sdX bs=512 count=33 \ skip=$((`blockdev --getsz /dev/sdX` - 33)) | hexdump -C

13. Optionally, if you need to precongure the cluster's network settings before the deepsea package is installed, create /srv/pillar/ceph/stack/ceph/cluster.yml manually and set the cluster_network: and public_network: options. Note that the le will not be overwritten after you install deepsea . Then, run:

chown -R salt:salt /srv/pillar/ceph/stack

40 Cluster Deployment SES 6 Tip: Enabling IPv6 If you need to enable IPv6 network addressing, refer to Section 7.2.1, “Enabling IPv6 for Ceph Cluster Deployment”

14. Install DeepSea on the Salt master node:

root@master # zypper in deepsea

15. The value of the master_minion parameter is dynamically derived from the /etc/salt/ minion_id le on the Salt master. If you need to override the discovered value, edit the le /srv/pillar/ceph/stack/global.yml and set a relevant value:

master_minion: MASTER_MINION_NAME

If your Salt master is reachable via more host names, use the Salt minion name for the storage cluster as returned by the salt-key -L command. If you used the default host name for your Salt master—salt—in the ses domain, then the le looks as follows:

master_minion: salt.ses

Now you deploy and congure Ceph. Unless specied otherwise, all steps are mandatory.

Note: Salt Command Conventions There are two possible ways to run salt-run state.orch —one is with 'stage. STAGE_NUMBER ', the other is with the name of the stage. Both notations have the same impact and it is fully your preference which command you use.

PROCEDURE 5.1: RUNNING DEPLOYMENT STAGES

1. Ensure the Salt minions belonging to the Ceph cluster are correctly targeted through the deepsea_minions option in /srv/pillar/ceph/deepsea_minions.sls . Refer to Section 5.2.2.3, “Set the deepsea_minions Option” for more information.

41 Cluster Deployment SES 6 2. By default, DeepSea deploys Ceph clusters with tuned proles active on Ceph Monitor, Ceph Manager, and Ceph OSD nodes. In some cases, you may need to deploy without tuned proles. To do so, put the following lines in /srv/pillar/ceph/stack/global.yml before running DeepSea stages:

alternative_defaults: tuned_mgr_init: default-off tuned_mon_init: default-off tuned_osd_init: default-off

3. Optional: create sub-volumes for /var/lib/ceph/ . This step needs to be executed before DeepSea stage.0. To migrate existing directories or for more details, see Book “Administration Guide”, Chapter 33 “Hints and Tips”, Section 33.6 “Btrfs Subvolume for /var/lib/ ceph on Ceph Monitor Nodes”. Apply the following commands to each of the Salt minions:

root@master # salt 'MONITOR_NODES' saltutil.sync_all root@master # salt 'MONITOR_NODES' state.apply ceph.subvolume

Note The Ceph.subvolume command creates /var/lib/ceph as a @/var/lib/ceph Btrfs subvolume.

The new subvolume is now mounted and /etc/fstab is updated.

4. Prepare your cluster. Refer to DeepSea Stages Description for more details.

root@master # salt-run state.orch ceph.stage.0

or

root@master # salt-run state.orch ceph.stage.prep

Note: Run or Monitor Stages using DeepSea CLI Using the DeepSea CLI, you can follow the stage execution progress in real-time, either by running the DeepSea CLI in the monitoring mode, or by running the stage directly through DeepSea CLI. For details refer to Section 5.4, “DeepSea CLI”.

42 Cluster Deployment SES 6 5. The discovery stage collects data from all minions and creates conguration fragments that are stored in the directory /srv/pillar/ceph/proposals . The data are stored in the YAML format in *.sls or *.yml les. Run the following command to trigger the discovery stage:

root@master # salt-run state.orch ceph.stage.1

or

root@master # salt-run state.orch ceph.stage.discovery

6. After the previous command nishes successfully, create a policy.cfg le in /srv/ pillar/ceph/proposals . For details refer to Section 5.5.1, “The policy.cfg File”.

Tip If you need to change the cluster's network setting, edit /srv/pillar/ceph/ stack/ceph/cluster.yml and adjust the lines starting with cluster_network: and public_network: .

7. The conguration stage parses the policy.cfg le and merges the included les into their nal form. Cluster and role related content are placed in /srv/pillar/ceph/ cluster , while Ceph specic content is placed in /srv/pillar/ceph/stack/default . Run the following command to trigger the conguration stage:

root@master # salt-run state.orch ceph.stage.2

or

root@master # salt-run state.orch ceph.stage.configure

The conguration step may take several seconds. After the command nishes, you can view the pillar data for the specied minions (for example, named ceph_minion1 , ceph_minion2 , etc.) by running:

root@master # salt 'ceph_minion*' pillar.items

43 Cluster Deployment SES 6 Tip: Modifying OSD's Layout If you want to modify the default OSD's layout and change the drive groups conguration, follow the procedure described in Section 5.5.2, “DriveGroups”.

Note: Overwriting Defaults As soon as the command nishes, you can view the default conguration and change it to suit your needs. For details refer to Chapter 7, Customizing the Default Configuration.

8. Now you run the deployment stage. In this stage, the pillar is validated, and the Ceph Monitor and Ceph OSD daemons are started:

root@master # salt-run state.orch ceph.stage.3

or

root@master # salt-run state.orch ceph.stage.deploy

The command may take several minutes. If it fails, you need to x the issue and run the previous stages again. After the command succeeds, run the following to check the status:

cephadm@adm > ceph -s

9. The last step of the Ceph cluster deployment is the services stage. Here you instantiate any of the currently supported services: iSCSI Gateway, CephFS, Object Gateway, and NFS Ganesha. In this stage, the necessary pools, authorizing keyrings, and starting services are created. To start the stage, run the following:

root@master # salt-run state.orch ceph.stage.4

or

root@master # salt-run state.orch ceph.stage.services

Depending on the setup, the command may run for several minutes.

44 Cluster Deployment SES 6 10. Disable insecure clients. Since Nautilus v14.2.20, a new health warning was introduced that informs you that insecure clients are allowed to join the cluster. This warning is on by default. The Ceph Dashboard will show the cluster in the HEALTH_WARN status and verifying the cluster status on the command line informs you as follows:

cephadm@adm > ceph status cluster: id: 3fe8b35a-689f-4970-819d-0e6b11f6707c health: HEALTH_WARN mons are allowing insecure global_id reclaim [...]

This warning means that the Ceph Monitors are still allowing old, unpatched clients to connect to the cluster. This ensures existing clients can still connect while the cluster is being upgraded, but warns you that there is a problem that needs to be addressed. When the cluster and all clients are upgraded to the latest version of Ceph, disallow unpatched clients by running the following command:

cephadm@adm > ceph config set mon auth_allow_insecure_global_id_reclaim false

11. Before you continue, we strongly recommend enabling the Ceph telemetry module. For more information, see Book “Administration Guide”, Chapter 21 “Ceph Manager Modules”, Section 21.2 “Telemetry Module” for information and instructions.

5.4 DeepSea CLI

DeepSea also provides a command line interface (CLI) tool that allows the user to monitor or run stages while visualizing the execution progress in real-time. Verify that the deepsea-cli package is installed before you run the deepsea executable. Two modes are supported for visualizing a stage's execution progress:

DEEPSEA CLI MODES

Monitoring mode: visualizes the execution progress of a DeepSea stage triggered by the salt-run command issued in another terminal session.

Stand-alone mode: runs a DeepSea stage while providing real-time visualization of its component steps as they are executed.

45 DeepSea CLI SES 6 Important: DeepSea CLI Commands The DeepSea CLI commands can only be run on the Salt master node with the root privileges.

5.4.1 DeepSea CLI: Monitor Mode

The progress monitor provides a detailed, real-time visualization of what is happening during execution of stages using salt-run state.orch commands in other terminal sessions.

Tip: Start Monitor in a New Terminal Session You need to start the monitor in a new terminal window before running any salt-run state.orch so that the monitor can detect the start of the stage's execution.

If you start the monitor after issuing the salt-run state.orch command, then no execution progress will be shown. You can start the monitor mode by running the following command:

root@master # deepsea monitor

For more information about the available command line options of the deepsea monitor command, check its manual page:

root@master # man deepsea-monitor

5.4.2 DeepSea CLI: Stand-alone Mode

In the stand-alone mode, DeepSea CLI can be used to run a DeepSea stage, showing its execution in real-time. The command to run a DeepSea stage from the DeepSea CLI has the following form:

root@master # deepsea stage run stage-name where stage-name corresponds to the way Salt orchestration state les are referenced. For example, stage deploy, which corresponds to the directory located in /srv/salt/ceph/stage/ deploy , is referenced as ceph.stage.deploy.

46 DeepSea CLI: Monitor Mode SES 6 This command is an alternative to the Salt-based commands for running DeepSea stages (or any DeepSea orchestration state le). The command deepsea stage run ceph.stage.0 is equivalent to salt-run state.orch ceph.stage.0 . For more information about the available command line options accepted by the deepsea stage run command, check its manual page:

root@master # man deepsea-stage run

In the following gure shows an example of the output of the DeepSea CLI when running Stage 2:

FIGURE 5.1: DEEPSEA CLI STAGE EXECUTION PROGRESS OUTPUT

47 DeepSea CLI: Stand-alone Mode SES 6 5.4.2.1 DeepSea CLI stage run Alias

For advanced users of Salt, we also support an alias for running a DeepSea stage that takes the Salt command used to run a stage, for example, salt-run state.orch stage-name , as a command of the DeepSea CLI. Example:

root@master # deepsea salt-run state.orch stage-name

5.5 Configuration and Customization

5.5.1 The policy.cfg File

The /srv/pillar/ceph/proposals/policy.cfg conguration le is used to determine roles of individual cluster nodes. For example, which nodes act as Ceph OSDs or Ceph Monitors. Edit policy.cfg in order to reect your desired cluster setup. The order of the sections is arbitrary, but the content of included lines overwrites matching keys from the content of previous lines.

Tip: Examples of policy.cfg You can nd several examples of complete policy les in the /usr/share/doc/ packages/deepsea/examples/ directory.

5.5.1.1 Cluster Assignment

In the cluster section you select minions for your cluster. You can select all minions, or you can blacklist or whitelist minions. Examples for a cluster called ceph follow. To include all minions, add the following lines:

cluster-ceph/cluster/*.sls

To whitelist a particular minion:

cluster-ceph/cluster/abc.domain.sls or a group of minions—you can shell glob matching:

cluster-ceph/cluster/mon*.sls

48 Configuration and Customization SES 6 To blacklist minions, set the them to unassigned :

cluster-unassigned/cluster/client*.sls

5.5.1.2 Role Assignment

This section provides you with details on assigning 'roles' to your cluster nodes. A 'role' in this context means the service you need to run on the node, such as Ceph Monitor, Object Gateway, or iSCSI Gateway. No role is assigned automatically, only roles added to policy.cfg will be deployed. The assignment follows this pattern:

role-ROLE_NAME/PATH/FILES_TO_INCLUDE

Where the items have the following meaning and values:

ROLE_NAME is any of the following: 'master', 'admin', 'mon', 'mgr', 'storage', 'mds', 'igw', 'rgw', 'ganesha', 'grafana', or 'prometheus'.

PATH is a relative directory path to .sls or .yml les. In case of .sls les, it usually is cluster , while .yml les are located at stack/default/ceph/minions .

FILES_TO_INCLUDE are the Salt state les or YAML conguration les. They normally consist of Salt minions' host names, for example ses5min2.yml . Shell globbing can be used for more specic matching.

An example for each role follows:

master - the node has admin keyrings to all Ceph clusters. Currently, only a single Ceph cluster is supported. As the master role is mandatory, always add a similar line to the following:

role-master/cluster/master*.sls

admin - the minion will have an admin keyring. You dene the role as follows:

role-admin/cluster/abc*.sls

mon - the minion will provide the monitor service to the Ceph cluster. This role requires addresses of the assigned minions. From SUSE Enterprise Storage 5, the public addresses are calculated dynamically and are no longer needed in the Salt pillar.

49 The policy.cfg File SES 6 role-mon/cluster/mon*.sls

The example assigns the monitor role to a group of minions.

mgr - the Ceph manager daemon which collects all the state information from the whole cluster. Deploy it on all minions where you plan to deploy the Ceph monitor role.

role-mgr/cluster/mgr*.sls

storage - use this role to specify storage nodes.

role-storage/cluster/data*.sls

mds - the minion will provide the metadata service to support CephFS.

role-mds/cluster/mds*.sls

igw - the minion will act as an iSCSI Gateway. This role requires addresses of the assigned minions, thus you need to also include the les from the stack directory:

role-igw/cluster/*.sls

rgw - the minion will act as an Object Gateway:

role-rgw/cluster/rgw*.sls

ganesha - the minion will act as an NFS Ganesha server. The 'ganesha' role requires either an 'rgw' or 'mds' role in cluster, otherwise the validation will fail in Stage 3.

role-ganesha/cluster/ganesha*.sls

To successfully install NFS Ganesha, additional conguration is required. If you want to use NFS Ganesha, read Chapter 12, Installation of NFS Ganesha before executing stages 2 and 4. However, it is possible to install NFS Ganesha later. In some cases it can be useful to dene custom roles for NFS Ganesha nodes. For details, see Book “Administration Guide”, Chapter 30 “NFS Ganesha: Export Ceph Data via NFS”, Section 30.3 “Custom NFS Ganesha Roles”.

grafana, prometheus - this node adds Grafana charts based on Prometheus alerting to the Ceph Dashboard. Refer to Book “Administration Guide” for its detailed description.

role-grafana/cluster/grafana*.sls

50 The policy.cfg File SES 6 role-prometheus/cluster/prometheus*.sls

Note: Multiple Roles of Cluster Nodes You can assign several roles to a single node. For example, you can assign the 'mds' roles to the monitor nodes:

role-mds/cluster/mon[1,2]*.sls

5.5.1.3 Common Configuration

The common conguration section includes conguration les generated during the discovery (Stage 1). These conguration les store parameters like fsid or public_network . To include the required Ceph common conguration, add the following lines:

config/stack/default/global.yml config/stack/default/ceph/cluster.yml

5.5.1.4 Item Filtering

Sometimes it is not practical to include all les from a given directory with *.sls globbing. The policy.cfg le parser understands the following lters:

Warning: Advanced Techniques This section describes ltering techniques for advanced users. When not used correctly, ltering can cause problems for example in case your node numbering changes. slice=[start:end] Use the slice lter to include only items start through end-1. Note that items in the given directory are sorted alphanumerically. The following line includes the third to fth les from the role-mon/cluster/ subdirectory:

role-mon/cluster/*.sls slice[3:6] re=regexp

51 The policy.cfg File SES 6 Use the regular expression lter to include only items matching the given expressions. For example:

role-mon/cluster/mon*.sls re=.*1[135]\.subdomainX\.sls$

5.5.1.5 Example policy.cfg File

Following is an example of a basic policy.cfg le:

## Cluster Assignment cluster-ceph/cluster/*.sls 1

## Roles # ADMIN role-master/cluster/examplesesadmin.sls 2 role-admin/cluster/sesclient*.sls 3

# MON role-mon/cluster/ses-example-[123].sls 4

# MGR role-mgr/cluster/ses-example-[123].sls 5

# STORAGE role-storage/cluster/ses-example-[5678].sls 6

# MDS role-mds/cluster/ses-example-4.sls 7

# IGW role-igw/cluster/ses-example-4.sls 8

# RGW role-rgw/cluster/ses-example-4.sls 9

# COMMON config/stack/default/global.yml 10 config/stack/default/ceph/cluster.yml 11

1 Indicates that all minions are included in the Ceph cluster. If you have minions you do not want to include in the Ceph cluster, use:

cluster-unassigned/cluster/*.sls cluster-ceph/cluster/ses-example-*.sls

52 The policy.cfg File SES 6 The rst line marks all minions as unassigned. The second line overrides minions matching 'ses-example-*.sls', and assigns them to the Ceph cluster. 2 The minion called 'examplesesadmin' has the 'master' role. This, by the way, means it will get admin keys to the cluster. 3 All minions matching 'sesclient*' will get admin keys as well. 4 All minions matching 'ses-example-[123]' (presumably three minions: ses-example-1, ses- example-2, and ses-example-3) will be set up as MON nodes. 5 All minions matching 'ses-example-[123]' (all MON nodes in the example) will be set up as MGR nodes. 6 All minions matching 'ses-example-[5678]' will be set up as storage nodes. 7 Minion 'ses-example-4' will have the MDS role. 8 Minion 'ses-example-4' will have the IGW role. 9 Minion 'ses-example-4' will have the RGW role. 10 Means that we accept the default values for common conguration parameters such as fsid and public_network . 11 Means that we accept the default values for common conguration parameters such as fsid and public_network .

5.5.2 DriveGroups

DriveGroups specify the layouts of OSDs in the Ceph cluster. They are dened in a single le / srv/salt/ceph/configuration/files/drive_groups.yml . An administrator should manually specify a group of OSDs that are interrelated (hybrid OSDs that are deployed on solid state and spinners) or share the same deployment options (identical, for example same object store, same encryption option, stand-alone OSDs). To avoid explicitly listing devices, DriveGroups use a list of lter items that correspond to a few selected elds of ceph-volume 's inventory reports. In the simplest case this could be the 'rotational' ag (all solid- state drives are to be db_devices, all rotating ones data devices) or something more involved such as 'model' strings, or sizes. DeepSea will provide code that translates these DriveGroups into actual device lists for inspection by the user.

53 DriveGroups SES 6 Note Note that the lters use an OR gate to match against the drives.

Following is a simple procedure that demonstrates the basic workow when conguring DriveGroups:

1. Inspect your disks' properties as seen by the ceph-volume command. Only these properties are accepted by DriveGroups:

root@master # salt-run disks.details

2. Open the /srv/salt/ceph/configuration/files/drive_groups.yml YAML le and adjust to your needs. Refer to Section 5.5.2.1, “Specification”. Remember to use spaces instead of tabs. Find more advanced examples in Section 5.5.2.4, “Examples”. The following example includes all drives available to Ceph as OSDs:

default_drive_group_name: target: '*' data_devices: all: true

3. Verify new layouts:

root@master # salt-run disks.list

This runner returns you a structure of matching disks based on your DriveGroups. If you are not happy with the result, repeat the previous step.

Tip: Detailed Report In addition to the disks.list runner, there is a disks.report runner that prints out a detailed report of what will happen in the next DeepSea stage 3 invocation.

root@master # salt-run disks.report

4. Deploy OSDs. On the next DeepSea stage 3 invocation, the OSD disks will be deployed according to your DriveGroups specication.

54 DriveGroups SES 6 5.5.2.1 Specification

/srv/salt/ceph/configuration/files/drive_groups.yml can take one of two basic forms, depending on whether BlueStore or FileStore is to be used. For BlueStore setups, drive_groups.yml can be as follows:

drive_group_default_name: target: * data_devices: drive_spec: DEVICE_SPECIFICATION db_devices: drive_spec: DEVICE_SPECIFICATION wal_devices: drive_spec: DEVICE_SPECIFICATION block_wal_size: '5G' # (optional, unit suffixes permitted) block_db_size: '5G' # (optional, unit suffixes permitted) osds_per_device: 1 # number of osd daemons per device format: # 'bluestore' or 'filestore' (defaults to 'bluestore') encryption: # 'True' or 'False' (defaults to 'False')

For FileStore setups, drive_groups.yml can be as follows:

drive_group_default_name: target: * data_devices: drive_spec: DEVICE_SPECIFICATION journal_devices: drive_spec: DEVICE_SPECIFICATION format: filestore encryption: True

Note If you are unsure if your OSD is encrypted, see Book “Administration Guide”, Chapter 2 “Salt Cluster Administration”, Section 2.5 “Verify an Encrypted OSD”.

55 DriveGroups SES 6 5.5.2.2 Matching Disk Devices

You can describe the specication using the following lters:

By a disk model:

model: DISK_MODEL_STRING

By a disk vendor:

vendor: DISK_VENDOR_STRING

Tip: Lowercase Vendor String Always lowercase the DISK_VENDOR_STRING .

Whether a disk is rotational or not. SSDs and NVME drives are not rotational.

rotational: 0

Deploy a node using all available drives for OSDs:

data_devices: all: true

Additionally, by limiting the number of matching disks:

limit: 10

5.5.2.3 Filtering Devices by Size

You can lter disk devices by their size—either by an exact size, or a size range. The size: parameter accepts arguments in the following form:

'10G' - Includes disks of an exact size.

'10G:40G' - Includes disks whose size is within the range.

':10G' - Includes disks less than or equal to 10 GB in size.

'40G:' - Includes disks equal to or greater than 40 GB in size.

EXAMPLE 5.1: MATCHING BY DISK SIZE

drive_group_default:

56 DriveGroups SES 6 target: '*' data_devices: size: '40TB:' db_devices: size: ':2TB'

Note: Quotes Required When using the ':' delimiter, you need to enclose the size in quotes, otherwise the ':' sign will be interpreted as a new conguration hash.

Tip: Unit Shortcuts Instead of (G)igabytes, you can specify the sizes in (M)egabytes or (T)erabytes as well.

5.5.2.4 Examples

This section includes examples of dierent OSD setups.

EXAMPLE 5.2: SIMPLE SETUP

This example describes two nodes with the same setup:

20 HDDs

Vendor: Intel

Model: SSD-123-foo

Size: 4 TB

2 SSDs

Vendor: Micron

Model: MC-55-44-ZX

Size: 512 GB

The corresponding drive_groups.yml le will be as follows:

drive_group_default:

57 DriveGroups SES 6 target: '*' data_devices: model: SSD-123-foo db_devices: model: MC-55-44-XZ

Such a conguration is simple and valid. The problem is that an administrator may add disks from dierent vendors in the future, and these will not be included. You can improve it by reducing the lters on core properties of the drives:

drive_group_default: target: '*' data_devices: rotational: 1 db_devices: rotational: 0

In the previous example, we are enforcing all rotating devices to be declared as 'data devices' and all non-rotating devices will be used as 'shared devices' (wal, db). If you know that drives with more than 2 TB will always be the slower data devices, you can lter by size:

drive_group_default: target: '*' data_devices: size: '2TB:' db_devices: size: ':2TB'

EXAMPLE 5.3: ADVANCED SETUP

This example describes two distinct setups: 20 HDDs should share 2 SSDs, while 10 SSDs should share 2 NVMes.

20 HDDs

Vendor: Intel

Model: SSD-123-foo

Size: 4 TB

12 SSDs

58 DriveGroups SES 6 Vendor: Micron

Model: MC-55-44-ZX

Size: 512 GB

2 NVMes

Vendor: Samsung

Model: NVME-QQQQ-987

Size: 256 GB

Such a setup can be dened with two layouts as follows:

drive_group: target: '*' data_devices: rotational: 0 db_devices: model: MC-55-44-XZ

drive_group_default: target: '*' data_devices: model: MC-55-44-XZ db_devices: vendor: samsung size: 256GB

Note that any drive of the size 256 GB and any drive from Samsung will match as a DB device with this example.

EXAMPLE 5.4: ADVANCED SETUP WITH NON-UNIFORM NODES

The previous examples assumed that all nodes have the same drives. However, that is not always the case: Nodes 1-5:

20 HDDs

59 DriveGroups SES 6 Vendor: Intel

Model: SSD-123-foo

Size: 4 TB

2 SSDs

Vendor: Micron

Model: MC-55-44-ZX

Size: 512 GB

Nodes 6-10:

5 NVMes

Vendor: Intel

Model: SSD-123-foo

Size: 4 TB

20 SSDs

Vendor: Micron

Model: MC-55-44-ZX

Size: 512 GB

You can use the 'target' key in the layout to target specic nodes. Salt target notation helps to keep things simple:

drive_group_node_one_to_five: target: 'node[1-5]' data_devices: rotational: 1 db_devices: rotational: 0

followed by

drive_group_the_rest: target: 'node[6-10]' data_devices:

60 DriveGroups SES 6 model: MC-55-44-XZ db_devices: model: SSD-123-foo

EXAMPLE 5.5: EXPERT SETUP

All previous cases assumed that the WALs and DBs use the same device. It is however possible to deploy the WAL on a dedicated device as well:

20 HDDs

Vendor: Intel

Model: SSD-123-foo

Size: 4 TB

2 SSDs

Vendor: Micron

Model: MC-55-44-ZX

Size: 512 GB

2 NVMes

Vendor: Samsung

Model: NVME-QQQQ-987

Size: 256 GB

drive_group_default: target: '*' data_devices: model: MC-55-44-XZ db_devices: model: SSD-123-foo wal_devices: model: NVME-QQQQ-987

EXAMPLE 5.6: COMPLEX (AND UNLIKELY) SETUP

In the following setup, we are trying to dene:

20 HDDs backed by 1 NVMe

2 HDDs backed by 1 SSD(db) and 1 NVMe(wal)

61 DriveGroups SES 6 8 SSDs backed by 1 NVMe

2 SSDs stand-alone (encrypted)

1 HDD is spare and should not be deployed

The summary of used drives follows:

23 HDDs

Vendor: Intel

Model: SSD-123-foo

Size: 4 TB

10 SSDs

Vendor: Micron

Model: MC-55-44-ZX

Size: 512 GB

1 NVMe

Vendor: Samsung

Model: NVME-QQQQ-987

Size: 256 GB

The DriveGroups denition will be the following:

drive_group_hdd_nvme: target: '*' data_devices: rotational: 0 db_devices: model: NVME-QQQQ-987

drive_group_hdd_ssd_nvme: target: '*' data_devices: rotational: 0 db_devices: model: MC-55-44-XZ

62 DriveGroups SES 6 wal_devices: model: NVME-QQQQ-987

drive_group_ssd_nvme: target: '*' data_devices: model: SSD-123-foo db_devices: model: NVME-QQQQ-987

drive_group_ssd_standalone_encrypted: target: '*' data_devices: model: SSD-123-foo encryption: True

One HDD will remain as the le is being parsed from top to bottom.

5.5.3 Adjusting ceph.conf with Custom Settings

If you need to put custom settings into the ceph.conf conguration le, see Book “Administration Guide”, Chapter 2 “Salt Cluster Administration”, Section 2.14 “Adjusting ceph.conf with Custom Settings” for more details.

63 Adjusting ceph.conf with Custom Settings SES 6 6 Upgrading from Previous Releases

This chapter introduces steps to upgrade SUSE Enterprise Storage 5.5 to version 6. Note that version 5.5 is basically 5 with all latest patches applied.

Note: Upgrade from Older Releases Not Supported Upgrading from SUSE Enterprise Storage versions older than 5.5 is not supported. You rst need to upgrade to the latest version of SUSE Enterprise Storage 5.5 and then follow the steps in this chapter.

6.1 General Considerations

If openATTIC is located on the Admin Node, it will be unavailable after you upgrade the node. The new Ceph Dashboard will not be available until you deploy it by using DeepSea.

The cluster upgrade may take a long time—approximately the time it takes to upgrade one machine multiplied by the number of cluster nodes.

A single node cannot be upgraded while running the previous SUSE Linux Enterprise Server release, but needs to be rebooted into the new version's installer. Therefore the services that the node provides will be unavailable for some time. The core cluster services will still be available—for example if one MON is down during upgrade, there are still at least two active MONs. Unfortunately, single instance services, such as a single iSCSI Gateway, will be unavailable.

64 General Considerations SES 6 6.2 Steps to Take before Upgrading the First Node

6.2.1 Read the Release Notes

In the SES 6 release notes, you can nd additional information on changes since the previous release of SUSE Enterprise Storage. Check the SES 6 release notes online to see whether:

Your hardware needs special considerations.

Any used software packages have changed signicantly.

Special precautions are necessary for your installation.

You can nd SES 6 release notes online at https://www.suse.com/releasenotes/ .

6.2.2 Verify Your Password

Your password must be changed to meet SUSE Enterprise Storage 6 requirements. Ensure you change the username and password on all initiators as well. For more information on changing your password, see Section 10.4.4.3, “CHAP Authentication”.

6.2.3 Verify the Previous Upgrade

In case you previously upgraded from version 4, verify that the upgrade to version 5 was completed successfully:

Check for the existence of the le

/srv/salt/ceph/configuration/files/ceph.conf.import

It is created by the import process during the upgrade from SES 4 to 5. Also, the configuration_init: default-import option is set in the le /srv/pillar/ceph/ proposals/config/stack/default/ceph/cluster.yml If configuration_init is still set to default-import , the cluster is using ceph.conf.import as its conguration le and not DeepSea's default ceph.conf which is compiled from les in /srv/salt/ceph/configuration/files/ceph.conf.d/

65 Steps to Take before Upgrading the First Node SES 6 Therefore you need to inspect ceph.conf.import for any custom conguration, and possibly move the conguration to one of the les in

/srv/salt/ceph/configuration/files/ceph.conf.d/

Then remove the configuration_init: default-import line from /srv/pillar/ ceph/proposals/config/stack/default/ceph/cluster.yml

Warning: Default DeepSea Configuration If you do not merge the conguration from ceph.conf.import and remove the configuration_init: default-import option, any default conguration settings we ship as part of DeepSea (stored in /srv/salt/ceph/configuration/files/ ceph.conf.j2 ) will not be applied to the cluster.

Run the salt-run upgrade.check command to verify that the cluster uses the new bucket type straw2 , and that the Admin Node is not a storage node. The default is straw2 for any newly created buckets.

Important The new straw2 bucket type xes several limitations in the original straw bucket type. The previous straw buckets would change some mappings that should have changed when a weight was adjusted. straw2 achieves the original goal of only changing mappings to or from the bucket item whose weight has changed. Changing a bucket type from straw to straw2 results in a small amount of data movement, depending on how much the bucket item weights vary from each other. When the weights are all the same, no data will move. When an item's weight varies signicantly there will be more movement. To migrate, execute:

cephadm@adm > ceph osd getcrushmap -o backup-crushmap cephadm@adm > ceph osd crush set-all-straw-buckets-to-straw2

If there are problems, you can revert this change with:

cephadm@adm > ceph osd setcrushmap -i backup-crushmap

66 Verify the Previous Upgrade SES 6 Moving to straw2 buckets unlocks a few recent features, such as the crush- compat balancer mode that was added in Ceph Luminous (SES 4).

Check that Ceph 'jewel' prole is used:

cephadm@adm > ceph osd crush dump | grep profile

6.2.4 Upgrade Old RBD Kernel Clients

In case old RBD kernel clients (older than SUSE Linux Enterprise Server 12 SP3) are being used, refer to Book “Administration Guide”, Chapter 23 “RADOS Block Device”, Section 23.9 “Mapping RBD Using Old Kernel Clients”. We recommend upgrading old RBD kernel clients if possible.

6.2.5 Adjust AppArmor

If you used AppArmor in either 'complain' or 'enforce' mode, you need to set a Salt pillar variable before upgrading. Because SUSE Linux Enterprise Server 15 SP1 ships with AppArmor by default, AppArmor management was integrated into DeepSea stage 0. The default behavior in SUSE Enterprise Storage 6 is to remove AppArmor and related proles. If you want to retain the behavior congured in SUSE Enterprise Storage 5.5, verify that one of the following lines is present in the /srv/pillar/ceph/stack/global.yml le before starting the upgrade:

apparmor_init: default-enforce or

apparmor_init: default-complain

6.2.6 Verify MDS Names

From SUSE Enterprise Storage 6, MDS names are no longer allowed to begin with a digit, and such names will cause MDS daemons to refuse to start. You can check whether your daemons have such names either by running the ceph fs status command, or by restarting an MDS and checking its logs for the following message:

deprecation warning: MDS id '1mon1' is invalid and will be forbidden in

67 Upgrade Old RBD Kernel Clients SES 6 a future version. MDS names may not start with a numeric digit.

If you see the above message, the MDS names must be migrated before attempting to upgrade to SUSE Enterprise Storage 6. DeepSea provides an orchestration to automate such a migration. MDS names starting with a digit will be prepended with 'mds.':

root@master # salt-run state.orch ceph.mds.migrate-numerical-names

Tip: Custom Configuration Bound to MDS Names If you have conguration settings that are bound to MDS names and your MDS daemons have names starting with a digit, verify that your conguration settings apply to the new names as well (with the 'mds.' prex). Consider the following example section in the / etc/ceph/ceph.conf le:

[mds.123-my-mds] # config setting specific to MDS name with a name starting with a digit mds cache memory limit = 1073741824 mds standby for name = 456-another-mds

The ceph.mds.migrate-numerical-names orchestrator will change the MDS daemon name '123-my-mds' to 'mds.123-my-mds'. You need to adjust the conguration to reect the new name:

[mds.mds,123-my-mds] # config setting specific to MDS name with the new name mds cache memory limit = 1073741824 mds standby for name = mds.456-another-mds

This will add MDS daemons with the new names before removing the old MDS daemons. The number of MDS daemons will double for a short time. Clients will be able to access CephFS only after a short pause for failover to happen. Therefore plan the migration for a time when you expect little or no CephFS load.

6.2.7 Consolidate Scrub-related Configuration

The osd_scrub_max_interval and osd_scrub_max_interval settings are used by both OSD and MON daemons. OSDs use these settings to decide when to run scrub, and MONs use them to decide if a warning about scrub not running in time (running too long) should be shown.

68 Consolidate Scrub-related Configuration SES 6 Therefore, if non-default settings are used, they should be visible by both OSD and MON daemons (that is, dened either in both [osd] and [mon] sections, or in the [global] section), otherwise the monitor may give a false alarm. In SES 5.5 the monitor warning are disabled by default and the issue may not be noticed if the settings are overridden in the [osd] section only. But when the monitors are upgraded to SES 6, it will start to complain, because in this version the warnings are enabled by default. So if you dene non-default scrub settings in your conguration only in the [osd] section, it is desirable to move them to the [global] section before upgrading to SES 6 to avoid false alarms about scrub not running in time.

6.2.8 Back Up Cluster Data

Although creating backups of a cluster's conguration and data is not mandatory, we strongly recommend backing up important conguration les and cluster data. Refer to Book “Administration Guide”, Chapter 3 “Backing Up Cluster Configuration and Data” for more details.

6.2.9 Migrate from ntpd to chronyd

SUSE Linux Enterprise Server 15 SP1 no longer uses ntpd to synchronize the local host time. Instead, chronyd is used. You need to migrate the time synchronization daemon on each cluster node. You can migrate to chronyd either before migrating the cluster, or upgrade the cluster and migrate to chronyd afterward.

Warning Before you continue, review your current ntpd settings and determine if you want to keep using the same time server. Keep in mind that the default behaviour will convert to using chronyd . If you want to manually maintain the chronyd conguration, follow the instructions below and ensure you disable ntpd time conguration. See Procedure 7.1, “Disabling Time Synchronization” for more information.

69 Back Up Cluster Data SES 6 PROCEDURE 6.1: MIGRATE TO chronyd BEFORE THE CLUSTER UPGRADE

1. Install the chrony package:

root@minion > zypper install chrony

2. Edit the chronyd conguration le /etc/chrony.conf and add NTP sources from the current ntpd conguration in /etc/ntp.conf .

Tip: More Details on chronyd Configuration Refer to https://documentation.suse.com/sles/15-SP1/html/SLES-all/cha-ntp.html to nd more details about how to include time sources in chronyd conguration.

3. Disable and stop the ntpd service:

root@minion > systemctl disable ntpd.service && systemctl stop ntpd.service

4. Start and enable the chronyd service:

root@minion > systemctl start chronyd.service && systemctl enable chronyd.service

5. Verify the status of chronyd :

root@minion > chronyc tracking

PROCEDURE 6.2: MIGRATE TO chronyd AFTER THE CLUSTER UPGRADE

1. During cluster upgrade, add the following software repositories:

SLE-Module-Legacy15-SP1-Pool

SLE-Module-Legacy15-SP1-Updates

2. Upgrade the cluster to version 6.

3. Edit the chronyd conguration le /etc/chrony.conf and add NTP sources from the current ntpd conguration in /etc/ntp.conf .

Tip: More Details on chronyd Configuration Refer to https://documentation.suse.com/sles/15-SP1/html/SLES-all/cha-ntp.html to nd more details about how to include time sources in chronyd conguration.

70 Migrate from ntpd to chronyd SES 6 4. Disable and stop the ntpd service:

root@minion > systemctl disable ntpd.service && systemctl stop ntpd.service

5. Start and enable the chronyd service:

root@minion > systemctl start chronyd.service && systemctl enable chronyd.service

6. Migrate from ntpd to chronyd .

7. Verify the status of chronyd :

root@minion > chronyc tracking

8. Remove the legacy software repositories that you added to keep ntpd in the system during the upgrade process.

6.2.10 Patch Cluster Prior to Upgrade

Apply the latest patches to all cluster nodes prior to upgrade.

6.2.10.1 Required Software Repositories

Check that required repositories are congured on each host of the cluster. To list all available repositories, run

root@minion > zypper lr

Important: Remove SUSE Enterprise Storage 5.5 LTSS Repositories Upgrades will fail if LTSS repositories are congured in SUSE Enterprise Storage 5.5. Find their IDs and remove them from the system. For example:

root # zypper lr [...] 12 | SUSE_Linux_Enterprise_Server_LTSS_12_SP3_x86_64:SLES12-SP3-LTSS-Debuginfo- Updates 13 | SUSE_Linux_Enterprise_Server_LTSS_12_SP3_x86_64:SLES12-SP3-LTSS-Updates [...]

71 Patch Cluster Prior to Upgrade SES 6 root # zypper rr 12 13

Tip: Upgrade Without Using SCC, SMT, or RMT If your nodes are not subscribed to one of the supported software channel providers that handle automatic channel adjustment—such as SMT, RMT, or SCC—you may need to enable additional software modules and channels.

SUSE Enterprise Storage 5.5 requires:

SLES12-SP3-Installer-Updates

SLES12-SP3-Pool

SLES12-SP3-Updates

SUSE-Enterprise-Storage-5-Pool

SUSE-Enterprise-Storage-5-Updates

NFS/SMB Gateway on SLE-HA on SUSE Linux Enterprise Server 12 SP3 requires:

SLE-HA12-SP3-Pool

SLE-HA12-SP3-Updates

6.2.10.2 Repository Staging Systems

If you are using one of the repository staging systems—SMT, or RMT—create a new frozen patch level for the current and the new SUSE Enterprise Storage version. Find more information in:

https://documentation.suse.com/sles/12-SP5/single-html/SLES-smt/#book-smt ,

https://documentation.suse.com/sles/15-SP1/single-html/SLES-rmt/#book-rmt .

https://documentation.suse.com/suma/3.2/ ,

72 Patch Cluster Prior to Upgrade SES 6 6.2.10.3 Patch the Whole Cluster to the Latest Patches

1. Apply the latest patches of SUSE Enterprise Storage 5.5 and SUSE Linux Enterprise Server 12 SP3 to each Ceph cluster node. Verify that correct software repositories are connected to each cluster node (see Section 6.2.10.1, “Required Software Repositories”) and run DeepSea stage 0:

root@master # salt-run state.orch ceph.stage.0

2. After stage 0 completes, verify that each cluster node's status includes 'HEALTH_OK'. If not, resolve the problem before any possible reboots in the next steps.

3. Run zypper ps to check for processes that may still be running with outdated libraries or binaries, and reboot if there are any.

4. Verify that the running kernel is the latest available, and reboot if not. Check outputs of the following commands:

cephadm@adm > uname -a cephadm@adm > rpm -qa kernel-default

5. Verify that the ceph package is version 12.2.12 or newer. Verify that the deepsea package is version 0.8.9 or newer.

6. If you previously used any of the bluestore_cache settings, they are no longer eective from ceph version 12.2.10. The new setting bluestore_cache_autotune which is set to 'true' by default disables manual cache sizing. To turn on the old behavior, you need to set bluestore_cache_autotune=false . Refer to Book “Administration Guide”, Chapter 25 “Ceph Cluster Configuration”, Section 25.2.1 “Automatic Cache Sizing” for details.

6.2.11 Verify the Current Environment

If the system has obvious problems, x them before starting the upgrade. Upgrading never xes existing system problems.

Check cluster performance. You can use commands such as rados bench , ceph tell osd.* bench , or iperf3 .

Verify access to gateways (such as iSCSI Gateway or Object Gateway) and RADOS Block Device.

73 Verify the Current Environment SES 6 Document specic parts of the system setup, such as network setup, partitioning, or installation details.

Use supportconfig to collect important system information and save it outside cluster nodes. Find more information in https://documentation.suse.com/sles/12-SP5/single-html/ SLES-admin/#sec-admsupport-supportconfig .

Ensure there is enough free disk space on each cluster node. Check free disk space with df -h . When needed, free up additional disk space by removing unneeded les/directories or removing obsolete OS snapshots. If there is not enough free disk space, do not continue with the upgrade until you have freed enough disk space.

6.2.12 Check the Cluster's State

Check the cluster health command before starting the upgrade procedure. Do not start the upgrade unless each cluster node reports 'HEALTH_OK'.

Verify that all services are running:

Salt master and Salt master daemons.

Ceph Monitor and Ceph Manager daemons.

Metadata Server daemons.

Ceph OSD daemons.

Object Gateway daemons.

iSCSI Gateway daemons.

The following commands provide details of the cluster state and specic conguration: ceph -s Prints a brief summary of Ceph cluster health, running services, data usage, and I/O statistics. Verify that it reports 'HEALTH_OK' before starting the upgrade. ceph health detail Prints details if Ceph cluster health is not OK. ceph versions

74 Check the Cluster's State SES 6 Prints versions of running Ceph daemons. ceph df Prints total and free disk space on the cluster. Do not start the upgrade if the cluster's free disk space is less than 25% of the total disk space. salt '*' cephprocesses.check results=true Prints running Ceph processes and their PIDs sorted by Salt minions. ceph osd dump | grep ^flags Verify that 'recovery_deletes' and 'purged_snapdirs' ags are present. If not, you can force a scrub on all placement groups by running the following command. Be aware that this forced scrub may possibly have a negative impact on your Ceph clients’ performance.

cephadm@adm > ceph pg dump pgs_brief | cut -d " " -f 1 | xargs -n1 ceph pg scrub

6.2.13 Migrate OSDs to BlueStore

OSD BlueStore is a new back-end for the OSD daemons. It is the default option since SUSE Enterprise Storage 5. Compared to FileStore, which stores objects as les in an XFS le system, BlueStore can deliver increased performance because it stores objects directly on the underlying block device. BlueStore also enables other features, such as built-in compression and EC overwrites, that are unavailable with FileStore. Specically for BlueStore, an OSD has a 'wal' (Write Ahead Log) device and a 'db' (RocksDB database) device. The RocksDB database holds the metadata for a BlueStore OSD. These two devices will reside on the same device as an OSD by default, but either can be placed on dierent, for example faster, media. In SUSE Enterprise Storage 5, both FileStore and BlueStore are supported and it is possible for FileStore and BlueStore OSDs to co-exist in a single cluster. During the SUSE Enterprise Storage upgrade procedure, FileStore OSDs are not automatically converted to BlueStore.

Warning Migration to BlueStore needs to be completed on all OSD nodes before the cluster upgrade because FileStore OSDs are not supported in SES 6.

75 Migrate OSDs to BlueStore SES 6 Before converting to BlueStore, the OSDs need to be running SUSE Enterprise Storage 5. The conversion is a slow process as all data gets re-written twice. Though the migration process can take a long time to complete, there is no cluster outage and all clients can continue accessing the cluster during this period. However, do expect lower performance for the duration of the migration. This is caused by rebalancing and backlling of cluster data. Use the following procedure to migrate FileStore OSDs to BlueStore:

Tip: Turn O Safety Measures Salt commands needed for running the migration are blocked by safety measures. In order to turn these precautions o, run the following command:

root@master # salt-run disengage.safety

Rebuild the nodes before continuing:

root@master # salt-run rebuild.node TARGET

You can also choose to rebuild each node individually. For example:

root@master # salt-run rebuild.node data1.ceph

The rebuild.node always removes and recreates all OSDs on the node.

Important If one OSD fails to convert, re-running the rebuild destroys the already-``` converted BlueStore OSDs. Instead of re-running the rebuild, you can run:

root@master # salt-run disks.deploy TARGET

After the migration to BlueStore, the object count will remain the same and disk usage will be nearly the same.

76 Migrate OSDs to BlueStore SES 6 6.3 Order in Which Nodes Must Be Upgraded

Certain types of daemons depend upon others. For example, Ceph Object Gateways depend upon Ceph MON and OSD daemons. We recommend upgrading in this order:

1. Admin Node

2. Ceph Monitors/Ceph Managers

3. Metadata Servers

4. Ceph OSDs

5. Object Gateways

6. iSCSI Gateways

7. NFS Ganesha

8. Samba Gateways

6.4 Oine Upgrade of CTDB Clusters

CTDB provides a clustered database used by Samba Gateways. The CTDB protocol does not support clusters of nodes communicating with dierent protocol versions. Therefore, CTDB nodes need to be taken oine prior to performing a SUSE Enterprise Storage upgrade. CTDB refuses to start if it is running alongside an incompatible version. For example, if you start a SUSE Enterprise Storage 6 CTDB version while SUSE Enterprise Storage 5.5 CTDB versions are running, then it will fail. To take the CTDB oine, stop the SLE-HA cloned CTDB resource. For example:

root@master # crm resource stop cl-ctdb

This will stop the resource across all gateway nodes (assigned to the cloned resource). Verify all the services are stopped by running the following command:

root@master # crm status

77 Order in Which Nodes Must Be Upgraded SES 6 Note Ensure CTDB is taken oine prior to the SUSE Enterprise Storage 5.5 to SUSE Enterprise Storage 6 upgrade of the CTDB and Samba Gateway packages. SLE-HA may also specify requirements for the upgrade of the underlying pacemaker/Linux-HA cluster; these should be tracked separately.

The SLE-HA cloned CTDB resource can be restarted once the new packages have been installed on all Samba Gateway nodes and the underlying pacemaker/Linux-HA cluster is up. To restart the CTDB resource run the following command:

root@master # crm resource start cl-ctdb

6.5 Per-Node Upgrade Instructions

To ensure the core cluster services are available during the upgrade, you need to upgrade the cluster nodes sequentially one by one. There are two ways you can perform the upgrade of a node: either using the installer DVD or using the distribution migration system. After upgrading each node, we recommend running rpmconfigcheck to check for any updated conguration les that have been edited locally. If the command returns a list of le names with a sux .rpmnew , .rpmorig , or .rpmsave , compare these les against the current conguration les to ensure that no local changes have been lost. If necessary, update the aected les. For more information on working with .rpmnew , .rpmorig , and .rpmsave les, refer to https://documentation.suse.com/sles/15-SP1/single-html/SLES-admin/#sec-rpm-packages- manage .

Tip: Orphaned Packages After a node is upgraded, a number of packages will be in an 'orphaned' state without a parent repository. This happens because python3 related packages do not make python2 packages obsolete. Find more information about listing orphaned packages in https://documentation.suse.com/sles/12-SP5/single-html/SLES-admin/#sec-zypper- softup-orphaned .

78 Per-Node Upgrade Instructions SES 6 6.5.1 Manual Node Upgrade Using the Installer DVD

1. Reboot the node from the SUSE Linux Enterprise Server 15 SP1 installer DVD/image.

2. On the YaST command line, add the option YAST_ACTIVATE_LUKS=0 . This option ensures that the system does not ask for a password for encrypted disks.

Warning This must not be enabled by default as it would break full-disk encryption on the system disk or part of the system disk. This parameter only works if it is provided by the installer. If not provided, you will be prompted for an encryption password for each individual disk partition. This is only supported since the 3rd Quarterly Update of SLES 15 SP1. You need SLE-15-SP1-Installer-DVD-*-QU3-DVD1.iso media or newer.

3. Select Upgrade from the boot menu.

4. On the Select the Migration Target screen, verify that 'SUSE Linux Enterprise Server 15 SP1' is selected and activate the Manually Adjust the Repositories for Migration check box.

FIGURE 6.1: SELECT THE MIGRATION TARGET

79 Manual Node Upgrade Using the Installer DVD SES 6 5. Select the following modules to install:

SUSE Enterprise Storage 6 x86_64

Basesystem Module 15 SP1 x86_64

Desktop Applications Module 15 SP1 x86_64

Legacy Module 15 SP1 x86_64

Server Applications Module 15 SP1 x86_64

6. On the Previously Used Repositories screen, verify that the correct repositories are selected. If the system is not registered with SCC/SMT, you need to add the repositories manually. SUSE Enterprise Storage 6 requires:

SLE-Module-Basesystem15-SP1-Pool

SLE-Module-Basesystem15-SP1-Updates

SLE-Module-Server-Applications15-SP1-Pool

SLE-Module-Server-Applications15-SP1-Updates

SLE-Module-Desktop-Applications15-SP1-Pool

SLE-Module-Desktop-Applications15-SP1-Updates

SLE-Product-SLES15-SP1-Pool

SLE-Product-SLES15-SP1-Updates

SLE15-SP1-Installer-Updates

SUSE-Enterprise-Storage-6-Pool

SUSE-Enterprise-Storage-6-Updates

If you intend to migrate ntpd to chronyd after SES migration (refer to Section 6.2.9, “Migrate from ntpd to chronyd”), include the following repositories:

SLE-Module-Legacy15-SP1-Pool

SLE-Module-Legacy15-SP1-Updates

80 Manual Node Upgrade Using the Installer DVD SES 6 NFS/SMB Gateway on SLE-HA on SUSE Linux Enterprise Server 15 SP1 requires:

SLE-Product-HA15-SP1-Pool

SLE-Product-HA15-SP1-Updates

7. Review the Installation Settings and start the installation procedure by clicking Update.

6.5.2 Node Upgrade Using the SUSE Distribution Migration System

The Distribution Migration System (DMS) provides an upgrade path for an installed SUSE Linux Enterprise system from one major version to another. The following procedure utilizes DMS to upgrade SUSE Enterprise Storage 5.5 to version 6, including the underlying SUSE Linux Enterprise Server 12 SP3 to SUSE Linux Enterprise Server 15 SP1 migration. Refer to https://documentation.suse.com/suse-distribution-migration-system/1.0/single-html/ distribution-migration-system/ to nd both general and detailed information about DMS.

6.5.2.1 Before You Begin

Before the starting the upgrade process, check whether the sles-ltss-release or sles-ltss- release-POOL packages are installed on any node of the cluster:

root@minion > rpm -q sles-ltss-release root@minion > rpm -q sles-ltss-release-POOL

If either or both are installed, remove them:

root@minion > zypper rm -y sles-ltss-release sles-ltss-release-POOL

Important This must be done on all nodes of the cluster before proceeding.

Note Ensure you also follow the Section 6.2.12, “Check the Cluster's State” guidelines. The upgrade must not be started until all nodes are fully patched. See Section 6.2.10.3, “Patch the Whole Cluster to the Latest Patches” for more information.

81 Node Upgrade Using the SUSE Distribution Migration System SES 6 6.5.2.2 Upgrading Nodes

1. Install the migration RPM packages. They adjust the GRUB boot loader to automatically trigger the upgrade on next reboot. Install the SLES15-SES-Migration and suse- migration-sle15-activation packages:

root@minion > zypper install SLES15-SES-Migration suse-migration-sle15-activation

2. a. If the node being upgraded is registered with a repository staging system such as SCC, SMT, RMT, or SUSE Manager, create the /etc/sle-migration-service.yml with the following content:

use_zypper_migration: true preserve: rules: - /etc//rules.d/70-persistent-net.rules

b. If the node being upgraded is not registered with a repository staging system such as SCC, SMT, RMT, or SUSE Manager, perform the following changes:

i. Create the /etc/sle-migration-service.yml with the following content:

use_zypper_migration: false preserve: rules: - /etc/udev/rules.d/70-persistent-net.rules

ii. Disable or remove the SLE 12 SP3 and SES 5 repos, and add the SLE 15 SP1 and SES6 repos. Find the list of related repositories in Section 6.2.10.1, “Required Software Repositories”.

3. Reboot to start the upgrade. While the upgrade is running, you can log in to the upgraded node via ssh as the migration user using the existing SSH key from the host system as described in https://documentation.suse.com/suse-distribution-migration- system/1.0/single-html/distribution-migration-system/ . For SUSE Enterprise Storage, if you have physical access or direct console access to the machine, you can also log in as root on the system console using the password sesupgrade . The node will reboot automatically after the upgrade.

82 Node Upgrade Using the SUSE Distribution Migration System SES 6 Tip: Upgrade Failure If the upgrade fails, inspect /var/log/distro_migration.log . Fix the problem, re-install the migration RPM packages, and reboot the node.

6.6 Upgrade the Admin Node

The following commands will still work, although Salt minions are running old versions of Ceph and Salt: salt '*' test.ping and ceph status

After the upgrade of the Admin Node, openATTIC will no longer be installed.

If the Admin Node hosted SMT, complete its migration to RMT (refer to https:// documentation.suse.com/sles/15-SP1/single-html/SLES-rmt/#cha-rmt-migrate ).

Use the procedure described in Section 6.5, “Per-Node Upgrade Instructions”.

Tip: Status of Cluster Nodes After the Admin Node is upgraded, you can run the salt-run upgrade.status command to view useful information about cluster nodes. The command lists the Ceph and OS versions of all nodes, and recommends the order in which to upgrade any nodes that are still running old versions.

root@master # salt-run upgrade.status The newest installed software versions are: ceph: ceph version 14.2.1-468-g994fd9e0cc (994fd9e0ccc50c2f3a55a3b7a3d4e0ba74786d50) nautilus (stable) os: SUSE Linux Enterprise Server 15 SP1

Nodes running these software versions: admin.ceph (assigned roles: master) mon2.ceph (assigned roles: admin, mon, mgr)

Nodes running older software versions must be upgraded in the following order: 1: mon1.ceph (assigned roles: admin, mon, mgr) 2: mon3.ceph (assigned roles: admin, mon, mgr) 3: data1.ceph (assigned roles: storage) [...]

83 Upgrade the Admin Node SES 6 6.7 Upgrade Ceph Monitor/Ceph Manager Nodes

If your cluster does not use MDS roles, upgrade MON/MGR nodes one by one.

If your cluster uses MDS roles, and MON/MGR and MDS roles are co-located, you need to shrink the MDS cluster and then upgrade the co-located nodes. Refer to Section 6.8, “Upgrade Metadata Servers” for more details.

If your cluster uses MDS roles and they run on dedicated servers, upgrade all MON/MGR nodes one by one, then shrink the MDS cluster and upgrade it. Refer to Section 6.8, “Upgrade Metadata Servers” for more details.

Note: Ceph Monitor Upgrade Due to a limitation in the Ceph Monitor design, once two MONs have been upgraded to SUSE Enterprise Storage 6 and have formed a quorum, the third MON (while still on SUSE Enterprise Storage 5.5) will not rejoin the MON cluster if it restarted for any reason, including a node reboot. Therefore, when two MONs have been upgraded it is best to upgrade the rest as soon as possible.

Use the procedure described in Section 6.5, “Per-Node Upgrade Instructions”.

6.8 Upgrade Metadata Servers

You need to shrink the Metadata Server (MDS) cluster. Because of incompatible features between the SUSE Enterprise Storage 5.5 and 6 versions, the older MDS daemons will shut down as soon as they see a single SES 6 level MDS join the cluster. Therefore it is necessary to shrink the MDS cluster to a single active MDS (and no standbys) for the duration of the MDS node upgrades. As soon as the second node is upgraded, you can extend the MDS cluster again.

Tip On a heavily loaded MDS cluster, you may need to reduce the load (for example by stopping clients) so that a single active MDS is able to handle the workload.

1. Note the current value of the max_mds option:

cephadm@adm > ceph fs get cephfs | grep max_mds

84 Upgrade Ceph Monitor/Ceph Manager Nodes SES 6 2. Shrink the MDS cluster if you have more then 1 active MDS daemon, i.e. max_mds is > 1. To shrink the MDS cluster, run

cephadm@adm > ceph fs set FS_NAME max_mds 1

where FS_NAME is the name of your CephFS instance ('cephfs' by default).

3. Find the node hosting one of the standby MDS daemons. Consult the output of the ceph fs status command and start the upgrade of the MDS cluster on this node.

cephadm@adm > ceph fs status cephfs - 2 clients ======+------+------+------+------+------+------+ | Rank | State | MDS | Activity | dns | inos | +------+------+------+------+------+------+ | 0 | active | mon1-6 | Reqs: 0 /s | 13 | 16 | +------+------+------+------+------+------+ +------+------+------+------+ | Pool | type | used | avail | +------+------+------+------+ | cephfs_metadata | metadata | 2688k | 96.8G | | cephfs_data | data | 0 | 96.8G | +------+------+------+------+ +------+ | Standby MDS | +------+ | mon3-6 | | mon2-6 | +------+

In this example, you need to start the upgrade procedure either on node 'mon3-6' or 'mon2-6'.

4. Upgrade the node with the standby MDS daemon. After the upgraded MDS node starts, the outdated MDS daemons will shut down automatically. At this point, clients may experience a short downtime of the CephFS service. Use the procedure described in Section 6.5, “Per-Node Upgrade Instructions”.

5. Upgrade the remaining MDS nodes.

6. Reset max_mds to the desired conguration:

cephadm@adm > ceph fs set FS_NAME max_mds ACTIVE_MDS_COUNT

85 Upgrade Metadata Servers SES 6 6.9 Upgrade Ceph OSDs

For each storage node, follow these steps:

1. Identify which OSD daemons are running on a particular node:

cephadm@adm > ceph osd tree

2. Set the noout ag for each OSD daemon on the node that is being upgraded:

cephadm@adm > ceph osd add-noout osd.OSD_ID

For example:

cephadm@adm > for i in $(ceph osd ls-tree OSD_NODE_NAME);do echo "osd: $i"; ceph osd add-noout osd.$i; done

Verify with:

cephadm@adm > ceph health detail | grep noout

or

cephadm@adm > ceph –s cluster: id: 44442296-033b-3275-a803-345337dc53da health: HEALTH_WARN 6 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set

3. Create /etc/ceph/osd/*.json les for all existing OSDs by running the following command on the node that is going to be upgraded:

cephadm@osd > ceph-volume simple scan --force

4. Upgrade the OSD node. Use the procedure described in Section 6.5, “Per-Node Upgrade Instructions”.

5. Activate all OSDs found in the system:

cephadm@osd > ceph-volume simple activate --all

86 Upgrade Ceph OSDs SES 6 Tip: Activating Data Partitions Individually If you want to activate data partitions individually, you need to nd the correct ceph-volume command for each partition to activate it. Replace X1 with the partition's correct letter/number:

cephadm@osd > ceph-volume simple scan /dev/sdX1

For example:

cephadm@osd > ceph-volume simple scan /dev/vdb1 [...] --> OSD 8 got scanned and metadata persisted to file: /etc/ceph/osd/8-d7bd2685-5b92-4074-8161-30d146cd0290.json --> To take over management of this scanned OSD, and disable ceph-disk and udev, run: --> ceph-volume simple activate 8 d7bd2685-5b92-4074-8161-30d146cd0290

The last line of the output contains the command to activate the partition:

cephadm@osd > ceph-volume simple activate 8 d7bd2685-5b92-4074-8161-30d146cd0290 [...] --> All ceph-disk units have been disabled to prevent OSDs getting triggered by UDEV events [...] Running command: /bin/systemctl start ceph-osd@8 --> Successfully activated OSD 8 with FSID d7bd2685-5b92-4074-8161-30d146cd0290

6. Verify that the OSD node will start properly after the reboot.

7. Address the 'Legacy BlueStore stats reporting detected on XX OSD(s)' message:

cephadm@adm > ceph –s cluster: id: 44442296-033b-3275-a803-345337dc53da health: HEALTH_WARN Legacy BlueStore stats reporting detected on 6 OSD(s)

The warning is normal when upgrading Ceph to 14.2.2. You can disable it by setting:

bluestore_warn_on_legacy_statfs = false

87 Upgrade Ceph OSDs SES 6 The proper x is to run the following command on all OSDs while they are stopped:

cephadm@osd > ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-XXX

Following is a helper script that runs the ceph-bluestore-tool repair for all OSDs on the NODE_NAME node:

cephadm@adm > OSDNODE=OSD_NODE_NAME;\ for OSD in $(ceph osd ls-tree $OSDNODE);\ do echo "osd=" $OSD;\ salt $OSDNODE* cmd.run "systemctl stop ceph-osd@$OSD";\ salt $OSDNODE* cmd.run "ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph- $OSD";\ salt $OSDNODE* cmd.run "systemctl start ceph-osd@$OSD";\ done

8. Unset the 'noout' ag for each OSD daemon on the node that is upgraded:

cephadm@adm > ceph osd rm-noout osd.OSD_ID

For example:

cephadm@adm > for i in $(ceph osd ls-tree OSD_NODE_NAME);do echo "osd: $i"; ceph osd rm-noout osd.$i; done

Verify with:

cephadm@adm > ceph health detail | grep noout

Note:

cephadm@adm > ceph –s cluster: id: 44442296-033b-3275-a803-345337dc53da health: HEALTH_WARN Legacy BlueStore stats reporting detected on 6 OSD(s)

9. Verify the cluster status. It will be similar to the following output:

cephadm@adm > ceph status cluster: id: e0d53d64-6812-3dfe-8b72-fd454a6dcf12 health: HEALTH_WARN 3 monitors have not enabled msgr2

88 Upgrade Ceph OSDs SES 6 services: mon: 3 daemons, quorum mon1,mon2,mon3 (age 2h) mgr: mon2(active, since 22m), standbys: mon1, mon3 osd: 30 osds: 30 up, 30 in

data: pools: 1 pools, 1024 pgs objects: 0 objects, 0 B usage: 31 GiB used, 566 GiB / 597 GiB avail pgs: 1024 active+clean

10. Once the last OSD node has been upgraded, issue the following command:

cephadm@adm > ceph osd require-osd-release nautilus

This disallows pre-SUSE Enterprise Storage 6 and Nautilus OSDs and enables all new SUSE Enterprise Storage 6 and Nautilus-only OSD functionality.

11. Enable the new v2 network protocol by issuing the following command:

cephadm@adm > ceph mon enable-msgr2

This instructs all monitors that bind to the old default port for the legacy v1 Messenger protocol (6789) to also bind to the new v2 protocol port (3300). To see if all monitors have been updated, run:

cephadm@adm > ceph mon dump

Verify that each monitor has both a v2: and v1: address listed.

12. Verify that all OSD nodes were rebooted and that OSDs started automatically after the reboot.

6.10 Upgrade Gateway Nodes

Upgrade gateway nodes in the following order:

1. Object Gateways

89 Upgrade Gateway Nodes SES 6 If the Object Gateways are fronted by a load balancer, then a rolling upgrade of the Object Gateways should be possible without an outage.

Validate that the Object Gateway daemons are running after each upgrade, and test with S3/Swift client.

Use the procedure described in Section 6.5, “Per-Node Upgrade Instructions”.

2. iSCSI Gateways

Important: Package Dependency Conflict After a package dependency is calculated, you need to resolve a package dependency conict. It applies to the patterns-ses-ceph_iscsi version mismatch.

FIGURE 6.2: DEPENDENCY CONFLICT RESOLUTION

From the four presented solutions, choose deinstalling the patterns-ses- ceph_iscsi pattern. This way you will keep the required lrbd package installed.

90 Upgrade Gateway Nodes SES 6 If iSCSI initiators are congured with multipath, then a rolling upgrade of the iSCSI Gateways should be possible without an outage.

Validate that the lrbd daemon is running after each upgrade, and test with initiator.

Use the procedure described in Section 6.5, “Per-Node Upgrade Instructions”.

3. NFS Ganesha. Use the procedure described in Section 6.5, “Per-Node Upgrade Instructions”.

4. Samba Gateways. Use the procedure described in Section 6.5, “Per-Node Upgrade Instructions”.

6.11 Steps to Take after the Last Node Has Been Upgraded

6.11.1 Update Ceph Monitor Setting

For each host that has been upgraded — OSD, MON, MGR, MDS, and Gateway nodes, as well as client hosts — update your ceph.conf le so that it either species no monitor port (if you are running the monitors on the default ports) or references both the v2 and v1 addresses and ports explicitly.

Note Things will still work if only the v1 IP and port are listed, but each CLI instantiation or daemon will need to reconnect after learning that the monitors also speak the v2 protocol. This slows things down and prevents a full transition to the v2 protocol.

91 Steps to Take after the Last Node Has Been Upgraded SES 6 6.11.2 Disable Insecure Clients

Since Nautilus v14.2.20, a new health warning was introduced that informs you that insecure clients are allowed to join the cluster. This warning is on by default. The Ceph Dashboard will show the cluster in the HEALTH_WARN status and verifying the cluster status on the command line informs you as follows:

cephadm@adm > ceph status cluster: id: 3fe8b35a-689f-4970-819d-0e6b11f6707c health: HEALTH_WARN mons are allowing insecure global_id reclaim [...]

This warning means that the Ceph Monitors are still allowing old, unpatched clients to connect to the cluster. This ensures existing clients can still connect while the cluster is being upgraded, but warns you that there is a problem that needs to be addressed. When the cluster and all clients are upgraded to the latest version of Ceph, disallow unpatched clients by running the following command:

cephadm@adm > ceph config set mon auth_allow_insecure_global_id_reclaim false

6.11.3 Enable the Telemetry Module

Finally, consider enabling the Telemetry module to send anonymized usage statistics and crash information to the upstream Ceph developers. To see what would be reported (without actually sending any information to anyone):

cephadm@adm > ceph mgr module enable telemetry cephadm@adm > ceph telemetry show

If you are comfortable with the high-level cluster metadata that will be reported, you can opt- in to automatically report it:

cephadm@adm > ceph telemetry on

92 Disable Insecure Clients SES 6 6.12 Update policy.cfg and Deploy Ceph Dashboard Using DeepSea

On the Admin Node, edit /srv/pillar/ceph/proposals/policy.cfg and apply the following changes:

Important: No New Services During cluster upgrade, do not add new services to the policy.cfg le. Change the cluster architecture only after the upgrade is completed.

1. Remove role-openattic .

2. Add role-prometheus and role-grafana to the node that had Prometheus and Grafana installed, usually the Admin Node.

3. Role profile-PROFILE_NAME is now ignored. Add new corresponding role, role- storage line. For example, for existing

profile-default/cluster/*.sls

add

role-storage/cluster/*.sls

4. Synchronize all Salt modules:

root@master # salt '*' saltutil.sync_all

5. Update the Salt pillar by running DeepSea stage 1 and stage 2:

root@master # salt-run state.orch ceph.stage.1 root@master # salt-run state.orch ceph.stage.2

6. Clean up openATTIC:

root@master # salt OA_MINION state.apply ceph.rescind.openattic root@master # salt OA_MINION state.apply ceph.remove.openattic

93 Update policy.cfg and Deploy Ceph Dashboard Using DeepSea SES 6 7. Unset the restart_igw grain to prevent stage 0 from restarting iSCSI Gateway, which is not installed yet:

root@master # salt '*' grains.delkey restart_igw

8. Finally, run through DeepSea stages 0-4:

root@master # salt-run state.orch ceph.stage.0 root@master # salt-run state.orch ceph.stage.1 root@master # salt-run state.orch ceph.stage.2 root@master # salt-run state.orch ceph.stage.3 root@master # salt-run state.orch ceph.stage.4

Tip: 'subvolume missing' Errors during Stage 3 DeepSea stage 3 may fail with an error similar to the following:

subvolume : ['/var/lib/ceph subvolume missing on 4510-2', \ '/var/lib/ceph subvolume missing on 4510-1', \ [...] 'See /srv/salt/ceph/subvolume/README.md']

In this case, you need to edit /srv/pillar/ceph/stack/global.yml and add the following line:

subvolume_init: disabled

Then refresh the Salt pillar and re-run DeepSea stage.3:

root@master # salt '*' saltutil.refresh_pillar root@master # salt-run state.orch ceph.stage.3

After DeepSea successfully nished stage.3, the Ceph Dashboard will be running. Refer to Book “Administration Guide” for a detailed overview of Ceph Dashboard features. To list nodes running dashboard, run:

cephadm@adm > ceph mgr services | grep dashboard

To list admin credentials, run:

root@master # salt-call grains.get dashboard_creds

94 Update policy.cfg and Deploy Ceph Dashboard Using DeepSea SES 6 9. Sequentially restart the Object Gateway services to use 'beast' Web server instead of the outdated 'civetweb':

root@master # salt-run state.orch ceph.restart.rgw.force

10. Before you continue, we strongly recommend enabling the Ceph telemetry module. For more information, see Book “Administration Guide”, Chapter 21 “Ceph Manager Modules”, Section 21.2 “Telemetry Module” for information and instructions.

6.13 Migration from Profile-based Deployments to DriveGroups

In SUSE Enterprise Storage 5.5, DeepSea oered so called 'proles' to describe the layout of your OSDs. Starting with SUSE Enterprise Storage 6, we moved to a dierent approach called DriveGroups (nd more details in Section 5.5.2, “DriveGroups”).

Note Migrating to the new approach is not immediately mandatory. Destructive operations, such as salt-run osd.remove , salt-run osd.replace , or salt-run osd.purge are still available. However, adding new OSDs will require your action.

Because of the dierent approach of these implementations, we do not oer an automated migration path. However, we oer a variety of tools—Salt runners—to make the migration as simple as possible.

6.13.1 Analyze the Current Layout

To view information about the currently deployed OSDs, use the following command:

root@master # salt-run disks.discover

Alternatively, you can inspect the content of the les in the /srv/pillar/ceph/proposals/ profile-*/ directories. They have a similar structure to the following:

ceph: storage:

95 Migration from Profile-based Deployments to DriveGroups SES 6 osds: /dev/disk/by-id/-drive_name: format: bluestore /dev/disk/by-id/scsi-drive_name2: format: bluestore

6.13.2 Create DriveGroups Matching the Current Layout

Refer to Section 5.5.2.1, “Specification” for more details on DriveGroups specication. The dierence between a fresh deployment and upgrade scenario is that the drives to be migrated are already 'used'. Because

root@master # salt-run disks.list looks for unused disks only, use

root@master # salt-run disks.list include_unavailable=True

Adjust DriveGroups until you match your current setup. For a more visual representation of what will be happening, use the following command. Note that it has no output if there are no free disks:

root@master # salt-run disks.report bypass_pillar=True

If you veried that your DriveGroups are properly congured and want to apply the new approach, remove the les from the /srv/pillar/ceph/proposals/profile-PROFILE_NAME/ directory, remove the corresponding profile-PROFILE_NAME/cluster/*.sls lines from the /srv/pillar/ceph/proposals/policy.cfg le, and run DeepSea stage 2 to refresh the Salt pillar.

root@master # salt-run state.orch ceph.stage.2

Verify the result by running the following commands:

root@master # salt target_node pillar.get ceph:storage root@master # salt-run disks.report

Warning: Incorrect DriveGroups Configuration If your DriveGroups are not properly congured and there are spare disks in your setup, they will be deployed in the way you specied them. We recommend running:

root@master # salt-run disks.report

96 Create DriveGroups Matching the Current Layout SES 6 6.13.3 OSD Deployment

As of the Ceph Mimic release (SES 5), the ceph-disk tool is deprecated, and as of the Ceph Nautilus release (SES 6) it is no longer shipped upstream. ceph-disk is still supported in SUSE Enterprise Storage 6. Any pre-deployed ceph-disk OSDs will continue to function normally. However, when a disk breaks there is no migration path: a disk will need to be re-deployed. For completeness, consider migrating OSDs on the whole node. There are two paths for SUSE Enterprise Storage 6 users:

Keep OSDs deployed with ceph-disk : The simple command provides a way to take over the management while disabling ceph-disk triggers.

Re-deploy existings OSDs with ceph-volume . For more information on replacing your OSDs, see Book “Administration Guide”, Chapter 2 “Salt Cluster Administration”, Section 2.8 “Replacing an OSD Disk”.

Tip: Migrate to LVM Format Whenever a single legacy OSD needs to be replaced on a node, all OSDs that share devices with it need to be migrated to the LVM-based format.

6.13.4 More Complex Setups

If you have a more sophisticated setup than just stand-alone OSDs, for example dedicated WAL/ DBs or encrypted OSDs, the migration can only happen when all OSDs assigned to that WAL/DB device are removed. This is due to the ceph-volume command that creates Logical Volumes on disks before deployment. This prevents the user from mixing partition based deployments with LV based deployments. In such cases it is best to manually remove all OSDs that are assigned to a WAL/DB device and re-deploy them using the DriveGroups approach.

97 OSD Deployment SES 6 7 Customizing the Default Configuration

You can change the default cluster conguration generated in Stage 2 (refer to DeepSea Stages Description). For example, you may need to change network settings, or software that is installed on the Admin Node by default. You can perform the former by modifying the pillar updated after Stage 2, while the latter is usually done by creating a custom sls le and adding it to the pillar. Details are described in following sections.

7.1 Using Customized Configuration Files

This section lists several tasks that require adding/changing your own sls les. Such a procedure is typically used when you need to change the default deployment process.

Tip: Prefix Custom .sls Files Your custom .sls les belong to the same subdirectory as DeepSea's .sls les. To prevent overwriting your .sls les with the possibly newly added ones from the DeepSea package, prex their name with the custom- string.

7.1.1 Disabling a Deployment Step

If you address a specic task outside of the DeepSea deployment process and therefore need to skip it, create a 'no-operation' le following this example:

PROCEDURE 7.1: DISABLING TIME SYNCHRONIZATION

1. Create /srv/salt/ceph/time/disabled.sls with the following content and save it:

disable time setting: test.nop

2. Edit /srv/pillar/ceph/stack/global.yml , add the following line, and save it:

time_init: disabled

3. Verify by refreshing the pillar and running the step:

root@master # salt target saltutil.pillar_refresh

98 Using Customized Configuration Files SES 6 root@master # salt 'admin.ceph' state.apply ceph.time admin.ceph: Name: disable time setting - Function: test.nop - Result: Clean

Summary for admin.ceph ------Succeeded: 1 Failed: 0 ------Total states run: 1

Note: Unique ID The task ID 'disable time setting' may be any message unique within an sls le. Prevent ID collisions by specifying unique descriptions.

7.1.2 Replacing a Deployment Step

If you need to replace the default behavior of a specic step with a custom one, create a custom sls le with replacement content. By default /srv/salt/ceph/pool/default.sls creates an rbd image called 'demo'. In our example, we do not want this image to be created, but we need two images: 'archive1' and 'archive2'.

PROCEDURE 7.2: REPLACING THE DEMO RBD IMAGE WITH TWO CUSTOM RBD IMAGES

1. Create /srv/salt/ceph/pool/custom.sls with the following content and save it:

wait: module.run: - name: wait.out - kwargs: 'status': "HEALTH_ERR" 1 - fire_event: True

archive1: cmd.run: - name: "rbd -p rbd create archive1 --size=1024" 2 - unless: "rbd -p rbd ls | grep -q archive1$" - fire_event: True

archive2: cmd.run:

99 Replacing a Deployment Step SES 6 - name: "rbd -p rbd create archive2 --size=768" - unless: "rbd -p rbd ls | grep -q archive2$" - fire_event: True

1 The wait module will pause until the Ceph cluster does not have a status of HEALTH_ERR . In fresh installations, a Ceph cluster may have this status until a sucient number of OSDs become available and the creation of pools has completed. 2 The rbd command is not idempotent. If the same creation command is re-run after the image exists, the Salt state will fail. The unless statement prevents this.

2. To call the newly created custom le instead of the default, you need to edit /srv/pillar/ ceph/stack/ceph/cluster.yml , add the following line, and save it:

pool_init: custom

3. Verify by refreshing the pillar and running the step:

root@master # salt target saltutil.pillar_refresh root@master # salt 'admin.ceph' state.apply ceph.pool

Note: Authorization The creation of pools or images requires sucient authorization. The admin.ceph minion has an admin keyring.

Tip: Alternative Way Another option is to change the variable in /srv/pillar/ceph/stack/ceph/roles/ master.yml instead. Using this le will reduce the clutter of pillar data for other minions.

7.1.3 Modifying a Deployment Step

Sometimes you may need a specic step to do some additional tasks. We do not recommend modifying the related state le as it may complicate a future upgrade. Instead, create a separate le to carry out the additional tasks identical to what was described in Section 7.1.2, “Replacing a Deployment Step”. Name the new sls le descriptively. For example, if you need to create two rbd images in addition to the demo image, name the le archive.sls .

100 Modifying a Deployment Step SES 6 PROCEDURE 7.3: CREATING TWO ADDITIONAL RBD IMAGES

1. Create /srv/salt/ceph/pool/custom.sls with the following content and save it:

include: - .archive - .default

Tip: Include Precedence In this example, Salt will create the archive images and then create the demo image. The order does not matter in this example. To change the order, reverse the lines after the include: directive. You can add the include line directly to archive.sls and all the images will get created as well. However, regardless of where the include line is placed, Salt processes the steps in the included le rst. Although this behavior can be overridden with requires and order statements, a separate le that includes the others guarantees the order and reduces the chances of confusion.

2. Edit /srv/pillar/ceph/stack/ceph/cluster.yml , add the following line, and save it:

pool_init: custom

3. Verify by refreshing the pillar and running the step:

root@master # salt target saltutil.pillar_refresh root@master # salt 'admin.ceph' state.apply ceph.pool

7.1.4 Modifying a Deployment Stage

If you need to add a completely separate deployment step, create three new les—an sls le that performs the command, an orchestration le, and a custom le which aligns the new step with the original deployment steps. For example, if you need to run logrotate on all minions as part of the preparation stage: First create an sls le and include the logrotate command.

PROCEDURE 7.4: RUNNING logrotate ON ALL SALT MINIONS

1. Create a directory such as /srv/salt/ceph/logrotate .

101 Modifying a Deployment Stage SES 6 2. Create /srv/salt/ceph/logrotate/init.sls with the following content and save it:

rotate logs: cmd.run: - name: "/usr/sbin/logrotate /etc/logrotate.conf"

3. Verify that the command works on a minion:

root@master # salt 'admin.ceph' state.apply ceph.logrotate

Because the orchestration le needs to run before all other preparation steps, add it to the Prep stage 0:

1. Create /srv/salt/ceph/stage/prep/logrotate.sls with the following content and save it:

logrotate: salt.state: - tgt: '*' - sls: ceph.logrotate

2. Verify that the orchestration le works:

root@master # salt-run state.orch ceph.stage.prep.logrotate

The last le is the custom one which includes the additional step with the original steps:

1. Create /srv/salt/ceph/stage/prep/custom.sls with the following content and save it:

include: - .logrotate - .master - .minion

2. Override the default behavior. Edit /srv/pillar/ceph/stack/global.yml , add the following line, and save the le:

stage_prep: custom

3. Verify that Stage 0 works:

root@master # salt-run state.orch ceph.stage.0

102 Modifying a Deployment Stage SES 6 Note: Why global.yml? The global.yml le is chosen over the cluster.yml because during the prep stage, no minion belongs to the Ceph cluster and has no access to any settings in cluster.yml .

7.1.5 Updates and Reboots during Stage 0

During stage 0 (refer to DeepSea Stages Description for more information on DeepSea stages), the Salt master and Salt minions may optionally reboot because newly updated packages, for example kernel , require rebooting the system. The default behavior is to install available new updates and not reboot the nodes even in case of kernel updates. You can change the default update/reboot behavior of DeepSea stage 0 by adding/ changing the stage_prep_master and stage_prep_minion options in the /srv/pillar/ ceph/stack/global.yml le. stage_prep_master sets the behavior of the Salt master, and stage_prep_minion sets the behavior of all minions. All available parameters are: default Install updates without rebooting. default-update-reboot Install updates and reboot after updating. default-no-update-reboot Reboot without installing updates. default-no-update-no-reboot Do not install updates or reboot.

For example, to prevent the cluster nodes from installing updates and rebooting, edit /srv/ pillar/ceph/stack/global.yml and add the following lines:

stage_prep_master: default-no-update-no-reboot stage_prep_minion: default-no-update-no-reboot

103 Updates and Reboots during Stage 0 SES 6 Tip: Values and Corresponding Files The values of stage_prep_master correspond to le names located in /srv/salt/ ceph/stage/0/master , while values of stage_prep_minion correspond to les in / srv/salt/ceph/stage/0/minion :

root@master # ls -l /srv/salt/ceph/stage/0/master default-no-update-no-reboot.sls default-no-update-reboot.sls default-update-reboot.sls [...]

root@master # ls -l /srv/salt/ceph/stage/0/minion default-no-update-no-reboot.sls default-no-update-reboot.sls default-update-reboot.sls [...]

7.2 Modifying Discovered Configuration

After you completed Stage 2, you may want to change the discovered conguration. To view the current settings, run:

root@master # salt target pillar.items

The output of the default conguration for a single minion is usually similar to the following:

------available_roles: - admin - mon - storage - mds - igw - rgw - client-cephfs - client-radosgw - client-iscsi - mds-nfs - rgw-nfs - master cluster:

104 Modifying Discovered Configuration SES 6 ceph cluster_network: 172.16.22.0/24 fsid: e08ec63c-8268-3f04-bcdb-614921e94342 master_minion: admin.ceph mon_host: - 172.16.21.13 - 172.16.21.11 - 172.16.21.12 mon_initial_members: - mon3 - mon1 - mon2 public_address: 172.16.21.11 public_network: 172.16.21.0/24 roles: - admin - mon - mds time_server: admin.ceph time_service: ntp

The above mentioned settings are distributed across several conguration les. The directory structure with these les is dened in the /srv/pillar/ceph/stack/stack.cfg directory. The following les usually describe your cluster:

/srv/pillar/ceph/stack/global.yml - the le aects all minions in the Salt cluster.

/srv/pillar/ceph/stack/ceph/cluster.yml - the le aects all minions in the Ceph cluster called ceph .

/srv/pillar/ceph/stack/ceph/roles/role.yml - aects all minions that are assigned the specic role in the ceph cluster.

/srv/pillar/ceph/stack/ceph/minions/MINION_ID/yml - aects the individual minion.

105 Modifying Discovered Configuration SES 6 Note: Overwriting Directories with Default Values There is a parallel directory tree that stores the default conguration setup in /srv/ pillar/ceph/stack/default . Do not change values here, as they are overwritten.

The typical procedure for changing the collected conguration is the following:

1. Find the location of the conguration item you need to change. For example, if you need to change cluster related setting such as cluster network, edit the le /srv/pillar/ceph/ stack/ceph/cluster.yml .

2. Save the le.

3. Verify the changes by running:

root@master # salt target saltutil.pillar_refresh

and then

root@master # salt target pillar.items

7.2.1 Enabling IPv6 for Ceph Cluster Deployment

Since IPv4 network addressing is prevalent, you need to enable IPv6 as a customization. DeepSea has no auto-discovery of IPv6 addressing. To congure IPv6, set the public_network and cluster_network variables in the /srv/ pillar/ceph/stack/global.yml le to valid IPv6 subnets. For example:

public_network: fd00:10::/64 cluster_network: fd00:11::/64

Then run DeepSea stage 2 and verify that the network information matches the setting. Stage 3 will generate the ceph.conf with the necessary ags.

Important: No Support for Dual Stack Ceph does not support dual stack—running Ceph simultaneously on IPv4 and IPv6 is not possible. DeepSea validation will reject a mismatch between public_network and cluster_network or within either variable. The following example will fail the validation.

106 Enabling IPv6 for Ceph Cluster Deployment SES 6 public_network: "192.168.10.0/24 fd00:10::/64"

Tip: Avoid Using fe80::/10 link-local Addresses Avoid using fe80::/10 link-local addresses. All network interfaces have an assigned fe80 address and require an interface qualier for proper routing. Either assign IPv6 addresses allocated to your site or consider using fd00::/8 . These are part of ULA and not globally routable.

107 Enabling IPv6 for Ceph Cluster Deployment SES 6 III Installation of Additional Services

8 Installation of Services to Access your Data 109

9 Ceph Object Gateway 110

10 Installation of iSCSI Gateway 117

11 Installation of CephFS 130

12 Installation of NFS Ganesha 145 8 Installation of Services to Access your Data

After you deploy your SUSE Enterprise Storage 6 cluster you may need to install additional software for accessing your data, such as the Object Gateway or the iSCSI Gateway, or you can deploy a clustered le system on top of the Ceph cluster. This chapter mainly focuses on manual installation. If you have a cluster deployed using Salt, refer to Chapter 5, Deploying with DeepSea/ Salt for a procedure on installing particular gateways or the CephFS.

109 SES 6 9 Ceph Object Gateway

Ceph Object Gateway is an object storage interface built on top of librgw to provide applications with a RESTful gateway to Ceph clusters. It supports two interfaces:

S3-compatible: Provides object storage functionality with an interface that is compatible with a large subset of the Amazon S3 RESTful API.

Swift-compatible: Provides object storage functionality with an interface that is compatible with a large subset of the OpenStack Swift API.

The Object Gateway daemon uses 'Beast' HTTP front-end by default. It uses the Boost.Beast library for HTTP parsing and the Boost.Asio library for asynchronous network I/O operations. Because Object Gateway provides interfaces compatible with OpenStack Swift and Amazon S3, the Object Gateway has its own user management. Object Gateway can store data in the same cluster that is used to store data from CephFS clients or RADOS Block Device clients. The S3 and Swift APIs share a common name space, so you may write data with one API and retrieve it with the other.

Important: Object Gateway Deployed by DeepSea Object Gateway is installed as a DeepSea role, therefore you do not need to install it manually. To install the Object Gateway during the cluster deployment, see Section 5.3, “Cluster Deployment”. To add a new node with Object Gateway to the cluster, see Book “Administration Guide”, Chapter 2 “Salt Cluster Administration”, Section 2.2 “Adding New Roles to Nodes”.

9.1 Object Gateway Manual Installation

1. Install Object Gateway on a node that is not using port 80. The following command installs all required components:

cephadm@ogw > sudo zypper ref && zypper in ceph-radosgw

110 Object Gateway Manual Installation SES 6 2. If the Apache server from the previous Object Gateway instance is running, stop it and disable the relevant service:

cephadm@ogw > sudo systemctl stop disable apache2.service

3. Edit /etc/ceph/ceph.conf and add the following lines:

[client.rgw.gateway_host] rgw frontends = "beast port=80"

Tip If you want to congure Object Gateway/Beast for use with SSL encryption, modify the line accordingly:

rgw frontends = beast ssl_port=7480 ssl_certificate=PATH_TO_CERTIFICATE.PEM

4. Restart the Object Gateway service.

cephadm@ogw > sudo systemctl restart [email protected]_host

9.1.1 Object Gateway Configuration

Several steps are required to congure an Object Gateway.

9.1.1.1 Basic Configuration

Conguring a Ceph Object Gateway requires a running Ceph Storage Cluster. The Ceph Object Gateway is a client of the Ceph Storage Cluster. As a Ceph Storage Cluster client, it requires:

A host name for the gateway instance, for example gateway .

A storage cluster user name with appropriate permissions and a keyring.

Pools to store its data.

A data directory for the gateway instance.

An instance entry in the Ceph conguration le.

111 Object Gateway Configuration SES 6 Each instance must have a user name and key to communicate with a Ceph storage cluster. In the following steps, we use a monitor node to create a bootstrap keyring, then create the Object Gateway instance user keyring based on the bootstrap one. Then, we create a client user name and key. Next, we add the key to the Ceph Storage Cluster. Finally, we distribute the keyring to the node containing the gateway instance.

1. Create a keyring for the gateway:

cephadm@adm > ceph-authtool --create-keyring /etc/ceph/ceph.client.rgw.keyring cephadm@adm > sudo +r /etc/ceph/ceph.client.rgw.keyring

2. Generate a Ceph Object Gateway user name and key for each instance. As an example, we will use the name gateway after client.radosgw :

cephadm@adm > ceph-authtool /etc/ceph/ceph.client.rgw.keyring \ -n client.rgw.gateway --gen-key

3. Add capabilities to the key:

cephadm@adm > ceph-authtool -n client.rgw.gateway --cap osd 'allow rwx' \ --cap mon 'allow rwx' /etc/ceph/ceph.client.rgw.keyring

4. Once you have created a keyring and key to enable the Ceph Object Gateway with access to the Ceph Storage Cluster, add the key to your Ceph Storage Cluster. For example:

cephadm@adm > ceph -k /etc/ceph/ceph.client.admin.keyring auth add client.rgw.gateway \ -i /etc/ceph/ceph.client.rgw.keyring

5. Distribute the keyring to the node with the gateway instance:

cephadm@adm > scp /etc/ceph/ceph.client.rgw.keyring ceph@HOST_NAME:/home/ceph cephadm@adm > ssh ceph@HOST_NAME cephadm@ogw > mv ceph.client.rgw.keyring /etc/ceph/ceph.client.rgw.keyring

Tip: Use Bootstrap Keyring An alternative way is to create the Object Gateway bootstrap keyring, and then create the Object Gateway keyring from it:

1. Create an Object Gateway bootstrap keyring on one of the monitor nodes:

cephadm@mon > ceph \

112 Object Gateway Configuration SES 6 auth get-or-create client.bootstrap-rgw mon 'allow profile bootstrap-rgw' \ --connect-timeout=25 \ --cluster=ceph \ --name mon. \ --keyring=/var/lib/ceph/mon/ceph-NODE_HOST/keyring \ -o /var/lib/ceph/bootstrap-rgw/keyring

2. Create the /var/lib/ceph/radosgw/ceph-RGW_NAME directory for storing the bootstrap keyring:

cephadm@mon > mkdir \ /var/lib/ceph/radosgw/ceph-RGW_NAME

3. Create an Object Gateway keyring from the newly created bootstrap keyring:

cephadm@mon > ceph \ auth get-or-create client.rgw.RGW_NAME osd 'allow rwx' mon 'allow rw' \ --connect-timeout=25 \ --cluster=ceph \ --name client.bootstrap-rgw \ --keyring=/var/lib/ceph/bootstrap-rgw/keyring \ -o /var/lib/ceph/radosgw/ceph-RGW_NAME/keyring

4. Copy the Object Gateway keyring to the Object Gateway host:

cephadm@mon > scp \ /var/lib/ceph/radosgw/ceph-RGW_NAME/keyring \ RGW_HOST:/var/lib/ceph/radosgw/ceph-RGW_NAME/keyring

9.1.1.2 Create Pools (Optional)

Ceph Object Gateways require Ceph Storage Cluster pools to store specic gateway data. If the user you created has proper permissions, the gateway will create the pools automatically. However, ensure that you have set an appropriate default number of placement groups per pool in the Ceph conguration le. The pool names follow the ZONE_NAME.POOL_NAME syntax. When conguring a gateway with the default region and zone, the default zone name is 'default' as in our example:

.rgw.root default.rgw.control default.rgw.meta

113 Object Gateway Configuration SES 6 default.rgw.log default.rgw.buckets.index default.rgw.buckets.data

To create the pools manually, see Book “Administration Guide”, Chapter 22 “Managing Storage Pools”, Section 22.2.2 “Create a Pool”.

Important: Object Gateway and Erasure-Coded Pools Only the default.rgw.buckets.data pool can be erasure coded. All other pools need to be replicated, otherwise the gateway is not accessible.

9.1.1.3 Adding Gateway Configuration to Ceph

Add the Ceph Object Gateway conguration to the Ceph Conguration le. The Ceph Object Gateway conguration requires you to identify the Ceph Object Gateway instance. Then, specify the host name where you installed the Ceph Object Gateway daemon, a keyring (for use with cephx), and optionally a log le. For example:

[client.rgw.INSTANCE_NAME] host = HOST_NAME keyring = /etc/ceph/ceph.client.rgw.keyring

Tip: Object Gateway Log File To override the default Object Gateway log le, include the following:

log file = /var/log/radosgw/client.rgw.INSTANCE_NAME.log

The [client.rgw.*] portion of the gateway instance identies this portion of the Ceph conguration le as conguring a Ceph Storage Cluster client where the client type is a Ceph Object Gateway (radosgw). The instance name follows. For example:

[client.rgw.gateway] host = ceph-gateway keyring = /etc/ceph/ceph.client.rgw.keyring

114 Object Gateway Configuration SES 6 Note The HOST_NAME must be your machine host name, excluding the domain name.

Then turn o print continue . If you have it set to true, you may encounter problems with PUT operations:

rgw print continue = false

To use a Ceph Object Gateway with subdomain S3 calls (for example http:// bucketname.hostname ), you must add the Ceph Object Gateway DNS name under the [client.rgw.gateway] section of the Ceph conguration le:

[client.rgw.gateway] ... rgw dns name = HOST_NAME

You should also consider installing a DNS server such as Dnsmasq on your client machine(s) when using the http://BUCKET_NAME.HOST_NAME syntax. The dnsmasq.conf le should include the following settings:

address=/HOST_NAME/HOST_IP_ADDRESS listen-address=CLIENT_LOOPBACK_IP

Then, add the CLIENT_LOOPBACK_IP IP address as the rst DNS server on the client machine(s).

9.1.1.4 Create Data Directory

Deployment scripts may not create the default Ceph Object Gateway data directory. Create data directories for each instance of a radosgw daemon if not already done. The host variables in the Ceph conguration le determine which host runs each instance of a radosgw daemon. The typical form species the radosgw daemon, the cluster name, and the daemon ID.

root # mkdir -p /var/lib/ceph/radosgw/CLUSTER_ID

Using the example ceph.conf settings above, you would execute the following:

root # mkdir -p /var/lib/ceph/radosgw/ceph-radosgw.gateway

115 Object Gateway Configuration SES 6 9.1.1.5 Restart Services and Start the Gateway

To ensure that all components have reloaded their congurations, we recommend restarting your Ceph Storage Cluster service. Then, start up the radosgw service. For more information, see Book “Administration Guide”, Chapter 15 “Introduction” and Book “Administration Guide”, Chapter 26 “Ceph Object Gateway”, Section 26.3 “Operating the Object Gateway Service”. When the service is up and running, you can make an anonymous GET request to see if the gateway returns a response. A simple HTTP request to the domain name should return the following:

anonymous

116 Object Gateway Configuration SES 6 10 Installation of iSCSI Gateway iSCSI is a (SAN) protocol that allows clients (called initiators ) to send SCSI commands to SCSI storage devices (targets) on remote servers. SUSE Enterprise Storage 6 includes a facility that opens Ceph storage management to heterogeneous clients, such as Microsoft Windows* and VMware* vSphere, through the iSCSI protocol. Multipath iSCSI access enables availability and scalability for these clients, and the standardized iSCSI protocol also provides an additional layer of security isolation between clients and the SUSE Enterprise Storage 6 cluster. The conguration facility is named ceph-iscsi . Using ceph-iscsi , Ceph storage administrators can dene thin-provisioned, replicated, highly-available volumes supporting read-only snapshots, read-write clones, and automatic resizing with Ceph RADOS Block Device (RBD). Administrators can then export volumes either via a single ceph-iscsi gateway host, or via multiple gateway hosts supporting multipath failover. Linux, Microsoft Windows, and VMware hosts can connect to volumes using the iSCSI protocol, which makes them available like any other SCSI block device. This means SUSE Enterprise Storage 6 customers can eectively run a complete block-storage infrastructure subsystem on Ceph that provides all the features and benets of a conventional SAN, enabling future growth. This chapter introduces detailed information to set up a Ceph cluster infrastructure together with an iSCSI gateway so that the client hosts can use remotely stored data as local storage devices using the iSCSI protocol.

10.1 iSCSI Block Storage iSCSI is an implementation of the Small Computer System Interface (SCSI) command set using the Internet Protocol (IP), specied in RFC 3720. iSCSI is implemented as a service where a client (the initiator) talks to a server (the target) via a session on TCP port 3260. An iSCSI target's IP address and port are called an iSCSI portal, where a target can be exposed through one or more portals. The combination of a target and one or more portals is called the target portal group (TPG). The underlying data link layer protocol for iSCSI is commonly Ethernet. More specically, modern iSCSI infrastructures use 10 Gigabit Ethernet or faster networks for optimal throughput. 10 Gigabit Ethernet connectivity between the iSCSI gateway and the back-end Ceph cluster is strongly recommended.

117 iSCSI Block Storage SES 6 10.1.1 The Linux Kernel iSCSI Target

The Linux kernel iSCSI target was originally named LIO for linux-iscsi.org, the project's original domain and Web site. For some time, no fewer than four competing iSCSI target implementations were available for the Linux platform, but LIO ultimately prevailed as the single iSCSI reference target. The mainline kernel code for LIO uses the simple, but somewhat ambiguous name "target", distinguishing between "target core" and a variety of front-end and back-end target modules. The most commonly used front-end module is arguably iSCSI. However, LIO also supports (FC), Fibre Channel over Ethernet (FCoE) and several other front-end protocols. At this time, only the iSCSI protocol is supported by SUSE Enterprise Storage. The most frequently used target back-end module is one that is capable of simply re-exporting any available block device on the target host. This module is named iblock. However, LIO also has an RBD-specic back-end module supporting parallelized multipath I/O access to RBD images.

10.1.2 iSCSI Initiators

This section introduces brief information on iSCSI initiators used on Linux, Microsoft Windows, and VMware platforms.

10.1.2.1 Linux

The standard initiator for the Linux platform is open-iscsi . open-iscsi launches a daemon, iscsid , which the user can then use to discover iSCSI targets on any given portal, log in to targets, and map iSCSI volumes. iscsid communicates with the SCSI mid layer to create in- kernel block devices that the kernel can then treat like any other SCSI block device on the system. The open-iscsi initiator can be deployed in conjunction with the Multipath ( dm-multipath ) facility to provide a highly available iSCSI block device.

10.1.2.2 Microsoft Windows and Hyper-V

The default iSCSI initiator for the Microsoft Windows operating system is the Microsoft iSCSI initiator. The iSCSI service can be congured via a graphical user interface (GUI), and supports multipath I/O for high availability.

118 The Linux Kernel iSCSI Target SES 6 10.1.2.3 VMware

The default iSCSI initiator for VMware vSphere and ESX is the VMware ESX software iSCSI initiator, vmkiscsi . When enabled, it can be congured either from the vSphere client, or using the vmkiscsi-tool command. You can then format storage volumes connected through the vSphere iSCSI storage adapter with VMFS, and use them like any other VM storage device. The VMware initiator also supports multipath I/O for high availability.

10.2 General Information about ceph-iscsi ceph-iscsi combines the benets of RADOS Block Devices with the ubiquitous versatility of iSCSI. By employing ceph-iscsi on an iSCSI target host (known as the iSCSI Gateway), any application that needs to make use of block storage can benet from Ceph, even if it does not speak any Ceph client protocol. Instead, users can use iSCSI or any other target front-end protocol to connect to an LIO target, which translates all target I/O to RBD storage operations.

FIGURE 10.1: CEPH CLUSTER WITH A SINGLE ISCSI GATEWAY ceph-iscsi is inherently highly-available and supports multipath operations. Thus, downstream initiator hosts can use multiple iSCSI gateways for both high availability and scalability. When communicating with an iSCSI conguration with more than one gateway, initiators may load-balance iSCSI requests across multiple gateways. In the event of a gateway failing, being temporarily unreachable, or being disabled for maintenance, I/O will transparently continue via another gateway.

119 General Information about ceph-iscsi SES 6 FIGURE 10.2: CEPH CLUSTER WITH MULTIPLE ISCSI GATEWAYS

10.3 Deployment Considerations

A minimum conguration of SUSE Enterprise Storage 6 with ceph-iscsi consists of the following components:

A Ceph storage cluster. The Ceph cluster consists of a minimum of four physical servers hosting at least eight object storage daemons (OSDs) each. In such a conguration, three OSD nodes also double as a monitor (MON) host.

An iSCSI target server running the LIO iSCSI target, congured via ceph-iscsi .

An iSCSI initiator host, running open-iscsi (Linux), the Microsoft iSCSI Initiator (Microsoft Windows), or any other compatible iSCSI initiator implementation.

120 Deployment Considerations SES 6 A recommended production conguration of SUSE Enterprise Storage 6 with ceph-iscsi consists of:

A Ceph storage cluster. A production Ceph cluster consists of any number of (typically more than 10) OSD nodes, each typically running 10-12 object storage daemons (OSDs), with no fewer than three dedicated MON hosts.

Several iSCSI target servers running the LIO iSCSI target, congured via ceph-iscsi . For iSCSI fail-over and load-balancing, these servers must run a kernel supporting the target_core_rbd module. Update packages are available from the SUSE Linux Enterprise Server maintenance channel.

Any number of iSCSI initiator hosts, running open-iscsi (Linux), the Microsoft iSCSI Initiator (Microsoft Windows), or any other compatible iSCSI initiator implementation.

10.4 Installation and Configuration

This section describes steps to install and congure an iSCSI Gateway on top of SUSE Enterprise Storage.

10.4.1 Deploy the iSCSI Gateway to a Ceph Cluster

You can deploy the iSCSI Gateway either during the Ceph cluster deployment process, or add it to an existing cluster using DeepSea. To include the iSCSI Gateway during the cluster deployment process, refer to Section 5.5.1.2, “Role Assignment”. To add the iSCSI Gateway to an existing cluster, refer to Book “Administration Guide”, Chapter 2 “Salt Cluster Administration”, Section 2.2 “Adding New Roles to Nodes”.

10.4.2 Create RBD Images

RBD images are created in the Ceph store and subsequently exported to iSCSI. We recommend that you use a dedicated RADOS pool for this purpose. You can create a volume from any host that is able to connect to your storage cluster using the Ceph rbd command line utility. This requires the client to have at least a minimal ceph.conf conguration le, and appropriate CephX authentication credentials.

121 Installation and Configuration SES 6 To create a new volume for subsequent export via iSCSI, use the rbd create command, specifying the volume size in megabytes. For example, in order to create a 100 GB volume named 'testvol' in the pool named 'iscsi-images', run:

cephadm@adm > rbd --pool iscsi-images create --size=102400 'testvol'

10.4.3 Export RBD Images via iSCSI

To export RBD images via iSCSI, you can use either Ceph Dashboard Web interface or the ceph- iscsi gwcli utility. In this section we will focus on gwcli only, demonstrating how to create an iSCSI target that exports an RBD image using the command line.

Note Only the following RBD image features are supported: layering , striping (v2) , exclusive-lock , fast-diff , and data-pool . RBD images with any other feature enabled cannot be exported.

As root , start the iSCSI gateway command line interface:

root # gwcli

Go to iscsi-targets and create a target with the name iqn.2003-01.org.linux- iscsi.iscsi.SYSTEM-ARCH:testvol :

gwcli > /> cd /iscsi-targets gwcli > /iscsi-targets> create iqn.2003-01.org.linux-iscsi.iscsi.SYSTEM-ARCH:testvol

Create the iSCSI gateways by specifying the gateway name and ip address:

gwcli > /iscsi-targets> cd iqn.2003-01.org.linux-iscsi.iscsi.SYSTEM-ARCH:testvol/ gateways gwcli > /iscsi-target...tvol/gateways> create iscsi1 192.168.124.104 gwcli > /iscsi-target...tvol/gateways> create iscsi2 192.168.124.105

Tip Use the help command to show the list of available commands in the current conguration node.

122 Export RBD Images via iSCSI SES 6 Add the RBD image with the name 'testvol' in the pool 'iscsi-images':

gwcli > /iscsi-target...tvol/gateways> cd /disks gwcli > /disks> attach iscsi-images/testvol

Map the RBD image to the target:

gwcli > /disks> cd /iscsi-targets/iqn.2003-01.org.linux-iscsi.iscsi.SYSTEM-ARCH:testvol/ disks gwcli > /iscsi-target...testvol/disks> add iscsi-images/testvol

Note You can use lower level tools, such as targetcli , to query the local conguration, but not to modify it.

Tip You can use the ls command to review the conguration. Some conguration nodes also support the info command, which can be used to display more detailed information.

Note that, by default, ACL authentication is enabled so this target is not accessible yet. Check Section 10.4.4, “Authentication and Access Control” for more information about authentication and access control.

10.4.4 Authentication and Access Control iSCSI authentication is exible and covers many authentication possibilities.

10.4.4.1 No Authentication

'No authentication' means that any initiator will be able to access any LUNs on the corresponding target. You can enable 'No authentication' by disabling the ACL authentication:

gwcli > /> cd /iscsi-targets/iqn.2003-01.org.linux-iscsi.iscsi.SYSTEM-ARCH:testvol/hosts gwcli > /iscsi-target...testvol/hosts> auth disable_acl

123 Authentication and Access Control SES 6 10.4.4.2 ACL Authentication

When using initiator name based ACL authentication, only the dened initiators are allowed to connect. You can dene an initiator by doing:

gwcli > /> cd /iscsi-targets/iqn.2003-01.org.linux-iscsi.iscsi.SYSTEM-ARCH:testvol/hosts gwcli > /iscsi-target...testvol/hosts> create iqn.1996-04.de.suse:01:e6ca28cc9f20

Dened initiators will be able to connect, but will only have access to the RBD images that were explicitly added to the initiator:

gwcli > /iscsi-target...:e6ca28cc9f20> disk add rbd/testvol

10.4.4.3 CHAP Authentication

In addition to the ACL, you can enable the CHAP authentication by specifying a user name and password for each initiator:

gwcli > /> cd /iscsi-targets/iqn.2003-01.org.linux-iscsi.iscsi.SYSTEM-ARCH:testvol/ hosts/iqn.1996-04.de.suse:01:e6ca28cc9f20 gwcli > /iscsi-target...:e6ca28cc9f20> auth username=common12 password=pass12345678

Note User names must have a length of 8 to 64 characters and can contain alphanumeric characters, '.', '@', '-', '_' or ':'. Passwords must have a length of 12 to 16 characters and can contain alphanumeric characters, '@', '-', '_' or '/'.

Optionally, you can also enable the CHAP mutual authentication by specifying the mutual_username and mutual_password parameters in the auth command.

10.4.4.4 Discovery and Mutual Authentication

Discovery authentication is independent of the previous authentication methods. It requires credentials for browsing, it is optional, and can be congured by:

gwcli > /> cd /iscsi-targets gwcli > /iscsi-targets> discovery_auth username=du123456 password=dp1234567890

124 Authentication and Access Control SES 6 Note User-names must have a length of 8 to 64 characters and can only contain letters, '.', '@', '-', '_' or ':'. Passwords must have a length of 12 to 16 characters and can only contain letters, '@', '-', '_' or '/'.

Optionally, you can also specify the mutual_username and mutual_password parameters in the discovery_auth command. Discovery authentication can be disabled by using the following command:

gwcli > /iscsi-targets> discovery_auth nochap

10.4.5 Advanced Settings ceph-iscsi can be congured with advanced parameters which are subsequently passed on to the LIO I/O target. The parameters are divided up into 'target' and 'disk' parameters.

Warning Unless otherwise noted, changing these parameters from the default setting is not recommended.

10.4.5.1 Target Settings

You can view the value of these settings by using the info command:

gwcli > /> cd /iscsi-targets/iqn.2003-01.org.linux-iscsi.iscsi.SYSTEM-ARCH:testvol gwcli > /iscsi-target...i.SYSTEM-ARCH:testvol> info

And change a setting using the reconfigure command:

gwcli > /iscsi-target...i.SYSTEM-ARCH:testvol> reconfigure login_timeout 20

The available 'target' settings are: default_cmdsn_depth

125 Advanced Settings SES 6 Default CmdSN (Command Sequence Number) depth. Limits the amount of requests that an iSCSI initiator can have outstanding at any moment. default_erl Default error recovery level. login_timeout Login timeout value in seconds. netif_timeout NIC failure timeout in seconds. prod_mode_write_protect If set to 1, prevents writes to LUNs.

10.4.5.2 Disk Settings

You can view the value of these settings by using the info command:

gwcli > /> cd /disks/rbd/testvol gwcli > /disks/rbd/testvol> info

And change a setting using the reconfigure command:

gwcli > /disks/rbd/testvol> reconfigure rbd/testvol emulate_pr 0

The available 'disk' settings are: block_size Block size of the underlying device. emulate_3pc If set to 1, enables Third Party Copy. emulate_caw If set to 1, enables Compare and Write. emulate_dpo If set to 1, turns on Disable Page Out. emulate_fua_read If set to 1, enables Force Unit Access read.

126 Advanced Settings SES 6 emulate_fua_write If set to 1, enables Force Unit Access write. emulate_model_alias If set to 1, uses the back-end device name for the model alias. emulate_pr If set to 0, support for SCSI Reservations, including Persistent Group Reservations, is disabled. While disabled, the SES iSCSI Gateway can ignore reservation state, resulting in improved request latency.

Tip Setting backstore_emulate_pr to 0 is recommended if iSCSI initiators do not require SCSI Reservation support. emulate_rest_reord If set to 0, the Queue Algorithm Modier has Restricted Reordering. emulate_tas If set to 1, enables Task Aborted Status. emulate_tpu If set to 1, enables Thin Provisioning Unmap. emulate_tpws If set to 1, enables Thin Provisioning Write Same. emulate_ua_intlck_ctrl If set to 1, enables Unit Attention Interlock. emulate_write_cache If set to 1, turns on Write Cache Enable. enforce_pr_isids If set to 1, enforces persistent reservation ISIDs. is_nonrot If set to 1, the backstore is a non-rotational device. max_unmap_block_desc_count

127 Advanced Settings SES 6 Maximum number of block descriptors for UNMAP. max_unmap_lba_count: Maximum number of LBAs for UNMAP. max_write_same_len Maximum length for WRITE_SAME. optimal_sectors Optimal request size in sectors. pi_prot_type DIF protection type. queue_depth Queue depth. unmap_granularity UNMAP granularity. unmap_granularity_alignment UNMAP granularity alignment. force_pr_aptpl When enabled, LIO will always write out the persistent reservation state to persistent storage, regardless of whether or not the client has requested it via aptpl=1 . This has no eect with the kernel RBD back-end for LIO—it always persists PR state. Ideally, the target_core_rbd option should force it to '1' and throw an error if someone tries to disable it via congfs. unmap_zeroes_data Aects whether LIO will advertise LBPRZ to SCSI initiators, indicating that zeros will be read back from a region following UNMAP or WRITE SAME with an unmap bit.

10.5 Exporting RADOS Block Device Images Using tcmu-runner

The ceph-iscsi supports both rbd (kernel-based) and user:rbd (tcmu-runner) backstores, making all the management transparent and independent of the backstore.

128 Exporting RADOS Block Device Images Using tcmu-runner SES 6 Warning: Technology Preview tcmu-runner based iSCSI Gateway deployments are currently a technology preview.

Unlike kernel-based iSCSI Gateway deployments, tcmu-runner based iSCSI Gateways do not oer support for multipath I/O or SCSI Persistent Reservations. To export an RADOS Block Device image using tcmu-runner , all you need to do is specify the user:rbd backstore when attaching the disk:

gwcli > /disks> attach rbd/testvol backstore=user:rbd

Note When using tcmu-runner , the exported RBD image must have the exclusive-lock feature enabled.

129 Exporting RADOS Block Device Images Using tcmu-runner SES 6 11 Installation of CephFS

The Ceph le system (CephFS) is a POSIX-compliant le system that uses a Ceph storage cluster to store its data. CephFS uses the same cluster system as Ceph block devices, Ceph object storage with its S3 and Swift APIs, or native bindings ( librados ). To use CephFS, you need to have a running Ceph storage cluster, and at least one running Ceph metadata server.

11.1 Supported CephFS Scenarios and Guidance

With SUSE Enterprise Storage 6, SUSE introduces ocial support for many scenarios in which the scale-out and distributed component CephFS is used. This entry describes hard limits and provides guidance for the suggested use cases. A supported CephFS deployment must meet these requirements:

Clients are SUSE Linux Enterprise Server 12 SP3 or newer, or SUSE Linux Enterprise Server 15 or newer, using the cephfs kernel module driver. The FUSE module is not supported.

CephFS quotas are supported in SUSE Enterprise Storage 6 and can be set on any subdirectory of the Ceph le system. The quota restricts either the number of bytes or files stored beneath the specied point in the directory hierarchy. For more information, see Book “Administration Guide”, Chapter 28 “”, Section 28.6 “Setting CephFS Quotas”.

CephFS supports le layout changes as documented in Section 11.3.4, “File Layouts”. However, while the le system is mounted by any client, new data pools may not be added to an existing CephFS le system ( ceph mds add_data_pool ). They may only be added while the le system is unmounted.

A minimum of one Metadata Server. SUSE recommends deploying several nodes with the MDS role. By default, additional MDS daemons start as standby daemons, acting as backups for the active MDS. Multiple active MDS daemons are also supported (refer to section Section 11.3.2, “MDS Cluster Size”).

130 Supported CephFS Scenarios and Guidance SES 6 11.2 Ceph Metadata Server

Ceph metadata server (MDS) stores metadata for the CephFS. Ceph block devices and Ceph object storage do not use MDS. MDSs make it possible for POSIX le system users to execute basic commands—such as ls or find —without placing an enormous burden on the Ceph storage cluster.

11.2.1 Adding and Removing a Metadata Server

You can deploy MDS either during the initial cluster deployment process as described in Section 5.3, “Cluster Deployment”, or add it to an already deployed cluster as described in Book “Administration Guide”, Chapter 2 “Salt Cluster Administration”, Section 2.1 “Adding New Cluster Nodes”. After you deploy your MDS, allow the Ceph OSD/MDS service in the rewall setting of the server where MDS is deployed: Start yast , navigate to Security and Users Firewall Allowed Services and in the Service to Allow drop–down menu select Ceph OSD/MDS. If the Ceph MDS node is not allowed full trac, mounting of a le system fails, even though other operations may work properly. You can remove a metadata server in your cluster as described in .

11.2.2 Configuring a Metadata Server

You can ne-tune the MDS behavior by inserting relevant options in the ceph.conf conguration le.

METADATA SERVER SETTINGS mon force standby active If set to 'true' (default), monitors force standby-replay to be active. Set under [mon] or [global] sections. mds cache memory limit The soft memory limit (in bytes) that the MDS will enforce for its cache. Administrators should use this instead of the old mds cache size setting. Defaults to 1 GB. mds cache reservation The cache reservation (memory or inodes) for the MDS cache to maintain. When the MDS begins touching its reservation, it will recall client state until its cache size shrinks to restore the reservation. Defaults to 0.05.

131 Ceph Metadata Server SES 6 mds cache size The number of inodes to cache. A value of 0 (default) indicates an unlimited number. It is recommended to use mds cache memory limit to limit the amount of memory the MDS cache uses. mds cache mid The insertion point for new items in the cache LRU (from the top). Default is 0.7. mds dir commit ratio The fraction of directory that is dirty before Ceph commits using a full update instead of partial update. Default is 0.5. mds dir max commit size The maximum size of a directory update before Ceph breaks it into smaller transactions. Default is 90 MB. mds decay halflife The half-life of MDS cache temperature. Default is 5. mds beacon interval The frequency in seconds of beacon messages sent to the monitor. Default is 4. mds beacon grace The interval without beacons before Ceph declares an MDS laggy and possibly replaces it. Default is 15. mds blacklist interval The blacklist duration for failed MDSs in the OSD map. This setting controls how long failed MDS daemons will stay in the OSD map blacklist. It has no eect on how long something is blacklisted when the administrator blacklists it manually. For example, the ceph osd blacklist add command will still use the default blacklist time. Default is 24 * 60. mds reconnect timeout The interval in seconds to wait for clients to reconnect during MDS restart. Default is 45. mds tick interval How frequently the MDS performs internal periodic tasks. Default is 5. mds dirstat min interval The minimum interval in seconds to try to avoid propagating recursive stats up the tree. Default is 1.

132 Configuring a Metadata Server SES 6 mds scatter nudge interval How quickly dirstat changes propagate up. Default is 5. mds client prealloc inos The number of inode numbers to preallocate per client session. Default is 1000. mds early reply Determines whether the MDS should allow clients to see request results before they commit to the journal. Default is 'true'. mds use tmap Use trivial map for directory updates. Default is 'true'. mds default dir hash The function to use for hashing les across directory fragments. Default is 2 (that is 'rjenkins'). mds log skip corrupt events Determines whether the MDS should try to skip corrupt journal events during journal replay. Default is 'false'. mds log max events The maximum events in the journal before we initiate trimming. Set to -1 (default) to disable limits. mds log max segments The maximum number of segments (objects) in the journal before we initiate trimming. Set to -1 to disable limits. Default is 30. mds log max expiring The maximum number of segments to expire in parallels. Default is 20. mds log eopen size The maximum number of inodes in an EOpen event. Default is 100. mds bal sample interval Determines how frequently to sample directory temperature for fragmentation decisions. Default is 3. mds bal replicate threshold The maximum temperature before Ceph attempts to replicate metadata to other nodes. Default is 8000.

133 Configuring a Metadata Server SES 6 mds bal unreplicate threshold The minimum temperature before Ceph stops replicating metadata to other nodes. Default is 0. mds bal split size The maximum directory size before the MDS will split a directory fragment into smaller bits. Default is 10000. mds bal split rd The maximum directory read temperature before Ceph splits a directory fragment. Default is 25000. mds bal split wr The maximum directory write temperature before Ceph splits a directory fragment. Default is 10000. mds bal split bits The number of bits by which to split a directory fragment. Default is 3. mds bal merge size The minimum directory size before Ceph tries to merge adjacent directory fragments. Default is 50. mds bal interval The frequency in seconds of workload exchanges between MDSs. Default is 10. mds bal fragment interval The delay in seconds between a fragment being capable of splitting or merging, and execution of the fragmentation change. Default is 5. mds bal fragment fast factor The ratio by which fragments may exceed the split size before a split is executed immediately, skipping the fragment interval. Default is 1.5. mds bal fragment size max The maximum size of a fragment before any new entries are rejected with ENOSPC. Default is 100000. mds bal idle threshold The minimum temperature before Ceph migrates a subtree back to its parent. Default is 0. mds bal mode

134 Configuring a Metadata Server SES 6 The method for calculating MDS load:

0 = Hybrid.

1 = Request rate and latency.

2 = CPU load.

Default is 0. mds bal min rebalance The minimum subtree temperature before Ceph migrates. Default is 0.1. mds bal min start The minimum subtree temperature before Ceph searches a subtree. Default is 0.2. mds bal need min The minimum fraction of target subtree size to accept. Default is 0.8. mds bal need max The maximum fraction of target subtree size to accept. Default is 1.2. mds bal midchunk Ceph will migrate any subtree that is larger than this fraction of the target subtree size. Default is 0.3. mds bal minchunk Ceph will ignore any subtree that is smaller than this fraction of the target subtree size. Default is 0.001. mds bal target removal min The minimum number of balancer iterations before Ceph removes an old MDS target from the MDS map. Default is 5. mds bal target removal max The maximum number of balancer iteration before Ceph removes an old MDS target from the MDS map. Default is 10. mds replay interval The journal poll interval when in standby-replay mode ('hot standby'). Default is 1. mds shutdown check

135 Configuring a Metadata Server SES 6 The interval for polling the cache during MDS shutdown. Default is 0. mds thrash fragments Ceph will randomly fragment or merge directories. Default is 0. mds dump cache on map Ceph will dump the MDS cache contents to a le on each MDS map. Default is 'false'. mds dump cache after rejoin Ceph will dump MDS cache contents to a le after rejoining the cache during recovery. Default is 'false'. mds standby for name An MDS daemon will standby for another MDS daemon of the name specied in this setting. mds standby for rank An MDS daemon will standby for an MDS daemon of this rank. Default is -1. mds standby replay Determines whether a Ceph MDS daemon should poll and replay the log of an active MDS ('hot standby'). Default is 'false'. mds min caps per client Set the minimum number of capabilities a client may hold. Default is 100. mds max ratio caps per client Set the maximum ratio of current caps that may be recalled during MDS cache pressure. Default is 0.8.

METADATA SERVER JOURNALER SETTINGS journaler write head interval How frequently to update the journal head object. Default is 15. journaler prefetch periods How many stripe periods to read ahead on journal replay. Default is 10. journal prezero periods How many stripe periods to zero ahead of write position. Default 10. journaler batch interval Maximum additional latency in seconds we incur articially. Default is 0.001.

136 Configuring a Metadata Server SES 6 journaler batch max Maximum number of bytes by which we will delay ushing. Default is 0.

11.3 CephFS

When you have a healthy Ceph storage cluster with at least one Ceph metadata server, you can create and mount your Ceph le system. Ensure that your client has network connectivity and a proper authentication keyring.

11.3.1 Creating CephFS

A CephFS requires at least two RADOS pools: one for data and one for metadata . When conguring these pools, you might consider:

Using a higher replication level for the metadata pool, as any data loss in this pool can render the whole le system inaccessible.

Using lower-latency storage such as SSDs for the metadata pool, as this will improve the observed latency of le system operations on clients.

When assigning a role-mds in the policy.cfg , the required pools are automatically created. You can manually create the pools cephfs_data and cephfs_metadata for manual performance tuning before setting up the Metadata Server. DeepSea will not create these pools if they already exist. For more information on managing pools, see Book “Administration Guide”, Chapter 22 “Managing Storage Pools”. To create the two required pools—for example, 'cephfs_data' and 'cephfs_metadata'—with default settings for use with CephFS, run the following commands:

cephadm@adm > ceph osd pool create cephfs_data pg_num cephadm@adm > ceph osd pool create cephfs_metadata pg_num

It is possible to use EC pools instead of replicated pools. We recommend to only use EC pools for low performance requirements and infrequent random access, for example cold storage, backups, archiving. CephFS on EC pools requires BlueStore to be enabled and the pool must have the allow_ec_overwrite option set. This option can be set by running ceph osd pool set ec_pool allow_ec_overwrites true .

137 CephFS SES 6 Erasure coding adds signicant overhead to le system operations, especially small updates. This overhead is inherent to using erasure coding as a fault tolerance mechanism. This penalty is the trade o for signicantly reduced storage space overhead. When the pools are created, you may enable the le system with the ceph fs new command:

cephadm@adm > ceph fs new fs_name metadata_pool_name data_pool_name

For example:

cephadm@adm > ceph fs new cephfs cephfs_metadata cephfs_data

You can check that the le system was created by listing all available CephFSs:

cephadm@adm > ceph fs ls name: cephfs, metadata pool: cephfs_metadata, data pools: [cephfs_data ]

When the le system has been created, your MDS will be able to enter an active state. For example, in a single MDS system:

cephadm@adm > ceph mds stat e5: 1/1/1 up

Tip: More Topics You can nd more information of specic tasks—for example mounting, unmounting, and advanced CephFS setup—in Book “Administration Guide”, Chapter 28 “Clustered File System”.

11.3.2 MDS Cluster Size

A CephFS instance can be served by multiple active MDS daemons. All active MDS daemons that are assigned to a CephFS instance will distribute the le system's directory tree between themselves, and thus spread the load of concurrent clients. In order to add an active MDS daemon to a CephFS instance, a spare standby is needed. Either start an additional daemon or use an existing standby instance. The following command will display the current number of active and passive MDS daemons.

cephadm@adm > ceph mds stat

The following command sets the number of active MDSs to two in a le system instance.

cephadm@adm > ceph fs set fs_name max_mds 2

138 MDS Cluster Size SES 6 In order to shrink the MDS cluster prior to an update, two steps are necessary. First, set max_mds so that only one instance remains:

cephadm@adm > ceph fs set fs_name max_mds 1 and after that, explicitly deactivate the other active MDS daemons:

cephadm@adm > ceph mds deactivate fs_name:rank where rank is the number of an active MDS daemon of a le system instance, ranging from 0 to max_mds -1. We recommend at least one MDS is left as a standby daemon.

11.3.3 MDS Cluster and Updates

During Ceph updates, the feature ags on a le system instance may change (usually by adding new features). Incompatible daemons (such as the older versions) are not able to function with an incompatible feature set and will refuse to start. This means that updating and restarting one daemon can cause all other not yet updated daemons to stop and refuse to start. For this reason, we recommend shrinking the active MDS cluster to size one and stopping all standby daemons before updating Ceph. The manual steps for this update procedure are as follows:

1. Update the Ceph related packages using zypper .

2. Shrink the active MDS cluster as described above to one instance and stop all standby MDS daemons using their systemd units on all other nodes:

cephadm@mds > systemctl stop ceph-mds\*.service ceph-mds.target

3. Only then restart the single remaining MDS daemon, causing it to restart using the updated binary.

cephadm@mds > systemctl restart ceph-mds\*.service ceph-mds.target

4. Restart all other MDS daemons and reset the desired max_mds setting.

cephadm@mds > systemctl start ceph-mds.target

If you use DeepSea, it will follow this procedure in case the ceph package was updated during stages 0 and 4. It is possible to perform this procedure while clients have the CephFS instance mounted and I/O is ongoing. Note however that there will be a very brief I/O pause while the active MDS restarts. Clients will recover automatically.

139 MDS Cluster and Updates SES 6 It is good practice to reduce the I/O load as much as possible before updating an MDS cluster. An idle MDS cluster will go through this update procedure quicker. Conversely, on a heavily loaded cluster with multiple MDS daemons it is essential to reduce the load in advance to prevent a single MDS daemon from being overwhelmed by ongoing I/O.

11.3.4 File Layouts

The layout of a le controls how its contents are mapped to Ceph RADOS objects. You can read and write a le’s layout using virtual extended attributes or xattrs for shortly. The name of the layout xattrs depends on whether a le is a regular le or a directory. Regular les’ layout xattrs are called ceph.file.layout , while directories’ layout xattrs are called ceph.dir.layout . Where examples refer to ceph.file.layout , substitute the .dir. part as appropriate when dealing with directories.

11.3.4.1 Layout Fields

The following attribute elds are recognized: pool ID or name of a RADOS pool in which a le’s data objects will be stored. pool_namespace RADOS namespace within a data pool to which the objects will be written. It is empty by default, meaning the default namespace. stripe_unit The size in bytes of a block of data used in the RAID 0 distribution of a le. All stripe units for a le have equal size. The last stripe unit is typically incomplete—it represents the data at the end of the le as well as the unused 'space' beyond it up to the end of the xed stripe unit size. stripe_count The number of consecutive stripe units that constitute a RAID 0 'stripe' of le data. object_size The size in bytes of RADOS objects into which the le data is chunked.

140 File Layouts SES 6 Tip: Object Sizes RADOS enforces a congurable limit on object sizes. If you increase CephFS object sizes beyond that limit, then writes may not succeed. The OSD setting is osd_max_object_size , which is 128 MB by default. Very large RADOS objects may prevent smooth operation of the cluster, so increasing the object size limit past the default is not recommended.

11.3.4.2 Reading Layout with getfattr

Use the getfattr command to read the layout information of an example le file as a single string:

root # touch file root # getfattr -n ceph.file.layout file # file: file ceph.file.layout="stripe_unit=4194304 stripe_count=1 object_size=419430

Read individual layout elds:

root # getfattr -n ceph.file.layout.pool file # file: file ceph.file.layout.pool="cephfs_data" root # getfattr -n ceph.file.layout.stripe_unit file # file: file ceph.file.layout.stripe_unit="4194304"

Tip: Pool ID or Name When reading layouts, the pool will usually be indicated by name. However, in rare cases when pools have only just been created, the ID may be output instead.

Directories do not have an explicit layout until it is customized. Attempts to read the layout will fail if it has never been modied: this indicates that the layout of the next ancestor directory with an explicit layout will be used.

root # mkdir dir root # getfattr -n ceph.dir.layout dir dir: ceph.dir.layout: No such attribute root # setfattr -n ceph.dir.layout.stripe_count -v 2 dir

141 File Layouts SES 6 root # getfattr -n ceph.dir.layout dir # file: dir ceph.dir.layout="stripe_unit=4194304 stripe_count=2 object_size=4194304 pool=cephfs_data"

11.3.4.3 Writing Layouts with setfattr

Use the setfattr command to modify the layout elds of an example le file :

cephadm@adm > ceph osd lspools 0 rbd 1 cephfs_data 2 cephfs_metadata root # setfattr -n ceph.file.layout.stripe_unit -v 1048576 file root # setfattr -n ceph.file.layout.stripe_count -v 8 file # Setting pool by ID: root # setfattr -n ceph.file.layout.pool -v 1 file # Setting pool by name: root # setfattr -n ceph.file.layout.pool -v cephfs_data file

Note: Empty File When the layout elds of a le are modied using setfattr , this le needs to be empty otherwise an error will occur.

11.3.4.4 Clearing Layouts

If you want to remove an explicit layout from an example directory mydir and revert back to inheriting the layout of its ancestor, run the following:

root # setfattr -x ceph.dir.layout mydir

Similarly, if you have set the 'pool_namespace' attribute and wish to modify the layout to use the default namespace instead, run:

# Create a directory and set a namespace on it root # mkdir mydir root # setfattr -n ceph.dir.layout.pool_namespace -v foons mydir root # getfattr -n ceph.dir.layout mydir ceph.dir.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304 \ pool=cephfs_data_a pool_namespace=foons"

142 File Layouts SES 6 # Clear the namespace from the directory's layout root # setfattr -x ceph.dir.layout.pool_namespace mydir root # getfattr -n ceph.dir.layout mydir ceph.dir.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304 \ pool=cephfs_data_a"

11.3.4.5 Inheritance of Layouts

Files inherit the layout of their parent directory at creation time. However, subsequent changes to the parent directory’s layout do not aect children:

root # getfattr -n ceph.dir.layout dir # file: dir ceph.dir.layout="stripe_unit=4194304 stripe_count=2 object_size=4194304 \ pool=cephfs_data"

# file1 inherits its parent's layout root # touch dir/file1 root # getfattr -n ceph.file.layout dir/file1 # file: dir/file1 ceph.file.layout="stripe_unit=4194304 stripe_count=2 object_size=4194304 \ pool=cephfs_data"

# update the layout of the directory before creating a second file root # setfattr -n ceph.dir.layout.stripe_count -v 4 dir root # touch dir/file2

# file1's layout is unchanged root # getfattr -n ceph.file.layout dir/file1 # file: dir/file1 ceph.file.layout="stripe_unit=4194304 stripe_count=2 object_size=4194304 \ pool=cephfs_data"

# ...while file2 has the parent directory's new layout root # getfattr -n ceph.file.layout dir/file2 # file: dir/file2 ceph.file.layout="stripe_unit=4194304 stripe_count=4 object_size=4194304 \ pool=cephfs_data"

Files created as descendants of the directory also inherit its layout if the intermediate directories do not have layouts set:

root # getfattr -n ceph.dir.layout dir # file: dir ceph.dir.layout="stripe_unit=4194304 stripe_count=4 object_size=4194304 \

143 File Layouts SES 6 pool=cephfs_data" root # mkdir dir/childdir root # getfattr -n ceph.dir.layout dir/childdir dir/childdir: ceph.dir.layout: No such attribute root # touch dir/childdir/grandchild root # getfattr -n ceph.file.layout dir/childdir/grandchild # file: dir/childdir/grandchild ceph.file.layout="stripe_unit=4194304 stripe_count=4 object_size=4194304 \ pool=cephfs_data"

11.3.4.6 Adding a Data Pool to the Metadata Server

Before you can use a pool with CephFS, you need to add it to the Metadata Server:

cephadm@adm > ceph fs add_data_pool cephfs cephfs_data_ssd cephadm@adm > ceph fs ls # Pool should now show up .... data pools: [cephfs_data cephfs_data_ssd ]

Tip: cephx Keys Make sure that your cephx keys allow the client to access this new pool.

You can then update the layout on a directory in CephFS to use the pool you added:

root # mkdir /mnt/cephfs/myssddir root # setfattr -n ceph.dir.layout.pool -v cephfs_data_ssd /mnt/cephfs/myssddir

All new les created within that directory will now inherit its layout and place their data in your newly added pool. You may notice that the number of objects in your primary data pool continues to increase, even if les are being created in the pool you newly added. This is normal: the le data is stored in the pool specied by the layout, but a small amount of metadata is kept in the primary data pool for all les.

144 File Layouts SES 6 12 Installation of NFS Ganesha

NFS Ganesha provides NFS access to either the Object Gateway or the CephFS. In SUSE Enterprise Storage 6, NFS versions 3 and 4 are supported. NFS Ganesha runs in the instead of the kernel space and directly interacts with the Object Gateway or CephFS.

Warning: Cross Protocol Access Native CephFS and NFS clients are not restricted by le locks obtained via Samba, and vice versa. Applications that rely on cross protocol le locking may experience data corruption if CephFS backed Samba share paths are accessed via other means.

12.1 Preparation

12.1.1 General Information

To successfully deploy NFS Ganesha, you need to add a role-ganesha to your /srv/pillar/ ceph/proposals/policy.cfg . For details, see Section 5.5.1, “The policy.cfg File”. NFS Ganesha also needs either a role-rgw or a role-mds present in the policy.cfg . Although it is possible to install and run the NFS Ganesha server on an already existing Ceph node, we recommend running it on a dedicated host with access to the Ceph cluster. The client hosts are typically not part of the cluster, but they need to have network access to the NFS Ganesha server. To enable the NFS Ganesha server at any point after the initial installation, add the role- ganesha to the policy.cfg and re-run at least DeepSea stages 2 and 4. For details, see Section 5.3, “Cluster Deployment”. NFS Ganesha is congured via the le /etc/ganesha/ganesha.conf that exists on the NFS Ganesha node. However, this le is overwritten each time DeepSea stage 4 is executed. Therefore we recommend to edit the template used by Salt, which is the le /srv/salt/ceph/ ganesha/files/ganesha.conf.j2 on the Salt master. For details about the conguration le, see Book “Administration Guide”, Chapter 30 “NFS Ganesha: Export Ceph Data via NFS”, Section 30.2 “Configuration”.

145 Preparation SES 6 12.1.2 Summary of Requirements

The following requirements need to be met before DeepSea stages 2 and 4 can be executed to install NFS Ganesha:

At least one node needs to be assigned the role-ganesha .

You can dene only one role-ganesha per minion.

NFS Ganesha needs either an Object Gateway or CephFS to work.

The kernel based NFS needs to be disabled on minions with the role-ganesha role.

12.2 Example Installation

This procedure provides an example installation that uses both the Object Gateway and CephFS File System Abstraction Layers (FSAL) of NFS Ganesha.

1. If you have not done so, execute DeepSea stages 0 and 1 before continuing with this procedure.

root@master # salt-run state.orch ceph.stage.0 root@master # salt-run state.orch ceph.stage.1

2. After having executed stage 1 of DeepSea, edit the /srv/pillar/ceph/proposals/ policy.cfg and add the line

role-ganesha/cluster/NODENAME

Replace NODENAME with the name of a node in your cluster. Also make sure that a role-mds and a role-rgw are assigned.

3. Execute at least stages 2 and 4 of DeepSea. Running stage 3 in between is recommended.

root@master # salt-run state.orch ceph.stage.2 root@master # salt-run state.orch ceph.stage.3 # optional but recommended root@master # salt-run state.orch ceph.stage.4

4. Verify that NFS Ganesha is working by checking that the NFS Ganesha service is running on the minion node:

root@master # salt -I roles:ganesha service.status nfs-ganesha

146 Summary of Requirements SES 6 MINION_ID: True

12.3 Active-Active Configuration

This section provides an example of simple active-active NFS Ganesha setup. The aim is to deploy two NFS Ganesha servers layered on top of the same existing CephFS. The servers will be two Ceph cluster nodes with separate addresses. The clients need to be distributed between them manually. “Failover” in this conguration means manually unmounting and remounting the other server on the client.

12.3.1 Prerequisites

For our example conguration, you need the following:

Running Ceph cluster. See Section 5.3, “Cluster Deployment” for details on deploying and conguring Ceph cluster by using DeepSea.

At least one congured CephFS. See Chapter 11, Installation of CephFS for more details on deploying and conguring CephFS.

Two Ceph cluster nodes with NFS Ganesha deployed. See Chapter 12, Installation of NFS Ganesha for more details on deploying NFS Ganesha.

Tip: Use Dedicated Servers Although NFS Ganesha nodes can share resources with other Ceph related services, we recommend to use dedicated servers to improve performance.

After you deploy the NFS Ganesha nodes, verify that the cluster is operational and the default CephFS pools are there:

cephadm@adm > rados lspools cephfs_data cephfs_metadata

147 Active-Active Configuration SES 6 12.3.2 Configure NFS Ganesha

Check that both NFS Ganesha nodes have the le /etc/ganesha/ganesha.conf installed. Add the following blocks, if they do not exist yet, to the conguration le in order to enable RADOS as the recovery backend of NFS Ganesha.

NFS_CORE_PARAM { Enable_NLM = false; Enable_RQUOTA = false; Protocols = 4; } NFSv4 { RecoveryBackend = rados_cluster; Minor_Versions = 1,2; } CACHEINODE { Dir_Chunk = 0; NParts = 1; Cache_Size = 1; } RADOS_KV { pool = "rados_pool"; namespace = "pool_namespace"; nodeid = "fqdn" UserId = "cephx_user_id"; Ceph_Conf = "path_to_ceph.conf" }

You can nd out the values for rados_pool and pool_namespace by checking the already existing line in the conguration of the form:

%url rados://rados_pool/pool_namespace/...

The value for nodeid option corresponds to the FQDN of the machine, and UserId and Ceph_Conf options value can be found in the already existing RADOS_URLS block. Because legacy versions of NFS prevent us from lifting the grace period early and therefore prolong a server restart, we disable options for NFS prior to version 4.2. We also disable most of the NFS Ganesha caching as Ceph libraries do aggressive caching already. The 'rados_cluster' recovery back-end stores its info in RADOS objects. Although it is not a lot of data, we want it highly available. We use the CephFS metadata pool for this purpose, and declare a new 'ganesha' namespace in it to keep it distinct from CephFS objects.

148 Configure NFS Ganesha SES 6 Note: Cluster Node IDs Most of the conguration is identical between the two hosts, however the nodeid option in the 'RADOS_KV' block needs to be a unique string for each node. By default, NFS Ganesha sets nodeid to the host name of the node. If you need to use dierent xed values other than host names, you can for example set nodeid = 'a' on one node and nodeid = 'b' on the other one.

12.3.3 Populate the Cluster Grace Database

We need to verify that all of the nodes in the cluster know about each other. This done via a RADOS object that is shared between the hosts. NFS Ganesha uses this object to communicate the current state with regard to a grace period. The nfs-ganesha-rados-grace package contains a command line tool for querying and manipulating this database. If the package is not installed on at least one of the nodes, install it with

root # zypper install nfs-ganesha-rados-grace

We will use the command to create the DB and add both nodeid s. In our example, the two NFS Ganesha nodes are named ses6min1.example.com and ses6min2.example.com On one of the NFS Ganesha hosts, run

cephadm@adm > ganesha-rados-grace -p cephfs_metadata -n ganesha add ses6min1.example.com cephadm@adm > ganesha-rados-grace -p cephfs_metadata -n ganesha add ses6min2.example.com cephadm@adm > ganesha-rados-grace -p cephfs_metadata -n ganesha cur=1 rec=0 ======ses6min1.example.com E ses6min2.example.com E

This creates the grace database and adds both 'ses6min1.example.com' and 'ses6min2.example.com' to it. The last command dumps the current state. Newly added hosts are always considered to be enforcing the grace period so they both have the 'E' ag set. The 'cur' and 'rec' values show the current and recovery epochs, which is how we keep track of what hosts are allowed to perform recovery and when.

149 Populate the Cluster Grace Database SES 6 12.3.4 Restart NFS Ganesha Services

On both NFS Ganesha nodes, restart the related services:

root # systemctl restart nfs-ganesha.service

After the services are restarted, check the grace database:

cephadm@adm > ganesha-rados-grace -p cephfs_metadata -n ganesha cur=3 rec=0 ======ses6min1.example.com ses6min2.example.com

Note: Cleared the 'E' Flag Note that both nodes have cleared their 'E' ags, indicating that they are no longer enforcing the grace period and are now in normal operation mode.

12.3.5 Conclusion

After you complete all the preceding steps, you can mount the exported NFS from either of the two NFS Ganesha servers, and perform normal NFS operations against them. Our example conguration assumes that if one of the two NFS Ganesha servers goes down, you will restart it manually within 5 minutes. After 5 minutes, the Metadata Server may cancel the session that the NFS Ganesha client held and all of the state associated with it. If the session’s capabilities get cancelled before the rest of the cluster goes into the grace period, the server’s clients may not be able to recover all of their state.

12.4 More Information

More information can be found in Book “Administration Guide”, Chapter 30 “NFS Ganesha: Export Ceph Data via NFS”.

150 Restart NFS Ganesha Services SES 6 IV Cluster Deployment on Top of SUSE CaaS Platform 4 (Technology Preview)

13 SUSE Enterprise Storage 6 on Top of SUSE CaaS Platform 4 Kubernetes

Cluster 152 13 SUSE Enterprise Storage 6 on Top of SUSE CaaS Platform 4 Kubernetes Cluster

Warning: Technology Preview Running containerized Ceph cluster on SUSE CaaS Platform is a technology preview. Do not deploy on a production Kubernetes cluster. This is not a supported version.

This chapter describes how to deploy containerized SUSE Enterprise Storage 6 on top of SUSE CaaS Platform 4 Kubernetes cluster.

13.1 Considerations

Before you start deploying, consider the following points:

To run Ceph in Kubernetes, SUSE Enterprise Storage 6 uses an upstream project called Rook (https://rook.io/ ).

Depending on the conguration, Rook may consume all unused disks on all nodes in a Kubernetes cluster.

The setup requires privileged containers.

13.2 Prerequisites

The minimum requirements and prerequisites to deploy SUSE Enterprise Storage 6 on top of SUSE CaaS Platform 4 Kubernetes cluster are as follows:

A running SUSE CaaS Platform 4 cluster. You need to have an account with a SUSE CaaS Platform subscription. You can activate a 60-day free evaluation here https://www.suse.com/products/caas-platform/download/MkpwEt3Ub98~/? campaign_name=Eval:_CaaSP_4 .

At least three SUSE CaaS Platform worker nodes, with at least one additional disk attached to each worker node as storage for the OSD. We recommend four SUSE CaaS Platform worker nodes.

152 Considerations SES 6 At least one OSD per worker node, with a minimum disk size of 5 GB.

Access to SUSE Enterprise Storage 6. You can get a trial subscription from here https:// www.suse.com/products/suse-enterprise-storage/download/ .

Access to a workstation that has access to the SUSE CaaS Platform cluster via kubectl . We recommend using the SUSE CaaS Platform master node as the workstation.

Ensure that the SUSE-Enterprise-Storage-6-Pool and SUSE-Enterprise-Storage-6- Updates repositories are congured on the management node to install the rook-k8s- yaml RPM package.

13.3 Get Rook Manifests

The Rook orchestrator uses conguration les in YAML format called manifests. The manifests you need are included in the rook-k8s-yaml RPM package. You can nd this package in the SUSE Enterprise Storage 6 repository. Install it by running the following:

root # zypper install rook-k8s-yaml

13.4 Installation

Rook-Ceph includes two main components: the 'operator' which is run by Kubernetes and allows creation of Ceph clusters, and the Ceph 'cluster' itself which is created and partially managed by the operator.

13.4.1 Configuration

13.4.1.1 Global Configuration

The manifests used in this setup install all Rook and Ceph components in the 'rook-ceph' namespace. If you need to change it, adopt all references to the namespace in the Kubernetes manifests accordingly. Depending on which features of Rook you intend to use, alter the 'Pod Security Policy' conguration in common.yaml to limit Rook's security requirements. Follow the comments in the manifest le.

153 Get Rook Manifests SES 6 13.4.1.2 Operator Configuration

The manifest operator.yaml congures the Rook operator. Normally, you do not need to change it. Find more information following the comments in the manifest le.

13.4.1.3 Ceph Cluster Configuration

The manifest cluster.yaml is responsible for conguring the actual Ceph cluster which will run in Kubernetes. Find detailed description of all available options in the upstream Rook documentation at https://rook.io/docs/rook/v1.0/ceph-cluster-crd.html . By default, Rook is congured to use all nodes that are not tainted with node- role.kubernetes.io/master:NoSchedule and will obey congured placement settings (see https://rook.io/docs/rook/v1.0/ceph-cluster-crd.html#placement-configuration-settings ). The following example disables such behavior and only uses the nodes explicitly listed in the nodes section:

storage: useAllNodes: false nodes: - name: caasp4-worker-0 - name: caasp4-worker-1 - name: caasp4-worker-2

Note By default, Rook is congured to use all free and empty disks on each node for use as Ceph storage.

13.4.1.4 Documentation

The Rook-Ceph upstream documentation at https://rook.github.io/docs/rook/master/ceph- storage.html contains more detailed information about conguring more advanced deployments. Use it as a reference for understanding the basics of Rook-Ceph before doing more advanced congurations.

Find more details about the SUSE CaaS Platform product at https:// documentation.suse.com/suse-caasp/4.0/ .

154 Configuration SES 6 13.4.2 Create the Rook Operator

Install the Rook-Ceph common components, CSI roles, and the Rook-Ceph operator by executing the following command on the SUSE CaaS Platform master node:

root # kubectl apply -f common.yaml -f operator.yaml common.yaml will create the 'rook-ceph' namespace, Ceph Custom Resource Denitions (CRDs) (see https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/ ) to make Kubernetes aware of Ceph Objects (for example, 'CephCluster'), and the RBAC roles and Pod Security Policies (see https://kubernetes.io/docs/concepts/policy/pod-security-policy/ ) which are necessary for allowing Rook to manage the cluster-specic resources.

Tip: hostNetwork and hostPorts Usage Allowing the usage of hostNetwork is required when using hostNetwork: true in the Cluster Resource Denition. Allowing the usage of hostPorts in the PodSecurityPolicy is also required.

Verify the installation by running kubectl get pods -n rook-ceph on SUSE CaaS Platform master node, for example:

root # kubectl get pods -n rook-ceph NAME READY STATUS RESTARTS AGE rook-ceph-agent-57c9j 1/1 Running 0 22h rook-ceph-agent-b9j4x 1/1 Running 0 22h rook-ceph-operator-cf6fb96-lhbj7 1/1 Running 0 22h rook-discover-mb8gv 1/1 Running 0 22h rook-discover-tztz4 1/1 Running 0 22h

13.4.3 Create the Ceph Cluster

After you modify cluster.yaml according to your needs, you can create the Ceph cluster. Run the following command on the SUSE CaaS Platform master node:

root # kubectl apply -f cluster.yaml

Watch the 'rook-ceph' namespace to see the Ceph cluster being created. You will see as many Ceph Monitors as congured in the cluster.yaml manifest (default is 3), one Ceph Manager, and as many Ceph OSDs as you have free disks.

155 Create the Rook Operator SES 6 Tip: Temporary OSD Pods While bootstrapping the Ceph cluster, you will see some pods with the name rook-ceph- osd-prepare-NODE-NAME run for a while and then terminate with the status 'Completed'. As their name implies, these pods provision Ceph OSDs. They are left without being deleted so that you can inspect their logs after their termination. For example:

root # kubectl get pods --namespace rook-ceph NAME READY STATUS RESTARTS AGE rook-ceph-agent-57c9j 1/1 Running 0 22h rook-ceph-agent-b9j4x 1/1 Running 0 22h rook-ceph-mgr-a-6d48564b84-k7dft 1/1 Running 0 22h rook-ceph-mon-a-cc44b479-5qvdb 1/1 Running 0 22h rook-ceph-mon-b-6c6565ff48-gm9wz 1/1 Running 0 22h rook-ceph-operator-cf6fb96-lhbj7 1/1 Running 0 22h rook-ceph-osd-0-57bf997cbd-4wspg 1/1 Running 0 22h rook-ceph-osd-1-54cf468bf8-z8jhp 1/1 Running 0 22h rook-ceph-osd-prepare-caasp4-worker-0-f2tmw 0/2 Completed 0 9m35s rook-ceph-osd-prepare-caasp4-worker-1-qsfhz 0/2 Completed 0 9m33s rook-ceph-tools-76c7d559b6-64rkw 1/1 Running 0 22h rook-discover-mb8gv 1/1 Running 0 22h rook-discover-tztz4 1/1 Running 0 22h

13.5 Using Rook as Storage for Kubernetes Workload

Rook allows you to use three dierent types of storage:

Object Storage Object storage exposes an S3 API to the storage cluster for applications to put and get data. Refer to https://rook.io/docs/rook/v1.0/ceph-object.html for a detailed description.

Shared File System A shared le system can be mounted with read/write permission from multiple pods. This is useful for applications that are clustered using a shared le system. Refer to https:// rook.io/docs/rook/v1.0/ceph-filesystem.html for a detailed description.

Block Storage Block storage allows you to mount storage to a single pod. Refer to https://rook.io/docs/ rook/v1.0/ceph-block.html for a detailed description.

156 Using Rook as Storage for Kubernetes Workload SES 6 13.6 Uninstalling Rook

To uninstall Rook, follow these steps:

1. Delete any Kubernetes applications that are consuming Rook storage.

2. Delete all object, le, and/or block storage artifacts that you created by following Section 13.5, “Using Rook as Storage for Kubernetes Workload”.

3. Delete the Ceph cluster, operator, and related resources:

root # kubectl delete -f cluster.yaml root # kubectl delete -f operator.yaml root # kubectl delete -f common.yaml

4. Delete the data on hosts:

root # rm -rf /var/lib/rook

5. If necessary, wipe the disks that were used by Rook. Refer to https://rook.io/docs/rook/ master/ceph-teardown.html for more details.

157 Uninstalling Rook SES 6 A Ceph Maintenance Updates Based on Upstream 'Nautilus' Point Releases

Several key packages in SUSE Enterprise Storage 6 are based on the Nautilus release series of Ceph. When the Ceph project (https://github.com/ceph/ceph ) publishes new point releases in the Nautilus series, SUSE Enterprise Storage 6 is updated to ensure that the product benets from the latest upstream bugxes and feature backports. This chapter contains summaries of notable changes contained in each upstream point release that has been—or is planned to be—included in the product.

Nautilus 14.2.20 Point Release

This release includes a security x that ensures the global_id value (a numeric value that should be unique for every authenticated client or daemon in the cluster) is reclaimed after a network disconnect or ticket renewal in a secure fashion. Two new health alerts may appear during the upgrade indicating that there are clients or daemons that are not yet patched with the appropriate x. To temporarily mute the health alerts around insecure clients for the duration of the upgrade, you may want to run:

cephadm@adm > ceph health mute AUTH_INSECURE_GLOBAL_ID_RECLAIM 1h cephadm@adm > ceph health mute AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED 1h

When all clients are updated, enable the new secure behavior, not allowing old insecure clients to join the cluster:

cephadm@adm > ceph config set mon auth_allow_insecure_global_id_reclaim false

For more details, refer ro https://docs.ceph.com/en/latest/security/CVE-2021-20288/ .

Nautilus 14.2.18 Point Release

This release xes a regression introduced in 14.2.17 in which the manager module tries to use a couple of Python modules that do not exist in some environments.

158 Nautilus 14.2.20 Point Release SES 6 This release xes issues loading the dashboard and volumes manager modules in some environments.

Nautilus 14.2.17 Point Release

This release includes the following xes:

$pid expansion in conguration paths such as admin_socket will now properly expand to the daemon PID for commands like ceph-mds or ceph-osd . Previously, only ceph- fuse and rbd-nbd expanded $pid with the actual daemon PID.

RADOS: PG removal has been optimized.

RADOS: Memory allocations are tracked in ner detail in BlueStore and displayed as a part of the dump_mempools command.

CephFS: clients which acquire capabilities too quickly are throttled to prevent instability. See new cong option mds_session_cap_acquisition_throttle to control this behavior.

Nautilus 14.2.16 Point Release

This release xes a security aw in CephFS.

CVE-2020-27781 : OpenStack Manila use of ceph_volume_client.py library allowed tenant access to any Ceph credentials' secret.

Nautilus 14.2.15 Point Release

This release xes a ceph-volume regression introduced in v14.2.13 and includes few other xes.

ceph-volume: Fixes lvm batch –auto , which breaks backward compatibility when using non rotational devices only (SSD and/or NVMe).

BlueStore: Fixes a bug in collection_list_legacy which makes PGs inconsistent during scrub when running OSDs older than 14.2.12 with newer ones.

MGR: progress module can now be turned on or o, using the commands ceph progress on and ceph progress off .

159 Nautilus 14.2.17 Point Release SES 6 Nautilus 14.2.14 Point Release

This releases xes a security aw aecting Messenger V2 for Octopus and Nautilus, among other xes across components.

CVE 2020-25660: Fix a regression in Messenger V2 replay attacks.

Nautilus 14.2.13 Point Release

This release xes a regression introduced in v14.2.12, and a few ceph-volume amd RGW xes.

Fixed a regression that caused breakage in clusters that referred to ceph-mon hosts using dns names instead of IP addresses in the mon_host parameter in ceph.conf .

ceph-volume: the lvm batch subcommand received a major rewrite.

Nautilus 14.2.12 Point Release

In addition to bug xes, this major upstream release brought a number of notable changes:

The ceph df command now lists the number of PGs in each pool.

MONs now have a cong option mon_osd_warn_num_repaired , 10 by default. If any OSD has repaired more than this many I/O errors in stored data, a OSD_TOO_MANY_REPAIRS health warning is generated. In order to allow clearing of the warning, a new command ceph tell osd.SERVICE_ID clear_shards_repaired COUNT has been added. By default, it will set the repair count to 0. If you want to be warned again if additional repairs are performed, you can provide a value to the command and specify the value of mon_osd_warn_num_repaired . This command will be replaced in future releases by the health mute/unmute feature.

It is now possible to specify the initial MON to contact for Ceph tools and daemons using the mon_host_override config option or --mon-host-override IP command-line switch. This generally should only be used for debugging and only aects initial communication with Ceph’s MON cluster.

Fix an issue with osdmaps not being trimmed in a healthy cluster.

160 Nautilus 14.2.14 Point Release SES 6 Nautilus 14.2.11 Point Release

In addition to bug xes, this major upstream release brought a number of notable changes:

RGW: The radosgw-admin sub-commands dealing with orphans – radosgw-admin orphans find , radosgw-admin orphans finish , radosgw-admin orphans list- jobs – have been deprecated. They have not been actively maintained and they store intermediate results on the cluster, which could ll a nearly-full cluster. They have been replaced by a tool, currently considered experimental, rgw-orphan-list .

Now, when noscrub and/or nodeep-scrub ags are set globally or per pool, scheduled scrubs of the type disabled will be aborted. All user initiated scrubs are not interrupted.

Fixed a ceph-osd crash in committed OSD maps when there is a failure to encode the rst incremental map.

Nautilus 14.2.10 Point Release

This upstream release patched one security aw:

CVE-2020-10753: rgw: sanitize newlines in s3 CORSConguration’s ExposeHeader

In addition to security aws, this major upstream release brought a number of notable changes:

The pool parameter target_size_ratio , used by the PG autoscaler, has changed meaning. It is now normalized across pools, rather than specifying an absolute ratio. If you have set target size ratios on any pools, you may want to set these pools to autoscale warn mode to avoid data movement during the upgrade:

ceph osd pool set POOL_NAME pg_autoscale_mode warn

The behaviour of the -o argument to the RADOS tool has been reverted to its original behaviour of indicating an output le. This reverts it to a more consistent behaviour when compared to other tools. Specifying object size is now accomplished by using an upper case O -O .

The format of MDSs in ceph fs dump has changed.

161 Nautilus 14.2.11 Point Release SES 6 Ceph will issue a health warning if a RADOS pool’s size is set to 1 or, in other words, the pool is congured with no redundancy. This can be xed by setting the pool size to the minimum recommended value with:

cephadm@adm > ceph osd pool set pool-name size num-replicas

The warning can be silenced with:

cephadm@adm > ceph config set global mon_warn_on_pool_no_redundancy false

RGW: bucket listing performance on sharded bucket indexes has been notably improved by heuristically – and signicantly, in many cases – reducing the number of entries requested from each bucket index shard.

Nautilus 14.2.9 Point Release

This upstream release patched two security aws:

CVE-2020-1759: Fixed nonce reuse in msgr V2 secure mode

CVE-2020-1760: Fixed XSS due to RGW GetObject header-splitting

In SES 6, these aws were patched in Ceph version 14.2.5.389+gb0f23ac248.

Nautilus 14.2.8 Point Release

In addition to bug xes, this major upstream release brought a number of notable changes:

The default value of bluestore_min_alloc_size_ssd has been changed to 4K to improve performance across all workloads.

The following OSD memory cong options related to BlueStore cache autotuning can now be congured during runtime:

osd_memory_base (default: 768 MB) osd_memory_cache_min (default: 128 MB) osd_memory_expected_fragmentation (default: 0.15) osd_memory_target (default: 4 GB)

162 Nautilus 14.2.9 Point Release SES 6 You can set the above options by running:

cephadm@adm > ceph config set osd OPTION VALUE

The Ceph Manager now accepts profile rbd and profile rbd-read-only user capabilities. You can use these capabilities to provide users access to MGR-based RBD functionality such as rbd perf image iostat and rbd perf image iotop .

The conguration value osd_calc_pg_upmaps_max_stddev used for upmap balancing has been removed. Instead, use the Ceph Manager balancer conguration option upmap_max_deviation which now is an integer number of PGs of deviation from the target PGs per OSD. You can set it with a following command:

cephadm@adm > ceph config set mgr mgr/balancer/upmap_max_deviation 2

The default upmap_max_deviation is 5. There are situations where crush rules would not allow a pool to ever have completely balanced PGs. For example, if crush requires 1 replica on each of 3 racks, but there are fewer OSDs in 1 of the racks. In those cases, the conguration value can be increased.

CephFS: multiple active Metadata Server forward scrub is now rejected. Scrub is currently only permitted on a le system with a single rank. Reduce the ranks to one via ceph fs set FS_NAME max_mds 1 .

Ceph will now issue a health warning if a RADOS pool has a pg_num value that is not a power of two. This can be xed by adjusting the pool to an adjacent power of two:

cephadm@adm > ceph osd pool set POOL_NAME pg_num NEW_PG_NUM

Alternatively, you can silence the warning with:

cephadm@adm > ceph config set global mon_warn_on_pool_pg_num_not_power_of_two false

163 Nautilus 14.2.8 Point Release SES 6 Nautilus 14.2.7 Point Release

This upstream release patched two security aws:

CVE-2020-1699: a path traversal aw in Ceph Dashboard that could allow for potential information disclosure.

CVE-2020-1700: a aw in the RGW beast front-end that could lead to denial of service from an unauthenticated client.

In SES 6, these aws were patched in Ceph version 14.2.5.382+g8881d33957b.

Nautilus 14.2.6 Point Release

This release xed a Ceph Manager bug that caused MGRs becoming unresponsive on larger clusters. SES users were never exposed to the bug.

Nautilus 14.2.5 Point Release

Health warnings are now issued if daemons have recently crashed. Ceph will now issue health warnings if daemons have recently crashed. Ceph has been collecting crash reports since the initial Nautilus release, but the health alerts are new. To view new crashes (or all crashes, if you have just upgraded), run:

cephadm@adm > ceph crash ls-new

To acknowledge a particular crash (or all crashes) and silence the health warning, run:

cephadm@adm > ceph crash archive CRASH-ID cephadm@adm > ceph crash archive-all

pg_num must be a power of two, otherwise HEALTH_WARN is reported. Ceph will now issue a health warning if a RADOS pool has a pg_num value that is not a power of two. You can x this by adjusting the pool to a nearby power of two:

cephadm@adm > ceph osd pool set POOL-NAME pg_num NEW-PG-NUM

164 Nautilus 14.2.7 Point Release SES 6 Alternatively, you can silence the warning with:

cephadm@adm > ceph config set global mon_warn_on_pool_pg_num_not_power_of_two false

Pool size needs to be greater than 1 otherwise HEALTH_WARN is reported. Ceph will issue a health warning if a RADOS pool’s size is set to 1 or if the pool is congured with no redundancy. Ceph will stop issuing the warning if the pool size is set to the minimum recommended value:

cephadm@adm > ceph osd pool set POOL-NAME size NUM-REPLICAS

You can silence the warning with:

cephadm@adm > ceph config set global mon_warn_on_pool_no_redundancy false

Health warning is reported if average OSD heartbeat ping time exceeds the threshold. A health warning is now generated if the average OSD heartbeat ping time exceeds a congurable threshold for any of the intervals computed. The OSD computes 1 minute, 5 minute and 15 minute intervals with average, minimum, and maximum values. A new conguration option, mon_warn_on_slow_ping_ratio , species a percentage of osd_heartbeat_grace to determine the threshold. A value of zero disables the warning. A new conguration option, mon_warn_on_slow_ping_time , specied in milliseconds, overrides the computed value and causes a warning when OSD heartbeat pings take longer than the specied amount. A new command ceph daemon mgr.MGR-NUMBER dump_osd_network THRESHOLD lists all connections with a ping time longer than the specied threshold or value determined by the conguration options, for the average for any of the 3 intervals. A new command ceph daemon osd.# dump_osd_network THRESHOLD will do the same as the previous one but only including heartbeats initiated by the specied OSD.

Changes in the telemetry MGR module. A new 'device' channel (enabled by default) will report anonymized hard disk and SSD health metrics to telemetry.ceph.com in order to build and improve device failure prediction algorithms. Telemetry reports information about CephFS le systems, including:

How many MDS daemons (in total and per le system).

Which features are (or have been) enabled.

165 Nautilus 14.2.5 Point Release SES 6 How many data pools.

Approximate le system age (year and the month of creation).

How many les, bytes, and snapshots.

How much metadata is being cached.

Other miscellaneous information:

Which Ceph release the monitors are running.

Whether msgr v1 or v2 addresses are used for the monitors.

Whether IPv4 or IPv6 addresses are used for the monitors.

Whether RADOS cache tiering is enabled (and the mode).

Whether pools are replicated or erasure coded, and which erasure code prole plug- in and parameters are in use.

How many hosts are in the cluster, and how many hosts have each type of daemon.

Whether a separate OSD cluster network is being used.

How many RBD pools and images are in the cluster, and how many pools have RBD mirroring enabled.

How many RGW daemons, zones, and zonegroups are present and which RGW frontends are in use.

Aggregate stats about the CRUSH Map, such as which algorithms are used, how big buckets are, how many rules are dened, and what tunables are in use.

If you had telemetry enabled before 14.2.5, you will need to re-opt-in with:

cephadm@adm > ceph telemetry on

If you are not comfortable sharing device metrics, you can disable that channel rst before re-opting-in:

cephadm@adm > ceph config set mgr mgr/telemetry/channel_device false cephadm@adm > ceph telemetry on

166 Nautilus 14.2.5 Point Release SES 6 You can view exactly what information will be reported rst with:

cephadm@adm > ceph telemetry show # see everything cephadm@adm > ceph telemetry show device # just the device info cephadm@adm > ceph telemetry show basic # basic cluster info

New OSD daemon command dump_recovery_reservations . It reveals the recovery locks held ( in_progress ) and waiting in priority queues. Usage:

cephadm@adm > ceph daemon osd.ID dump_recovery_reservations

New OSD daemon command dump_scrub_reservations . It reveals the scrub reservations that are held for local (primary) and remote (replica) PGs. Usage:

cephadm@adm > ceph daemon osd.ID dump_scrub_reservations

RGW now supports S3 Object Lock set of APIs. RGW now supports S3 Object Lock set of APIs allowing for a WORM model for storing objects. 6 new APIs have been added PUT/ GET bucket object lock, PUT/GET object retention, PUT/GET object legal hold.

RGW now supports List Objects V2. RGW now supports List Objects V2 as specied at https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html .

Nautilus 14.2.4 Point Release

This point release xes a serious regression that found its way into the 14.2.3 point release. This regression did not aect SUSE Enterprise Storage customers because we did not ship a version based on 14.2.3.

Nautilus 14.2.3 Point Release

Fixed a denial of service vulnerability where an unauthenticated client of Ceph Object Gateway could trigger a crash from an uncaught exception.

Nautilus-based librbd clients can now open images on Jewel clusters.

167 Nautilus 14.2.4 Point Release SES 6 The Object Gateway num_rados_handles has been removed. If you were using a value of num_rados_handles greater than 1, multiply your current objecter_inflight_ops and objecter_inflight_op_bytes parameters by the old num_rados_handles to get the same throttle behavior.

The secure mode of Messenger v2 protocol is no longer experimental with this release. This mode is now the preferred mode of connection for monitors.

osd_deep_scrub_large_omap_object_key_threshold has been lowered to detect an object with a large number of omap keys more easily.

The Ceph Dashboard now supports silencing Prometheus notications.

Nautilus 14.2.2 Point Release

The no{up,down,in,out} related commands have been revamped. There are now two ways to set the no{up,down,in,out} ags: the old command

ceph osd [un]set FLAG

which sets cluster-wide ags; and the new command

ceph osd [un]set-group FLAGS WHO

which sets ags in batch at the granularity of any crush node or device class.

radosgw-admin introduces two subcommands that allow the managing of expire-stale objects that might be left behind after a bucket reshard in earlier versions of Object Gateway. Expire-stale objects are expired objects that should have been automatically erased but still exist and need to be listed and removed manually. One subcommand lists such objects and the other deletes them.

Earlier Nautilus releases (14.2.1 and 14.2.0) have an issue where deploying a single new Nautilus BlueStore OSD on an upgraded cluster (i.e. one that was originally deployed pre- Nautilus) breaks the pool utilization statistics reported by ceph df . Until all OSDs have been reprovisioned or updated (via ceph-bluestore-tool repair ), the pool statistics will show values that are lower than the true value. This is resolved in 14.2.2, such that

168 Nautilus 14.2.2 Point Release SES 6 the cluster only switches to using the more accurate per-pool stats after all OSDs are 14.2.2 or later, are Block Storage, and have been updated via the repair function if they were created prior to Nautilus.

The default value for mon_crush_min_required_version has been changed from firefly to , which means the cluster will issue a health warning if your CRUSH tunables are older than Hammer. There is generally a small (but non-zero) amount of data that will be re-balanced after making the switch to Hammer tunables. If possible, we recommend that you set the oldest allowed client to hammer or later. To display what the current oldest allowed client is, run:

cephadm@adm > ceph osd dump | grep min_compat_client

If the current value is older than hammer , run the following command to determine whether it is safe to make this change by verifying that there are no clients older than Hammer currently connected to the cluster:

cephadm@adm > ceph features

The newer straw2 CRUSH bucket type was introduced in Hammer. If you verify that all clients are Hammer or newer, it allows new features only supported for straw2 buckets to be used, including the crush-compat mode for the Balancer (Book “Administration Guide”, Chapter 21 “Ceph Manager Modules”, Section 21.1 “Balancer”).

Find detailed information about the patch at https://download.suse.com/Download? buildid=D38A7mekBz4~

Nautilus 14.2.1 Point Release

This was the rst point release following the original Nautilus release (14.2.0). The original ('General Availability' or 'GA') version of SUSE Enterprise Storage 6 was based on this point release.

169 Nautilus 14.2.1 Point Release SES 6 Glossary

General

Admin node The node from which you run the ceph-deploy utility to deploy Ceph on OSD nodes.

Bucket A point that aggregates other nodes into a hierarchy of physical locations.

Important: Do Not Mix with S3 Buckets S3 buckets or containers represent dierent terms meaning folders for storing objects.

CRUSH, CRUSH Map Controlled Replication Under Scalable Hashing: An algorithm that determines how to store and retrieve data by computing locations. CRUSH requires a map of the cluster to pseudo-randomly store and retrieve data in OSDs with a uniform distribution of data across the cluster.

Monitor node, MON A cluster node that maintains maps of cluster state, including the monitor map, or the OSD map.

Node Any single machine or server in a Ceph cluster.

OSD Depending on context, Object Storage Device or Object Storage Daemon. The ceph-osd daemon is the component of Ceph that is responsible for storing objects on a local le system and providing access to them over the network.

OSD node A cluster node that stores data, handles data replication, recovery, backlling, rebalancing, and provides some monitoring information to Ceph monitors by checking other Ceph OSD daemons.

170 SES 6 PG Placement Group: a sub-division of a pool, used for performance tuning.

Pool Logical partitions for storing objects such as disk images.

Routing tree A term given to any diagram that shows the various routes a receiver can run.

Rule Set Rules to determine data placement for a pool.

Ceph Specific Terms

Alertmanager A single binary which handles alerts sent by the Prometheus server and noties end user.

Ceph Storage Cluster The core set of storage software which stores the user’s data. Such a set consists of Ceph monitors and OSDs. AKA “Ceph Object Store”.

Grafana Database analytics and monitoring solution.

Prometheus Systems monitoring and alerting toolkit.

Object Gateway Specific Terms archive module Module that enables creating an Object Gateway zone for keeping the history of S3 object versions.

171 SES 6 Object Gateway The S3/Swift gateway component for Ceph Object Store.

172 SES 6 B Documentation Updates

This chapter lists content changes for this document since the release of the latest maintenance update of SUSE Enterprise Storage 5. You can nd changes related to the cluster deployment that apply to previous versions in https://documentation.suse.com/ses/5.5/single-html/ses- deployment/#ap-deploy-docupdate . The document was updated on the following dates:

Section B.1, “Maintenance update of SUSE Enterprise Storage 6 documentation”

Section B.2, “June 2019 (Release of SUSE Enterprise Storage 6)”

B.1 Maintenance update of SUSE Enterprise Storage 6 documentation

Added a list of new features for Ceph 14.2.5 in the 'Ceph Maintenance Updates Based on Upstream 'Nautilus' Point Releases' appendix.

Suggested running rpmconfigcheck to prevent losing local changes in Section 6.5, “Per- Node Upgrade Instructions” (https://jira.suse.com/browse/SES-348 ).

Added Book “Tuning Guide”, Chapter 8 “Improving Performance with LVM cache” (https:// jira.suse.com/browse/SES-269 ).

Added Chapter 13, SUSE Enterprise Storage 6 on Top of SUSE CaaS Platform 4 Kubernetes Cluster (https://jira.suse.com/browse/SES-720 ).

Added a tip on monitoring cluster nodes' status during upgrade in Section 6.6, “Upgrade the Admin Node” (https://bugzilla.suse.com/show_bug.cgi?id=1154568 ).

Made the network recommendations synchronized and more specic in Section 2.1.1, “Network Recommendations” (https://bugzilla.suse.com/show_bug.cgi?id=1156631 ).

Added Section 6.5.2, “Node Upgrade Using the SUSE Distribution Migration System” (https:// bugzilla.suse.com/show_bug.cgi?id=1154438 ).

Made the upgrade chapter sequential, Chapter 6, Upgrading from Previous Releases (https:// bugzilla.suse.com/show_bug.cgi?id=1144709 ).

173 Maintenance update of SUSE Enterprise Storage 6 documentation SES 6 Added changelog entry for Ceph 14.2.4 (https://bugzilla.suse.com/show_bug.cgi? id=1151881 ).

Unied the pool name 'cephfs_metadata' in examples in Chapter 12, Installation of NFS Ganesha (https://bugzilla.suse.com/show_bug.cgi?id=1148548 ).

Updated Section 5.5.2.1, “Specification” to include more realistic values (https:// bugzilla.suse.com/show_bug.cgi?id=1148216 ).

Added two new repositories for 'Module-Desktop' as our customers use mostly GUI in Section 6.5.1, “Manual Node Upgrade Using the Installer DVD” (https://bugzilla.suse.com/ show_bug.cgi?id=1144897 ).

deepsea-cli is not a dependency of deepsea in Section 5.4, “DeepSea CLI” (https:// bugzilla.suse.com/show_bug.cgi?id=1143602 ).

Added a hint to migrate ntpd to chronyd in Section 6.2.9, “Migrate from ntpd to chronyd” (https://bugzilla.suse.com/show_bug.cgi?id=1135185 ).

Added Book “Administration Guide”, Chapter 2 “Salt Cluster Administration”, Section 2.16 “Deactivating Tuned Profiles” (https://bugzilla.suse.com/show_bug.cgi?id=1130430 ).

Consider migrating whole OSD node in Section 6.13.3, “OSD Deployment” (https:// bugzilla.suse.com/show_bug.cgi?id=1138691 ).

Added a point about migrating MDS names in Section 6.2.6, “Verify MDS Names” (https:// bugzilla.suse.com/show_bug.cgi?id=1138804 ).

B.2 June 2019 (Release of SUSE Enterprise Storage 6)

GENERAL UPDATES

Added Section 5.5.2, “DriveGroups” (jsc#SES-548).

Rewrote Chapter 6, Upgrading from Previous Releases (jsc#SES-88).

Added Section 7.2.1, “Enabling IPv6 for Ceph Cluster Deployment” (jsc#SES-409).

Made Block Storage the default storage back-end (Fate#325658).

Removed all references to external online documentation, replaced with the relevant content (Fate#320121).

174 June 2019 (Release of SUSE Enterprise Storage 6) SES 6 BUGFIXES

Added information about AppArmor during upgrade in Section 6.2.5, “Adjust AppArmor” (https://bugzilla.suse.com/show_bug.cgi?id=1137945 ).

Added a tip about orphaned packages in Section 6.5, “Per-Node Upgrade Instructions” (https:// bugzilla.suse.com/show_bug.cgi?id=1136624 ).

Updated profile-* with role-storage in Tip: Deploying Monitor Nodes without Defining OSD Profiles (https://bugzilla.suse.com/show_bug.cgi?id=1138181 ).

Added Section 6.13, “Migration from Profile-based Deployments to DriveGroups” (https:// bugzilla.suse.com/show_bug.cgi?id=1135340 ).

Added Section 6.13, “Migration from Profile-based Deployments to DriveGroups” (https:// bugzilla.suse.com/show_bug.cgi?id=1135340 ).

Added Section 6.8, “Upgrade Metadata Servers” (https://bugzilla.suse.com/show_bug.cgi? id=1135064 ).

MDS cluster needs to be shrunk in Section 6.8, “Upgrade Metadata Servers” (https:// bugzilla.suse.com/show_bug.cgi?id=1134826 ).

Changed conguration le to /srv/pillar/ceph/stack/global.yml (https:// bugzilla.suse.com/show_bug.cgi?id=1129191 ).

Updated various parts of Book “Administration Guide”, Chapter 29 “Exporting Ceph Data via Samba” (https://bugzilla.suse.com/show_bug.cgi?id=1101478 ).

master_minion.sls is gone in Section 5.3, “Cluster Deployment” (https://bugzilla.suse.com/ show_bug.cgi?id=1090921 ).

Mentioned the deepsea-cli package in Section 5.4, “DeepSea CLI” (https:// bugzilla.suse.com/show_bug.cgi?id=1087454 ).

175 June 2019 (Release of SUSE Enterprise Storage 6) SES 6