Implementation Guide

Service Provider Data Center For Disclosure under NDA Only

Scalable Cloud-based Services for Flexible, Robust, Cost- Effective Storage

Discover how to improve storage utilization, reduce cost and improve scalability with a *-based storage-as-a-service (STaaS) solution optimized for ® technology

Introduction This implementation guide provides key learnings and configuration Storage-as-a-service (STaaS) uses software-defined storage (SDS) to abstract insights to integrate technologies storage software from the storage hardware. By providing a shared pool of storage with optimal business value. capacity that can be used across service offerings, SDS eliminates storage silos If you are responsible for… and helps improve utilization ratios. Intelligent, automated orchestration reduces • Technology decisions: operating costs and can speed provisioning from several weeks to a few minutes. You will learn how to implement a storage-as-a-service (STaaS) Ceph* is an open source STaaS solution that supports object, and file storage. solution using Ceph*. You’ll Using an open source solution can help you lower costs and avoid vendor lock-in. also find tips for optimizing It also enables you to deploy new technology quickly. Ceph is well supported by performance with Intel® the open source community, and commercial distributions are also available. Intel, technologies and best practices the community and Independent Software Vendors (ISVs) have worked closely to for deploying Ceph. develop reference architectures and best practices for deploying a Ceph-based STaaS platform that is optimized to run on Intel® Xeon® processors and take advantage of other Intel® technologies such as Intel® SSD Data Center Family for NVMe* (Non-Volatile Memory Express*) Solid State Drives (SSDs), Intel® Ethernet Products and software optimizations such as Intel® Intelligent Storage Acceleration (Intel® ISA-L) and Intel® Cache Acceleration Software (Intel® CAS).

Overview equirements Configuration Operation Use Cases Validation

For Disclosure under NDA Only Implementation Guide | Scalable Cloud-based Services for Flexible, Robust, Cost-Effective Storage 2

Table of Contents Solution Overview Introduction ...... 1 A Ceph-based STaaS deployment consists of the Ceph software, several types of Solution Overview...... 2 nodes (servers), and Intel software optimization products. Ceph Software...... 2 Ceph Software Node Types...... 2 Intel® Technologies...... 2 Ceph’s foundation is the Reliable Autonomic Distributed Object Store* (RADOS*), which provides your applications with object, block and storage in System Requirements ...... 3 a single unified storage cluster—making Ceph flexible, highly reliable and easy Software Requirements...... 3 for you to manage. Each one of your applications can use the object, block or Minimum Hardware Requirements. . 3 file system interfaces to the same RADOS cluster simultaneously, which means Installation and Configuration. . . . 5 your Ceph storage system serves as a flexible foundation for all of your data Get Ceph...... 5 storage needs. You can use Ceph for free because it is open source, and deploy Install Ceph ...... 5 it on economical industry-standard hardware. Or you can opt for a commercially Deploy Storage Clusters...... 6 supported Ceph distribution if you prefer. Deploy Ceph Clients...... 6 The various storage access modes use different components of Ceph: Configuration Considerations. . . . .6 •  uses the Ceph Object Gateway daemon, radosgw* (RGW*). Ceph Operation and Utilization. . . 8 • File storage (CephFS*) can use a Ceph filesystem kernel driver or the user space Ceph Topologies...... 8 FUSE* client. Using Intel CAS...... 10 • Block storage uses RADOS Block Devices (RBDs*). Using Intel ISA-L...... 10 Integrating with OpenStack. . . . .10 • All storage is ultimately stored by Ceph Object Storage Daemons (Ceph OSDs). Orchestration ...... 11 • All data is stored as “objects” which are randomly distributed across the cluster Ceph-Based STaaS Use Cases. . . . . 11 by the CRUSH* (Controlled Under Scalable Hashing*) algorithm. Validation...... 13 To efficiently compute information about object placement and location, Ceph uses the CRUSH algorithm instead of a central lookup table. CRUSH enables Ceph Best Practices...... 13 performance to scale linearly by ensuring that data is always retrieved directly from Planning...... 13 the primary OSD where it is stored—avoiding bottlenecks created by centralized Nodes...... 13 metadata lookups. Journaling...... 14 Network...... 14 Node Types Summary...... 14 As you build your Ceph cluster using the guidelines in this document, you will be References...... 14 working with several types of “nodes” (sometimes referred to as “hosts”). A node is simply any single machine or server in a Ceph system. Appendix A: Ceph Tuning Details. . . . 15 • Storage nodes (sometimes called “OSD nodes” or simply “OSDs”) are where the Solutions Proven by Your Peers . . . . 17 actual data is stored. • Monitor nodes track the health and configuration of the Ceph cluster by maintaining copies of the cluster maps. • RGW nodes serve as HTTP proxies for object storage workloads. • Metadata nodes map the directories and filenames from CephFS to objects stored within RADOS clusters. • Client nodes request data. See Figure 1 for an overview of how all these fit together into a Ceph cluster.

Intel® Technologies Several Intel technologies, both hardware and software, contribute to the reliability and performance of a Ceph-based STaaS solution. See the “References” section for links. • Intel® Xeon® processors and Intel® Xeon® processor Scalable family provide the compute power needed to process vast amounts of data. • Intel® SATA-based SSDs, Intel® NVMe*-based SSDs and Intel® Optane™ SSDs provide performance, stability, efficiency and low power consumption.

For Disclosure under NDA Only Implementation Guide | Scalable Cloud-based Services for Flexible, Robust, Cost-Effective Storage 3

e ore oonens

App Host/VM Client

RadosG* RBD* CephFS* i

Librados RADOS

Metadata Monitors Servers ool ool ool ool n D ON

D n ON n n

n n n

Cluster Node Cluster Node Cluster Node [OSDs] [OSDs] [OSDs]

PG lceen rou

Figure 1 . Software Elements of a Ceph Cluster

• Intel® Cache Acceleration Software (Intel® CAS) improves CephFS performance—by as much as 550 percent for reads and 400 percent for writes, compared to running CephFS without Intel CAS.1 • Intel® Intelligent Storage Acceleration Library (Intel® ISA-L) enables Ceph to more easily use features of Intel Xeon processors for tasks such as data protection, data integrity and data security. Intel ISA-L enables Ceph to offer erasure coding without negatively affecting performance. • Intel® Ethernet Network Adapters, Controllers and Accessories enable agility within the data center to effectively deliver services efficiently and cost-effectively. Worldwide availability, exhaustive testing for compatibility, and 35 years of innovation mean that Intel Ethernet products are an excellent choice when building your Ceph clusters.

System Requirements

Software Requirements • Ceph software . The current version of Ceph is v12.2.4 Luminous RC. See Ceph Releases. • Compatible OS . Intel recommends you deploy Ceph on one of the three major * distributions; SUSE Enterprise Storage 4*, *, or CentOS*/ Enterprise Linux* (RHEL*). And you should deploy Ceph on the latest releases. The choice of a specific distribution is often determined by your existing install base or a preference for interfaces like sysvinit, upstart, or systemd. For full details, visit Ceph Dependencies. • OpenStack* . Although optional, OpenStack integrates well with Ceph and extends its capabilities. See “Integrating with OpenStack” later in this document.

Minimum Hardware Requirements2 Ceph was designed to run on industry-standard hardware, which makes building and maintaining PB-scale data clusters economically feasible. When planning your cluster hardware, you will need to balance a number of considerations, including failure domains and potential performance issues. Hardware planning should include distributing Ceph daemons and other processes that use Ceph across many hosts. Generally, we recommend running Ceph daemons of a specific type on a host configured for that type of daemon. We recommend using other hosts for processes that use your data cluster (such as OpenStack or CloudStack*).

For Disclosure under NDA Only Implementation Guide | Scalable Cloud-based Services for Flexible, Robust, Cost-Effective Storage 4

The hardware requirements of Ceph depend on the I/O workload. Use the following hardware recommendations as a starting point only. The recommendations given in this section are on a per- process basis. If several processes are co-located on the same server, each process’ CPU, RAM, disk and network requirements should be totaled. CPU Requirements: Ceph metadata servers (MDS) dynamically redistribute their load, which is CPU- intensive. Therefore, your MDS should have significant processing power (for example, quad-core processors or better). Ceph OSDs run the RADOS service, calculate data placement with CRUSH, replicate data and maintain their own copy of the cluster map. Therefore, storage nodes should have a reasonable amount of processing power (for example, dual-core processors). Ceph monitors simply maintain a master copy of the cluster map, so they are not CPU-intensive. You must also consider whether the host machine will run CPU-intensive processes in addition to Ceph daemons. For example, if your hosts will run computing VMs (such as OpenStack Nova*), you will need to ensure that these other processes leave sufficient processing power for Ceph daemons. We recommend running additional CPU-intensive processes on separate hosts. Memory: MDS and Ceph monitors must be capable of serving their data quickly, so they should have plenty of RAM (for example, 1 GB of RAM per daemon instance). OSDs do not require as much RAM for regular operations (500 MB of RAM per daemon instance); however, during recovery they need significantly more RAM (as much as 1 GB per 1 TB of storage per daemon). Generally, more RAM is better. Data Storage: Plan your data storage configuration carefully. There are significant cost and performance tradeoffs to consider when planning for data storage. Simultaneous OS operations, and simultaneous request for read and write operations from multiple daemons against a single drive can slow performance considerably. Storage nodes should have plenty of disk drive space for object data. We recommend a minimum drive size of 1 TB. Tables 1 and 2 provide some guidance on server chassis and HDD/SSD choices. I/O Controllers: Disk controllers also have a significant impact on write throughput. Carefully consider your selection of disk controllers to ensure that they do not create a performance bottleneck. Networks: We recommend a minimum of two dual-port NICs to account for a public (front-side) network and a private cluster (back-side) network. While a 10 GbE network may suffice for some small implementations, we recommend 25 GbE. Other Hardware Considerations: Storage nodes should be bare metal, not virtualized, for disk performance reasons. Also, you should have a dedicated drive for the (OS).

Table 1 . Server Chassis Guidance

Optimization Criterion Small (250 TB-1 PB) Medium (1-2 PB) Large (2+ PB)

2 to 4 PCIe*/NVMe* slots IOPS Not typical Not typical or 8-24x 2.5” drive bays Throughput 24 to 36 3.5” bays 12 to 16 3.5” bays 24 to 36 3.5” bays Capacity 60 to 90 3.5” bays

For Disclosure under NDA Only Implementation Guide | Scalable Cloud-based Services for Flexible, Robust, Cost-Effective Storage 5

Table 2 . Chassis Options for HDDs and SSDs

Configuration Priority Typical Use Case Details

All-Flash NVMe-based Intel® SSD Configuration with IOPS High Performance Intel® Optane™ technology Journal/Cache All-Flash U.2, SATA-based Intel® SSD or 10K SAS HDD Throughput Performance with PCIe Journal/Cache 7.2K SATA HDD with Capacity Capacity/Value NVMe-based Intel SSD Journal/Cache

Installation and Configuration The steps to deploying your Ceph-based STaaS solution are fairly simple: 1. Get Ceph 2. Install Ceph 3. Deploy storage clusters 4. Deploy clients Exactly how you accomplish these steps depends on the distribution of Ceph you are using (open source, Red Hat or SUSE). This implementation guide does not attempt to replicate the installation instructions for each distribution—instead, we provide several links you can use to access the materials each community has already published. If you run into problems, there is a robust user community that can provide suggestions. For example, the Ceph mailing lists are located at https://ceph.com/irc/.

Get Ceph Open Source: While you can pull the latest Ceph source code from the github* repository, we believe for the majority of users it is better to obtain a stable community production release. You can see a list of all active releases at Ceph Releases. Red Hat Ceph Storage: You can get started downloading Red Hat’s Ceph distribution here. SUSE Enterprise Storage: SUSE’s Ceph distribution is available here.

Install Ceph Once you have the Ceph software, installing it is easy. Install packages on each Ceph node in your cluster. You may use ceph-deploy* to install Ceph for your storage cluster, or use distribution-specific packaged management tools (such as *, apt-get* and so on). • To get complete install instructions for an open source install, see Ceph’s Quick Install Guide and Ceph- deploy instructions and/or join the Ceph community. • For Red Hat Ceph Storage, see Red Hat’s Install Guide. You can also refer to the following Red Hat resources: - Red Hat’s library of reference architectures - Red Hat customer portal and product documentation (general) and v1.3-specific documentation • For SUSE Enterprise Storage, see SUSE’s Ceph Install Guide. SUSE also provides an extensive library of Ceph resources. Note: The ceph-helm* project enables you to deploy Ceph in a environment. See Installation (Kubernetes + Helm) for more information.

For Disclosure under NDA Only Implementation Guide | Scalable Cloud-based Services for Flexible, Robust, Cost-Effective Storage 6

Deploy Storage Clusters If you are using the open source distribution, once you have completed Ceph installation, you can begin deploying a Ceph storage cluster using ceph-deploy. For details, see Storage Cluster Quick Start. You can use the Ceph librados* APIs or OpenStack APIs.

Deploy Ceph Clients Complete client install instructions are available for Ceph Object Storage, Ceph Block Devices and CephFS. If you are using the open source distribution, see the following links for details: • Object Storage • Block Storage • File Storage

Configuration Considerations When configuring your Ceph nodes and clusters, it is important to consider your workload needs. Are they most sensitive to IOPS and throughput, latency or capacity? Are they mostly small-object or large- object? How many objects are there? Maybe you’re using block storage instead? Tables 3 and 4 provide some scenarios and recommendations for RGW and RBD workloads, respectively. Table 5 provides some guidance on setting memory, placement group (PG) and journal values. Appendix A provides detailed tuning information. For additional guidance on architecting your Ceph cluster, see “Red Hat Ceph Storage on servers with Intel processors and SSDs.” Table 3 . Optimization Recommendations for RGW (Object) Workloads

Scenario Recommendations

• Higher read ops can be achieved by adding more RGW hosts, until cluster disk saturation. • Bucket Index on flash media helps reduce read and write latencies and Optimizing for Small increases ops. Object Workloads (IOPS) • Higher write ops can be achieved by adding more storage nodes, until more RGW hosts are needed. • Use Intel® Cache Acceleration Software (Intel® CAS) to cache data for increased throughput and reduced latency.

• Higher read throughput can be achieved by adding more RGW hosts, until cluster disk saturation. Optimizing for Large • Higher write throughput can be achieved by adding more storage nodes, Object Workloads together with multiple RGWs. (Throughput in MBps) • Increasing rgw_max_chunk_size to 4M helps reduce write I/O requests to disks. • Use Intel CAS to cache data for even higher throughput and lower latency.

Optimizing for High Object • Use Intel CAS to cache filesystem metadata (and, optionally, data) to Density (>100 Million significantly reduce seek latency and increase throughput. Objects)

Optimizing the Number of For 100 percent large object (32 MB) write workload: RGW Nodes (see https:// • 1 x RGW host with 10 GbE NIC for every 100 OSDs (HDDs with Journals www.redhat.com/cms/ on Flash) in erasure code (EC) pool configuration managed-files/st-ceph- For 100 percent small object (64 KB) write workload: storage-qct-object-storage- reference-architecture- • 1 x RGW host with 10 GbE NIC for every 50 OSDs (HDDs with Journals f7901-201706-v2-en.pdf on Flash) for more details) • EC Pool and Bucket Index on Flash

For Disclosure under NDA Only Implementation Guide | Scalable Cloud-based Services for Flexible, Robust, Cost-Effective Storage 7

Table 4 . All-Flash Recommendations for RBD (Block) Workloads

Scenario Recommendations

• Ceph RBD (block) pools • OSDs on SSDs - 4x Intel® SSD DC P4500 Series per server: Dual-socket Intel® Xeon® Gold 6152 processor • Two OSDs per NVMe-based Intel SSD. Up to six NVMe-based Intel SSDs or 12 Optimizing OSDs per server for IOPS • Data protection: Replication (2x on SSD-based OSDs) with regular backups to the object storage pool • All-NVMe cluster: Intel SSD DC P4800X Series for journal/ metadata/caching drive for highest IOPS and lowest tail latency

• Ceph RBD (block) or Ceph RGW (object) pools • OSDs on HDDs: - Good: Write journals on Intel SSD DC S4600 Series 480 GB drives, with a ratio of 4-5 HDDs to each SSD - Better: Write journals on Intel SSD DC P4600 Series 1 TB NVMe-based drives, with a ratio of 12-18 HDDs to each SSD - Best: Write journals and Intel Cache Acceleration Software (Intel CAS) on an Intel SSD DC P4600 Series 2 TB • One CPU core-GHz per OSD. For example: Optimizing for Throughput - 12 OSDs per HDD per server: Dual-socket Intel Xeon Silver 4110 processor (8 cores per socket @2.1 GHz) - 36 OSDs per HDD per server: Dual-socket Intel Xeon Gold 6138 processor (20 cores per socket @2.0 GHz) - 60 OSDs per HDD per server: Dual-socket Intel Xeon Gold 6130 processor (16 cores * 2.1 GHz * 2 sockets) • Data protection: Replication (read-intensive or mixed read/write) or erasure- coded (write-intensive, more CPU cores recommended) • High-bandwidth networking: Greater than 10 GbE for servers with more than 12-16 drives.

• Ceph RGW (object) pools • OSDs on HDDs with write journals co-located on HDDs in a separate partition Optimizing for Capacity • One CPU core-GHz per OSD. See throughput-optimized section above for examples. Intel Xeon E3 processor-based servers would suffice. •  Data protection: Erasure coding

For Disclosure under NDA Only Implementation Guide | Scalable Cloud-based Services for Flexible, Robust, Cost-Effective Storage 8

Table 5 . Ceph Sizing Formulas

Setting Formula

16 GB minimum (2 GB * #OSD) + 10 GB (that is, 1 GB per 1 TB of Memory= storage)

PG Sum= (OSD * 100/max_rep_count)

PG/Pool= ((OSD * 100/max_rep_count)/NB_POOLS)

journal_size= (2 * (expected BW-throughput * file_store_max_sync_interval))

num_journal= SSD_seq_write_speed/HDD_seq_write_speed

PG = placement group PG calculator: https://ceph.com/pgcalc/

Ceph Operation and Utilization In this section, we discuss the different topologies for block, object and file storage, as well as provide details about adjacent technologies such as Intel CAS, Intel ISA-L, OpenStack and orchestration frameworks.

Ceph Topologies Whether you are using Ceph’s object, block or file storage capabilities, the actual Ceph cluster is based on the same standard spine/leaf Ethernet infrastructure shown in Figure 2. For high availability, the physical topology uses redundant connections to dual top-of-rack (TOR) switches.

Neor Arciecure AN Fric O

ine

e O O O O O O O O

oniorroy Node oniorroy Node oniorroy Node oniorroy Node Adinonrol Node D D D ore Node ore Node ore Node ore Node onded inerces ore Node ore Node ore Node ore Node

O ore Node ore Node ore Node ore Node A O ore Node ore Node ore Node ore Node OD

ore Node ore Node ore Node ore Node

c c c n o TOR o o rc MDS ed serer

Figure 2 . Standard physical Clos architecture with redundant switches

For Disclosure under NDA Only Implementation Guide | Scalable Cloud-based Services for Flexible, Robust, Cost-Effective Storage 9

Figures 3 and 4 illustrate the logical topologies for CephFS/RBD (file and block) and RGW (object) storage. Note that for RBD clients, they can be either bare-metal or virtualized. For the RBD topology, the clients connect directly to the Ceph Monitor nodes and the storage cluster; for the RGW topology, load balancers serve as proxies for the S3 clients. As you can see in these diagrams, the cluster never changes—only the way in which you access the cluster.

esF nd D luser ooloy

esF lien Nodes eD Storage Node OD OD OD OD esF ernel O esF F OD OD OD OD eD OD OD OD OD esF irry lices Storage Node OD OD OD OD irdos eD OD OD OD OD OD OD OD OD

Storage Node OD OD OD OD OD OD OD OD OD OD OD OD

D lien Nodes Storage Node OD OD OD OD eon OD OD OD OD OD OD OD OD

eu eon Storage Node OD OD OD OD OD OD OD OD OD OD OD OD lird eon irdos Storage Node OD OD OD OD OD OD OD OD OD OD OD OD

lienonior Neor elicion Neor ournls nd ce

Figure 3 . CephFS and block storage (RBD) topology

esF luser ooloy

Storage Node OD OD OD OD OD OD OD OD lien Node OD OD OD OD

Storage Node OD OD OD OD OD OD OD OD lien Node OD OD OD OD D Storage Node OD OD OD OD od lncer OD OD OD OD OD OD OD OD lien Node D od lncer Storage Node OD OD OD OD OD OD OD OD OD OD OD OD lien Node D Storage Node OD OD OD OD OD OD OD OD OD OD OD OD lien Node eon Storage Node OD OD OD OD OD OD OD OD eon OD OD OD OD

eon

lienonior Neor elicion Neor ournls nd ce

Figure 4 . Object storage (RGW) topology

For Disclosure under NDA Only Implementation Guide | Scalable Cloud-based Services for Flexible, Robust, Cost-Effective Storage 10

Using Intel® CAS Intel CAS (Figure 5) increases storage performance via intelligent caching and is designed to work with high-performance Intel® SSDs. Unlike inefficient conventional caching techniques that rely solely on data temperature, Intel CAS employs a classification-based solution that intelligently prioritizes I/O. This unique capability allows you to further optimize your storage performance based on I/O types (for example, data versus metadata), size and additional parameters. The advantage of this approach is that logical block-level storage volumes can be configured with multiple performance requirements in mind. For example, the filesystem journal could receive a different class of service than regular file data, allowing workloads to be better tuned for specific applications. Efficiencies can be increased incrementally as higher-performing NVMe-based Intel SSDs are used for data caching. To use Intel CAS, first install it, then use the configuration utility or manual configuration to tune it to your needs. See the “Intel® Cache Acceleration Software for Linux* v3.5.1” administration guide for details; in particular, Chapter 14 discusses installing and using Intel CAS in a Ceph cluster. You can download a 120-day free trial of the software and the Admin Guide.

lien

e Oec ey

ADO onior

ulic Neor

Storage Node 1 Storage Node x A ournl A ournl ce ce

OD OD

luser Neor

Figure 5 . Intel® Cache Acceleration Software (Intel® CAS) improves caching performance on your Ceph cluster

Using Intel® ISA-L Intel ISA-L provides tools to maximize storage throughput, security and resilience, as well as to minimize disk space usage. Intel ISA-L is highly optimized for Intel® architecture, and is natively designed for SDS applications. It can run on multiple generations of Intel® processors. To use Intel ISA-L, first create an ISA profile (see Ceph’s ISA erasure code plugin page for details). Then use that profile when you create an ISA erasure coded pool, as described atCeph’s Pools page. Another resource you may find helpful is the Ceph Erasure Coding Introduction from Intel.

Integrating with OpenStack While Ceph does not require OpenStack (or vice versa), the two software-defined solutions complement each other because they both support all three storage types: object, block and file. And because they are both open source projects, they often benefit from integration and cross-project development. Figure 6 shows the integration points between OpenStack and Ceph. You can read more about how to integrate the two systems in this OpenStack Superuser article.

For Disclosure under NDA Only Implementation Guide | Scalable Cloud-based Services for Flexible, Robust, Cost-Effective Storage 11

Oenc nerion oins i e

AN F NOA ND ANDA

O O OO D ON

NAN A F O OO A O OA F A

D D D eF

Figure 6 . OpenStack and Ceph complement each other’s object, block and file storage capabilities

Orchestration Ceph supports most major orchestration and deployment frameworks: • ceph- * • ceph-salt* • ceph- * • Ceph Juju* • ceph- * • crowbar-ceph * If you’re already using a provisioning tool, such as Chef* or Puppet*, you can continue to use it with Ceph. If you don’t already have such a tool, we recommend Ansible.

Ceph-Based STaaS Use Cases As mentioned previously, Ceph supports three storage access methods: • Block . The application or virtual machine (VM) accesses storage in blocks of a fixed size. • Object . Typical of most new applications—data puts and gets are of any size. • File . Individual data records of fixed size are read and written to a named file. Whatever the storage access method, workloads can be categorized in a couple ways. The first is by tiering— cold storage is not at all latency-sensitive, while hot storage requires very low latency. Table 6 shows typical tiering use cases. You can find out more onCeph’s Use Cases and Reference Architectures page.

For Disclosure under NDA Only Implementation Guide | Scalable Cloud-based Services for Flexible, Robust, Cost-Effective Storage 12

Table 6 . Mapping Storage Tiers to Use Cases

Storage Tier Example Use Cases

• Backup Cold • Disaster recovery • Archival and logs

• Video surveillance Warm • Primary storage • Media sync and video sharing

• Collaboration and file sharing Hot • Databases • Constant delivery networks (CDNs) and Web hosting

Another way to categorize workloads is by optimization criteria—IOPS, throughput or capacity. Depending on the use case and optimization criteria, we recommend different cluster storage configurations, as described in Table 7. Table 7 . Cluster Configuration Recommendations per Optimization Criteria

Optimization Recommended Example Use Properties Criteria Configuration Cases

• Lowest cost per IOPS • Highest IOPS • Typically block storage MySQL* on IOPS • Meets minimum fault domain • 3x replication on HDDs or OpenStack recommendation (single 2x replication on Intel® SSD clouds server is less than or equal to DC Series 10 percent of the cluster)

• Lowest cost per given unit of throughput • Highest throughput • Block or object storage • Highest throughput per BTU • 3x replication Streaming Throughput • Highest throughput per watt media • Active performance storage • Meets minimum fault domain for video, audio and images recommendation (single server is less than or equal to 10 percent of the cluster)

• Lowest cost per TB

• Lowest BTU per TB • Typically object storage Video, audio • Lowest watt per TB • Erasure coding common for and image Capacity • Meets minimum fault domain maximizing usable capacity object archive repositories recommendation (single • Object archive server is less than or equal to 15 percent of the cluster)

For Disclosure under NDA Only Implementation Guide | Scalable Cloud-based Services for Flexible, Robust, Cost-Effective Storage 13

Validation To minimize the risk of introducing SDS, it’s a good idea to run a pilot project using a few servers so you can evaluate the architecture and see how easily you can operate and support it. The pilot project also presents an opportunity to gather data on operating costs that you can use for modeling your business case, and performance data you can use for infrastructure planning, including consolidation ratios from deduplication. During the pilot, the team will have an opportunity to acquire skills in managing SDS and will be able to assess the likely learning curve. You can start by introducing SDS for particular compute environments and gradually expand its reach in your infrastructure by scaling it up and shifting other workloads into the new infrastructure over time.

Ceph Best Practices As Intel has worked with cloud service providers (CSPs) on their Ceph-based STaaS implementations, we have gathered a number of best practices. (See also “Minimum Hardware Requirements” earlier.) Here are our top three tips, followed by more detail on specific topics. • Pick an architecture that someone else has done and benchmarked with a workload similar to yours, and/or conduct a small proof of concept with your configuration to prove your architecture. • Either have an ISV install your Ceph cluster, or if you do it yourself, read “Red Hat Ceph Storage on servers with Intel processors and SSDs.” Be ready to seek ISV or open source community help. • Gear up your production slowly. Most of the Ceph learning curve is in knowing how to respond to often- cryptic error messages.

Planning As a rule of thumb, a knowledgeable person can implement a good-sized Ceph cluster in less than a week. But it may take several weeks to decide on the best configuration for your situation and needs. Plan on spending some time studying reference architectures and talking to ISVs or people in the open source community. When planning your hardware needs, balance cost reduction against resiliency. If you place too many responsibilities into too few failure domains, you increase the risk of downtime. On the other hand, isolating every potential failure domain can drive up costs. (A failure domain is any failure that prevents access to one or more OSDs. Examples include a stopped daemon on a host; HDD failure, an OS crash, a malfunctioning NIC, a failed power supply, a network outage or a power outage). Plan ahead for scaling—poorly managed cluster growth can result in lower client I/O performance. Follow published reference architectures, such as “Build an Open Storage-as-a-Service Platform with Ceph*.” After you have implemented Ceph, it may take up to six months for a small team to learn all the intricacies of Ceph and how to respond when performance does not meet expectations. But you don’t have to do it alone—you can take advantage of a substantial amount of information available from the Ceph community, as well as recorded sessions from the OpenStack Summit and the Red Hat Ceph Days. Other tips include: • Bolster your staff’s Linux skills and streamline operational practices. • Be prepared to test and tune; the default values are rarely optimal for any given configuration. • Avoid the temptation to pull the most recent Ceph code from the open source repository. It is far better to use a stable community production release or a commercial distribution from Red Hat or SUSE.

Nodes For monitor nodes, log rotation is a good practice that can help prevent available disk space from being blindly consumed. This is especially relevant if verbose debugging output is set on the monitors, since they will generate a large amount of logging information. In most situations, monitors should be run on distinct nodes or on virtual machines (VMs) that reside on physically separate machines to prevent a single point of failure.

For Disclosure under NDA Only Implementation Guide | Scalable Cloud-based Services for Flexible, Robust, Cost-Effective Storage 14

For storage, no node should represent greater than 10 percent of the total raw storage. This dictates the size and type of storage node. In general, hosting multiple OSD processes on a single NVMe-based Intel SSD is recommended in order to fully utilize the available I/O bandwidth of the NVMe-based Intel SSD. Four is the recommended number. You should run the OS, OSD data and OSD journals on separate drives.

Journaling Write journals should be located on a separate SSD. The recommended journal-to-HDD ratio is four or five HDDs per journal. Use the sizing recommendations from Table 4.

Network As mentioned earlier, using separate dual-port NICs for the public (client) network and the private cluster (replication) network provides optimal results. While a 10 GbE network may suffice for some small implementations, we recommend 25 GbE. Poor network configuration can nullify even the best storage configuration. We recommend using Jumbo frames, especially on the replication network.

Summary You are facing spiraling data volumes, inefficient storage appliances and increasing demand from customers for cost-effective unified storage. Software-defined STaaS solutions can help solve these challenges by abstracting the storage software from the storage hardware. A STaaS solution based on Ceph supports object, block and file storage in a single package. This implementation guide shows how to combine Ceph with Intel technologies such as NVMe-based Intel SSDs, Intel Ethernet Products, Intel Xeon processors and software optimization through Intel ISA-L and Intel CAS. Using these guidelines, you can build a Ceph-based STaaS solution that provides the performance you need to successfully compete in the fast-growing STaaS market.

References

Red Hat: • Red Hat Ceph Storage home page • Red Hat’s Install Guide • Product Documentation for Red Hat Ceph Storage • Red Hat Ceph Documentation Library (specifically for v1.3) • Red Hat Architecture Guide • Library of Ceph and * reference architectures • Red Hat Ceph Storage: Scalable Object Storage on QCT Servers

SUSE: • SUSE Enterprise Storage 5 home page • SUSE’s Ceph Install Guide • Resource Library: SUSE Enterprise Storage

Ceph Open Source Community: • Ceph Quick Install Guide • C eph-deploy Installation Guide • Storage Cluster Quick Start • Ceph Object Gateway Quick Start • Ceph Block Device Quick Start • CephFS Quick Start

For Disclosure under NDA Only Implementation Guide | Scalable Cloud-based Services for Flexible, Robust, Cost-Effective Storage 15

• Ceph Community • Ceph Email Lists • Ceph Use Cases and Reference Architectures • Ceph Operating System Recommendations • Installation (Kubernetes + Helm)

Intel: • Red Hat Ceph Storage on servers with Intel processors and SSDs configuration guide • Intel® Cloud Insider Program • Intel® Storage Builders Program • Intel® Xeon® Scalable processors • Intel® Solid State Drives • Intel® Ethernet products • Intel® Cache Acceleration Software (Intel® CAS) • Intel® Intelligent Storage Acceleration Library (Intel® ISA-L)

Appendix A: Ceph Tuning Details When we benchmarked our high-performance reference architecture for Ceph, we used the tunings showed in the following four code snippets. You can use these same best-practice tuning recommendations to achieve similar results.

Code Snippet 1: Ceph Benchmark Tool (CBT) yaml File (Cluster Section) cluster: mkfs_opts: ‘-f -i size=2048’ user: “root” mount_opts: ‘-o inode64,noatime,logbsize=256k’ head: “node01” conf_file: ‘/etc/ceph/ceph.conf’ clients: [‘client01’, use_existing: True ‘client02’,’client03’,’client04’,’client05’,’client06’] rebuild_every_test: False osds: [“node01”, “node02”, “node03”, “node04”, clusterid: “ceph” “node05”, “node06”] iterations: 1 mons: tmp_dir: “/tmp/cbt” node01: pool_profiles: node01: “192 168. .xxx .xxx:6789” 2rep: mgrs: pg_size: 8192 client01: pgp_size: 8192 a: ‘192 168. . xxx .xxx:6789’ replication: 2 osds_per_node: 16 fs: bluefs

For Disclosure under NDA Only Implementation Guide | Scalable Cloud-based Services for Flexible, Robust, Cost-Effective Storage 16

Code Snippet 2: Ceph Benchmark Tool (CBT) yaml File (Benchmarks Section) benchmarks: norandommap: True librbdfio: cmd_path: ‘/usr/local/bin/fio’ time: 300 pool_profiles: ‘2rep’ ramp: 100 log_avg_msec: 250 vol_size: 307200 mode: [‘randrw’] rwmixread: [0, 70, 100] op_size: [4096] numjobs: 1 procs_per_volume: [1] volumes_per_client: [10] use_existing_volumes: True iodepth: [1, 2, 4, 8, 16, 32, 64, 128] osd_ra: [128]

Code Snippet 3: ceph.conf file (1)

[global] cluster_network = 192 168. . debug_finisher = 0/0 xxx .0/24 osd objectstore = bluestore [debug_heartbeatmap = 0/0 public_network = 192 168. . debug_perfcounter = 0/0 xxx .0/24 rbd readahead disable after debug_asok = 0/0 debug_lockdep = 0/0 = 0 debug_throttle = 0/0 debug_context = 0/0 rbd readhead max bytes = debug_mon = 0/0 4194304 debug_crush = 0/0 debug_paxos = 0/0 bluestore default buffered read debug_buffer = 0/0 = false debug_rgw = 0/0 debug_timer = 0/0 mon_allow_pool_delete = true debug_filer = 0/0 debug_objecter = 0/0 auth client required = none debug_rados = 0/0 auth cluster required = none debug_rbd = 0/0 auth service required = none debug_ms = 0/0 filestore xattr use omap = true debug_monc = 0/0 debug_tp = 0/0 debug_auth = 0/0

For Disclosure under NDA Only Implementation Guide | Scalable Cloud-based Services for Flexible, Robust, Cost-Effective Storage 17

Code Snippet 4: ceph.conf file (2) perf = true log file = /var/log/ceph/$name.log [mon] mutex_perf_counter = true log to syslog = false [mon_max_pool_pg_num=166496 throttler_perf_counter = false mon compact on trim = false mon_osd_max_split_count = 10000 rbd cache = false osd pg bits = 8 mon_pg_warn_max_per_osd = rbd_cache_writethrough_until_ osd pgp bits = 8 10000 flush = false mon pg warn max object skew = [osd] rbd_op_threads = 2 100000 osd_op_num_shards = 8 osd scrub load threshold = 0 .01 mon pg warn min per osd = 0 osd_op_num_threads_per_shard osd scrub min interval = mon pg warn max per osd = 32768 = 2 137438953472 osd_crush_chooseleaf_type = 0 objecter_inflight_ops = 102400 osd scrub max interval = 137438953472 ms_dispatch_throttle_bytes = 1048576000 osd deep scrub interval = 137438953472 objecter_inflight_op_bytes = 1048576000 osd max scrubs = 16

Solutions Proven by Your Peers An STaaS platform, powered by Intel® technology, provides high performance and easy manageability. This and other solutions are based on real-world experience gathered from customers who have successfully tested, piloted, and/or deployed the solutions in specific use cases. The solutions architects and technology experts for this solution implementation guide include: • Daniel Ferber, Solutions Architect, Intel Sales & Marketing Group • Tushar Gohad, Principal Engineer, Intel Data Center Group • Orlando Moreno, Performance Engineer, Intel Data Center Group • Karl Veitmeier, Solutions Architect, Intel Sales & Marketing Group Intel Solutions Architects are technology experts who work with the world’s largest and most successful companies to design business solutions that solve pressing business challenges. These solutions are based on real-world experience gathered from customers who have successfully tested, piloted, and deployed these solutions in specific business use cases.

Find the solution that is right for your organization. Contact your Intel representative or visit intel .com/CSP

For Disclosure under NDA Only Implementation Guide | Scalable Cloud-based Services for Flexible, Robust, Cost-Effective Storage 18

Solution Provided By:

1 Red Hat, 2017, “Red Hat Ceph Storage: Scalable Object Storage on QCT Servers.” https://www.redhat.com/cms/managed-files/st-ceph- storage-qct-object-storage-reference-architecture-f7901-201706-v2-en.pdf 2 http://docs.ceph.com/docs/jewel/start/hardware-recommendations/ All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps. Cost reduction scenarios described are intended as examples of how a given Intel- based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks Performance results are based on testing as of August 2017 and may not reflect all publicly available security updates. See configuration disclosure for details. No product can be absolutely secure. Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software, or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer, or learn more at intel.com. Copyright © 2018 Intel Corporation. All rights reserved. Intel, Xeon, Optane and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 0918/JOPD/CAT/PDF 337763-001EN

For Disclosure under NDA Only