Bring Ceph to Enterprise Setup a 50T mobile cluster in 30 min

Alex Lau (劉俊賢) Software Consultant [email protected]

How to access to ceph storage? Introduction of iSCSI

Management Node

Object Block File Storage Storage System Remote Cluster Data Encrypted at Rest

RADOS gateway RESTful api

iSCSI Heterogeneous OS Access

Monitor Nodes SUSE Enterprise Storage 3

A first commercial available ISCSI access to connect to SES3. It allow client access to ceph storage remotely over TCP/IP protocol. SES3 provide a iscsi Target driver on top of RDB ( RADOS block device ). This allow any iscsi Initiator can access SES3 over network. iSCSI Architecture Technical Background Protocol: ‒ Block storage access over TCP/IP ‒ Initiators the client that access the iscsi target over tcp/ip ‒ Targets, the that provide access to a local block SCSI and iSCSI: ‒ iSCSI encapsulated commands and responses ‒ TCP package of iscsi is representing SCSI command Remote access: ‒ iSCSI Initiators able to access a remote block like local disk ‒ Attach and format with XFS, brtfs etc. ‒ Booting directly from a iscsi target is supported

Before iSCSI RBD support …

OSD1 OSD2 OSD3 OSD4

Public Target System Network

RBD Block LIO to ISCSI

Initiator System Before iSCSI support what’s wrong? Missing features

LIO over RBD: ‒ It doesn’t support “atomic compare and write” ‒ It doesn’t support “persistent group reservations” iSCSI: ‒ ISCSI Active/Active Multiple Path MPIO is not supported ‒ Block layer support all these require a different approach

Benefit of iSCSI LIO gateway for RBD

Multiple Platform access to ceph: ‒ It doesn’t require to be part of the cluster like radosgw Standard iSCSI interface:

‒ Most OS support iSCSI ‒ Open-iscsi in most OS LIO Linux IO Target: ‒ In kernel target implementation Flexible configuration: ‒ Targetcli utility is available with lrbd

Config RBD iSCSI gateway Introduction of lrbd

Easy Setup: ‒ Package bundle with iscsi since SES2.0 ‒ Multi-Node configuration support with targetcli Technical Background: ‒ JSON configuration format ‒ Target, Portals, Pools, Auth ‒ Configuration state stored in ceph cluster Related Link: ‒ https://github.com/swiftgist/lrbd ‒ https://github.com/swiftgist/lrbd/wiki

iSCSI Gateway Optimizations

Efficient handling of certain SCSI operations: ‒ Offload RBD image IO to OSDs ‒ Avoid Locking on iSCSI gateway nodes ‒ Compare and Write ‒ New cmpext OSD operation to handle RBD data comparison ‒ Dispatch as compound cmpext+write OSD request ‒ Write Same ‒ New writesame OSD operation to expand duplicate data at the OSD ‒ Reservations ‒ State stored as RBD image extended attribute ‒ Updated using compound cmpxattr+setxattr OSD request

9 Multiple Path Support with iSCSI on RBD

Cluster Network

OSD1 OSD2 OSD3 OSD4

RBD image

Public Network

RBD RBD Module Module iSCSI Initiator

iSCSI Gateway iSCSI Gateway

10 How to manage storage growth and costs of ceph ?

$

Easily scale and Control storage Support today’s manage data growth and investment and storage manage costs adapt to the future Introduction to openATTIC

Easily scale and manage data storage SUSE Enterprise Storage Management Vision

Open Source : ‒ Alternative to proprietary storage management systems Enterprise:

‒ Work as expected with traditional storage unified storage interface e.g. NAS, SAN SDS Support: ‒ Provide initial ceph setup in managing and monitoring to ease in complicated scale out scenarios

It will be available in next SES release or download it now at https://build.opensuse.org/package/show/filesystems:openATTIC/openattic openATTIC Features Existing capability

Modern Web UI Volume Mirroring RESTful API ‒ DRBD ‒ Software Defined Storage Unified Storage ‒ LVM, XFS, ZFS, , ext3/4 ‒ NAS (NFS, CIFS, HTTP) Monitoring ‒ SAN (iSCSI, Fiber Channel) ‒ Nagios / Icinga built-in ‒ Ceph Management (WIP) openATTIC Architecture Technical Detail

Backend: Web Frontend ‒ Python (Django) ‒ AngularJS ‒ Django REST Framework ‒ Bootstrap ‒ Nagios / Icinga & ‒ REST API PNP4Nagios Automated Test Suites ‒ Linux tools ‒ Python unit tests ‒ LVM, LIO, DRBD ‒ Gatling ‒ Ceph API ‒ RESTful API ‒ librados, librbd ‒ Protractor / Jasmine ‒ WebUI test openATTIC Architecture High Level Overview

RESTful API

PostgreSQL

HTTP Django

NoDB DBUS librados openATTIC /librbd

Shell Web UI REST Linux OS Client Tools openATTIC Development Current status

- Create and map RBDs as block devices (volumes) - Pool management Web UI (table view) - OSD management Web UI (table view) - RBD management Web UI (add/delete, table view) - Monitor a cluster health and performance - Support for managing Ceph with salt integration (WIP) - Role management of node, monitor, storage, cephfs, iscsi, radosgw Volume Management Pool Listing OSD Listing RBD Listing oA Ceph Roadmap future is in your hand

- Ceph Cluster Status Dashboard incl. Performance Graphs - Extend Pool Management - OSD Monitoring/Management - RBD Management/Monitoring - CephFS Management - RGW Management (users, buckets keys) - Deployment, remote configuration of Ceph nodes (via Salt) - Public Roadmap on the openATTIC Wiki to solicit community feedback: http://bit.ly/28PCTWf How ceph control storage cost?

$

Control storage growth and manage costs Minimal recommendation

OSD Storage Node MON Monitor Node ‒ 2GB RAM per OSD ‒ 3 Mons minimal ‒ 1.5GHz CPU core per ‒ 2GB RAM per node OSD ‒ SSD System OS ‒ 10GEb public and ‒ Mon and OSD should not backend be virtualized ‒ 4GB RAM for cache tier ‒ Bonding 10GEb

SUSE Storage Pricing

High-end Disk Array

Mid-range Array

Fully Featured NAS Device

Mid-range NAS

Entry-level Disk Array SUSE Enterprise Storage JBOD Storage Use storage with multiple tiers

Write Tier Read Tier Hot Pool Hot Pool Normal Tier Normal Tier Cold Pool Cold Pool

SUSE Enterprise Storage Cluster

Writing Quickly Application like: Reading Quickly Application like: • e.g. Video Recording • e.g. Video Streaming • e.g. Lots of IoT Data • e.g. Big Data analysis How to create multiple price point?

1000$ = 1000G 2000MB rw 4 PCIe = 4000$ = 8000MB rw 4T Storage 400,000 IOPS 4$ per G

250$ = 1000G, 500MB rw 16 Driver = 4000$ = 8000MB rw 16T Storage 100,000 IOPS 1$ per G

250$ = 8000G 150MB rw 16 Driver = 4000$ = 2400MB rw 128T Storage 2000 IOPS 0.1$ per G How EC reduce storage cost? $

Copy Copy Copy Data Data Data Data Parity Parity Control Costs Control Costs Replication Pool Erasure Coded Pool

SES CEPH CLUSTSER SES CEPH CLUSTSER

Multiple Copy of stored data Single Copy with Parity • 300% cost of data size • 150% cost of data size • Low Latency, Faster Recovery • Data/Parity ratio trade of CPU Public Cloud Setup

H270-H70 - 40000$ 1000 Customer Running - 48 Core * 8 : 384 Cores - 32G * 32: 1T Memory 5$ - Web Hosting = 5000$ - 1T * 16: 16T SSD - 40GbE * 8 8 Months = 40000$

R120-T30 - 5700$ * 7 EC 5+2 is about 250T - 48 Core * 7 : 336 Cores 2500 Customer 100GB - 8 * 16G * 7 : 896G Memory - 1T * 2 * 7 : 14T SSD 2$ Storage = 5000$ - 8T * 6 * 7 : 336T HDD 8 Months = 40000$ - 40GbE * 7 - 10GbE * 14

For developer?

6T = 220$ 6T = 220$ 6T = 220$ 220 * 3 = 660$ 220 * 3 = 660$ 220 * 3 = 660$

512G = 150$ 512G = 150$ 512G = 150$

OSD1 OSD5 OSD9 OSD2 OSD6 OSD10 OSD3 OSD7 OSD11 OSD4 OSD8 OSD12

300$ 300$ 300$

MON1 MON2 MON3

Dual 1G Network Pros and Cons of this mobile cluster

Price: ‒ Around 3200$ vs Expensive Laptops Size:

‒ 50T and 20kg is mobile enough to demo a usable cluster ‒ Real HDD better for presentation of a storage solution Benchmark:

‒ Beside Networking capability, all features and requirement of a ceph cluster meet Features:

‒ Great fit for developers and tester to perform software base test but something that VM can’t be done How DevOps story fit? Introduce you salt

Support today’s investment and adapt to the future Salt enable ceph Existing capability

Sesceph ‒ Python API library that help deploy and manage ceph ‒ Already upstream in to salt available in next release ‒ https://github.com/oms4suse/sesceph Python-ceph-cfg ‒ Python salt module that use sesceph to deploy ‒ https://github.com/oms4suse/python-ceph-cfg

Both library come with SES3.0 already Why Salt? Existing capability

Product setup

‒ SUSE OpenStack cloud, SUSE manager and SUSE Enterprise Storage all come with salt enable Parallel execution

‒ E.g. Compare to ceph-deploy to prepare OSD Customize Python module

‒ Continuous development on python api easy to manage Flexible Configuration

‒ Default Jinja2 + YAML ( stateconf ) ‒ Pydsl if you like python directly, json, pyobject, etc

Create a cluster with a single stage file https://github.com/AvengerMoJo/Ceph-Saltstack/blob/master /stages/ses/ceph/ceph_create.sls

This is a show case of how a simple way to create a cluster with a simple stage file It is up to your custom to create your own easily

Quick deployment example

Git repo for fast deploy and benchmark - https://github.com/AvengerMoJo/Ceph-Saltstack Demo recording - https://asciinema.org/a/4hmdsrksn0fd8fgpssdgqsjdb 1) Salt setup 2) Git clone and copy module to salt _modules 3) Saltutil.sync_all push to all minion nodes 4) ntp_update all nodes 5) Create new mons, and create keys 6) Clean disk partitions and prepare OSD 7) Update crushmap Reduce storage costs and management with SUSE Enterprise Storage

Control Adapt Costs Quickly

Manage Less

Scale storage from terabytes BUSINESS to hundreds of petabytes OPERATIONS without downtime

SOCIAL MEDIA

% 100 UPTIME

MOBILE DATA

CUSTOMER DATA