Reliable Replicated File Systems with GlusterFS

John Sellens

[email protected]

@jsellens

USENIX LISA 28, 2014 November 14, 2014

Notes PDF at http://www.syonex.com/notes/ Reliable Replicated File Systems with GlusterFS

Contents

Preamble and Introduction 2

Setting Up GlusterFS Servers 8

Mounting on Clients 20

Managing, Monitoring, Fixing 25

Wrap Up 33

c 2014JohnSellens USENIXLISA28,2014 1 Reliable Replicated File Systems with GlusterFS Preamble and Introduction

Preamble and Introduction

c 2014JohnSellens USENIXLISA28,2014 2 Reliable Replicated File Systems with GlusterFS Preamble and Introduction

Overview

• Network Attached Storage is handy to have in many cases – And sometimes we have limited budgets

• GlusterFS provides a scalable NAS system – On “normal” systems and hardware

• An introduction to GlusterFS and its uses

• And how to implement and maintain a GlusterFS file service

c 2014JohnSellens USENIXLISA28,2014 3

Notes:

• http://www.gluster.org/

• We’re not going to cover everything in this Mini Tutorial session – But it should get you started – In time for mid-afternoon break!

• Both USENIX and I will very much appreciate your feedback — please fill out the evaluation form Reliable Replicated File Systems with GlusterFS Preamble and Introduction

Solving a Problem

• Needed to replace a small but reliable network file service – Expanding the existing service wasn’t going to work

• Wanted something comprehensive but comprehensible

• Needed Posix filesystem semantics, and NFS

• Wanted something that would let me sleep at night

• GlusterFS seemed a good fit – Supported by RedHat, NFS, CIFS, . . . – User space, on top of regular filesystem

c 2014JohnSellens USENIXLISA28,2014 4

Notes:

• I have a small hosting infrastructure that I like to implement reliably

Storage Server is a supported GlusterFS implementation Reliable Replicated File Systems with GlusterFS Preamble and Introduction

Alternatives I Was Less Enthused About

• Block – DRBD, HAST – Not transparent – hard to look and confirm consistency – Hard to expand, Limited to two server nodes

• Object stores – , Hadoop, etc. – No need for shared block devices for KVMs, etc – Not always Posix and NFS

• Others – MooseFS, , etc. – Some needed separate meta-data server(s) – Some had single master servers

c 2014JohnSellens USENIXLISA28,2014 5

Notes:

• I was running HAST on FreeBSD, and tried (and failed) to expand it – Partly due to old hardware I was using Reliable Replicated File Systems with GlusterFS Preamble and Introduction

Why I Like GlusterFS

• Can run on just two servers – all functions on both

• Sits on top of a standard filesystem (ext3, xfs) – Files in GlusterFS volumes are visible as normal files – So if everything fails very badly, I can likely copy the files out – Easy to compare replicated copies of files for consistency

• Fits nicely with CentOS which I tend to use – NFS server support means that my existing FreeBSD boxes would work “just fine”

c 2014JohnSellens USENIXLISA28,2014 6

Notes:

• I like to be both simple-minded and paranoid – So being able to check and copy if need be was appealing Reliable Replicated File Systems with GlusterFS Preamble and Introduction

Hardware – Don’t Use Your Old Junk

• I have some old 32-bit machines – Bad, bad idea

• These days, code doesn’t seem to be tested well on 32 bit

• GlusterFS inodes (or equivalent) are 64 bits – Which doesn’t sit well with 32 bit NFS clients

• In theory 32 bit should work, in practice it’s at least annoying

• 26 Yes! but 25 No!

c 2014JohnSellens USENIXLISA28,2014 7

Notes:

• This is not just GlusterFS related

• My old 32 bit FreeBSD HAST systems started misbehaving when I tried to update and expand Reliable Replicated File Systems with GlusterFS Setting Up GlusterFS Servers

Setting Up GlusterFS Servers

c 2014JohnSellens USENIXLISA28,2014 8 Reliable Replicated File Systems with GlusterFS Setting Up GlusterFS Servers

Set Up Some Servers

• Ordinary servers with ordinary storage – All the “normal” speed/reliability questions – I’ll suggest CentOS 7 (or 6)

• Leave unallocated space to use for GlusterFS

• Separate storage network? – Traffic and security

• Dedicated servers for storage? – Likely want storage servers to be static and dedicated

c 2014JohnSellens USENIXLISA28,2014 9

Notes:

• Since RedHat does the development, it’s pretty likely that GlusterFS will work well on CentOS – Should work on Fedora and Debian as well, if you’re that way in- clined

• GlusterFS 3.6 likely to have FreeBSD and MacOS support (I hope) https://forums.freebsd.org/viewtopic.php?t=46923

• And of course, it should go without saying, but make sure NTP and DNS and networking are working properly. Reliable Replicated File Systems with GlusterFS Setting Up GlusterFS Servers

RAID on the Servers?

• GlusterFS hardware failures “should be” non-disruptive

• RAID should provide better I/O performance – Especially hardware RAID with cache

• Re-building/silvering an entire server for a disk failure is boring – Overall storage performance will suffer in the meantime – A second failure might be a big problem

• Small general purpose deployment? – Use good servers and suitable RAID

• Other situations may suit non-RAID – Lots of servers, more than 2 replicas, etc.

c 2014JohnSellens USENIXLISA28,2014 10

Notes:

• Configuration management should mean that a server rebuild is “easy” – Your mileage may vary

• Remember that a failed disk means lots of I/O and time to repair, and you’re vulnerable to other failures while rebuilding Reliable Replicated File Systems with GlusterFS Setting Up GlusterFS Servers

Networks and Security

• GlusterFS has limited security and access controls – Assumption: all servers and networks are friendly

• A separate storage network may be prudent – glusterfs mounts need to reach gluster peer addresses – NFS mounts by default are available on all interfaces

• Generally you want to isolate GlusterFS traffic if you can – Firewalls, subnets, iptables, . . .

c 2014JohnSellens USENIXLISA28,2014 11

Notes:

• I have very limited experience trying to contain GlusterFS

• If you’re using only glusterfs mounts an isolated network would be useful – For performance and “containment” Reliable Replicated File Systems with GlusterFS Setting Up GlusterFS Servers

IPs and Addressing

• Generally you will want fixed and floating addresses

• GlusterFS peers need to talk to each other

• glusterfs mounts need to find one peer then talk to the others – First peer provides details of the volumes and peers

• NFS and CIFS mounts want floating service addresses – Active/passive mounts need just one – Active/active mounts need more

• CTDB is recommended for IP address manipulation

c 2014JohnSellens USENIXLISA28,2014 12

Notes:

• With two servers, I have 6 addresses total – Management addresses – Storage network peer addresses – Floating addresses that are normally one per server

• More on CTDB later on slide 19 Reliable Replicated File Systems with GlusterFS Setting Up GlusterFS Servers

Installing GlusterFS

• Use the standard gluster.org repositories – See notes

• Install with yum install glusterfs-server service glusterd start chkconfig glusterd on

• or apt-get install glusterfs-server

• Current version is 3.6.1

c 2014JohnSellens USENIXLISA28,2014 13

Notes:

• Versions – use 3.5.x – I seemed to have less reliable/stable behaviour with 3.4

• Everything is under the download link at http://download.gluster.org/pub/gluster/glusterfs/LATEST/

• CentOS: wget -P /etc/yum.repos.d \ http://download.gluster.org/pub/gluster/ \ glusterfs/LATEST/CentOS/glusterfs-epel.repo

• Debian – see http://download.gluster.org/pub/gluster/ \ glusterfs/3.5/LATEST/Debian/wheezy/README Reliable Replicated File Systems with GlusterFS Setting Up GlusterFS Servers

A Little Terminology

• A set of GlusterFS servers is a Trusted Storage Pool – Members of a pool are peers of each other

• A GlusterFS filesystem is a Volume

• Volumes are composed of storage Bricks

• Volumes can be three types, and most combinations – Distributed – different files are on different bricks – Striped – (very large) files are split across bricks – Replicated – two or more copies on different bricks

• Distributed Replicated – more servers than replicas

• A Sub-Volume is a replica set within a Volume

c 2014JohnSellens USENIXLISA28,2014 14

Notes:

• Distributed provides no redundancy – Though you might have RAID disks on servers – But you’re still in trouble if a server goes down Reliable Replicated File Systems with GlusterFS Setting Up GlusterFS Servers

Set Up the Peers

• All servers in a pool need to know each other node1# gluster peer probe node2

• Doesn’t hurt to do this (I think it’s optional) node2# gluster peer probe node1

• And make sure they are talking: node1# gluster peer status

– That only lists the other peer(s)

• List the servers in a pool node1# gluster pool list

c 2014JohnSellens USENIXLISA28,2014 15 Reliable Replicated File Systems with GlusterFS Setting Up GlusterFS Servers

Set Us Up the Brick

• A brick is just a directory in an OS filesystem

• One brick per filesystem – Disk storage dedicated to a volume – /data/gluster/volname/brickN/brick

• Could have multiple bricks in a filesystem – Disk storage shared between volumes – /data/gluster/disk1/volname/brickN

• Don’t want a brick to be a filesystem mount point – Big problems if underlying storage not mounted

• Multiple volumes? Use the latter for better utilization

c 2014JohnSellens USENIXLISA28,2014 16

Notes:

• XFS is the suggested filesystem to use

• A suggested naming convention for bricks: http://www.gluster.org/community/documentation/ index.php/HowTos:Brick_naming_conventions

• With disk mount points, and multiple bricks per OS filesystem, one Glus- terFS volume can use up space and “fill up” other volumes

• With multiple bricks per OS filesystem, it’s harder to know which gluster volume is using up space – df shows the same for all volumes

• Depends on your use case – One big volume or multiple volumes for different purposes – Will volumes shrink, or only grow? – Is it convenient to have multiple OS disk partitions? Reliable Replicated File Systems with GlusterFS Setting Up GlusterFS Servers

Sizing Up a Brick

• How big should a brick (partition) be?

• One brick using all space on a server is easy to create – But harder to move or replace if needed

• Consider using bricks of manageable size e.g. 500GB, 1TB – Will likely be easier to migrate/replace if needed – Of course, if you have a lot of storage, a zillion bricks might be difficult

• Keep more space free than is on any one server?

c 2014JohnSellens USENIXLISA28,2014 17

Notes:

• I think there are some subtleties here that aren’t quite so obvious

• And might be worth a thought or two before you commit yourself to a storage layout that will be hard to change Reliable Replicated File Systems with GlusterFS Setting Up GlusterFS Servers

Create a Volume

• Volume creation is straightforward node1# gluster volume create vol1 replica 2 \ node1:/data/glusterfs/disk1/vol1/brick1 \ node2:/data/glusterfs/disk1/vol1/brick1 \ node1:/data/glusterfs/disk2/vol1/brick2 \ node2:/data/glusterfs/disk2/vol1/brick2 node1# gluster volume start node1# gluster volume info vol1 node1# mount -t glusterfs localhost:/vol1 /mnt node1# showmount -e node2

• Replicas are across the first two bricks, and next two

• Name things sensibly now, save your brain later

c 2014JohnSellens USENIXLISA28,2014 18

Notes:

• Each brick will now have a .glusterfs directory

• Adding files or directories to the volume causes them to show up in the bricks of one of the replicated pairs

• You can look, but do not touch – Only change a volume through a mount – Never my modifying a brick directly

• Likely best to stick with the built-in NFS server

• You can set options on a volume with gluster volume set volname option value

• If you’re silly (like me) and have 32 bit NFS clients: gluster volume set volname \ nfs.enable-ino32 on Reliable Replicated File Systems with GlusterFS Setting Up GlusterFS Servers

IP Addresses and CTDB

• CTDB is a clustered TDB database built for Samba

• Includes IP address failover

• Set up CTDB on each node – /etc/ctdb/nodes

• Manage public IPs – /etc/ctdb/public_addresses

• Needs a shared private directory for locks, etc.

• Starts/stops Samba

• Active/active with DNS round robin

c 2014JohnSellens USENIXLISA28,2014 19

Notes:

• Setup is fairly easy – follow these pages http://www.gluster.org/community/ documentation/index.php/CTDB http://wiki.samba.org/index.php/CTDB_Setup http://ctdb.samba.org/ Reliable Replicated File Systems with GlusterFS Mounting on Clients

Mounting on Clients

c 2014JohnSellens USENIXLISA28,2014 20 Reliable Replicated File Systems with GlusterFS Mounting on Clients

Native Mount or NFS?

• Many small files, mostly read? – e.g. a web server? – Use NFS client

• Write heavy load? – Use native gluster client

• Client not ? – Use NFS client – Or CIFS if Windows client

c 2014JohnSellens USENIXLISA28,2014 21

Notes:

• http://www.gluster.org/documentation/Technical_FAQ/ Reliable Replicated File Systems with GlusterFS Mounting on Clients

Gluster Native Mount

• Install glusterfs-fuse or glusterfs-client client# mount -t glusterfs ghost:/vol1 /mnt

• Use a public/floating IP/hostname for the mount

• Gluster client gets volume info

• Then uses the peer names used when adding bricks – So a gluster client must have access to the storage network

• Client handles if nodes disappear

c 2014JohnSellens USENIXLISA28,2014 22

Notes:

• mount.glusterfs(8) does not mention all the mount options

• In particular, the option backupvolfile-server=node2 might be useful, if you don’t use public/floating IPs Reliable Replicated File Systems with GlusterFS Mounting on Clients

NFS Mount

• Like any other NFS mount client# mount glusterhost:/vol1 /mnt

• Use a public/floating IP/hostname for the mount

• NFS talks to that IP/hostname – So an NFS client need not have access to the storage network

• NFS must use TCP, not UDP

• Failover should be handled by CTDB IP switch – But a planned outage might pre-plan and adjust the mount

c 2014JohnSellens USENIXLISA28,2014 23 Reliable Replicated File Systems with GlusterFS Mounting on Clients

CIFS Mounts

• Similar to NFS mounts – Use public/floating IP’s name

• Need to configure Samba as appropriate on the servers clustering = yes idmap backend = tdb2 private dir = /gluster/shared/lock

• CTDB will start/stop Samba

c 2014JohnSellens USENIXLISA28,2014 24 Reliable Replicated File Systems with GlusterFS Managing, Monitoring, Fixing

Managing, Monitoring, Fixing

c 2014JohnSellens USENIXLISA28,2014 25 Reliable Replicated File Systems with GlusterFS Managing, Monitoring, Fixing

Ongoing Management

• When all is going well, there’s not much to do

• Monitor filespace usage and other normal things

• Gluster monitoring – check for – Processes running – All bricks connected – Free space – Volume heal info

• Lots of logs in /var/log/glusterfs

• Note well: GlusterFS, like RAID, is not a backup

c 2014JohnSellens USENIXLISA28,2014 26

Notes:

• I use check_glusterfs by Mark Ruys, [email protected] http://exchange.nagios.org/directory/Plugins/ System-Metrics/File-System/GlusterFS-checks/details

• I run it as root via SNMP

• Unsynced entries (from heal info) are normally 0, but when busy there can be transitory unsynced entries – My gluster volumes are not heavy write – You may see more unsynced Reliable Replicated File Systems with GlusterFS Managing, Monitoring, Fixing

Command Line Stuff

• The gluster command is the primary tool node1# gluster volume info vol1 node1# gluster volume log rotate vol1 node1# gluster volume status vol1 node1# gluster volume heal vol1 info node1# gluster help

• The volume heal subcommands provide info on consistency – And can trigger a heal action

c 2014JohnSellens USENIXLISA28,2014 27 Reliable Replicated File Systems with GlusterFS Managing, Monitoring, Fixing

Adding More Space

• Expanding the underlying filesystem provides more space – But likely want to keep things consistent across servers

• And of course you can add bricks node1# gluster volume add-brick vol1 \ node1:/path/brick2 node2:/path/brick2 node1# gluster volume rebalance vol1 start

• Note that you must add bricks in multiple of replica count – Each new pair is a replica pair, just like for create

• Increase replica count by setting new count and adding enough bricks

c 2014JohnSellens USENIXLISA28,2014 28

Notes:

• If you have a replica with bricks of different sizes, you may be wasting space

• You don’t have to add-brick on a particular node, any server that knows about the volume should likely work fine – I’m just a creature of habit

• But you can’t reduce the replica count ... – At least, I don’t think you can reduce the replica count

• A rebalance could be useful if file deletions have left bricks (sub-volumes) unbalanced Reliable Replicated File Systems with GlusterFS Managing, Monitoring, Fixing

Removing Space

• Remove bricks with start, status, commit node1# gluster volume remove-brick vol1 \ node1:/path/brick1 node2:/path/brick1 start

• Replace start with status for progress

• When complete, run commit

• For replicated volumes, you have to remove all the bricks of a sub-volume at the same time

c 2014JohnSellens USENIXLISA28,2014 29

Notes:

• This of course is never needed, because space needs never decrease Reliable Replicated File Systems with GlusterFS Managing, Monitoring, Fixing

Replacing or Moving a Brick

• Move a brick with replace-brick node1# gluster volume replace-brick vol1 \ node1:/path/brick1 node2:/path/brick1 start

• Start, status, commit like remove-brick

• If you’re adding a third server to a pool with replicas – Should be able to shuffle bricks to the desired result – Or, if there’s extra space, add and remove bricks

• If a brick is dead, you may need commit force – With RAID, this is less of a problem . . .

c 2014JohnSellens USENIXLISA28,2014 30

Notes:

• The Red Hat manual suggests that this is much more complicated

• This is a nice description of adding a third server http://joejulian.name/blog/ how-to-expand-glusterfs-replicated-clusters-by-one-server/ Reliable Replicated File Systems with GlusterFS Managing, Monitoring, Fixing

Taking a Node Out of Service

• In theory it should be simple node1# ctdb disable node1# service gluster stop

• In practice, you might want to manually move NFS clients first

• Clients with native gluster mounts should be “just fine”

• On restart, volumes should “self-heal”

c 2014JohnSellens USENIXLISA28,2014 31

Notes:

• I’m paranoid about time for an NFS client to notice a new server Reliable Replicated File Systems with GlusterFS Managing, Monitoring, Fixing

Split Brain Problems

• With multiple servers (more than 2), useful to set node1# gluster volume set all \ cluster.server-quorum-ratio 51% node1# gluster volume set VOLNAME \ cluster.server-quorum-type server

• With two nodes, could add a 3rd “dummy” node with no storage

• If heal info reports unsync’d entries node1# gluster volume heal VOLNAME

• Sometimes a client-side “stat” of affected file can fix things – Or a copy and move back

c 2014JohnSellens USENIXLISA28,2014 32

Notes:

• Default quorum ratio is more than 50 – Or so the docs seem to say

• The Red Hat Storage Administration Guide has a nice discussion – And lots of details on recovery

• Fixing split brain: https://github.com/gluster/glusterfs/blob/master/doc/debugging/split-brain.md

• Remember: do not modify bricks directly! Reliable Replicated File Systems with GlusterFS Wrap Up

Wrap Up

c 2014JohnSellens USENIXLISA28,2014 33 Reliable Replicated File Systems with GlusterFS Wrap Up

We Haven’t Talked About

• GlusterFS has many features and options

• Snapshots

• Geo-Replication

• Object storage – OpenStack Storage (Swift)

• Quotas

c 2014JohnSellens USENIXLISA28,2014 34

Notes:

• We’ve tried to hit the key areas to get started with Gluster

• We didn’t cover everything

• Hopefully you’ve learned some of the more interesting aspects

• And can apply them in your own implementations Reliable Replicated File Systems with GlusterFS Wrap Up

Where to Get Gluster Help

• gluster.org web site has a lot of links – Mailing lists, IRC, . . .

• Quick Start Guide

• Red Hat Storage documentation is pretty good

• HowTo page

• GLusterFS Administrator Guide

c 2014JohnSellens USENIXLISA28,2014 35

Notes:

• GlusterFS documentation is currently a bit disjointed

• http://www.gluster.org/

• http://www.gluster.org/documentation/quickstart/index.html

• Administrator Guide is currently a link to a github repository of markdown files

• https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3/

• http://www.gluster.org/documentation/howto/HowTo/ Reliable Replicated File Systems with GlusterFS Wrap Up

And Finally!

• Please take the time to fill out the tutorial evaluations – The tutorial evaluations help USENIX offer the best possible tutorial programs – Comments, suggestions, criticisms gratefully accepted – All evaluations are carefully reviewed, by USENIX and by the presenter (me!)

• Feel free to contact me directly if you have any unanswered questions, either now, or later: [email protected]

• Questions? Comments?

• Thank you for attending!

c 2014JohnSellens USENIXLISA28,2014 36

Notes:

• Thank you for taking this tutorial, and I hope that it was (and will be) informative and useful for you.

• I would be very interested in your feedback, positive or negative, and sug- gestions for additional things to include in future versions of this tutorial, on the comment form, here at the conference, or later by email.