Introduction to Gluster

Man-Suen Chan Overview

Architecture Volumes Features Design of AOPP Cluster Performance Data management Puppet Issues Recommendations Introduction to gluster archictecture All nodes are the same, no meta-data servers etc. In this respect like Isilon Each node contributes storage, bandwidth and processing to the cluster, as you add more nodes performance increases linearly in a scale-out fashion. Different data layouts are possible but it is possible to have the entire cluster as one namespace or to divide it into volumes. All network connections are used, enabling bandwidth to be scaled. Adding and removing nodes can be done online Nodes contribute “bricks” to the cluster which are basically disk partitions. There is no need for these to be identical in hardware but it is better if they are the same size. They can also have any on them (default xfs). These are set up on RHEL (or clone) servers with gluster installed. Sold by as “Red Hat Storage” - very expensive and same software more or less but has support. Red Hat supports XFS over LVM volumes. Alternatively you can set up gluster for free (no support). Client access via NFS, glusterfs native client or object storage Volumes

Basic building blocks of gluster cluster consisting of several bricks. One cluster can have several volumes. Several types for example: Distributed. Files are distributed over several bricks as whole files. One copy, rely on underlying RAID for redundancy. Replicated. Files are distributed over several bricks as whole files and in multiple copies. Number of bricks needs to be multiple of replicate number.Gluster itself provides redundancy. Disperse (new feature). Files are dispersed with parity (like RAID 5). However they are no longer present as whole files. There are other types (striped, distributed-replicated etc). Gluster has many advanced features including: Support for infiniband Quotas Snapshots Geo-: You can have another cluster in a different location and replicate to it (asynchronous) – in theory when I tested it before it didn't work very well. I think some additional features are only available on Red Hat Storage. Design of AOPP cluster

Nodes are 12x6TB Dell PE730XD servers with separate (additional) disks for the OS. 5 nodes in RAID6 giving 300TB usable. Installed with SL7 and free version of gluster. (Ubuntu would also be possible but is likely to be less well supported) Ubuntu on the client side is fine. Distributed volumes (should be robust to risk of data loss although somewhat less to availability). Whole file distribution means the risk of catastrophic data loss due to (gluster) file system corruption is low. Also most performance slowdowns are due to replication. Will set up two volumes to allow some separation between different groups. One brick per node. Quotas at group and project level Complex setup because of need to accommodate legacy systems so multi-homed. Main storage and cluster communication on infiniband network but some client communication over ethernet. Performance (1) Fio test Sequential read of 10GB file with 32k block size. system bandwidth iops time latency

Simple 1gb 114417KB/s 3575 91645msec 2234.84 usec (matin) Simple ib 470192KB/s 14693 22301msec 542.14 usec (stier) Isilon 1gb 113864KB/s 3558 92090msec 2246.15 usec

Gluster 1 gb 113931KB/s 3560 92036msec 2244 usec

Gluster 10 gb 363912 KB/s 11372 28814msec 701 usec

Gluster ib 621415KB/s 19419 16874msec 411 usec Performance (2) Fio test. Writing 1000 small files of 100 KB each

system bandwidth iops time latency

Simple 1 gb 1730.6KB/s 432 57786msec 18449.49 usec (matin) Simple ib 65317KB/s 16329 1531msec 464.13 usec (stier) Isilon 1 gb 17425KB/s 4356 5739msec 1811.24 usec

Gluster 1 gb 40833KB/s 10208 2449msec 751 usec

Gluster 10 gb 45851KB/s 11462 2181msec 672usec

Gluster ib 50125KB/s 12531 1995msec 612usec Data management

Projects described using Dublin Core XML data format and controlled by quotas Modelling Jupiter's atmospheric spin-up using the MITgcm Roland Young Jupiter GCM Moist convection MITgcm Simulations using the MITgcm studying jet formation in Jupiter's atmosphere under passive and active cloud conditions (moist convection). Also includes test runs of the Jupiter MITgcm, and analysis of these simulations. Data not yet published Roland Young Peter Read 11 May 2016 GCM output NetCDF MITgcm custom binaries IDL .sav PLR001_YOUNG_JUPSPIN MITgcm English N/A Jupiter For internal AOPP use only. Puppet-gluster(1)

[root@cplxconfig2 manifests]# cat glusternode.pp class profile::glusternode( $hosts = {}, $bricks = {}, $volumes = {}, $properties = {}, ) { # class using gluster::server include gluster::params

class { 'gluster::server': } create_resources(gluster::host, $hosts) create_resources(gluster::brick, $bricks) create_resources(gluster::volume, $volumes) create_resources(gluster::volume::property, $properties) } Puppet-gluster (2) --- gluster::server::shorewall: false profile::glusternode::volumes: gluster::server::infiniband: true volume01: gluster::server::vip: '192.168.0.50' bricks: gluster::server::vrrp: true - 'atmgluster01.atm.ox.ac.uk:/data1' - 'atmgluster02.atm.ox.ac.uk:/data2' profile::glusternode::hosts: - 'atmgluster03.atm.ox.ac.uk:/data3' atmgluster01.atm.ox.ac.uk: transport: 'tcp,rdma' ip: '192.168.0.1' again: false uuid: 'c6bcc598-53ab-41dd-ad9a- start: true 532e2215df5e' ...

… profile::glusternode::properties: profile::glusternode::bricks: volume01#auth.allow: atmgluster01.atm.ox.ac.uk:/data1: value: dev: '/dev/sdb' - '163.1.242.*' fsuuid: '02f2e01a-5728-4263-be30- - '192.168.0.*' 0b4850fc5cdc' volume01#nfs.rpc-auth- lvm: false allow: xfs_inode64: true value: areyousure: false - '163.1.242.*' again: false - '192.168.0.*' … Issues All client access is currently via nfs. Glusterfs client seems to have slower performance. “Infiniband” access is currently IP over Infiniband and rdma does not seem to work very well. On the other hand the current setup performs well. You can't seem to add mount options as then you get sec=null so you need to mount with no options. Likewise although you can set up root squash then you have no access at all as root. Sometimes get issues of different nodes with different view or puppet runs don't complete but actually the whole system seems quite stable. Recommendations(1)

Pros: Gluster is relatively easy to set up, stable and works well. Performance is good probably at least as good as Isilon at a fraction of the price. There are features of advanced file systems so it is realistic as a production facility although you will mainly need to use the command line to configure them. It is easy to expand in future and the hardware will not need to be exactly the same. Puppet has already been set up. NB a separate role needs to be set up for each cluster but the manifest for each node in the cluster contains the information on the entire cluster. Recommendations(2)

Cautions Gluster does not seem to be widely used for HPC. I have carefully evaluated this with respect to the areas I have built systems for (bioinformatics, AOPP) and am satisfied it is suitable but it may be less suitable for systems with rapid turnovers of millions of files (ie fast scratch spaces). In fact Isilon is also recommended for mixed work loads rather than fast scratch spaces. So this will need to be carefully evaluated when moving into new areas. It may be marginally more expensive than systems that comprise metadata servers and data nodes as with a pure scale-out system each node needs to have reasonable performance and typically you do not use additional attached storage enclosures