Cheap Clustering with OCFS2

Mark Fasheh Oracle August 14, 2006

What is OCFS2

● General purpose cluster file system – Shared disk model – Symmetric architecture – Almost POSIX compliant ● fcntl(2) locking ● Shared writable mmap ● Cluster stack – Small, suitable only for a file system

Why use OCFS2?

● Versus NFS – Fewer points of failure – Data consistency – OCFS2 nodes have direct disk access ● Higher performance ● Widely distributed, supported – In Linux kernel – Novell SLES9, SLES10 – Oracle support for RAC customers

OCFS2 Uses

● File Serving – FTP – NFS ● Web serving (Apache) ● Xen image migration ● Oracle Database

Why do we need “cheap” clusters? ● Shared disk hardware can be expensive – as a rough example ● Switches: $3,000 - $20,000 ● Cards: $500 - $2,000 ● Cables, GBIC – Hundreds of dollars ● Disk(s): The sky's the limit ● Networks are getting faster and faster – Gigabit PCI card: $6 ● Some want to prototype larger systems – Performance not nec essarily critical Hardware

● Cheap commodity hardware is easy to find: – Refurbished from name brands (Dell, HP, IBM, etc) – Large hardware stores (Fry's Electronics, etc) – Online – Ebay, Amazon, Newegg, etc ● Impressive Performance – Dual core CPUs running at 2GHz and up – Gigabit network

– SATA, SATA II Hardware Examples - CPU

● 2.66GHz, Dual Core w/MB: $129 – Built in video, network

Hardware Examples - RAM

● 1GB DDR2: $70

Hardware Examples - Disk

● 100GB SATA: $50

Hardware Examples - Network

● Gigabit network card: $6 – Can direct connect rather than buy a switch, buy two!

Hardware Examples - Case

● 400 Watt Case: $70

Hardware Examples - Total

● Total hardware cost per node: $326 – 3 node cluster for less than $1,000! – One machine exports disk via network ● Dedicated gigabit network for the storage ● At $50 each, simple to buy an extra, dedicated disk ● Generally, this node cannot mount the shared disk ● Spend slightly more for nicer hardware – PCI-Express Gigabit: $30 – Athlon X2 3800+, MB (SATA II, DDR2): $180

Shared Disk via iSCSI

● SCSI over TCP/IP – Can be routed – Support for authentication, many enterprise features ● iSCSI Enterprise Target (IETD) – iSCSI “server” – Can run on any disks, regular files – Kernel / User space components ● Open iSCSI Initiator – iSCSI “client”

– Kernel / User space components Trivial ISCSI Target Config.

● Name the target – iqn.YYYY-MM.com.example:disk.name ● Create “Target” stanza in /etc/ietd.conf – Lun definitions describe disks to export – fileio type for normal disks – Special nullio type for testing Target iqn.2006-08.com.example:lab.exports Lun 0 Path=/dev/sdX,Type=fileio Lun 1 Sectors=10000,Type=nullio

Trivial ISCSI Initiator Config.

● Recent releases have a DB driven config. – Use “iscsiadm” program to manipulate – “rm -f /var/db//*” to start fresh – 3 steps ● Add discovery address ● Log into target ● When done, log out of target

$ iscsiadm -m discovery --type sendtargets –portal examplehost [cbb01c] 192.168.1.6:3260,1 iqn.2006-08.com.example:lab.exports

$ iscsiadm -m node --record cbb01c –-login

$ i scsiadm -m node --record cbb01 c –-logout Shared Disk via SLES10

● Easiest option – No downloading – all packages included – Very simple setup using YAST2 ● Simple to use, GUI configuration utility ● Text mode available ● Supported by Novell/Suse ● OCFS2 also integrated with Linux-HA software ● Demo on Wednesday – Visit Oracle booth for details Shared Disk via AoE

● ATA over Ethernet – Very simple standard – 6 page spec! – Lightweight client ● Less CPU overhead than iSCSI – Very easy to set up – auto configuration via Ethernet broadcast – Not routable, no authentication ● Targets and clients must be on the same Ethernet network ● Disks addressed by “shelf” and “slot” #'s

AoE Target Configuration

● “Virtual Blade” (vblade) software available for Linux, FreeBSD – Very small, user space daemon – Buffered I/O against a device or file ● Useful only for prototyping ● O_DIRECT patches available – Stock performance is not very high ● Very simple command – vbladed

AoE Client Configuration

● Single kernel module load required – Automatically finds blades – Optional load time option, aoe_iflist ● List of interfaces to listen on ● Aoetools package – Programs to get AoE status, bind interfaces, create devices, etc

OCFS2

● 1.2 tree – Shipped with SLES9/SLES10 – RPMS for other distributions available online – Builds against many kernels – Feature freeze, bug fix only ● 1.3 tree – Active development tree – Included in Linux kernel – Bug fixes and features go to -mm first.

OCFS2 Tools

● Standard set of file system utilities – mkfs.ocfs2, mount.ocfs2, fsck.ocfs2, etc – Cluster aware – o2cb to start/stop/configure cluster – Work with both OCFS2 trees ● Ocfs2console GUI configuration utility – Can create entire cluster configuration – Can distribute configuration to all nodes ● RPMS for non SLES distributions available online OCFS2 Configuration

● Major goal for OCFS2 was simple config. – /etc/ocfs2/cluster.conf ● Single file, identical on all nodes – Only step before mounting is to start cluster ● Can configure to start at boot $ /etc/init.d/o2cb online Loading module "configfs": OK Mounting configfs filesystem at /sys/kernel/config: OK Loading module "ocfs2_nodemanager": OK Loading module "ocfs2_dlm": OK Loading module "ocfs2_dlmfs": OK Mounting ocfs2_dlmfs filesystem at /dlm: OK Sta rting O2CB cluster ocfs2 : OK Sample cluster.conf node: ip_port = 7777 ip_address = 192.168.1.7 number = 0 name = keevan cluster = ocfs2 node: ip_port = 7777 ip_address = 192.168.1.2 number = 1 name = opaka cluster = ocfs2 cluster: node_count = 2 name = ocfs2

OCFS2 Tuning - Heartbeat

● Default heartbeat timeout tuned very low for our purposes – May result in node reboots for lower performance clusters – Timeout must be same on all nodes – Increase O2CB_HEARTBEAT_THRESHOLD value in /etc/sysconfig/o2cb ● OCFS2 Tools 1.2.3 release will add this to the configuration script. ● SLES10 users can use Linux-HA instead

OCFS2 Tuning – mkfs.ocfs2

● OCFS2 uses cluster and block sizes – Clusters for data, range from 4K-1M ● Use -C option – Blocks for meta data, range from .5K-4K ● Use -b option ● More meta data updates -> larger journal – -Jsize= to pick different size ● mkfs.ocfs2 -T filesystem-type – -Tmail option for meta data heavy workloads – -Tdatafiles for file syst ems with very large files OCFS2 Tuning - Practices

● No indexed directories yet – Keep directory sizes small to medium ● Reduce resource contention – Read only access is not a problem – Try to keep writes local to a node ● Each node has it's own directory ● Each node has it's own logfile ● Spread things out by using multiple file systems – Allows you to fine tun e mkfs options depending on file system target usage References

● http://oss.oracle.com/projects/ocfs2/

● http://oss.oracle.com/projects/ocfs2-tools/

● http://www.novell.com/linux/storage_foundation/

● http://iscsitarget.sf.net/

● http://www.open-iscsi.org/

● http://aoetools.sf.net/

● http://www.coraid.com/

● http://www.frys-electronics-ads.com/

● http://www.cdw.com/