Cheap Clustering with OCFS2
Mark Fasheh Oracle August 14, 2006
What is OCFS2
● General purpose cluster file system – Shared disk model – Symmetric architecture – Almost POSIX compliant ● fcntl(2) locking ● Shared writable mmap ● Cluster stack – Small, suitable only for a file system
Why use OCFS2?
● Versus NFS – Fewer points of failure – Data consistency – OCFS2 nodes have direct disk access ● Higher performance ● Widely distributed, supported – In Linux kernel – Novell SLES9, SLES10 – Oracle support for RAC customers
OCFS2 Uses
● File Serving – FTP – NFS ● Web serving (Apache) ● Xen image migration ● Oracle Database
Why do we need “cheap” clusters? ● Shared disk hardware can be expensive – Fibre Channel as a rough example ● Switches: $3,000 - $20,000 ● Cards: $500 - $2,000 ● Cables, GBIC – Hundreds of dollars ● Disk(s): The sky's the limit ● Networks are getting faster and faster – Gigabit PCI card: $6 ● Some want to prototype larger systems – Performance not nec essarily critical Hardware
● Cheap commodity hardware is easy to find: – Refurbished from name brands (Dell, HP, IBM, etc) – Large hardware stores (Fry's Electronics, etc) – Online – Ebay, Amazon, Newegg, etc ● Impressive Performance – Dual core CPUs running at 2GHz and up – Gigabit network
– SATA, SATA II Hardware Examples - CPU
● 2.66GHz, Dual Core w/MB: $129 – Built in video, network
Hardware Examples - RAM
● 1GB DDR2: $70
Hardware Examples - Disk
● 100GB SATA: $50
Hardware Examples - Network
● Gigabit network card: $6 – Can direct connect rather than buy a switch, buy two!
Hardware Examples - Case
● 400 Watt Case: $70
Hardware Examples - Total
● Total hardware cost per node: $326 – 3 node cluster for less than $1,000! – One machine exports disk via network ● Dedicated gigabit network for the storage ● At $50 each, simple to buy an extra, dedicated disk ● Generally, this node cannot mount the shared disk ● Spend slightly more for nicer hardware – PCI-Express Gigabit: $30 – Athlon X2 3800+, MB (SATA II, DDR2): $180
Shared Disk via iSCSI
● SCSI over TCP/IP – Can be routed – Support for authentication, many enterprise features ● iSCSI Enterprise Target (IETD) – iSCSI “server” – Can run on any disks, regular files – Kernel / User space components ● Open iSCSI Initiator – iSCSI “client”
– Kernel / User space components Trivial ISCSI Target Config.
● Name the target – iqn.YYYY-MM.com.example:disk.name ● Create “Target” stanza in /etc/ietd.conf – Lun definitions describe disks to export – fileio type for normal disks – Special nullio type for testing Target iqn.2006-08.com.example:lab.exports Lun 0 Path=/dev/sdX,Type=fileio Lun 1 Sectors=10000,Type=nullio
Trivial ISCSI Initiator Config.
● Recent releases have a DB driven config. – Use “iscsiadm” program to manipulate – “rm -f /var/db/iscsi/*” to start fresh – 3 steps ● Add discovery address ● Log into target ● When done, log out of target
$ iscsiadm -m discovery --type sendtargets –portal examplehost [cbb01c] 192.168.1.6:3260,1 iqn.2006-08.com.example:lab.exports
$ iscsiadm -m node --record cbb01c –-login
$ i scsiadm -m node --record cbb01 c –-logout Shared Disk via SLES10
● Easiest option – No downloading – all packages included – Very simple setup using YAST2 ● Simple to use, GUI configuration utility ● Text mode available ● Supported by Novell/Suse ● OCFS2 also integrated with Linux-HA software ● Demo on Wednesday – Visit Oracle booth for details Shared Disk via AoE
● ATA over Ethernet – Very simple standard – 6 page spec! – Lightweight client ● Less CPU overhead than iSCSI – Very easy to set up – auto configuration via Ethernet broadcast – Not routable, no authentication ● Targets and clients must be on the same Ethernet network ● Disks addressed by “shelf” and “slot” #'s
AoE Target Configuration
● “Virtual Blade” (vblade) software available for Linux, FreeBSD – Very small, user space daemon – Buffered I/O against a device or file ● Useful only for prototyping ● O_DIRECT patches available – Stock performance is not very high ● Very simple command – vbladed
AoE Client Configuration
● Single kernel module load required – Automatically finds blades – Optional load time option, aoe_iflist ● List of interfaces to listen on ● Aoetools package – Programs to get AoE status, bind interfaces, create devices, etc
OCFS2
● 1.2 tree – Shipped with SLES9/SLES10 – RPMS for other distributions available online – Builds against many kernels – Feature freeze, bug fix only ● 1.3 tree – Active development tree – Included in Linux kernel – Bug fixes and features go to -mm first.
OCFS2 Tools
● Standard set of file system utilities – mkfs.ocfs2, mount.ocfs2, fsck.ocfs2, etc – Cluster aware – o2cb to start/stop/configure cluster – Work with both OCFS2 trees ● Ocfs2console GUI configuration utility – Can create entire cluster configuration – Can distribute configuration to all nodes ● RPMS for non SLES distributions available online OCFS2 Configuration
● Major goal for OCFS2 was simple config. – /etc/ocfs2/cluster.conf ● Single file, identical on all nodes – Only step before mounting is to start cluster ● Can configure to start at boot $ /etc/init.d/o2cb online
OCFS2 Tuning - Heartbeat
● Default heartbeat timeout tuned very low for our purposes – May result in node reboots for lower performance clusters – Timeout must be same on all nodes – Increase O2CB_HEARTBEAT_THRESHOLD value in /etc/sysconfig/o2cb ● OCFS2 Tools 1.2.3 release will add this to the configuration script. ● SLES10 users can use Linux-HA instead
OCFS2 Tuning – mkfs.ocfs2
● OCFS2 uses cluster and block sizes – Clusters for data, range from 4K-1M ● Use -C
● No indexed directories yet – Keep directory sizes small to medium ● Reduce resource contention – Read only access is not a problem – Try to keep writes local to a node ● Each node has it's own directory ● Each node has it's own logfile ● Spread things out by using multiple file systems – Allows you to fine tun e mkfs options depending on file system target usage References
● http://oss.oracle.com/projects/ocfs2/
● http://oss.oracle.com/projects/ocfs2-tools/
● http://www.novell.com/linux/storage_foundation/
● http://iscsitarget.sf.net/
● http://www.open-iscsi.org/
● http://aoetools.sf.net/
● http://www.coraid.com/
● http://www.frys-electronics-ads.com/
● http://www.cdw.com/