Cheap Clustering with OCFS2
Total Page:16
File Type:pdf, Size:1020Kb
Cheap Clustering with OCFS2 Mark Fasheh Oracle August 14, 2006 What is OCFS2 ● General purpose cluster file system – Shared disk model – Symmetric architecture – Almost POSIX compliant ● fcntl(2) locking ● Shared writable mmap ● Cluster stack – Small, suitable only for a file system Why use OCFS2? ● Versus NFS – Fewer points of failure – Data consistency – OCFS2 nodes have direct disk access ● Higher performance ● Widely distributed, supported – In Linux kernel – Novell SLES9, SLES10 – Oracle support for RAC customers OCFS2 Uses ● File Serving – FTP – NFS ● Web serving (Apache) ● Xen image migration ● Oracle Database Why do we need “cheap” clusters? ● Shared disk hardware can be expensive – Fibre Channel as a rough example ● Switches: $3,000 - $20,000 ● Cards: $500 - $2,000 ● Cables, GBIC – Hundreds of dollars ● Disk(s): The sky's the limit ● Networks are getting faster and faster – Gigabit PCI card: $6 ● Some want to prototype larger systems – Performance not nec essarily critical Hardware ● Cheap commodity hardware is easy to find: – Refurbished from name brands (Dell, HP, IBM, etc) – Large hardware stores (Fry's Electronics, etc) – Online – Ebay, Amazon, Newegg, etc ● Impressive Performance – Dual core CPUs running at 2GHz and up – Gigabit network – SATA, SATA II Hardware Examples - CPU ● 2.66GHz, Dual Core w/MB: $129 – Built in video, network Hardware Examples - RAM ● 1GB DDR2: $70 Hardware Examples - Disk ● 100GB SATA: $50 Hardware Examples - Network ● Gigabit network card: $6 – Can direct connect rather than buy a switch, buy two! Hardware Examples - Case ● 400 Watt Case: $70 Hardware Examples - Total ● Total hardware cost per node: $326 – 3 node cluster for less than $1,000! – One machine exports disk via network ● Dedicated gigabit network for the storage ● At $50 each, simple to buy an extra, dedicated disk ● Generally, this node cannot mount the shared disk ● Spend slightly more for nicer hardware – PCI-Express Gigabit: $30 – Athlon X2 3800+, MB (SATA II, DDR2): $180 Shared Disk via iSCSI ● SCSI over TCP/IP – Can be routed – Support for authentication, many enterprise features ● iSCSI Enterprise Target (IETD) – iSCSI “server” – Can run on any disks, regular files – Kernel / User space components ● Open iSCSI Initiator – iSCSI “client” – Kernel / User space components Trivial ISCSI Target Config. ● Name the target – iqn.YYYY-MM.com.example:disk.name ● Create “Target” stanza in /etc/ietd.conf – Lun definitions describe disks to export – fileio type for normal disks – Special nullio type for testing Target iqn.2006-08.com.example:lab.exports Lun 0 Path=/dev/sdX,Type=fileio Lun 1 Sectors=10000,Type=nullio Trivial ISCSI Initiator Config. ● Recent releases have a DB driven config. – Use “iscsiadm” program to manipulate – “rm -f /var/db/iscsi/*” to start fresh – 3 steps ● Add discovery address ● Log into target ● When done, log out of target $ iscsiadm -m discovery --type sendtargets –portal examplehost [cbb01c] 192.168.1.6:3260,1 iqn.2006-08.com.example:lab.exports $ iscsiadm -m node --record cbb01c –-login $ i scsiadm -m node --record cbb01 c –-logout Shared Disk via SLES10 ● Easiest option – No downloading – all packages included – Very simple setup using YAST2 ● Simple to use, GUI configuration utility ● Text mode available ● Supported by Novell/Suse ● OCFS2 also integrated with Linux-HA software ● Demo on Wednesday – Visit Oracle booth for details Shared Disk via AoE ● ATA over Ethernet – Very simple standard – 6 page spec! – Lightweight client ● Less CPU overhead than iSCSI – Very easy to set up – auto configuration via Ethernet broadcast – Not routable, no authentication ● Targets and clients must be on the same Ethernet network ● Disks addressed by “shelf” and “slot” #'s AoE Target Configuration ● “Virtual Blade” (vblade) software available for Linux, FreeBSD – Very small, user space daemon – Buffered I/O against a device or file ● Useful only for prototyping ● O_DIRECT patches available – Stock performance is not very high ● Very simple command – vbladed <shelf> <slot> <ethn> <device> AoE Client Configuration ● Single kernel module load required – Automatically finds blades – Optional load time option, aoe_iflist ● List of interfaces to listen on ● Aoetools package – Programs to get AoE status, bind interfaces, create devices, etc OCFS2 ● 1.2 tree – Shipped with SLES9/SLES10 – RPMS for other distributions available online – Builds against many kernels – Feature freeze, bug fix only ● 1.3 tree – Active development tree – Included in Linux kernel – Bug fixes and features go to -mm first. OCFS2 Tools ● Standard set of file system utilities – mkfs.ocfs2, mount.ocfs2, fsck.ocfs2, etc – Cluster aware – o2cb to start/stop/configure cluster – Work with both OCFS2 trees ● Ocfs2console GUI configuration utility – Can create entire cluster configuration – Can distribute configuration to all nodes ● RPMS for non SLES distributions available online OCFS2 Configuration ● Major goal for OCFS2 was simple config. – /etc/ocfs2/cluster.conf ● Single file, identical on all nodes – Only step before mounting is to start cluster ● Can configure to start at boot $ /etc/init.d/o2cb online <cluster name> Loading module "configfs": OK Mounting configfs filesystem at /sys/kernel/config: OK Loading module "ocfs2_nodemanager": OK Loading module "ocfs2_dlm": OK Loading module "ocfs2_dlmfs": OK Mounting ocfs2_dlmfs filesystem at /dlm: OK Sta rting O2CB cluster ocfs2 : OK Sample cluster.conf node: ip_port = 7777 ip_address = 192.168.1.7 number = 0 name = keevan cluster = ocfs2 node: ip_port = 7777 ip_address = 192.168.1.2 number = 1 name = opaka cluster = ocfs2 cluster: node_count = 2 name = ocfs2 OCFS2 Tuning - Heartbeat ● Default heartbeat timeout tuned very low for our purposes – May result in node reboots for lower performance clusters – Timeout must be same on all nodes – Increase O2CB_HEARTBEAT_THRESHOLD value in /etc/sysconfig/o2cb ● OCFS2 Tools 1.2.3 release will add this to the configuration script. ● SLES10 users can use Linux-HA instead OCFS2 Tuning – mkfs.ocfs2 ● OCFS2 uses cluster and block sizes – Clusters for data, range from 4K-1M ● Use -C <clustersize> option – Blocks for meta data, range from .5K-4K ● Use -b <blocksize> option ● More meta data updates -> larger journal – -Jsize=<journalsize> to pick different size ● mkfs.ocfs2 -T filesystem-type – -Tmail option for meta data heavy workloads – -Tdatafiles for file syst ems with very large files OCFS2 Tuning - Practices ● No indexed directories yet – Keep directory sizes small to medium ● Reduce resource contention – Read only access is not a problem – Try to keep writes local to a node ● Each node has it's own directory ● Each node has it's own logfile ● Spread things out by using multiple file systems – Allows you to fine tun e mkfs options depending on file system target usage References ● http://oss.oracle.com/projects/ocfs2/ ● http://oss.oracle.com/projects/ocfs2-tools/ ● http://www.novell.com/linux/storage_foundation/ ● http://iscsitarget.sf.net/ ● http://www.open-iscsi.org/ ● http://aoetools.sf.net/ ● http://www.coraid.com/ ● http://www.frys-electronics-ads.com/ ● http://www.cdw.com/ .