Cheap Clustering with OCFS2

Cheap Clustering with OCFS2

Cheap Clustering with OCFS2 Mark Fasheh Oracle August 14, 2006 What is OCFS2 ● General purpose cluster file system – Shared disk model – Symmetric architecture – Almost POSIX compliant ● fcntl(2) locking ● Shared writable mmap ● Cluster stack – Small, suitable only for a file system Why use OCFS2? ● Versus NFS – Fewer points of failure – Data consistency – OCFS2 nodes have direct disk access ● Higher performance ● Widely distributed, supported – In Linux kernel – Novell SLES9, SLES10 – Oracle support for RAC customers OCFS2 Uses ● File Serving – FTP – NFS ● Web serving (Apache) ● Xen image migration ● Oracle Database Why do we need “cheap” clusters? ● Shared disk hardware can be expensive – Fibre Channel as a rough example ● Switches: $3,000 - $20,000 ● Cards: $500 - $2,000 ● Cables, GBIC – Hundreds of dollars ● Disk(s): The sky's the limit ● Networks are getting faster and faster – Gigabit PCI card: $6 ● Some want to prototype larger systems – Performance not nec essarily critical Hardware ● Cheap commodity hardware is easy to find: – Refurbished from name brands (Dell, HP, IBM, etc) – Large hardware stores (Fry's Electronics, etc) – Online – Ebay, Amazon, Newegg, etc ● Impressive Performance – Dual core CPUs running at 2GHz and up – Gigabit network – SATA, SATA II Hardware Examples - CPU ● 2.66GHz, Dual Core w/MB: $129 – Built in video, network Hardware Examples - RAM ● 1GB DDR2: $70 Hardware Examples - Disk ● 100GB SATA: $50 Hardware Examples - Network ● Gigabit network card: $6 – Can direct connect rather than buy a switch, buy two! Hardware Examples - Case ● 400 Watt Case: $70 Hardware Examples - Total ● Total hardware cost per node: $326 – 3 node cluster for less than $1,000! – One machine exports disk via network ● Dedicated gigabit network for the storage ● At $50 each, simple to buy an extra, dedicated disk ● Generally, this node cannot mount the shared disk ● Spend slightly more for nicer hardware – PCI-Express Gigabit: $30 – Athlon X2 3800+, MB (SATA II, DDR2): $180 Shared Disk via iSCSI ● SCSI over TCP/IP – Can be routed – Support for authentication, many enterprise features ● iSCSI Enterprise Target (IETD) – iSCSI “server” – Can run on any disks, regular files – Kernel / User space components ● Open iSCSI Initiator – iSCSI “client” – Kernel / User space components Trivial ISCSI Target Config. ● Name the target – iqn.YYYY-MM.com.example:disk.name ● Create “Target” stanza in /etc/ietd.conf – Lun definitions describe disks to export – fileio type for normal disks – Special nullio type for testing Target iqn.2006-08.com.example:lab.exports Lun 0 Path=/dev/sdX,Type=fileio Lun 1 Sectors=10000,Type=nullio Trivial ISCSI Initiator Config. ● Recent releases have a DB driven config. – Use “iscsiadm” program to manipulate – “rm -f /var/db/iscsi/*” to start fresh – 3 steps ● Add discovery address ● Log into target ● When done, log out of target $ iscsiadm -m discovery --type sendtargets –portal examplehost [cbb01c] 192.168.1.6:3260,1 iqn.2006-08.com.example:lab.exports $ iscsiadm -m node --record cbb01c –-login $ i scsiadm -m node --record cbb01 c –-logout Shared Disk via SLES10 ● Easiest option – No downloading – all packages included – Very simple setup using YAST2 ● Simple to use, GUI configuration utility ● Text mode available ● Supported by Novell/Suse ● OCFS2 also integrated with Linux-HA software ● Demo on Wednesday – Visit Oracle booth for details Shared Disk via AoE ● ATA over Ethernet – Very simple standard – 6 page spec! – Lightweight client ● Less CPU overhead than iSCSI – Very easy to set up – auto configuration via Ethernet broadcast – Not routable, no authentication ● Targets and clients must be on the same Ethernet network ● Disks addressed by “shelf” and “slot” #'s AoE Target Configuration ● “Virtual Blade” (vblade) software available for Linux, FreeBSD – Very small, user space daemon – Buffered I/O against a device or file ● Useful only for prototyping ● O_DIRECT patches available – Stock performance is not very high ● Very simple command – vbladed <shelf> <slot> <ethn> <device> AoE Client Configuration ● Single kernel module load required – Automatically finds blades – Optional load time option, aoe_iflist ● List of interfaces to listen on ● Aoetools package – Programs to get AoE status, bind interfaces, create devices, etc OCFS2 ● 1.2 tree – Shipped with SLES9/SLES10 – RPMS for other distributions available online – Builds against many kernels – Feature freeze, bug fix only ● 1.3 tree – Active development tree – Included in Linux kernel – Bug fixes and features go to -mm first. OCFS2 Tools ● Standard set of file system utilities – mkfs.ocfs2, mount.ocfs2, fsck.ocfs2, etc – Cluster aware – o2cb to start/stop/configure cluster – Work with both OCFS2 trees ● Ocfs2console GUI configuration utility – Can create entire cluster configuration – Can distribute configuration to all nodes ● RPMS for non SLES distributions available online OCFS2 Configuration ● Major goal for OCFS2 was simple config. – /etc/ocfs2/cluster.conf ● Single file, identical on all nodes – Only step before mounting is to start cluster ● Can configure to start at boot $ /etc/init.d/o2cb online <cluster name> Loading module "configfs": OK Mounting configfs filesystem at /sys/kernel/config: OK Loading module "ocfs2_nodemanager": OK Loading module "ocfs2_dlm": OK Loading module "ocfs2_dlmfs": OK Mounting ocfs2_dlmfs filesystem at /dlm: OK Sta rting O2CB cluster ocfs2 : OK Sample cluster.conf node: ip_port = 7777 ip_address = 192.168.1.7 number = 0 name = keevan cluster = ocfs2 node: ip_port = 7777 ip_address = 192.168.1.2 number = 1 name = opaka cluster = ocfs2 cluster: node_count = 2 name = ocfs2 OCFS2 Tuning - Heartbeat ● Default heartbeat timeout tuned very low for our purposes – May result in node reboots for lower performance clusters – Timeout must be same on all nodes – Increase O2CB_HEARTBEAT_THRESHOLD value in /etc/sysconfig/o2cb ● OCFS2 Tools 1.2.3 release will add this to the configuration script. ● SLES10 users can use Linux-HA instead OCFS2 Tuning – mkfs.ocfs2 ● OCFS2 uses cluster and block sizes – Clusters for data, range from 4K-1M ● Use -C <clustersize> option – Blocks for meta data, range from .5K-4K ● Use -b <blocksize> option ● More meta data updates -> larger journal – -Jsize=<journalsize> to pick different size ● mkfs.ocfs2 -T filesystem-type – -Tmail option for meta data heavy workloads – -Tdatafiles for file syst ems with very large files OCFS2 Tuning - Practices ● No indexed directories yet – Keep directory sizes small to medium ● Reduce resource contention – Read only access is not a problem – Try to keep writes local to a node ● Each node has it's own directory ● Each node has it's own logfile ● Spread things out by using multiple file systems – Allows you to fine tun e mkfs options depending on file system target usage References ● http://oss.oracle.com/projects/ocfs2/ ● http://oss.oracle.com/projects/ocfs2-tools/ ● http://www.novell.com/linux/storage_foundation/ ● http://iscsitarget.sf.net/ ● http://www.open-iscsi.org/ ● http://aoetools.sf.net/ ● http://www.coraid.com/ ● http://www.frys-electronics-ads.com/ ● http://www.cdw.com/ .

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    27 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us