L79 Linux Filesystems
Total Page:16
File Type:pdf, Size:1020Kb
L79 LinuxLinux FilesystemsFilesystems Klaus Bergmann Linux Architecture & Performance IBM Lab Boeblingen zSeries Expo, November 10 -14, 2003 | Hilton, Las Vegas, NV Trademarks The following are trademarks of the International Business Machines Corporation in the United States and/or other countries. Enterprise Storage Server ESCON* FICON FICON Express HiperSockets IBM* IBM logo* IBM eServer Netfinity* S/390* VM/ESA* WebSphere* z/VM zSeries * Registered trademarks of IBM Corporation The following are trademarks or registered trademarks of other companies. Intel is a trademark of the Intel Corporation in the United States and other countries. Java and all Java-related trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc., in the United States and other countries. Lotus, Notes, and Domino are trademarks or registered trademarks of Lotus Development Corporation. Linux is a registered trademark of Linus Torvalds. Microsoft, Windows and Windows NT are registered trademarks of Microsoft Corporation. Penguin (Tux) compliments of Larry Ewing. SET and Secure Electronic Transaction are trademarks owned by SET Secure Electronic Transaction LLC. UNIX is a registered trademark of The Open Group in the United States and other countries. * All other products may be trademarks or registered trademarks of their respective companies. AgendaAgenda Journaling file systems Measurement setup Measurement results LPAR – VM 31 / 64 bit single disk and LVM DASD statistics CPU load and CP overhead journaling options Outlook ProblemsProblems ofof non-journalingnon-journaling filefile systemssystems data and meta-data is written directly and in arbitrary order no algorithm to ensure data integrity after crash, complete structure of file system has to be checked to ensure integrity file system check times depend on size of file system i risk of data loss i long and costly system outages advantagesadvantages ofof journalingjournaling data integrity ensured in case of system crash only journal has to be replayed to recover consistent file system structure file system check time depends on size of journal i much higher data integrity i much shorter system outages but there is a cost... JournalingJournaling filefile systemssystems inin SuSESuSE SLES8SLES8 ext3 v0.9.18 jfs 1.0.24 reiserfs 3.6.2 For reference : ext2 v0.5 (non-journaling) ext3ext3 developed by Andrew Morton and others based on ext2 extended by journaling features supports full data journaling resizing (only with unmount) possible http://www.zipworld.com.au/~akpm/linux/ext3/ jfsjfs developed by IBM Austin Lab ported from OS/2 Warp Server only metadata journaling max. file system size 4 PB http://www.ibm.com/developerworks/oss/jfs/index.html reiserfsreiserfs developed by a group around Hans Reiser SUSE's default choice only metadata journaling disk space optimization algorithm online enlargement of file system http://www.namesys.com/ MeasurementMeasurement setupsetup Hardware Software 2064-216 (z900) SUSE SLES8 1.09ns (917MHz) Dbench 2 * 16 MB L2 Cache (shared) 64 GB 6 FICON channels 2105-F20 (Shark) 384 MB NVS 16 GB Cache 128 * 36 GB disks 10.000 RPM FICON (1 Gbps) MeasurementMeasurement setupsetup dbench 128MB main memory 1, 2 and 4 CPUs LPAR and z/VM 4.3 31-bit and 64-bit Single 3390 model 3 disk 6 pack of 3390-3 using striped LVM. Attached via 6 FICON channels Running 8 and 16 processes DbenchDbench FileFile I/OI/O Emulation of Netbench benchmark, rates windows fileservers Large set of mixed file operations workload for each process: create, write, read, append, delete Scaling for Linux with 1, 2, 4 PUs Scaling for 8 and 16 clients (processes) simultaneously forced to do I/O while memory is filling up with data MeasurementMeasurement resultsresults LPARLPAR andand VMVM single disk, LPAR and VM, 31bit, 4 CPUs 70 65 60 55 50 45 8 proc., LPAR 40 16 proc., LPAR 35 8 proc., VM 30 16 proc., VM 25 20 Throughput [MB/s] 15 10 5 0 Ext2 Ext3 Jfs Reiserfs filesystem 31-bit31-bit andand 64-bit64-bit single disk, VM, 4 CPUs 65 60 55 50 45 40 8 proc., 31bit 16 proc., 31bit 35 8 proc., 64bit 30 16 proc., 64bit 25 20 Throughput [MB/s] 15 10 5 0 Ext2 Ext3 Jfs Reiserfs filesystem /proc/dasd/statistics – Example root@g73vm1:~# cat /proc/dasd/statistics 56881 dasd I/O requests with 5270816 sectors(512B each) __<4 ___8 __16 __32 __64 _128 _256 _512 __1k __2k __4k __8k _16k _32k _64k 128k _256 _512 __1M __2M __4M __8M _16M _32M _64M 128M 256M 512M __1G __2G __4G _>4G Histogram of sizes (512B secs) 0 0 1039 4799 8102 36557 4475 292 195 1422 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Histogram of I/O times (microseconds) 0 0 0 0 0 0 0 0 2 8 109 3244 25570 17480 7666 1248 1390 153 11 0 0 0 0 0 0 0 0 0 0 0 0 0 Histogram of I/O times per sector 0 0 0 0 176 4141 24084 15639 9506 2513 601 173 41 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Histogram of I/O time till ssch 5 1 2 0 0 0 0 0 2 4 301 11527 25339 12278 5156 1759 383 118 6 0 0 0 0 0 0 0 0 0 0 0 0 0 Histogram of I/O time between ssch and irq 0 0 0 0 0 0 0 0 2584 23896 18720 5307 2325 2725 1217 62 23 21 1 0 0 0 0 0 0 0 0 0 0 0 0 0 Histogram of I/O time between ssch and irq per sector 0 0 0 21722 26243 3939 2184 1798 774 159 47 12 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Histogram of I/O time between irq and end 7 0 43393 11341 457 179 1494 3 3 1 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 # of req in chanq at enqueuing (1..32) 8 3 4 5 56861 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ext2, 8 Processes Histogram of I/O times (microseconds) 1400 1300 1200 1100 1000 900 800 700 600 500 400 300 200 100 0 __< ___ __1 __3 __6 _12 _25 _51 __1 __2 __4 __8 _16 _32 _64 128 _25 _51 __1 __2 __4 __8 _16 _32 _64 128 256 512 __1 __2 __4 _>4 4 8 6 2 4 8 6 2 k k k k k k k k 6 2 M M M M M M M M M M G G G G ext2, 16 Proceses Histogram of I/O times (microseconds) 7500 7000 6500 6000 5500 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0 __< ___ __1 __3 __6 _12 _25 _51 __1 __2 __4 __8 _16 _32 _64 128 _25 _51 __1 __2 __4 __8 _16 _32 _64 128 256 512 __1 __2 __4 4 8 6 2 4 8 6 2 k k k k k k k k 6 2 M M M M M M M M M M G G G ext3, 8 Processes Histogram of I/O times (microseconds) 3000 2750 2500 2250 2000 1750 1500 1250 1000 750 500 250 0 __< ___ __1 __3 __6 _12 _25 _51 __1 __2 __4 __8 _16 _32 _64 128 _25 _51 __1 __2 __4 __8 _16 _32 _64 128 256 512 __1 __2 __4 4 8 6 2 4 8 6 2 k k k k k k k k 6 2 M M M M M M M M M M G G G ext3, 16 Processes Histogram of I/O times (microseconds) 15000 14000 13000 12000 11000 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 __< ___ __1 __3 __6 _12 _25 _51 __1 __2 __4 __8 _16 _32 _64 128 _25 _51 __1 __2 __4 __8 _16 _32 _64 128 256 512 __1 __2 __4 4 8 6 2 4 8 6 2 k k k k k k k k 6 2 M M M M M M M M M M G G G ext3, 16 Processes Histogram of I/O time before SSCH (IOSQ) 13000 12000 11000 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 __< ___ __1 __3 __6 _12 _25 _51 __1 __2 __4 __8 _16 _32 _64 128 _25 _51 __1 __2 __4 __8 _16 _32 _64 128 256 512 __1 __2 __4 4 8 6 2 4 8 6 2 k k k k k k k k 6 2 M M M M M M M M M M G G G Ext3, 16 Processes Histogram of I/O time between SSCH and IRQ 18000 16000 14000 12000 10000 8000 6000 4000 2000 0 __ __ __ __ __ _1 _2 _5 __ __ __ __ _1 _3 _6 12 _2 _5 __ __ __ __ _1 _3 _6 12 25 51 __ __ __ <4 _8 16 32 64 28 56 12 1k 2k 4k 8k 6k 2k 4k 8k 56 12 1M 2M 4M 8M 6M 2M 4M 8M 6M 2M 1G 2G 4G Ext3, 16 Processes number of requests in subchannel-queue at enqueuing 60000 55000 50000 45000 40000 35000 30000 25000 20000 15000 10000 5000 0 1 2 3 4 5 LogicalLogical VolumeVolume ManagerManager (LVM)(LVM) Linux software raid with raid levels 0,1, 4 and 5 excellent performance excellent flexibility (resizing, adding/removing disks) available in SLES7, SLES8, and RedHat RHEL 3 on zSeries, support multipath and PAV (under z/VM) http://www.sistina.com/products_lvm.htm LVMLVM systemsystem structurestructure (journaled) file system Logical Volume Manager RAID adapter physical physical RAID disk disk array ImprovingImproving diskdisk performanceperformance withwith LVMLVM striped datastream physical physical physical volume volume volume With LVM and striping parallelism is achieved LVMLVM resultsresults single disk - LVM comparison, z/VM, 31bit, 2 CPUs 130 120 110 100 90 80 8 proc., LVM 70 16 proc., LVM 60 8 proc., single disk 16 proc., single disk 50 40 Throughput [MB/s] 30 20 10 0 Ext2 Ext3 Jfs Reiserfs filesystem filesystemfilesystem optionsoptions single disk, ext3 optimizations, z/VM, 31bit, 4 CPUs single disk, reiserfs optimizations, z/VM, 31bit, 4 CPUs 45 50 40 45 40 35 35 30 30 8 processes 25 16 processes 25 20 20 15 15 Throughput [MB/s] Throughput [MB/s] 10 10 5 5 0 0 default no hash no border no ordered writeback full data external 100M 200M 400M relocation allocation unhashed journal journal journal journal relocation single disk, jfs optimizations, z/VM, 31bit, 4 CPUs 45 40 35 30 25 20 15 Throughput [MB/s] 10 5 0 default external 100M journal 200 MB journal journal CPUCPU loadload LPAR, 1 CPU, 8 processes, single disk 100 90 80 70 60 idle 50 system user CPU load [%] 40 30 20 10 0 ext2, ext2, ext3, ext3, reiserfs, reiserfs, jfs, 31bit jfs, 64bit 31bit 64bit 31bit 64bit 31bit 64bit filesystem VMVM overheadoverhead LVM CPU consumption CP Guest CPU consumption ext2, ext2, ext3, ext3, reiserfs, reiserfs, jfs, 31bit jfs, 31bit 31bit, 31bit 31bit, 31bit 31bit, 31bit, LVM LVM LVM LVM recoveryrecovery timestimes OutlookOutlook onon kernelkernel 2.62.6 journaling filesystem in kernel version 2.4 vs 2.6 60 55 50 45 40 ext3 35 jfs reiserfs 30 25 20 Throughput [MB/s] 15 10 5 0 Kernel 2.4.19 Kernel 2.6.0-test5 preliminary results jfs was not compiled into 2.6 kernel SummarySummary journaling file systems increase data integrity significantly journaling file systems dramatically reduce system outage times performance cost is at least 30% reiserfs is slightly faster than ext3, but needs much more CPU journaling file systems profit from LVM jfs has fastest recovery times 2.6 will bring more improvements (increased throughput, reduced CPU load, iostat for ECKD) QuestionsQuestions ??.