Performance Improvement of Btrfs
Miao Xie
Comparison between Btrfs and Ext3/4 Issue analysis (We have investigated) Small file sequential read Large file random write (Direct I/O and fsync) File creation/deletion Future work
2 Comparison between Btrfs and Ext3/4
Performance test environment Hardware • CPU : Xeon(TM) X5260 3.33G X 2 ( 4 cores ) • Memory : 4GB • Disk : 20GB Software • OS : RHEL6(x86_64) • Kernel : 2.6.38 • Glibc : 2.12 • Btrfs-progs : 0.9 • Sysbench: 0.4.12
3 Comparison between Btrfs and Ext3/4
73 cases in total 72 file I/O cases, mix the following conditions: • Small file / Large file • Write / Read • Random / Sequential • Sync / Async / Direct I/O • Single-thread / Multi-thread • Different block size (1Kb, 4Kb, 32Kb) * File creation/deletion • Measure the speed of empty file creation/deletion
* Block size (bs): read or write BYTES bytes at a time.
4 Small file random read performance
6000
5000
) 4000 s / b K
: t i
n 3000 EXT3 U (
d EXT4 e e BTRFS p 2000 S
O I
1000
0 bs = 1K bs = 4K bs = 1K bs = 4K bs = 1K bs = 4K bs = 1K bs = 4K
1 Thread 8 Threads 1 Thread 8 Threads
DirectI/O General Read
5 Small file random write performance
3000
2500 )
s 2000 / b K
: t i
n 1500 EXT3 U (
d EXT4 e e
p BTRFS
S 1000
O I
500
0 bs = 1K bs = 4K bs = 1K bs = 4K bs = 1K bs = 4K bs = 1K bs = 4K
1 Thread 8 Threads 1 Thread 8 Threads
DirectI/O Write (fsync)
Write (fsync): write data into the file, and do fsync every 100 requests
6 Small file sequential read performance
60.00
50.00
) 40.00 s / b M
: t i
n 30.00 EXT3 U (
d EXT4 e
e BTRFS p 20.00 S
O I
10.00
0.00 bs = 1K bs = 4K bs = 1K bs = 4K bs = 1K bs = 4K bs = 1K bs = 4K
1 Thread 8 Threads 1 Thread 8 Threads
DirectI/O General Read
7 Small file sequential write performance
6000
5000 ) s
/ 4000 b K
: t i
n 3000
U EXT3 (
d EXT4 e e BTRFS p 2000 S
O I
1000
0 bs = 1K bs = 4K bs = 1K bs = 4K bs = 1K bs = 4K bs = 1K bs = 4K
1 Thread 8 Threads 1 Thread 8 Threads
DirectI/O Write (fsync)
Write (fsync): write data into the file, and do fsync every 100 requests
8 Large file random read performance
16.00
14.00
12.00 ) s /
b 10.00 M
: t i
n 8.00
U EXT3 (
d EXT4 e
e 6.00
p BTRFS S
O I 4.00
2.00
0.00 bs = 4K bs = 32K bs = 4K bs = 32K bs = 4K bs = 32K bs = 4K bs = 32K
1 Thread 8 Threads 1 Thread 8 Threads
DirectIO General Read
9 Large file sequential read performance
90.00
80.00
70.00
) 60.00 s / b M
: 50.00 t i n
U EXT3
( 40.00
d EXT4 e e 30.00 p BTRFS S
O I 20.00
10.00
0.00 bs = 4K bs = 32K bs = 4K bs = 32K bs = 4K bs = 32K bs = 4K bs = 32K
1 Thread 8 Threads 1 Thread 8 Threads
DirectIO General Read
10 Large file random write performance (1/2)
14.00
12.00
10.00 ) s / b
M 8.00
: t i
n EXT3 U ( 6.00 EXT4 d e
e BTRFS p S 4.00 O I
2.00
0.00 bs = 4K bs = 32K bs = 4K bs = 32K bs = 4K bs = 32K bs = 4K bs = 32K
1 Thread 8 Threads 1 Thread 8 Threads
DirectIO Write (fsync)
11 Large file random write performance (2/2)
40.00
35.00
30.00 ) s /
b 25.00 M
: t i
n 20.00
U EXT3 (
d EXT4 e 15.00 e
p BTRFS S
O 10.00 I
5.00
0.00 bs = 4K bs = 32K bs = 4K bs = 32K
1 Thread 8 Threads
General Write
12 Large file sequential write performance (1/2)
8.00
7.00
6.00 ) s
/ 5.00 b M
: t i
n 4.00 EXT3 U (
d EXT4 e 3.00 e BTRFS p S
O I 2.00
1.00
0.00 bs = 4K bs = 32K bs = 4K bs = 32K
1 Thread 8 Threads
DirectIO
13 Large file sequential write performance (2/2)
90.00
80.00
70.00
60.00 ) s / b M
50.00 : t i n EXT3 U (
40.00
d EXT4 e e
p 30.00 BTRFS S
O I 20.00
10.00
0.00 bs = 4K bs = 32K bs = 4K bs = 32K bs = 4K bs = 32K bs = 4K bs = 32K
1 Thread 8 Threads 1 Thread 8 Threads
Write (fsync) General Write
14 File creation/deletion performance
Create/delete lots of empty files to measure the speed of file creation and deletion.
140000
120000
100000 )
c 80000 e s / s
e Ext3 l i f
:
t 60000 Ext4 i n
U Btrfs ( 40000
20000
0 Creation Deletion
15 Comparison between Btrfs and Ext3/4
The performance of Btrfs is quite poor in the following cases (> 20% lower than Ext3/4) Small file random read (Not inline file) Small file sequential read Small file random/sequential write Large file random write (Direct I/O and fsync) Large file random write (general write, bs = 4Kb) File creation and deletion
16 Agenda
Comparison between Btrfs and Ext3/4 Issue analysis (We have investigated) Small file sequential read Large file random write (Direct I/O and fsync) File creation/deletion Future work
17 Small file sequential read
Reasons Metadata fragment -> The file extent reading latency -> The delay of file data reading • Btrfs must read file extent before reading file data (no matter the small file is inlined or not), but the disk has to reposition the reading offset frequently because of the fragment, and the readahead function can’t work well. So …
Fs/file tree
Disk
18 Small file sequential read
Reason verification Do small file sequential read after defragment
30.00
25.00 ) s /
b 20.00 M
: t i n
U 15.00 (
No Defrag d
e After Defrag e
p 10.00 S
O I
5.00
0.00 bs = 1K bs = 4K bs = 1K bs = 4K bs = 1K bs = 4K bs = 1K bs = 4K
1 Thread 8 Threads 1 Thread 8 Threads
DirectI/O General Read
19 Small file sequential read
Solution Pre-allocation for b+ tree: Introduce free space clusters for each node in the tree, then we can allocate contiguous free space from the parent node’s cluster to store the sibling leaves closely (The patch of this solution is still under test, hasn’t be posted)
Fs/file tree
Cluster
Cluster Cluster
Disk
20 Small file sequential read
Improvement result
60.00
50.00 ) s / b
M 40.00
: t i n
U EXT3 (
30.00
d EXT4 e
e BTRFS p
S 20.00
BTRFS + Patch O I
10.00
0.00 bs = 1K bs = 4K bs = 1K bs = 4K bs = 1K bs = 4K bs = 1K bs = 4K
1 Thread 8 Threads 1 Thread 8 Threads
DirectI/O General Read Further Improvement Introduce the auto defragment for metadata Apply the new metadata readahead API written by Arne 21 Agenda
Comparison between Btrfs and Ext3/4 Issue analysis (We have investigated) Small file sequential read Large file random write (Direct I/O and fsync) File creation/deletion Future work
22 Large file random write (Direct IO and fsync) Background – What is tree logging? Tree logging is a special write ahead log of dirty metadata. Purpose: Reduce the write requests of the metadata when fsyncs and O_SYNCs happen.
Implementation: Copy the changed items into a special tree (log tree, one per fs/file tree), and then write that tree to disk. After a crash, Btrfs recover the fs/file tree by that tree.
23 Large file random write (Direct IO and fsync) Reasons Log lots of unchanged metadata (Ex. Csum, File extent)
Application
File
Extent 1 Extent 2 Extent 3 … Extent N
Change the relative Csum tree Checksums
Log all the csum data of this file Log tree
Extent1 Extent2 Extent3 ExtentN Csum Csum Csum Csum
Disk Write to disk
Checksum of the file’s extent Checksum of the file’s extent that be changed
The extent that be changed 24 Large file random write (Direct IO and fsync)
Reason verification Do large file random write test after closing tree log function (mount with -o notreelog)
50.00
45.00
40.00
) 35.00 s / b M
30.00 : t i n
U 25.00 (
d BTRFS e
e 20.00
p BTRFS(no treelog) S
O 15.00 I
10.00
5.00
0.00 bs = 4K bs = 32K bs = 4K bs = 32K bs = 4K bs = 32K bs = 4K bs = 32K
1 Thread 8 Threads 1 Thread 8 Threads
DirectIO Write (fsync)
25 Large file random write (Direct IO and fsync) Solution Don’t log unchanged metadata: Introduce sub- transaction id to filter the unchanged metadata (v2.6.41)
Application File
Extent 1 Extent 2 Extent 3 … Extent N
Change the relative Csum tree Checksums
Log all the csum data of this file Log tree
Extent1 Extent2 Extent3 ExtentN Csum Csum Csum Csum
Disk Write to disk
Checksum of the file’s extent Checksum of the file’s extent that be changed
The extent that be changed 26 Large file random write (Direct IO and fsync) Solution Don’t log unchanged metadata: Introduce sub- transaction id to filter the unchanged metadata (v2.6.41)
Application File
Extent 1 Extent 2 Extent 3 … Extent N
Change the relative Csum tree Checksums
Log all the csum data of this file Log tree
Extent1 Extent2 Extent3 ExtentN Csum Csum Csum Csum
Disk Write to disk
Checksum of the file’s extent Checksum of the file’s extent that be changed
The extent that be changed 27 Large file random write (Direct IO and fsync) Solution Don’t log unchanged metadata: Introduce sub- transaction id to filter the unchanged metadata (v2.6.41)
Application File
Extent 1 Extent 2 Extent 3 … Extent N
Change the relative Csum tree Checksums
Log the changed csum data of this file Log tree
Extent1 Extent2 Extent3 ExtentN Csum Csum Csum Csum
Disk Write to disk
Checksum of the file’s extent Checksum of the file’s extent that be changed
The extent that be changed 28 Large file random write (Direct IO and fsync)
Improvement result
50.00
45.00
40.00
) 35.00 s / b M
30.00 : t i
n 25.00 EXT3 U (
d EXT4
e 20.00 e BTRFS p S 15.00 BTRFS + Patch O I 10.00
5.00
0.00 bs = 4K bs = 32K bs = 4K bs = 32K bs = 4K bs = 32K bs = 4K bs = 32K
1 Thread 8 Threads 1 Thread 8 Threads
DirectIO Write (fsync)
29 Agenda
Comparison between Btrfs and Ext3/4 Issue analysis (We have investigated) Small file sequential read Large file random write (Direct I/O and fsync) File creation/deletion Future work
30 File creation/deletion
Reasons Btrfs does more metadata insertion and deletion. Btrfs Ext4(Not Sure)
inode inode name back reference ACL File Creation ACL directory entry directory item directory name index
inode inode inode back reference ACL ACL directory entry File Deletion directory item directory name index logged directory item logged directory name index Btrfs must search the b+ tree to look up the place, where the inode will be stored, when updating inode. (Time complexity: O(log(n)), But Ext3/4 is O(1)) Searching nodes/leaves in the rb-tree spends lots of time
31 File creation/deletion
Solution Batch operation -- Insert/delete a batch of the directory name indexes (v2.6.40)
Delay operation -- Delay to update the inode information in the b+ tree (v2.6.40)
Using radix tree instead of rb-tree (v2.6.37)
32 File creation/deletion
Improvement result Create/delete lots of empty files to measure the speed of file creation and deletion.
140000
120000
100000 ) c e s / 80000
s Ext3 e l i
f Ext4
: t i 60000 Btrfs n U
( Btrfs + Patch 40000
20000
0 Creation Deletion
33 Agenda
Comparison between Btrfs and Ext3/4 Issue analysis (We have investigated) Small file sequential read Large file random write (Direct I/O and fsync) File creation/deletion Future work
34 Future work
Improve small file sequential read performance further Improve small file random read performance (Not inline file) Improve small file random/sequential write performance Do other benchmarks and improve bad cases
35 Q/A
36