Performance Improvement of

Miao Xie Li Zefan Agenda

 Comparison between Btrfs and Ext3/4  Issue analysis (We have investigated)  Small file sequential read  Large file random (Direct I/O and fsync)  File creation/deletion  Future work

2 Comparison between Btrfs and Ext3/4

 Performance test environment  Hardware • CPU : Xeon(TM) X5260 3.33G X 2 ( 4 cores ) • Memory : 4GB • Disk : 20GB  Software • OS : RHEL6(x86_64) • Kernel : 2.6.38 • Glibc : 2.12 • Btrfs-progs : 0.9 • Sysbench: 0.4.12

3 Comparison between Btrfs and Ext3/4

 73 cases in total  72 file I/O cases, mix the following conditions: • Small file / Large file • Write / Read • Random / Sequential • Sync / Async / Direct I/O • Single-thread / Multi-thread • Different block size (1Kb, 4Kb, 32Kb) *  File creation/deletion • Measure the speed of empty file creation/deletion

* Block size (bs): read or write BYTES bytes at a time.

4 Small file random read performance

6000

5000

) 4000 s / b K

: t i

n 3000 EXT3 U (

d e e BTRFS p 2000 S

O I

1000

0 bs = 1K bs = 4K bs = 1K bs = 4K bs = 1K bs = 4K bs = 1K bs = 4K

1 Thread 8 Threads 1 Thread 8 Threads

DirectI/O General Read

5 Small file random write performance

3000

2500 )

s 2000 / b K

: t i

n 1500 EXT3 U (

d EXT4 e e

p BTRFS

S 1000

O I

500

0 bs = 1K bs = 4K bs = 1K bs = 4K bs = 1K bs = 4K bs = 1K bs = 4K

1 Thread 8 Threads 1 Thread 8 Threads

DirectI/O Write (fsync)

Write (fsync): write data into the file, and do fsync every 100 requests

6 Small file sequential read performance

60.00

50.00

) 40.00 s / b M

: t i

n 30.00 EXT3 U (

d EXT4 e

e BTRFS p 20.00 S

O I

10.00

0.00 bs = 1K bs = 4K bs = 1K bs = 4K bs = 1K bs = 4K bs = 1K bs = 4K

1 Thread 8 Threads 1 Thread 8 Threads

DirectI/O General Read

7 Small file sequential write performance

6000

5000 ) s

/ 4000 b K

: t i

n 3000

U EXT3 (

d EXT4 e e BTRFS p 2000 S

O I

1000

0 bs = 1K bs = 4K bs = 1K bs = 4K bs = 1K bs = 4K bs = 1K bs = 4K

1 Thread 8 Threads 1 Thread 8 Threads

DirectI/O Write (fsync)

Write (fsync): write data into the file, and do fsync every 100 requests

8 Large file random read performance

16.00

14.00

12.00 ) s /

b 10.00 M

: t i

n 8.00

U EXT3 (

d EXT4 e

e 6.00

p BTRFS S

O I 4.00

2.00

0.00 bs = 4K bs = 32K bs = 4K bs = 32K bs = 4K bs = 32K bs = 4K bs = 32K

1 Thread 8 Threads 1 Thread 8 Threads

DirectIO General Read

9 Large file sequential read performance

90.00

80.00

70.00

) 60.00 s / b M

: 50.00 t i n

U EXT3

( 40.00

d EXT4 e e 30.00 p BTRFS S

O I 20.00

10.00

0.00 bs = 4K bs = 32K bs = 4K bs = 32K bs = 4K bs = 32K bs = 4K bs = 32K

1 Thread 8 Threads 1 Thread 8 Threads

DirectIO General Read

10 Large file random write performance (1/2)

14.00

12.00

10.00 ) s / b

M 8.00

: t i

n EXT3 U ( 6.00 EXT4 d e

e BTRFS p S 4.00 O I

2.00

0.00 bs = 4K bs = 32K bs = 4K bs = 32K bs = 4K bs = 32K bs = 4K bs = 32K

1 Thread 8 Threads 1 Thread 8 Threads

DirectIO Write (fsync)

11 Large file random write performance (2/2)

40.00

35.00

30.00 ) s /

b 25.00 M

: t i

n 20.00

U EXT3 (

d EXT4 e 15.00 e

p BTRFS S

O 10.00 I

5.00

0.00 bs = 4K bs = 32K bs = 4K bs = 32K

1 Thread 8 Threads

General Write

12 Large file sequential write performance (1/2)

8.00

7.00

6.00 ) s

/ 5.00 b M

: t i

n 4.00 EXT3 U (

d EXT4 e 3.00 e BTRFS p S

O I 2.00

1.00

0.00 bs = 4K bs = 32K bs = 4K bs = 32K

1 Thread 8 Threads

DirectIO

13 Large file sequential write performance (2/2)

90.00

80.00

70.00

60.00 ) s / b M

50.00 : t i n EXT3 U (

40.00

d EXT4 e e

p 30.00 BTRFS S

O I 20.00

10.00

0.00 bs = 4K bs = 32K bs = 4K bs = 32K bs = 4K bs = 32K bs = 4K bs = 32K

1 Thread 8 Threads 1 Thread 8 Threads

Write (fsync) General Write

14 File creation/deletion performance

 Create/delete lots of empty files to measure the speed of file creation and deletion.

140000

120000

100000 )

c 80000 e s / s

e Ext3 l i f

:

t 60000 Ext4 i n

U Btrfs ( 40000

20000

0 Creation Deletion

15 Comparison between Btrfs and Ext3/4

 The performance of Btrfs is quite poor in the following cases (> 20% lower than Ext3/4)  Small file random read (Not inline file)  Small file sequential read  Small file random/sequential write  Large file random write (Direct I/O and fsync)  Large file random write (general write, bs = 4Kb)  File creation and deletion

16 Agenda

 Comparison between Btrfs and Ext3/4  Issue analysis (We have investigated)  Small file sequential read  Large file random write (Direct I/O and fsync)  File creation/deletion  Future work

17 Small file sequential read

 Reasons  Metadata fragment -> The file extent reading latency -> The delay of file data reading • Btrfs must read file extent before reading file data (no matter the small file is inlined or not), but the disk has to reposition the reading offset frequently because of the fragment, and the readahead function can’t work well. So …

Fs/file tree

Disk

18 Small file sequential read

 Reason verification  Do small file sequential read after defragment

30.00

25.00 ) s /

b 20.00 M

: t i n

U 15.00 (

No Defrag d

e After Defrag e

p 10.00 S

O I

5.00

0.00 bs = 1K bs = 4K bs = 1K bs = 4K bs = 1K bs = 4K bs = 1K bs = 4K

1 Thread 8 Threads 1 Thread 8 Threads

DirectI/O General Read

19 Small file sequential read

 Solution  Pre-allocation for b+ tree: Introduce free space clusters for each node in the tree, then we can allocate contiguous free space from the parent node’s cluster to store the sibling leaves closely (The patch of this solution is still under test, hasn’t be posted)

Fs/file tree

Cluster

Cluster Cluster

Disk

20 Small file sequential read

 Improvement result

60.00

50.00 ) s / b

M 40.00

: t i n

U EXT3 (

30.00

d EXT4 e

e BTRFS p

S 20.00

BTRFS + Patch O I

10.00

0.00 bs = 1K bs = 4K bs = 1K bs = 4K bs = 1K bs = 4K bs = 1K bs = 4K

1 Thread 8 Threads 1 Thread 8 Threads

DirectI/O General Read  Further Improvement  Introduce the auto defragment for metadata  Apply the new metadata readahead API written by Arne 21 Agenda

 Comparison between Btrfs and Ext3/4  Issue analysis (We have investigated)  Small file sequential read  Large file random write (Direct I/O and fsync)  File creation/deletion  Future work

22 Large file random write (Direct IO and fsync)  Background – What is tree logging? Tree logging is a special write ahead log of dirty metadata.  Purpose: Reduce the write requests of the metadata when fsyncs and O_SYNCs happen.

 Implementation: Copy the changed items into a special tree (log tree, one per fs/file tree), and then write that tree to disk. After a crash, Btrfs recover the fs/file tree by that tree.

23 Large file random write (Direct IO and fsync)  Reasons  Log lots of unchanged metadata (Ex. Csum, File extent)

Application

File

Extent 1 Extent 2 Extent 3 … Extent N

Change the relative Csum tree Checksums

Log all the csum data of this file Log tree

Extent1 Extent2 Extent3 ExtentN Csum Csum Csum Csum

Disk Write to disk

Checksum of the file’s extent Checksum of the file’s extent that be changed

The extent that be changed 24 Large file random write (Direct IO and fsync)

 Reason verification  Do large file random write test after closing tree log function (mount with -o notreelog)

50.00

45.00

40.00

) 35.00 s / b M

30.00 : t i n

U 25.00 (

d BTRFS e

e 20.00

p BTRFS(no treelog) S

O 15.00 I

10.00

5.00

0.00 bs = 4K bs = 32K bs = 4K bs = 32K bs = 4K bs = 32K bs = 4K bs = 32K

1 Thread 8 Threads 1 Thread 8 Threads

DirectIO Write (fsync)

25 Large file random write (Direct IO and fsync)  Solution  Don’t log unchanged metadata: Introduce sub- transaction id to filter the unchanged metadata (v2.6.41)

Application File

Extent 1 Extent 2 Extent 3 … Extent N

Change the relative Csum tree Checksums

Log all the csum data of this file Log tree

Extent1 Extent2 Extent3 ExtentN Csum Csum Csum Csum

Disk Write to disk

Checksum of the file’s extent Checksum of the file’s extent that be changed

The extent that be changed 26 Large file random write (Direct IO and fsync)  Solution  Don’t log unchanged metadata: Introduce sub- transaction id to filter the unchanged metadata (v2.6.41)

Application File

Extent 1 Extent 2 Extent 3 … Extent N

Change the relative Csum tree Checksums

Log all the csum data of this file Log tree

Extent1 Extent2 Extent3 ExtentN Csum Csum Csum Csum

Disk Write to disk

Checksum of the file’s extent Checksum of the file’s extent that be changed

The extent that be changed 27 Large file random write (Direct IO and fsync)  Solution  Don’t log unchanged metadata: Introduce sub- transaction id to filter the unchanged metadata (v2.6.41)

Application File

Extent 1 Extent 2 Extent 3 … Extent N

Change the relative Csum tree Checksums

Log the changed csum data of this file Log tree

Extent1 Extent2 Extent3 ExtentN Csum Csum Csum Csum

Disk Write to disk

Checksum of the file’s extent Checksum of the file’s extent that be changed

The extent that be changed 28 Large file random write (Direct IO and fsync)

 Improvement result

50.00

45.00

40.00

) 35.00 s / b M

30.00 : t i

n 25.00 EXT3 U (

d EXT4

e 20.00 e BTRFS p S 15.00 BTRFS + Patch O I 10.00

5.00

0.00 bs = 4K bs = 32K bs = 4K bs = 32K bs = 4K bs = 32K bs = 4K bs = 32K

1 Thread 8 Threads 1 Thread 8 Threads

DirectIO Write (fsync)

29 Agenda

 Comparison between Btrfs and Ext3/4  Issue analysis (We have investigated)  Small file sequential read  Large file random write (Direct I/O and fsync)  File creation/deletion  Future work

30 File creation/deletion

 Reasons  Btrfs does more metadata insertion and deletion. Btrfs Ext4(Not Sure)

inode inode name back reference ACL File Creation ACL entry directory item directory name index

inode inode inode back reference ACL ACL directory entry File Deletion directory item directory name index logged directory item logged directory name index  Btrfs must search the b+ tree to look up the place, where the inode will be stored, when updating inode. (Time complexity: O(log(n)), But Ext3/4 is O(1))  Searching nodes/leaves in the rb-tree spends lots of time

31 File creation/deletion

 Solution  Batch operation -- Insert/delete a batch of the directory name indexes (v2.6.40)

 Delay operation -- Delay to update the inode information in the b+ tree (v2.6.40)

 Using radix tree instead of rb-tree (v2.6.37)

32 File creation/deletion

 Improvement result  Create/delete lots of empty files to measure the speed of file creation and deletion.

140000

120000

100000 ) c e s / 80000

s Ext3 e l i

f Ext4

: t i 60000 Btrfs n U

( Btrfs + Patch 40000

20000

0 Creation Deletion

33 Agenda

 Comparison between Btrfs and Ext3/4  Issue analysis (We have investigated)  Small file sequential read  Large file random write (Direct I/O and fsync)  File creation/deletion  Future work

34 Future work

 Improve small file sequential read performance further  Improve small file random read performance (Not inline file)  Improve small file random/sequential write performance  Do other benchmarks and improve bad cases

35 Q/A

36