Developing a Simple Filesystem
I. File Concepts II. File IO / Standard IO Libarary III. Filesystem-based Concepts IV. Kernel Concepts V. Developing Filesystem VI. Other Filesystem
2008. 10. 31 LG Electronics Software Center 심재훈
1 File concepts
File Type Regular / Directories / Symbolic Links / Hard Links / Named Pipes Special Files File Descriptors handle through which the file can be subsequently accessed Basic File Properties file type / access permission link count owner and group of file size / date / name
2 File concepts
Basic File Properties(con’t)
3 File concepts
# VFS Inode structure
4 File concepts
{“.”, 3333} / {“..”, 2}
inode {“directory.”, 12668259} KELP_seminar/ 3333 {“file”, 8463735} 3 {“hard_link”, 8463735} {“symbolic_link.”, 8463736} {“symbolic_link2”, 8463740}
directory/ file hard_link symbolic_link symbolic_link2
inode inode inode inode 12668259 8463735 8463736 8463740 inode number 2 2 1 1 nlink
{“.”, 12668259} regular file “file” “./file” {“..”, 3333} hard_link
5 File IO
basic file operations open() / creat() / close() / lseek() / read() / write() . truncate() / unlink() seeking and IO combined . pread() / pwrite() . read-write loop 1 million times. 35sec → 25sec vectored IO . readv() / writev() . use in case single-read data needs to placed in different areas of memory. asynchronous IO . aio_read() / aio_write() / aio_return() / aio_error() / aio_cancel() file and record locking . fcntl() : F_GETLK / F_SETLK / F_SETLKW memory mapped files, direct IO sparse file
6 File IO
## Vectored IO ## user address space readv(fd, &uiop, 3) ## Sparse File ##
struct uio uip = { {addr1, 512}, {addr2, 512}, file size addr2 {addr3, 1024} }; not allocated area addr1
file offset
## Memory Mapping IO ## ## Direct IO ## user address space addr = mmap(NULL, MAPSZ, user address space fd = open(filename, O_DIRECT, mode) PROT_READ, MAP_SHARED, fd, 0); write(fd, buf, size); memcpy(addr, buf, size);
page cache
file offset file offset
7 Filesystem concepts
Properties has root directory(/) and lost+found directory(most disk-based FS) each file and directory is identified uniquely by inode . normally root inode (2), lost+found (3) self-contained . no dependencies between filesystems clean / dirty state Disk, Partition, Volumes Creating filesystem : mkfs Repairing filesystem : fsck journaling / log-structured FS User / Group quotas
8 kernel concepts
VFS objects on-disk layout file descriptors / file table inode cache / page cache pathname resolution opening file / reading file / closing files
9 Linux Filesystem Structure Overview
User application User space
system call
Virtual FileSystem NFS
network Ext3 FAT JFFS Yaffs SMB
buffer / page cache Kernel space
io interface FTL
block device driver Flash driver
storage device
10 VFS Object
Virtual Filesystem Object[2] superblock object . stores information concerning a mounted filesystem. . corresponds to filesystem control block stored on disk. inode object . stores general information about a specific file. . corresponds to file control block stored on disk. file object . stores information about the interaction between open file and process dentry object . stores information about the linking of a directory entry with the corresponding file . recently used that are contained in dentry cache
11 VFS objects
Interaction between process and VFS objects[2]
page disk file cache Superblock Inode object object i_sb
d_inode fd Process 1 File object
f_dentry Process 2 File object dentry dentry object object Process 3 File object
dentry cache
12 VFS Object File object associated with process
struct fs_struct count f_flags root f_mode pwd f_dentry dentry object rootmnt f_pos pwdmnt f_count altroot f_op struct task_struct struct file fs files fdt max_fds fdtab fd next_fd open_fds f_flags f_mode fd_array[0] f_dentry dentry object fd_array[1] f_pos fd_array[2] f_count fd_array[3] f_op struct file fd_array[N]
struct file
13 VFS Object dentry
d_subdirs d_u.d_child d_parent dentry_hashtable[] d_hash 0 1 d_subdirs d_subdirs 2 d_u.d_child d_u.d_child 3 d_parent d_parent d_hash d_hash d_subdirs d_u.d_child d_parent i_dentry d_subdirs d_hash d_u.d_child d_lru d_parent d_alias d_hash d_inode d_lru d_name struct inode d_alias d_op struct dentry d_sb
superblock
struct super_block
14 VFS Object inode
inode_hashtable[] inode_in_use inode_unused
0 i_hash i_hash 1 i_list i_list 2 i_dentry i_dentry 3 i_sb i_sb i_hash i_list i_hash i_hash i_dentry i_list i_list i_sb i_dentry i_dentry i_sb i_sb i_op i_hash i_fop i_list dentry i_mappinig i_dentry i_private i_sb
specific inode structure
page cache sb->s_dirty
struct super_block SB dinode
volume
15 VFS Object superblock & file_system_type
super_blocks sb_lock
s_list s_list s_list s_instance s_instance s_instance s_dirty s_dirty s_dirty s_inodes i_hash i_hash s_inodes s_inodes s_dirt i_list i_list s_dirt s_dirt s_root i_dentry i_dentry s_root s_root s_type i_sb i_sb s_type s_type s_fs_info s_fs_info
ext3 jffs proc specific superblock structure name name name fs_supers fs_supers fs_supers get_sb() get_sb() get_sb() SB file_systems kill_sb() kill_sb() kill_sb() next next next volume file_systems_lock fs_flags fs_flags fs_flags struct file_system_type
16 VFS Object
Virtual Filesystem Operations struct super_operations . alloc_inode / destroy_inode / read_inode / write_inode / delete_inode . statfs / put_super struct inode_operations . create / link / unlink / lookup / mkdir / rmdir / setattr / truncate struct file_operations . open / lseek / read / write / mmap / ioctl struct address_space_operations . write_page / read_page / direct_IO / release_page struct dentry_operations . d_revalidate / d_hash / d_compare
17 VFS Object Main Structure for file access
super_blocks struct task_struct s_list files fd s_op struct files_struct struct super_block
alloc_inode /destroy_inode f_flags d_sb f_mode d_inode f_dentry d_op i_sb f_pos struct inode i_op f_reada struct dentry i_mapping f_op struct address_space struct file a_ops
create llseek lookup writepage / readpage d_revalidate read /write link / unlink sync_page d_hash aio_read/aio_write symlink writepages / readpages d_delete readdir mkdir / rmdir set_page_dirty d_release poll mknod prepare_write d_compare ioctl rename commit_write mmap readlink bmap open /release follow_link direct_IO flush / fsync / fasync truncate get_xip_page lock / flock truncate_range invalidatepage setattr releasepage
18 Reading file example
fd = open(filename, flag) read(fd, buf, 512); kernel 2.6 kernel 2.4 kernel 2.2
buf buf buf user mode
kernel mode struct task_struct struct file
pathname struct inode struct dentry page cache lookup
find block page cache page cache struct address_space buffer cache (0th block) (radix tree) (hashtable)
I/O I/O I/O on disk using bio using bh using bh block 0 superblock inode for filename inodes 0th block
data
19 Developing filesystem
design filesystem module init / exit mount / umount directory lookup / pathname resolution inode manipulation allo c / wr ite / delete file creating / link management create / removing directories filesystem status
20 Filesystem Analysis
Basic Categories
Filesystem / Filename / Metadata / Contents / Application [10] Journaling Filesystem Application Category Category Layout & Size Quota Data Superblock Information Resource Group File Name Metadata Contents Category Category Category
Times & Content Data file1.txt Address #1 Content Data #2 Times & file2.txt Address
Content Data Directory structure(inode) File structure(inode) #1
Journal / Journaling Allocation/Deallocation Category Recovery
21 UXFS Layout
Super lost+ root Inodes Data Blocks Block found
block 0 block 8 ~ 39 40 41
for each inode struct ux_superblock { struct ux_inode { struct ux_dirent { __u32 s_magic; __u32 i_mode; __u32 d_ino; __u32 s_mod; __u32 i_nlink; char d_name[28]; __u32 s_nifree; __u32 i_atime; }; __u32 s_inode[UX_MAXFILES]; __u32 i_mtime; __u32 s_nbfree; __u32 i_ctime; d_ino = 2, d_name = “.” __u32 s_block[UX_MAXBLOCKS]; __u32 i_uid; d_ino = 2, d_name = “..” }; __u32 i_gid; __u32 i_size; d_ino = 3, d_name = “lost+found” __u32 i_blocks; __u32 i_addr[UX_DIRECT_BLOCKS]; d_ino = 4, d_name = “fred” }; d_ino = 0, d_name = “” #define UX_DIRECT_BLOCKS 16 #define UX_MAXFILES 32 #define UX_MAXBLOCKS 470
22 UXFS Layout
Design Detail supports only 512-byte block (UX_BSIZE) fixed number of blocks (UX_MAXBLOCKS) . 470 blocks superblock is stored in block 0 there are only 32 inodes (UX_MAXFILES) . 실제로는 28개 사용가능(0, 1, root inode:2, lost+found:3 제외) . has 9 direct pointer. → limits the file size to (9 * 512) first data block is 42th . 40th block store root directory entries . 41th block store lost+found directory entries directory entries are fixed in size (32byte) . max filename size is 28 byte
23 Filesystem Registration
register_filesystem() struct file_system_type static struct file_system_type uxfs_fs_type = { name .owner = THIS_MODULE, fs_flags .name = "uxfs", get_sb() .get_sb = ux_get_sb, kill_sb() next .kill_sb = kill_block_super, fs_supers .fs_flags = FS_REQUIRES_DEV, s_lock_key }; s_umount_key ... register_filesystem(&uxfs_fs_type);
ext3 jffs proc file_systems name name name file_systems_lock fs_flags fs_flags fs_flags get_sb() get_sb() get_sb() kill_sb() kill_sb() kill_sb() next next next fs_supers fs_supers fs_supers struct file_system_type s_instances s_instances s_instances struct super_block
24 Filesystem mount
# mount -t uxfs /dev/sdd1 /mnt/testdir / mnt/ testdir
find dentry user using hash mode find directory lookup : ino = ux_find_entry() kernel /mnt/testdir pathname entry on disk ux_lookup() inode = iget() mode lookup get dentry d_add(dentry, inode) about mount point get sget() s_list superblock pathname s_instance lookup ux_get_sb() s_dirty s_inodes /dev/sdd1 fill get superblock superblock s_dirt & root inode s_root s_type find filesystem get s_fs_info type root inode type->get_sb()
root inode root dentry ux_superblock
SB volume
25 File Creation
pathname d_subdirs # touch /mnt/testdir/sample / mnt/ testdir d_u.d_child lookup d_parent path_lookup_create() d_hash /mnt/testdir parent dentry get dentry about parent d_subdirs i_dentry find directory d_u.d_child d_parent entry for *sample* d_hash ino = ux_find_entry() d_lru create inode d_alias allocate new inode for *sample* new dentry vfs inode create : ux_create() inode = new_inode() allocate new inode inode number d_ino = 20, d_name = “.” ino = ux_ialloc() new inode buffer d_ino = 15, d_name = “..” add parent directory entry d_ino = XX, d_name = “sample” ux_diradd() fill new parent inode buffer vfs inode read superblock & disk inode for allocating new inode num
link new dentry & inode SB inode data block d_instantiate() volume
26 Other Filesystem
FAT Filesystem Ext3 JFS Flash Filesystem Advanced Filesystem Ext4 / Brtfs POHMELFS AXFS / LogFS / UBIFS
27 FAT layout[3]
Root Reserved Area FAT #1 FAT #2 Cluster ... Cluster Cluster Directory
FileSystem Boot Cluster Cluster Cluster Cluster Cluster Information 3 4 8 6 7 .. F Record 2 3 4 5 ... 8 (FSInfo) 2 3 8 9 10 11 - cluster count filesize field : 4byte field : 4byte next free cluster - total sector32 4byte # cluster size 4K - FAT volume 최대 크기(cluster 개수:28bit 사용 * cluster 크기) 2(32-4) * 212 = 240 = 1T - 실제 가능한 최대 크기(sector size : 512byte) 232 * 29 = 2T
- 최대 file size 232 = 4G
28 FAT directory
cluster 2 3 5 6 7 8 9 Reserved Root FAT #1 FAT #2 ...... Area Directory
name type start cluster b.txt file 0x06 var directory 0x07
name type start cluster messag file 0x09
8 F F 3 5 F F
2 3 8 9 10 11
29 Ext3 layout[3]
Boot Block Group 0 Block Group 1 Block Group 2 ... Block Group N Sector
Group Super Block Inode Inode Descriptor Block Block Block Block Block Bitmap Bitmap Table ... Table
n blocks 1 block 1 block n blocks
# block size 4K 인 경우 - block bitmap 으로 표현가능한 block 개수 212(4K) * 23(8bit) = 215 - 최대 filesystem size(group당 * block개수 * block크기) 215 * 216(group 개수) * 212(4K) = 243 = 8T
30 Ext3 inode
0
Inode Header block ... 40 direct pointer block ... block
direct pointer ... 12 indirect pointer pointers indirect pointer block double indirect block pointer indirect pointer triple indirect indirect pointer ... pointer 128 block indirect pointer indirect pointer 1024 pointer indirect pointer
# block size 4K 인 경우 block - 최대 file size ... (12 * 212) + (210 * 212) + (210 * 210 * 212) + 42 (210 * 210 * 210 * 212) ≒ 2 ≒ 4T block
31 Ext3 directory
ext3_dir_entry2
uint32 inode number
uint16 rec_len Inode Header block ... uint8 name_len uint8 file_type direct pointer block ... char name[255]
direct pointer
indirect pointer
double indirect 11 16 8 REG_FILE “.” 10 16 8 pointer triple indirect REG_FILE “..” 25 32 8 REG_FILE “d pointer oc” 131 16 16 REG_FILE “message.1” 205 16 8 REG_FILE “tmp”
32 JFS Layout[4]
33 JFS inode strucutre
inode
common B+ tree field inode imap관련 (unused) H 0 100 200 300 union header field xtree entry array
H 0 50 H 100 150 H 200 250 H 300 350 400
0 25 50 75 100 125 150 175 200 225 250 275 300 330 350 375 400 425
34 Flash Filesystem
Flash memory features bad block management . initial bad block / run-time bad block out of place updates . in-place updates 불가능. 반드시 erase 후에 write 가 가능. . wear-leveling 필요 lifetime of erase blocks . 각 block 에 대해 erase 회수 제한(wear-out limit) large erase block . garbage collection 필요
page clean dirty valid
erase block
can erase
35 Problems of current Flash Filesystem
Problem Slow mount time Heavy memory consumption . many filesystem structure are organized in memory. Reason mount 할 때, disk scan 후 관련 구조체를 메모리에 구성해야 한다.
scan
index tree file tree # vfs caches - dentry cache ... - inode cache - page cache
36 Flash Filesystem Features
Log-structured FS features
write append failure
robust to power failure a b c a` c` d a`` garbage collection
write a b c a` c` d time a scan a b c garbage collection a b c a` c` d a b c a` c` d a`` d`
a b c a` c` d a`` a c a` d b c` a`` d` a b c a` c` d a`` d` e c a` d b c` a`` d`
read scan
37 Advanced Flash Filesystem
Requirements low memory usage fast mount time wear leveling robust to failure
Types of advanced flash filesystem approach : using tree structure . do not need scan all device when fs mounts . wandering tree problem logfs ubifs (has replaced the jffs3 project)
38 Flash Filesystem
Wandering tree problem[6]
file tree inode
th n ind valid
nd nd 2 ind 2 ind invalid st 1st ind st 1 ind 1 ind new
data data data data data data data data data
inode nth ind
2nd ind 2nd ind
1st ind 1st ind 1st ind
data data data data data data data data data data
39 Flash Filesystem
Wandering tree problem(cont)
file tree inode nth ind
2nd ind 2nd ind
1st ind 1st ind 1st ind 1st ind
data data data data data data data data data data
inode inode nth ind nth ind
2nd ind 2nd ind 2nd ind
1st ind 1st ind 1st ind 1st ind
data data data data data data data data data data
40 Advanced Flash Filesystem(UBIFS)
Volume managing layer bad block management wear-leveling
UBI layer[7] mapping logical eraseblocks(LEB) to physical eraseblocks(PEB) separate the flash property from filesystem support write-back
UBIFS layer read/write
UBI layer LEB0 LEB1 LEB2 LEB3 LEB4
MTD layer PEB0 PEB1 PEB2 PEB3 PEB4
Flash flash memory
41 Advanced Filesystem
Ext4 ext3 에 extent allocation 개념 도입 ext3 구조의 한계에도 불구하고 대용량 파일을 다루기 위한 설계 추가 1 EiB a file / 1 EiB a file / 255 bytes Btrfs[9] Oracle, ext3 대체를 목표로 개발되고 있는 파일시스템. supports on-line fsck, writable snap-shot, sub-volume can be upgraded from ext3fs to btrfs 16 EiB a file / 16 EiB a volume / 255 bytes crfs (coherent remote filesystem) – btrfs : Oracle pohmelfs (parellel optimized host message exchange layered filesystem) . handle as much as possible locally, with minimal server interaction. . 10x than nfs synchronous. 3x than nfs asynchronous
42 Advanced Flash Filesystem(AXFS)
Advanced XIP Filesystem read-only flash filesystem current common rootfs option . XIP-modified cramfs : save RAM, but requires extra flash memory. . squashfs : save flash memory, but requires extra RAM. allows for each page to be XIP or not. kernel profile(/proc/axfs/volume0)을 사용하여 image builder 의 input 으로 사용 . 작고 빠른 image 생성 가능 support kernel 2.6.26 later.
43