Developing a Simple Filesystem

I. File Concepts II. File IO / Standard IO Libarary III. Filesystem-based Concepts IV. Kernel Concepts V. Developing Filesystem VI. Other Filesystem

2008. 10. 31 LG Electronics Software Center 심재훈

1 File concepts

 File Type  Regular / Directories / Symbolic Links / Hard Links / Named Pipes  Special Files  File Descriptors  handle through which the file can be subsequently accessed  Basic File Properties  file type / access permission  link count  owner and group of file  size / date / name

2 File concepts

 Basic File Properties(con’t)

3 File concepts

# VFS Inode structure

4 File concepts

{“.”, 3333} / {“..”, 2}

inode {“directory.”, 12668259} KELP_seminar/ 3333 {“file”, 8463735} 3 {“hard_link”, 8463735} {“symbolic_link.”, 8463736} {“symbolic_link2”, 8463740}

directory/ file hard_link symbolic_link symbolic_link2

inode inode inode inode 12668259 8463735 8463736 8463740 inode number 2 2 1 1 nlink

{“.”, 12668259} regular file “file” “./file” {“..”, 3333} hard_link

5 File IO

 basic file operations  open() / creat() / close() / lseek() / read() / write() . truncate() / unlink()  seeking and IO combined . pread() / pwrite() . read-write loop 1 million times. 35sec → 25sec  vectored IO . readv() / writev() . use in case single-read data needs to placed in different areas of memory.  asynchronous IO . aio_read() / aio_write() / aio_return() / aio_error() / aio_cancel()  file and record locking . fcntl() : F_GETLK / F_SETLK / F_SETLKW  memory mapped files, direct IO  sparse file

6 File IO

## Vectored IO ## user address space readv(fd, &uiop, 3) ## Sparse File ##

struct uio uip = { {addr1, 512}, {addr2, 512}, file size addr2 {addr3, 1024} }; not allocated area addr1

file offset

## Memory Mapping IO ## ## Direct IO ## user address space addr = mmap(NULL, MAPSZ, user address space fd = open(filename, O_DIRECT, mode) PROT_READ, MAP_SHARED, fd, 0); write(fd, buf, size); memcpy(addr, buf, size);

page cache

file offset file offset

7 Filesystem concepts

 Properties  has root directory(/) and lost+found directory(most disk-based FS)  each file and directory is identified uniquely by inode . normally root inode (2), lost+found (3)  self-contained . no dependencies between filesystems  clean / dirty state  Disk, Partition, Volumes  Creating filesystem : mkfs  Repairing filesystem : fsck  journaling / log-structured FS  User / Group quotas

8 kernel concepts

VFS objects on-disk layout file descriptors / file table inode cache / page cache pathname resolution opening file / reading file / closing files

9 Linux Filesystem Structure Overview

User application User space

system call

Virtual FileSystem NFS

network FAT JFFS Yaffs SMB

buffer / page cache Kernel space

io interface FTL

block device driver Flash driver

storage device

10 VFS Object

 Virtual Filesystem Object[2]  superblock object . stores information concerning a mounted filesystem. . corresponds to filesystem control block stored on disk.  inode object . stores general information about a specific file. . corresponds to file control block stored on disk.  file object . stores information about the interaction between open file and process  dentry object . stores information about the linking of a directory entry with the corresponding file . recently used that are contained in dentry cache

11 VFS objects

 Interaction between process and VFS objects[2]

page disk file cache Superblock Inode object object i_sb

d_inode fd Process 1 File object

f_dentry Process 2 File object dentry dentry object object Process 3 File object

dentry cache

12 VFS Object File object associated with process

struct fs_struct count f_flags root f_mode pwd f_dentry dentry object rootmnt f_pos pwdmnt f_count altroot f_op struct task_struct struct file fs files fdt max_fds fdtab fd next_fd open_fds f_flags f_mode fd_array[0] f_dentry dentry object fd_array[1] f_pos fd_array[2] f_count fd_array[3] f_op struct file fd_array[N]

struct file

13 VFS Object dentry

d_subdirs d_u.d_child d_parent dentry_hashtable[] d_hash 0 1 d_subdirs d_subdirs 2 d_u.d_child d_u.d_child 3 d_parent d_parent d_hash d_hash d_subdirs d_u.d_child d_parent i_dentry d_subdirs d_hash d_u.d_child d_lru d_parent d_alias d_hash d_inode d_lru d_name struct inode d_alias d_op struct dentry d_sb

superblock

struct super_block

14 VFS Object inode

inode_hashtable[] inode_in_use inode_unused

0 i_hash i_hash 1 i_list i_list 2 i_dentry i_dentry 3 i_sb i_sb i_hash i_list i_hash i_hash i_dentry i_list i_list i_sb i_dentry i_dentry i_sb i_sb i_op i_hash i_fop i_list dentry i_mappinig i_dentry i_private i_sb

specific inode structure

page cache sb->s_dirty

struct super_block SB dinode

volume

15 VFS Object superblock & file_system_type

super_blocks sb_lock

s_list s_list s_list s_instance s_instance s_instance s_dirty s_dirty s_dirty s_inodes i_hash i_hash s_inodes s_inodes s_dirt i_list i_list s_dirt s_dirt s_root i_dentry i_dentry s_root s_root s_type i_sb i_sb s_type s_type s_fs_info s_fs_info

ext3 proc specific superblock structure name name name fs_supers fs_supers fs_supers get_sb() get_sb() get_sb() SB file_systems kill_sb() kill_sb() kill_sb() next next next volume file_systems_lock fs_flags fs_flags fs_flags struct file_system_type

16 VFS Object

 Virtual Filesystem Operations  struct super_operations . alloc_inode / destroy_inode / read_inode / write_inode / delete_inode . statfs / put_super  struct inode_operations . create / link / unlink / lookup / mkdir / rmdir / setattr / truncate  struct file_operations . open / lseek / read / write / mmap / ioctl  struct address_space_operations . write_page / read_page / direct_IO / release_page  struct dentry_operations . d_revalidate / d_hash / d_compare

17 VFS Object Main Structure for file access

super_blocks struct task_struct s_list files fd s_op struct files_struct struct super_block

alloc_inode /destroy_inode f_flags d_sb f_mode d_inode f_dentry d_op i_sb f_pos struct inode i_op f_reada struct dentry i_mapping f_op struct address_space struct file a_ops

create llseek lookup writepage / readpage d_revalidate read /write link / unlink sync_page d_hash aio_read/aio_write symlink writepages / readpages d_delete readdir mkdir / rmdir set_page_dirty d_release poll mknod prepare_write d_compare ioctl rename commit_write mmap readlink bmap open /release follow_link direct_IO flush / fsync / fasync truncate get_xip_page lock / flock truncate_range invalidatepage setattr releasepage

18 Reading file example

fd = open(filename, flag) read(fd, buf, 512); kernel 2.6 kernel 2.4 kernel 2.2

buf buf buf user mode

kernel mode struct task_struct struct file

pathname struct inode struct dentry page cache lookup

find block page cache page cache struct address_space buffer cache (0th block) (radix tree) (hashtable)

I/O I/O I/O on disk using bio using bh using bh block 0 superblock inode for filename inodes 0th block

data

19 Developing filesystem

design filesystem module init / exit mount / umount directory lookup / pathname resolution inode manipulation allo c / wr ite / delete file creating / link management create / removing directories filesystem status

20 Filesystem Analysis

 Basic Categories

 Filesystem / Filename / Metadata / Contents / Application [10]  Journaling Filesystem Application Category Category Layout & Size Quota Data Superblock Information Resource Group File Name Metadata Contents Category Category Category

Times & Content Data file1.txt Address #1 Content Data #2 Times & file2.txt Address

Content Data Directory structure(inode) File structure(inode) #1

Journal / Journaling Allocation/Deallocation Category Recovery

21 UXFS Layout

Super lost+ root Inodes Data Blocks Block found

block 0 block 8 ~ 39 40 41

for each inode struct ux_superblock { struct ux_inode { struct ux_dirent { __u32 s_magic; __u32 i_mode; __u32 d_ino; __u32 s_mod; __u32 i_nlink; char d_name[28]; __u32 s_nifree; __u32 i_atime; }; __u32 s_inode[UX_MAXFILES]; __u32 i_mtime; __u32 s_nbfree; __u32 i_ctime; d_ino = 2, d_name = “.” __u32 s_block[UX_MAXBLOCKS]; __u32 i_uid; d_ino = 2, d_name = “..” }; __u32 i_gid; __u32 i_size; d_ino = 3, d_name = “lost+found” __u32 i_blocks; __u32 i_addr[UX_DIRECT_BLOCKS]; d_ino = 4, d_name = “fred” }; d_ino = 0, d_name = “” #define UX_DIRECT_BLOCKS 16 #define UX_MAXFILES 32 #define UX_MAXBLOCKS 470

22 UXFS Layout

 Design Detail  supports only 512-byte block (UX_BSIZE)  fixed number of blocks (UX_MAXBLOCKS) . 470 blocks  superblock is stored in block 0  there are only 32 inodes (UX_MAXFILES) . 실제로는 28개 사용가능(0, 1, root inode:2, lost+found:3 제외) . has 9 direct pointer. → limits the file size to (9 * 512)  first data block is 42th . 40th block store root directory entries . 41th block store lost+found directory entries  directory entries are fixed in size (32byte) . max filename size is 28 byte

23 Filesystem Registration

 register_filesystem() struct file_system_type static struct file_system_type uxfs_fs_type = { name .owner = THIS_MODULE, fs_flags .name = "uxfs", get_sb() .get_sb = ux_get_sb, kill_sb() next .kill_sb = kill_block_super, fs_supers .fs_flags = FS_REQUIRES_DEV, s_lock_key }; s_umount_key ... register_filesystem(&uxfs_fs_type);

ext3 jffs proc file_systems name name name file_systems_lock fs_flags fs_flags fs_flags get_sb() get_sb() get_sb() kill_sb() kill_sb() kill_sb() next next next fs_supers fs_supers fs_supers struct file_system_type s_instances s_instances s_instances struct super_block

24 Filesystem mount

# mount -t uxfs /dev/sdd1 /mnt/testdir / mnt/ testdir

find dentry user using hash mode find directory lookup : ino = ux_find_entry() kernel /mnt/testdir pathname entry on disk ux_lookup() inode = iget() mode lookup get dentry d_add(dentry, inode) about mount point get sget() s_list superblock pathname s_instance lookup ux_get_sb() s_dirty s_inodes /dev/sdd1 fill get superblock superblock s_dirt & root inode s_root s_type find filesystem get s_fs_info type root inode type->get_sb()

root inode root dentry ux_superblock

SB volume

25 File Creation

pathname d_subdirs # touch /mnt/testdir/sample / mnt/ testdir d_u.d_child lookup d_parent path_lookup_create() d_hash /mnt/testdir parent dentry get dentry about parent d_subdirs i_dentry find directory d_u.d_child d_parent entry for *sample* d_hash ino = ux_find_entry() d_lru create inode d_alias allocate new inode for *sample* new dentry vfs inode create : ux_create() inode = new_inode() allocate new inode inode number d_ino = 20, d_name = “.” ino = ux_ialloc() new inode buffer d_ino = 15, d_name = “..” add parent directory entry d_ino = XX, d_name = “sample” ux_diradd() fill new parent inode buffer vfs inode read superblock & disk inode for allocating new inode num

link new dentry & inode SB inode data block d_instantiate() volume

26 Other Filesystem

FAT Filesystem Ext3 JFS Flash Filesystem Advanced Filesystem / Brtfs POHMELFS AXFS / LogFS / UBIFS

27 FAT layout[3]

Root Reserved Area FAT #1 FAT #2 Cluster ... Cluster Cluster Directory

FileSystem Boot Cluster Cluster Cluster Cluster Cluster Information 3 4 8 6 7 .. F Record 2 3 4 5 ... 8 (FSInfo) 2 3 8 9 10 11 - cluster count filesize field : 4byte field : 4byte next free cluster - total sector32 4byte # cluster size 4K - FAT volume 최대 크기(cluster 개수:28bit 사용 * cluster 크기) 2(32-4) * 212 = 240 = 1T - 실제 가능한 최대 크기(sector size : 512byte) 232 * 29 = 2T

- 최대 file size 232 = 4G

28 FAT directory

cluster 2 3 5 6 7 8 9 Reserved Root FAT #1 FAT #2 ...... Area Directory

name type start cluster b.txt file 0x06 var directory 0x07

name type start cluster messag file 0x09

8 F F 3 5 F F

2 3 8 9 10 11

29 Ext3 layout[3]

Boot Block Group 0 Block Group 1 Block Group 2 ... Block Group N Sector

Group Super Block Inode Inode Descriptor Block Block Block Block Block Bitmap Bitmap Table ... Table

n blocks 1 block 1 block n blocks

# block size 4K 인 경우 - block bitmap 으로 표현가능한 block 개수 212(4K) * 23(8bit) = 215 - 최대 filesystem size(group당 * block개수 * block크기) 215 * 216(group 개수) * 212(4K) = 243 = 8T

30 Ext3 inode

0

Inode Header block ... 40 direct pointer block ... block

direct pointer ... 12 indirect pointer pointers indirect pointer block double indirect block pointer indirect pointer triple indirect indirect pointer ... pointer 128 block indirect pointer indirect pointer 1024 pointer indirect pointer

# block size 4K 인 경우 block - 최대 file size ... (12 * 212) + (210 * 212) + (210 * 210 * 212) + 42 (210 * 210 * 210 * 212) ≒ 2 ≒ 4T block

31 Ext3 directory

ext3_dir_entry2

uint32 inode number

uint16 rec_len Inode Header block ... uint8 name_len uint8 file_type direct pointer block ... char name[255]

direct pointer

indirect pointer

double indirect 11 16 8 REG_FILE “.” 10 16 8 pointer triple indirect REG_FILE “..” 25 32 8 REG_FILE “d pointer oc” 131 16 16 REG_FILE “message.1” 205 16 8 REG_FILE “tmp”

32 JFS Layout[4]

33 JFS inode strucutre

inode

common  B+ tree field inode imap관련 (unused) H 0 100 200 300 union header field xtree entry array

H 0 50 H 100 150 H 200 250 H 300 350 400

0 25 50 75 100 125 150 175 200 225 250 275 300 330 350 375 400 425

34 Flash Filesystem

 Flash memory features  bad block management . initial bad block / run-time bad block  out of place updates . in-place updates 불가능. 반드시 erase 후에 write 가 가능. . wear-leveling 필요  lifetime of erase blocks . 각 block 에 대해 erase 회수 제한(wear-out limit)  large erase block . garbage collection 필요

page clean dirty valid

erase block

can erase

35 Problems of current Flash Filesystem

 Problem  Slow mount time  Heavy memory consumption . many filesystem structure are organized in memory.  Reason  mount 할 때, disk scan 후 관련 구조체를 메모리에 구성해야 한다.

scan

index tree file tree # vfs caches - dentry cache ... - inode cache - page cache

36 Flash Filesystem Features

 Log-structured FS features

 write append failure

 robust to power failure a b c a` c` d a``  garbage collection

write a b c a` c` d time a scan a b c garbage collection a b c a` c` d a b c a` c` d a`` d`

a b c a` c` d a`` a c a` d b c` a`` d` a b c a` c` d a`` d` e c a` d b c` a`` d`

read scan

37 Advanced Flash Filesystem

 Requirements  low memory usage  fast mount time  wear leveling  robust to failure

 Types of advanced flash filesystem  approach : using tree structure . do not need scan all device when fs mounts . wandering tree problem  (has replaced the jffs3 project)

38 Flash Filesystem

 Wandering tree problem[6]

file tree inode

th n ind valid

nd nd 2 ind 2 ind invalid st 1st ind st 1 ind 1 ind new

data data data data data data data data data

inode nth ind

2nd ind 2nd ind

1st ind 1st ind 1st ind

data data data data data data data data data data

39 Flash Filesystem

 Wandering tree problem(cont)

file tree inode nth ind

2nd ind 2nd ind

1st ind 1st ind 1st ind 1st ind

data data data data data data data data data data

inode inode nth ind nth ind

2nd ind 2nd ind 2nd ind

1st ind 1st ind 1st ind 1st ind

data data data data data data data data data data

40 Advanced Flash Filesystem(UBIFS)

 Volume managing layer  bad block management  wear-leveling

 UBI layer[7]  mapping logical eraseblocks(LEB) to physical eraseblocks(PEB)  separate the flash property from filesystem  support write-back

UBIFS layer read/write

UBI layer LEB0 LEB1 LEB2 LEB3 LEB4

MTD layer PEB0 PEB1 PEB2 PEB3 PEB4

Flash flash memory

41 Advanced Filesystem

 Ext4  ext3 에 allocation 개념 도입  ext3 구조의 한계에도 불구하고 대용량 파일을 다루기 위한 설계 추가  1 EiB a file / 1 EiB a file / 255 bytes  [9]  Oracle, ext3 대체를 목표로 개발되고 있는 파일시스템.  supports on-line fsck, writable snap-shot, sub-volume  can be upgraded from ext3fs to btrfs  16 EiB a file / 16 EiB a volume / 255 bytes  crfs (coherent remote filesystem) – btrfs : Oracle  pohmelfs  (parellel optimized host message exchange layered filesystem) . handle as much as possible locally, with minimal server interaction. . 10x than nfs synchronous. 3x than nfs asynchronous

42 Advanced Flash Filesystem(AXFS)

 Advanced XIP Filesystem  read-only flash filesystem  current common rootfs option . XIP-modified : save RAM, but requires extra flash memory. . : save flash memory, but requires extra RAM.  allows for each page to be XIP or not.  kernel profile(/proc/axfs/volume0)을 사용하여 image builder 의 input 으로 사용 . 작고 빠른 image 생성 가능  support kernel 2.6.26 later.

43