FAST: Quick Application Launch on Solid-State Drives

Yongsoo Joo1, Junhee Ryu2, Sangsoo Park1, and Kang G. Shin1,3

1Ewha Womans University, Korea 2Seoul National University, Korea 3University of Michigan, USA Application Launch Delay

! Elapsed time between two events ! A user clicks the icon ! The application becomes responsible

! Important for interactive applications ! Critically affects user satisfaction

2 Application Launch Performance

! Moore’s law not applicable ! Faster CPU and larger main memory not helpful ! HDD seek and rotational latencies do not improve well

(MIPS) (Gbit/s) (Mbit/s) (ms) 100000! 250! 1200! 15! seek 10000! 200! 1000! 12! 1000! 800! 150! 9! rotational 100! 600! 100! 6! 10! 400! 1! 50! 200! 3! Average seek time! Average rotational latency! 0.1! 0! 0! 0! 1970! 1980! 1990! 2000! 2010! 1980! 1990! 2000! 2010! 1990! 2000! 2010! 1990! 2000! 2010! (a)CPU CPU performance performance (b) PeakDRAM bandwidth throughput of DRAMs (c)HDD Peak throughput bandwidth of HDDsHDD (d)access Disk access latency latency Exponential improvement Linear improvement

3 Application Launch Performance

! Application launch breakdown

>'%6.$?$/'0@ 5**B@?0C@2'$?4 D?$?@$2?0E3*2@ -?$*0A# $/'0?-@-?$*0A# $/%*

1456'$

1/2*3'(

+,'-.$/'0

)'$*%

!"#$"%&'(

78 978 :78 ;78 <78 =778

4 SW-Level Optimization

! Many SW-level schemes deployed in OSes ! Application defragment, Superfetch, readahead, BootCache, etc.

! Sorted prefetch (ex: Windows prefetch) ! Obtain the set of accessed blocks for each application ! Monitor I/O requests during an application launch ! Pause the target application upon detection of its launch ! Prefetch the predetermined set of blocks in their LBA order ! Reduce the total seek distance of the disk head ! Resume the launch after the prefetch completes

5 SW-Level Optimization

! How sorted prefetch works position

HDD track Time Launch Launch start completion

Prefetcher CPU execution computation Improvement (typ: 40%) position

HDD track Launch Launch Launch Time detection resumption completion (x-axis not in scale)

6 Flash-based SSD

! The single most effective way to eliminate disk head positioning delay ! Acrobat reader: 4.0s -> 0.8s (84% reduction) ! Matlab: 16.0s -> 5.1s (68% reduction) ! Characteristics ! Consist of multiple NAND flash chips ! No mechanical moving part ! Uniform access latency (a few 100 microseconds) ! Prices now affordable ! 80 GB MLC SSD: less than 200$ now

7 Motivation

! Question: Are we satisfied with the app launch on SSD? ! Yes for lightweight applications (e.g., less than 1 sec) ! No for heavy applications (e.g., more than 5 sec) ! Far from ultimate user satisfaction ! Faster application launch is always good (at least, not bad)

! Needs increase for launch optimization on SSDs ! Applications are getting HEAVIER ! More blocks to be read ! SSD random read performance improves slowly ! Bounded by the single chip performance

8 HDD-Aware Optimizers on SSD

! Question: Will traditional HDD optimizers work for SSDs? ! Consensus: they will not be effective on SSDs ! Rationale: they mostly optimize disk head movement ! No disk head in SSDs ! Often recommended not to use on SSDs

! ! HDD-aware optimizers disabled upon detection of SSD ! Windows prefetch, Application defragmentation, Superfetch, Readyboost, etc.

9 Sorted Prefetch on SSDs

! No benefit from LBA sorting

! Uniform seek latency of SSD 32 24 Average QD: 0.3 ! Launch performance still improves16 8 ! Increased effective queue depth (0.3->3.4,0 app: Eclipse) 0 1 2 3 4 5 ! Observed 7% launch time reduction: better(a) Cold thanstart (no nothing! ) 32 32 32 24 AQueueverage QD: depth: 0.3 0.3 24 24 Average QD: 3.4 32 16 16 16 24 Average QD: 0.3 8 8 8 16 0 0 0 Queue depth Queue depth 8 (sec) 0 0.1 0.2 0.3 0.4 0.5 0.6 0 0 1 2 3 4 5 0 1 2 3 4 5 (sec) 0 (a)1 Cold2 start (no3 prefetcher)4 5 (b) Baseline prefetcher (c) Baseline prefetcher (zoomed in) 32 (a) Cold start (no prefetcher) 32 32 32 A24verage QD: 3.4 24 Average QD: 30.6 3224 32 24 16 16 2416 24 Average16 QD: 3.4 Queue8 depth: 3.4 8 16 8 16 8 0 0 0 0 8 8 0 0.1 0.2 0.3 0 0 1 2 3 4 5 0 0 0.10 0.21 0.32 0.4 3 0.5 40.6 5 Queue depth (d) Two-phase prefetcher (e) Two-phase prefetcher (zoomed in) 0 1 (b) Baseline2 3 prefetcher4 5 0 0.1(c) Baseline0.2 0.3 prefetcher0.4 0.5 (zoomed0.6 (sec in)) 32 (b) Baseline prefetcher (c)32 Baseline prefetcher (zoomed in) 10 3224 32 24 Average QD: 30.6 2416 24 A16verage QD: 30.6 16 8 16 8 8 0 8 0 0 0 0 1 2 3 4 5 0 0.1 0.2 0.3 0 1 2 3 4 5 0 0.1 0.2 0.3 (d) Two-phase prefetcher (e) Two-phase prefetcher (zoomed in) (d) Two-phase prefetcher (e) Two-phase prefetcher (zoomed in) FAST: Fast Application STarter

! Overlap CPU computation with SSD accesses

Application s1 c1 s2 c2 s3 c3 s4 c4 0 tlaunch Time (a) Cold start scenario

Application c1 c2 c3 c4 0 tlaunch Time (b) Warm start scenario

Application c1 c2 c3 c4 Time Prefetcher s1 s2 s3 s4 0 tlaunch Time (c) Proposed prefetching ( t cpu > t ssd )

11 Application Launch Sequence

! Deterministic block requests over repeated launches ! Raw block request traces

b1 b2 b3 b4 b5

b1 b2 b3 b4 b5 ...

b1 b2 b3 b4 b5

! Application launch sequence

b1 b2 b3 b4 b5

Block requests irrelevant Unrelated to application launch to the application launch

12 What to Do

! Application launch sequence profiling ! Using blktrace tool

! Prefetcher generation ! Replay block requests according to the application launch sequence

! Prefetcher execution ! Simultaneously with the original application ! By wrapping the system call exec() ! LD_PRELOAD

13 Prefetcher Generation

! Example application launch sequence ! AB->C->D ! Block-level I/O: (start LBA, size) ! (5, 2)->(1, 1)->(7, 1) <- obtainable from blktrace ! File-level I/O: (filename, offset, size) ! (“b.so”, 2, 2)->(“a.conf”, 1, 1)->(“c.lib”, 0, 1)

"a.conf" "b.so" "c.lib" File offset 0 1 2 0 1 2 3 0 1 2 C A B D LBA 0 1 2 3 4 5 6 7 8 9 "/dev/sda" Accessed block

14 Prefetcher Generation

! Example application launch sequence ! AB->C->D ! Block-level I/O: (start LBA, size) ! (5, 2)->(1, 1)->(7, 1) <- obtainable from blktrace ! File-level I/O: (filename, offset, size) ! (“b.so”, 2, 2)->(“a.conf”, 1, 1)->(“c.lib”, 0, 1)

"a.conf" "b.so" "c.lib" File offset 0 1 2 0 1 2 3 0 1 2 C A B D LBA 0 1 2 3 4 5 6 7 8 9 "/dev/sda" Accessed block

15 Prefetcher Generation

! Block-level I/O replay

int main(void) { ! fd = open("/dev/sda",O_RDONLY|O_LARGEFILE); ! posix_fadvise(fd,5*512,2*512,POSIX_FADV_WILLNEED); ! posix_fadvise(fd,1*512,1*512,POSIX_FADV_WILLNEED); ! posix_fadvise(fd,7*512,1*512,POSIX_FADV_WILLNEED); ! return 0; } LBA size

"a.conf" "b.so" "c.lib" File offset 0 1 2 0 1 2 3 0 1 2 C A B D LBA 0 1 2 3 4 5 6 7 8 9 "/dev/sda" Accessed block

16 Page Structure

Page cache

inode /dev/sda a.conf b.so c.lib

cached A B blocks C D

17 Page Cache Structure

Page cache

inode /dev/sda a.conf b.so c.lib

cached A B blocks Miss! Miss! Miss! C D

18 Page Cache Structure

Page cache

inode /dev/sda a.conf b.so c.lib

cached A B blocks C A B D C D

What we need to construct

19 Prefetcher Generation

! File-level I/O replay

int main(void) { ! fd1 = open("b.so", O_RDONLY); ! posix_fadvise(fd1,2*512,2*512,POSIX_FADV_WILLNEED); ! fd2 = open("a.conf",O_RDONLY); ! posix_fadvise(fd2,1*512,1*512,POSIX_FADV_WILLNEED); ! fd3 = open("c.lib", O_RDONLY); ! posix_fadvise(fd3,0*512,1*512,POSIX_FADV_WILLNEED); ! return 0; } file name file offset size "a.conf" "b.so" "c.lib" File offset 0 1 2 0 1 2 3 0 1 2 C A B D LBA 0 1 2 3 4 5 6 7 8 9 "/dev/sda" Accessed block

20 Block-to-File Level I/O Conversion

! LBA-to-inode mapping ! Not supported by EXT file system

(5,2) (“b.so”, 2,2) (1,1) (“a.conf”,1,1) (7,1) (“c.lib”, 0,1)

"a.conf" "b.so" "c.lib" File offset 0 1 2 0 1 2 3 0 1 2 C A B D LBA 0 1 2 3 4 5 6 7 8 9 "/dev/sda" Accessed block

21 Block-to-File Level I/O Conversion

! Inode-to-LBA map for a single file ! Easy to build

! LBA-to-inode map for the entire file system ! Millions of files in a file system ! Frequently changed ! Only a few 100s of files used by a single application

! Our approach: build a partial map for each application ! Determine the set of files used for the launch ! Monitoring system calls using filename as their argument

22 Application Prefetcher

! Automatically generated application prefetcher for Gimp

int main(void) { ... readlink("/etc/fonts/conf.d/90-ttf-arphic-uming-embolden.conf", linkbuf, 256); int fd423; fd423 = open("/etc/fonts/conf.d/90-ttf-arphic-uming-embolden.conf", O_RDONLY); posix_fadvise(fd423, 0, 4096, POSIX_FADV_WILLNEED); posix_fadvise(fd351, 286720, 114688, POSIX_FADV_WILLNEED); int fd424; fd424 = open("/usr/share/fontconfig/conf.avail/90-ttf-arphic-uming-embolden.conf", O_RDONLY); posix_fadvise(fd424, 0, 4096, POSIX_FADV_WILLNEED); int fd425; fd425 = open("/root/.gnupg/trustdb.gpg", O_RDONLY); posix_fadvise(fd425, 0, 4096, POSIX_FADV_WILLNEED); dirp = opendir("/var/cache/"); if(dirp)while(readdir(dirp)); ... return 0; }

23 CPU and SSD Usage

Cold CPU start SSD Warm CPU start SSD (a) (b) Reduction: FAST CPU 24% SSD Sorted CPU prefetch SSD 0 1 2 3 twarm tFAST 4 tsorted tcold 5 Low CPU usage Eclipse (sec)

Cold CPU start SSD Warm CPU start SSD (c)

FAST CPU SSD Sorted CPU prefetch SSD 0 twarm tFAST tsorted tcold 1 (sec) Firefox 24 CPU and SSD Usage

Cold CPU start SSD Warm CPU start SSD (a) (b) Reduction: FAST CPU 24% SSD Sorted CPU prefetch SSD 0 1 2 3 twarm tFAST 4 tsorted tcold 5 Low CPU usage Eclipse (sec)

Cold CPU start SSD Warm CPU start SSD (c) Reduction: FAST CPU 37% SSD Sorted CPU prefetch SSD 0 twarm tFAST tsorted tcold 1 (sec) Firefox 25 CPU and SSD Usage

Cold CPU start SSD Warm CPU start SSD (a) (b) Reduction: FAST CPU 24% SSD Sorted CPU prefetch SSD 0 1 2 3 twarm tFAST 4 tsorted tcold 5 Low CPU usage Eclipse (sec)

Cold CPU start SSD Warm CPU start SSD (c) Reduction: FAST CPU 37% SSD Sorted CPU prefetch SSD 0 twarm tFAST tsorted tcold 1 (sec) Firefox 26 Measured Application Launch Time

! Launch time reduction ! Warm start: 37% (upper bound) ! Proposed: 28% (min: 16%, max: 46%) ! Sorted prefetch: 7% (min: -5%, max: 21%)

TQNONP 1.6s 0.8s 1.9s 4.8s 2.1s 1.1s 0.9s 2.3s 2.6s 5.6s 1.8s 1.6s 1.2s 2.7s 5.1s 0.9s 1.9s 1.0s 1.0s 3.7s 2.6s 6.6s 93% TNNONP 72% 63% SNONP 27% RNONP tUDA,cold 2NONP 8DVHtsorted t7!8H QNONP FAST K!VCtwarm NONP tEV37ssd 4 ) - ) % % % 6 # # # + $ 5 + # & 5 # 2 ' 3 B - # / / ) & % % & # # " $ & - $ - 5 - & . ( WDXY,t

% # - $ bound " 9 8 ; 4 ; / 4 + & 5 - 9 - # - + G & 5 M ' ( 1 ) # 9 6 - & # 0 . # ? % ( F % " 4 8 6 = - 5 J ( % % : = K - ? 3 0 / % # # ' D " " # $ / & 8 # - % @ # ? 7 7 ( C # + 4 / ! * 3 : # - / < ) + / A # / ! B + ( . & L - > 5 = ? & ' $ > I # D E & # H + % " , > ! (Normalized to the cold start time.)

27 Measured Application Launch Time

! Launch time reduction ! Warm start: 37% (upper bound) ! Proposed: 28% (min: 16%, max: 46%) ! Sorted prefetch: 7% (min: -5%, max: 21%)

-)'( 100% -''( 93% ,'( 72% 63% ./01 +'( 2/$3#1 *'( 4!25 )'( 6%$7 '( !"#$%&# 28 Applicability on Smartphones

! Similarity to PCs with a SSD ! Running various applications ! Application launch performance does matter ! NAND Flash-based storage ! The same performance characteristic as SSDs ! Slightly modified OSes and file systems designed for PCs ! Easy to port

29 Applicability on Smartphones

! Further benefits ! More frequent launches of applications ! Limited main memory capacity ! Cold start scenario occurs more often ! Slower CPU and flash storage speed ! Relatively longer application launch time

30 Applicability on Smartphones

! Measured cold & warm start time on iPhone 4 ! Average cold start time: 6.1 seconds ! Warm start time: 63% of cold start time

#' 23456780/8 #, 90/:6780/8 6.1s 3.7s '

, Launch time (sec) $ % ' # & ( ) * + # , $ % & . # " " " " " " " " " # # # # 1 " " " " " " " " " " " " " " 0 / " " " " " ! ! ! ! ! ! ! ! ! . ! ! ! ! ! - !

31 Conclusion & Future Work

! Introduced an application prefetcher designed for SSDs ! Our ultimate goal ! Warm start performance in the cold start scenario ! Further improving FAST by exploiting the SSD parallelism ! See our poster!

CPU c1 c2 c3 c4 c5 32 1 SSD m1 d1 d2 d3 m4 d4 d5 Time 24 Average QD: 0.3

QD 2 16 8 (a) No prefetcher 0 Prefetcher 0 1 2 3 4 5 (a) Cold start (no prefetcher) CPU execution c1 c2 c3 c4 c5 1 SSD m4 d4 d2 Time 32 32 24 24 Average QD: 3.4

QD 2 d5 d1 3 m1 d3 16 16 8 8 (b) Baseline prefetcher 0 0 Launch completion 0 0.1 0.2 0.3 0.4 0.5 0.6 Prefetcher 0 1 2 3 4 5 (b) Baseline prefetcher CPU execution c1 c2 c3 c4 c5 (c) Baseline prefetcher (zoomed in) 1 SSD m4 d4 Time 32 32 24 24 Average QD: 30.6

QD 2 m1 d5 3 d2 16 16 4 d1 8 8 5 d3 0 0 First Second 0 1 2 3 4 5 0 0.1 0.2 0.3 phase phase (c) Two-phase prefetcher (d) Two-phase prefetcher (e) Two-phase prefetcher (zoomed in)

32 Backup Slides Applicability on HDDs

! FAST works as well on HDDs, but ... ! Application launch on HDDs: I/O bound ! Little room for overlapping CPU time and HDD access time ! Launch time reduction: 15% ! Sorted prefetch performs better ! Launch time reduction: 40%

100% 5//0 85% 4/0 60% 3/0 2/0 15% 1/0 /0 !"#$ %&'( '")*+$ ,-). Normalized application launch time on HDD 34 Determinism on Multi-Core

! We observed determinism even on multi-core CPUs ! Only one core is active during the most time periods ! SSD is mostly idle when two or more cores are active

CPU core 5 CPU core 4 CPU core 3 CPU core 2

CPUSSD core 1

SSD

35

Core 5 Core 4 Core 3 Core 2 Core 1

Core 5 Core 4 Core 3 Core 2 Core 1

Core 5 Core 4 Core 3 Core 2 Core 1 Why not Capturing File I/O?

! Why not simply capture all the file-level I/Os and replay them? ! Ex) Capture all read() calls using strace ! That’s possible, but the problem is... ! The number of read() calls are extremely large ! Only few of them will cause page fault, generating a block I/O ! Replaying all the captured read() calls are inefficient ! In terms of both prefetcher size and function call overhead ! Not easy to determine which of them will actually cause page faults ! May be more complicated than our approach (block-level to file-level I/O conversion)

36