arXiv:1705.04627v1 [cs.AR] 12 May 2017 oesr ieydseiaino coal n ehia w technical and scholarly of dissemination timely Univ ensure collection) State to garbage (regarding Pennsylvania data and new Dallas includes material at whe Texas done of mostly University and at Architecture Computer Performance High necnetdt omasnl trg,admlil I/O multiple and storage, single a being are form chips to memory flash interconnected NAND of thousands to hu dreds Specifically, changes. architectural severe 40 undergoing around still are widths 3]. 2, 31, by [14, brought flash storage advantages NAND conventional the avoid exploit to and rate attempt overheads, data interface an – in Express –, PCI 16GB/sec like is SSDs interfaces employ speed to high begun an through have data-intensive computing enterprise, performance Further, high laptops in workstations. deployed rapidly devices, and being mobile are in SSDs large-scale technology and storage dominant the come Introduction 1 runtime. at requests of memory half flash reducing the by aver- parallelism flash-level on more provides, 80.2% and age, patterns 68.8% request by I/O utilization different resource under overall improves times controlle it SSD ˜2.2 Further, state-of-the-art 1.8 the and than Sprinkler latency throughput our shorter better 56.6% with least equipped at SSD provides many-chip a that framework shows simulation SSD us- large-scale evaluation cycle-accurate a experimental ing extensive Our spe- to resources. requests transactional- cific memory improves flash over-committing (i.e., by transactions locality) of number addi- reduc the In and parallelism queue. flash-level device-level improves rather Sprinkler the layout tion, by resource imposed order internal the on than based requests Specifi- I/O schedu ing by chips. dependency parallelism flash relaxes per- Sprinkler NAND cally, Sprin- high targets additional achieving propose which without and controller, we formance utilization paper, SSD resource device-level this maximizing novel In a kler, SSDs. many-chip in 1 oee,sneidvda ADfls eieband- device flash NAND individual since However, be- have SSDs embedded and cards memory Flash-based problems emerging the of one is utilization Resource hsppri ulse t2t EEItrainlSymposiu International IEEE 20th at published is paper This pike:Mxmzn eoreUiiaini ayCi S Many-Chip in Utilization Resource Maximizing Sprinkler: 1 optrAcietr n eoySsesLbrtr,Yons Laboratory, Systems Memory and Architecture Computer Abstract ∼ 2 0M/e,mdr Ssare SSDs modern 400MB/sec, eateto ES h enyvnaSaeUniversity State Pennsylvania The EECS, of Department yugo Jung Myoungsoo jcmlbog [email protected] [email protected], n spresented is and ork. riy This ersity. ewas he n On m 1 rs. n- es n amtT Kandemir T. Mahmut and l- d ee.Freape oerqet ontsa cosalin- all across span not do requests system some a example, at For parallelism level. low introduce and turn utilization in chip which unbalanced their significantly, and vary MBs), (or can KBs offsets to re- data bytes SSD few a systems, from vary memory sizes request main quest I/O the Unlike an by pattern. exhibited rea- access first dependency The parallelism increases. is chips son of number problem the related as idleness utilization and from suffers architecture SSD thousands. thirty to we two from as dies flash of 1b), number the (Figure the and increase growing down, keeps sharply idleness goes memory-level inter- state- utilization the a 1a), resource of (Figure nal cycle-level bandwidth stagnates our SSD read many-chip fact, the of-the-art In that reveal solution. data promising simulation flash a more not and is more chips employing in- that resources means internal which of creases, amount the sig- as not improved unfortunately nificantly are SSDs many-chip of performance and performance. SSD requests, improve I/O potentially incoming therefore on can based these resources SSD All abundant ternal across [17]. accesses data parallelism parallelize that proposals internal prior layout of data advantage physical take can a determine been which also in- have investigated, 13] [7] 36, [16, and planes schemes [1] and allocation page dies level, Various multiple flash across a requests incoming at terleave comparison, SS In internal different across resources. them scatter mem- and flash su- requests, multiple ory into and request [1] I/O an a ganging split [7] From as perblocking such 5]. techniques [4, viewpoint, explored system ar been different have resources, approaches these utilized. chitectural manage are efficiently planes these to and well order dies In how flash on of thousands based or char- vary hundreds introducing SSDs performance modern expected, by of As performance acteristics easily their resources. can internal improve more SSDs and architecture, planes. up many-chip multiple scale accommodating this multi- which of to consists of Thanks chip each data flash dies, of single ple amount A maximum parallelism. the are access technologies extract flash to NAND developed new NAND being parallel, these In with integrated chips. being flash are cores and channels ebleeta hr r w esn h many-chip a why reasons two are there that believe We the expectation, common unlike that, found we However, 2 iUniversity ei ldSaeDisks State olid , ∗ in- D s s - . 4 NVMHC 1048.6

8 Queue 16 Host-Level Queue DDR DMA

524.3 Controller Engine 32 Host Driver Core

64 NAND Flash Controller

262.1 128 Controller Internals data transfer Host Side 131.1 Die Internals size (KB) Storage Side J Planes DataLanes 65.5 DATADATA REGISTERREGISTER DATA REGISTER Data bus (PCI Express) fabric CACHECACHE REGISTERREGISTER Bandwdith(MB/Sec) CACHE REGISTER 32.8 CACHE REGISTER Controller Controller Controller 11 BlockBlock 16.4 1 Block Blocks

2 8

32

128 512 k 2048 8192 NAND Flash 32768 NAND Flash

Peripherals NAND Flash CHIP 1 NAND Flash The number of flash dies CHIP 1 CHIP 1 CHIP 1 CHIP 1 CHIP 1 MemoryMemory ArrayArray CHIPCHIP 11 CHIPCHIP 11 CHIPCHIP 11 Memory Array CHIP 2 (a) Performance Stagnation CHIP 2 CHIPCHIP 22 CHIPCHIP 22 CHIPCHIP 22 CHIPCHIP 22 CHIPCHIP 22 CHIP 3 CHIP 3 CHIP 3 CHIPCHIP 33 CHIP 3 CHIP 3 Die 0 Die 1 Die 2 Die 3 100 CHIP 3 CHIPCHIP 33 CHIPCHIP 33 CHIP 4 CHIP 4 CHIP 4 4 CHIP 4 CHIP 4 CHIP 4 CHIPCHIP 44 CHIPCHIP 44 CHIPCHIP 44 Mutiplexed Fibre 80 8 Flash Interface 16 SSD Internals NAND Flash Chip Internals

32 60 Idleness (System-level Resources) (Flash-level Resources)

64

Utilization 128

40 data transfer Figure 2: A many-chip SSD architecture.

size (KB)

20

0 based on internal resource layout rather than the order im-

2 8 posed by the device-level queue (which is the current norm 32

128 512

2048 8192 ThePercentage of Chip Utilization andMemory Level Idleness (%) 32768

The number of flash dies [22, 16, 27, 16]). In addition, Sprinkler improves flash- (b) Utilization and Idleness level parallelism and reduces the number of transactions by over-committing flash memory requests to specific in- Figure 1: Many-chip SSD performance, chip utilization, and memory-level idleness sensitivity to varying number of ternal resources. To the best of our knowledge, this is the flash chips and data transfer sizes. Each curve corresponds first paper that suggests to exploit internal resource layout to a different data transfer size. and over-commit flash memory requests in order to maxi- ternal resources, and some other requests keep pending due mize resource utilization and parallelism, thereby improv- to chip-level conflicts, which in turn influences the number ing many-chip SSD performance. Our main contributions of resources that can be allocated at a given time. There- can be summarized as follows: fore, the degree of internal parallelism that can be enjoyed • Resource-driven I/O scheduling. Unlike conventional depends highly on incoming I/O access patterns, referred SSD controllers which schedule flash memory requests to as parallelism dependency in this work. Another reason based on the “order of incoming I/O requests”, Sprin- behind the utilization and idleness problems of emerging kler schedules them based on “available physical flash re- many-chip SSDs is low flash-level transactional-locality. sources” in a fine-grain, out-of-order fashion. This method, The transactional-locality in this work corresponds to the called Resource-driven I/O Scheduling (RIOS), “decou- ability of the references that form a flash transaction to ex- ples” parallelism dependency from the I/O request access hibit high flash-level parallelism. Since multiple dies and patterns, timings and sizes, improving overall resource uti- planes are connected to shared voltage drivers and a sin- lization by about 68.8% under a wide variety of realistic gle multiplexed interface, flash-level resources need to be workloads. managed carefully in order to exploit the maximum amount • request over-commitment. In order to of parallelism. Specifically, highly parallel data accesses increase flash-level transactional-locality, Sprinkler over- across internal flash resources can only be achieved when commits flash memory requests to devices. This Flash- incoming requests span all of them (spatial) and all types level parallelism Aware Request Over-commitment (FARO) of request transactions can be identified within a very short maximizes the opportunities for building a flash transaction time period like a few cycles (temporal). As a result, low at runtime, parallelizing multiple memory requests with re- flash-level transactional-locality may introduce poor data spect to the constraints imposed by the underlying flash access concurrency and extra idleness. technology. FARO provides, on average, 80.2% more flash- In this paper, we propose Sprinkler, a novel device- level parallelism and reduces approximately 50% of flash level SSD controller, which targets maximizing resource transactions. utilization and achieving high performance without addi- • Reducing idleness in many-chip SSDs. We identify two tional NAND flash chips. Specifically, Sprinkler relaxes different types of idleness in an SSD: inter-chip idleness the parallelism dependency by scheduling I/O requests and intra-chip idleness. Sprinkler reduces inter-chip and

2 1) resources: shared Resources. Technologies System-Level exploit 2.1 and computation, and parallelism. of I/O levels different between overlap advanta take the to approach of resources internal multi-phase different a by with out archi- carried dealt this are In services I/O [35]. tecture, Mavell be- by gap architecture SSD performance many-chip medium the flash fill and (16GB/sec) (40 to interfaces attempt speed con- high an and tween in channels I/O units multiple trol interconnecting by chips Disks State Solid Many-Chip 2 and throughput, system better latency. I/O times together) lower 78% FARO 1.8 and least (RIOS at based Sprinkler provides address 20], physical [27, state-of-the-art a scheduler even and scheduler 16] based As [22, address virtual respectively. conventional 23.5%, a to and compared 46.1% by idleness many-ch pipeline intra-chip a a in in resources routine internal service SSD I/O different an by of and phases operation The 3: Figure eoyrqet,adeeue it. executes and requests, memory 3) core. the man- module, To software called embedded an addresses. translation, address physical this block memory age systems flash file to host addresses, with compatible addresses, virtual 2) implemente NVMHC. usually in are han- proto- schedulers packet interface I/O data device-level and host dling, handshaking the queueing, of to the related aware is cols is NVMHC that Since component components. only internal the SSD between and communication host for responsible logic, control ls Controller Flash Core oenSD mlyhnrd otosnso flash of thousands to hundreds employ SSDs Modern o-oaieMmr otCnrle (NVMHC) Controller Host Memory Non-Volatile

∼ Device-level Queue Queuing Arrivals 0M/e) iue2pcoilyilsrtsarecent a illustrates pictorially 2 Figure 400MB/sec). ls rnlto Layer Translation Flash samcorcso,wihi eiae otranslate to dedicated is which microprocessor, a is tags ssoni iue2 hr xs ormain four exist there 2, Figure in shown As

Memory Request Composition Request Memory I/O Request ulsa builds Parsing NVMHC Core (Flash Translation Layer) Translation Core (Flash NVMHC ahtransaction flash FL,i mlmne in implemented is (FTL), Initiation Movement Data Memory Request (Virtual Address) Data Movement rmmultiple from Memory Request Commitment Request Memory Translation Address

sa is Memory Request ge (Physical Address) d fashion. d

3 Data Movement pSD nti otn,a / eus ssre ymultiple by served is request I/O an routine, this In SSD. ip otecr rcso nteSDfrfrhrIOprocessing. I/O further for SSD the This in processor core the to cteste vraalbefls otolr.Fnly fla a Finally, build controllers. controllers flash available over them and scatters memory addresses, physical the into NVMHC of from addresses coming requests virtual be translates can FTL commitment and pipelined. queu- to composition related request phases memory multiple ing, the that so fashion timely a cesscnb aallzdoe utpefls controller flash multiple over parallelized be corresponding can the accesses FTL, physi by the determined after been have First, scattered addresses be resources. internal can multiple requests across memory associated many quest, flash underlying (SLP). Parallelism the System-Level of below. limitations explained as the transactional- technology, chip as the well on as depending locality varies by transaction exhibited parallelism flash of a amount The requests. memory igo h rcs fsriiga /,NMCenqueues called NVMHC information, I/O, request an host servicing the of process the of ning Routine. Service I/O inter- chip flash the registers. and trans- nal buffer data internal for a SSD path the to same between the connected fer share are and array chips an flash like channel Multiple medium. flash the 4) a ae nteudryn rtcl(.. V Express a NVM builds associated (e.g., then the protocol NVMHC underlying parses [14]). the NVMHC on host serviced, based a be tag Once to chosen tags. is schedules and queue level eea ye oa B nIOrqeti yial pi into split typically is request I/O multiple an MB, an from to ranging significantly, bytes vary several can request I/O an of length as to ini- referred host and is composition activity the size, This between unit SSD. the I/O movement and flash data atomic corresponding the the as tiates same the is size Channel ntenx tp VH ed these sends NVMHC step, next the In eoyrqetcommitment request memory Pipelining (SLP) Pipelining & Striping eoyrequests. memory sadt ahbtentefls otolrand controller flash the between path data a is ic,ulk h anmmr ytm,the systems, memory main the unlike Since, . ahtransaction flash ssoni iue3 ttebegin- the at 3, Figure in shown As Flash Transaction Decision Transaction Handling Transactions Transactions Handling eoyrequest memory Flash Controllers Flash ed ob efre in performed be to needs hl evn nIOre- I/O an serving While tags by coalescing nisondevice- own its in , eoyrequests memory eoyrequest memory Sharing Sharing (FLP) Interleaving & Sequence Execution / request I/O hs data whose multiple cal sh s and channels, and this process is called channel stripping. all memory requests in the transaction are required to follow Further, each flash controller can pipeline a series of I/O an appropriatetiming sequence that flash makersdefine, and commands, control commands and data movements, asso- each flash transaction has its own timing sequence. There- ciated with the transaction across multiple flash chips within fore, as shown in the handling transactions part of Figure 3, a channel, and this process is referred to as channel pipelin- the type of a transaction should be decided within a short ing. Even though channel stripping and channel pipelin- period of time before entering the execution sequence. Due ing aim to improve internal parallelism, the amount of SLP to this, the parallelism of each transaction potentially hasa brought by them depends highly on the incoming I/O ac- dependency on I/O access pattern, length and arrival timing cess pattern. As shown in Figure 1b, poor chip utilization, of incoming memory requests, which makes it difficult to caused mainly by parallelism dependency, is one of main achieve high levels of FLP. We refer to this as “parallelism reasons for performance stagnation on emerging many-chip dependency” and demonstrate specific examples exhibiting SSDs. low FLP below. 2.2 Flash-Level Technologies Challenge. First, in cases where multiple memory requests arrive at a flash controller in a time interval which is longer Resources. State-of-the-art SSDs also employ the follow- than the transaction type decision time of the controller, ing flash-level technologies: they will be served by separate transactions even though 1) Flash Chip consists of multiple dies and planes, and ex- they could have been serviced, from a flash-level perspec- poses them through a small number of I/O (e.g., 8 ˜16) pins tive, as a single transaction. This poor flash-level temporal and a CE (chip enable) pin to the system-level resources. transactional-locality can potentially contribute to low FLP. This flash interface reduces the I/O connection complex- On the other hand, since only one flash transaction can oc- ity and communication noise, but it also introduces a set of cupy the shared interface, bus and flash medium at a time, command sequences and data movements for handing flash once the transaction type is determined and the correspond- transactions. ing memory requests are initiated, other memory requests 2) Die is a memory island, connected to a single multi- heading to the same chip should be stalled until the shared plexedbus fibre throughthe flash interface and the CE. Note flash resources are free. Lastly, to take advantage of plane that the memory cells in different dies can operate indepen- sharing, addresses of the memory requests in a transaction dently. should indicate the same page and die offset in the flash Plane 3) is the memory array in a die, sharing the word- chip, but different block addresses (or plane addresses). As line and voltage drivers for accessing specific flash memory a result, low flash-level spatial transactional-locality can cells. also introduce low FLP and high intra-chip idleness. Flash-Level Parallelism (FLP). Even though multiple dies are squeezed into a single flash interface, a set of bus ac- 3 I/O Scheduling in Modern Controllers tivities for a transaction (e.g, flash command, data move- The state-of-the-art I/O scheduling schemes in ment, control signals) can be interlaced, and the multiple NVMHCs can be broadly classified as virtually ad- dies can independently work without any circuit-level mod- dress schedulers (VAS) [22, 30, 16] and physically address ification. Consequently, multiple memory requests can be schedulers (PAS) [27, 20], in terms of the address type of interleaved across dies via die interleaving, which in turn memory requests that they operate on. In this section, we improves chip throughput and response time for a transac- briefly explain these two schedulers and their drawbacks. tion n times, where n is the number of flash dies. Plane Virtually Address Scheduler (VAS). This type of sched- sharing activates multiple planes in a die through the shared uler decides the order of I/O requests in the device-level wordline access, thereby improving throughput by m times, queue, but builds and commits memory requests relaying m being the number of planes. Lastly, die interleaving and only on the virtual addresses of the I/O requests, provided plane sharing can be combined, which can improve transac- by the underlying FTL. Consequently, VAS can suffer from ∗ tion performance by approximately n m times. collisions among I/O requests, which leads to low levels Flash Transaction and Parallelism Dependency. Unlike of SLP and FLP. Figure 4a illustrates this problem1. In DRAM, NAND flashes have an wide spectrum of opera- this example, there exist five I/O requests arriving back-to- tion sets, commands and execution sequences – most flash back and containing a total of nineteen individual memory memory vendors at least offer ten flash operations, each of requests. Initially, VAS composes four memory requests which typically has a different execution sequence. In this context, a flash transaction is a series activities that the flash 1For all the system-level diagrams (Figures 4a, 5a, and 7), we show snapshots with respect to the device-level queue (e.g., native command controller has to manage in executing a flash operation. It is queue), memory request commitments, and the corresponding physical composed of a set of commands (i.e., flash, control, delim- layout by following the controller visit order (from left to right). For each snapshot, an uncolored box on each channel indicates idle chips, and mem- iter commands), data movements (i.e., contents, addresses, ory request numbers on the resource layout refer to the memory request status information). In addition, during the execution stage, commitment order.

4 11 1J$ Q`RV` 11 1J$ Q`RV` L                     %V%V L                     %V%V `V_%V `V_%V                         VIQ`7       VIQ`7       `V_%V `V_%V

7 VIRCV0VC01V1Q`IVIQ`7`V_%V HQII1 IVJ 7 VIRCV0VC01V1Q`IVIQ`7`V_%V HQII1 IVJ

VIQ`7V_V% J%IGV`

 

          

               

        

       

         

.:JJVC           .:JJVC       

        

    

              .1]         .1]                  (a) System-level view (a) System-level view

Y `%V Y `%V Y `%V Y `%V Y `%V Y `%V Y `%V Y `%V Y`:CV

  Y`:CV

  



Y`:CV

 

   

 

C:JV   C:JV

        .1]  .1]          1V 1V V_8     :CCR%V Q .VLV_%V HQCC11QJ:  V_8   

  

V_8    V_8

    

V_8    V_8

    

V_8    V_8

V_8    V_8      (b) Flash-level view (VAS service time for chip 4 (C3)) (b) Flash-level view (PAS service time for memory requests) Figure 4: Operation of virtual address scheduler (VAS). Figure 5: Operation of physical address scheduler (PAS). VAS exhibits low resource utilization and long (I/O request) PAS exhibits better resource utilization and higher FLP than pending times at the flash level. VAS, but it still suffers from parallelism dependency and low flash-level transactional-locality. belonging to I/O (request) #1 and strips/pipelines across four different chips (C1∼C4). However, to commit I/O #2 including one stalled time frame introduced by I/O #3. One next, VAS has to wait for the completion of the previously- can see from this example that we have a large scope for committed request, which makes five chips (C4∼C8) idle. improving utilization by reducing the number of idle chips, The reason behind this inter-chip idleness is the request col- if VAS could reorder the I/O requests by being aware of lisions between I/O #1 and I/O #2 in three different chips “physical addresses”. (C0 ˜C2), and VAS schedules the I/Os with no idea about Physically Address Scheduler (PAS). PAS schedules the the underlying physical addresses. Similarly, I/Os #3, #4, I/O requests by being aware of the physical addresses and #5 would be stalled in the queuedue to the requestcolli- exposed by a hardware-assisted preprocessor [27] or a sions with the previously-committed requests, which makes software-based address translation unit [20]. Thanks to this twenty chips idle. physical address space exposure, PAS can reorder I/O re- Figure 4b plots a microscopic view of C3 in this exam- quests in an attempt to address the request collision prob- ple2. Since VAS has no other request commitments at the lem and execute flash transactions in an out-of-order fash- beginning of the I/O process, the first transaction is built by ion. Specifically, from a system-level viewpoint, PAS can onlyconsideringI/O #1. When C3 in servingthe flash trans- serve multiple I/O requests by grouping them without a ma- action, it makes read/busy signal (RB¯ ) true, which means jor memory request collision, which in turn reduces the that the chip is not available to serve anything else. Even number of idle chips and improves SLP. Figure 5a plots this though VAS is ready to commit the memory request as- scheduling scenario. PAS swaps I/O #3 with I/O #2 since sociated with I/O #2 as the next step, it has to wait until the latter can be simultaneously executed with I/O #1 with- RB¯ becomes false. Similarly, since VAS has no knowl- out any shared resource conflict. Due to the high degree of edge about the underlying physical layout, it further com- SLP at the beginning of the commitment process, the stalled mits three memory requests heading to C3 (associated with time frame shown in Figure 4b can be eliminated, and I/Os 3 different I/Os: #2, #4 and #5) in tandem without any #3, #4 and #5 save multiple execution cycles, which impacts transactional-locality consideration. Consequently, they are both system latency and throughput. Since this I/O reorder- built as four different “flash transactions” at the chip-level ing scheme based on physical addresses can partially relax parallelism dependency, there are only fifteen idle chips in 2For all flash-level view diagrams (Figure 4b, 5b, and 8), we show bus order to complete all the I/O requests in the queue, which and timing diagram associated with each system-level view’s snapshot (from left to right). To make better comparisons, we also illustrate the cor- corresponds to a 40% reduction in inter-chip idleness com- responding I/O request level latency below them. Note that all the sched- pared to VAS (Figure 4a). ulers we discuss can only submit a flash transaction during the RB=¯ false periods and have the same type of out-of-order executable device level However, PAS still has two downsides. First, it com- queue (NCQ). poses memory requests and commits them based on “I/O

5 100

Resource Utilization (PAS) Resource Utilization (VAS) resource conflicts are addressed; captured by the gray por-

90 84

Improvement Potentials 80 tion in each bar, and 3) A scenario where the parallelism

69

70 66 64 63

60 60 dependency is fully relaxed and high transactional-locality

60 55

51 50

48

46 50 is guaranteed, which is indicated by the shaded portion in 44 44 44

40 40 each bar. It can be observed from this plot that, internal re- 30 sources are badly underutilized under the first (typical) sce- 20

10 nario. Specifically, we observe an average chip utilization of 17% for the typical case scenario (VAS) and 24% for the

cfs0 cfs1 cfs2 cfs3 cfs4 hm0 hm1 Cycles to Total Execution Cycles) Execution Total to Cycles proj0 proj1 proj2 proj3 proj4 improved scenario (PAS). However, in cases where the two msnfs0msnfs1msnfs2msnfs3 Chip Utilization (Contribution of Busy Busy of (Contribution Utilization Chip challenges mentioned above are eliminated, all observed re- Figure 6: Resource utilization and improvement potential source utilizations are over 40%, irrespective of the work- under various workloads. Relaxing parallelism dependency load access pattern. Specifically, relaxing parallelism de- and achieving high transactional-locality improve resource utilization by 3x and 2x compared to VAS and PAS, respec- pendency and achieving high transactional-locality improve tively. resource utilization by 3x and 2x, respectively, compared to the typical scenario and improved scenario, and the chip utilization reaches in this case 55%, on average. Overall, request arrival order”, and these arrival patterns can vary (in these results clearly underline the importance of addressing terms of length and data offset of I/O requests). Second, parallelism dependency and transactional-locality in SSDs. PAS cannot capitalize on flash-level spatial transactional- Next, we present our proposed scheduling strategy that ad- locality even if one has lots of enqueued I/O requests, be- dresses these challenges. cause it does not take into account the physical layout of un- derlying resources, chips, and microarchitecture configura- 4 Sprinkler tions. In addition, memory request commitments performed To maximize resource utilization and reduce idleness, by PAS, clueless about flash-level temporal transactional- we propose Sprinkler, which is a novel scheduling strat- locality, can introduce poor FLP and long intra-chip idle- egy composed of two components: 1) RIOS and 2) FARO. ness. Figure 5b plots this problem for C3 from a flash level Unlike VAS and PAS, which consider the I/O request or- perspective. PAS serves multiple I/O requests in parallel, der in the device-level queue and build memory requests and shortens their latencies as shown in the bottom of the based on host level information, RIOS schedules and builds figure. However, I/O #4 and I/O #5 still need to be stalled memory requests based on the “internal resource layout” to due to the resource collision imposed by the host-level re- fully relax the parallelism dependency. The relaxed paral- quest information and boundary limit, and therefore, each lelism dependency through RIOS leads to the activation of memory request is assigned to a different transaction, which as many system-level and flash-level resources as possible in turn introduceslong latencies for I/O #4 and I/O #5. This with minimal impact of the length and data offset of the example provides two insights. First, the total number of incoming I/O requests. Further, FARO over-commits flash chips is relatively fewer than the total number of memory memory requests with the goal of achieving high temporal requests coming from different I/O requests, which means and spatial transactional-locality, which can in turn signifi- that many flash chips can be activated at any given time if cantly improve FLP and reduce intra-chip idleness. Specif- parallelism dependency could be relaxed. Second, there ex- ically, FARO supplies many memory requests to the under- ist (at any given period of time) many requests heading to lying flash controllers, which can increase the opportunities the same chip but to different internal resources, which im- for building a high-FLP transaction at a chip level. In ad- plies that multiple memoryrequests can be built into an FLP dition, the over-committed memory requests by FARO sat- transaction if we could change their commitment order. isfy appropriate timing sequences, bringing higher tempo- Improvement Potential. Strong parallelism dependency ral transactional-locality as well (more on this later). Due to and low flash-level spatial/temporal transactional-locality this improved transactional-locality, under FARO, all mem- introduce poor internal resource utilization that prevents ory requests targeting toward the same chip but different die a many-chip SSD architecture from realizing its potential. and plane can be incarnated as a “single” flash transaction. To quantify the potential gains when these two main chal- lenges are removed, we simulated the same SSD platform 4.1 Resource-Driven I/O Scheduling used to collect the data presented in Figure 1, under six- One of the insights behind Sprinkler is that the stalled teen publicly-available workloads [28, 33]. Figure 6 plots memory requests in the queue can be immediately served, the chip utilization exhibited by a state-of-the-art SSD con- if the scheduler could compose the requests beyond the troller under three different scenarios: 1) A typical scenario boundary of host-level I/O requests and commit them re- where these two challenges are present, indicated by the gardless of the order of the I/O requests. Motivated by black portion in each bar, 2) An improved scenario where this, Sprinkler does not consider the I/O request order in the

6 L Y `%V Y`:CV      Q JQ HQJ1RV` .V L `V_%V Q`RV` 1J .V _%V%V               `V_%V    VIQ`7                        `V_%V     7 VIRCV0VC01V1Q`IVIQ`7`V_%V HQII1 IVJ .1] 

   

1V C:JV



  

V_8

        

  

V_8



 

  

V_8

.:JJVC     

  V_8

      

    

  V_8

.1]          ``V 71]VC1J1J$ ``V 7 1]VC1J1J$ ``V 7 11 1J$ Q`RV` Figure 8: FARO service timing diagram. Four memory re- quests can be served as a die interleaving with multiplane transaction [1, 16], which only takes four bus activities and Figure 7: Operations of RIOS. Sprinkler executes mem- a cell activity from a system level viewpoint. Note that this ory requests in a fine grain out-of-order fashion through its flash transaction composition is not a part of our schedul- resource-driven I/O scheduling (RIOS) strategy. In this ex- ing method, which means that VAS and PAS could also ample, Sprinkler can eliminate one and two more snapshots achieve this transaction if they could address parallelism de- (indicated by dashed lines) on the visit order time-line com- pendency and appropriately schedule them by being aware pared to Figures 4a and 5a. of the underlying physical resource layout.

scatter memory requests across the system, taking advan- queue and does not build flash transaction based on incom- tage of channelstripping and channelpipeliningso that their ing host information. Instead, it schedules memory requests computation activities can be overlapped with I/O, and their based on the physical resource information, composed by memory requests can be fully parallelized without any ma- the following three processes. (i) Securing tags without ac- jor resource conflicts. tual data movement until there is no more room to enqueue or no further back-to-back I/Os from a host. In this step, Figure 7 illustrates how differently RIOS schedules the the scheduler identifies the resources targeted by I/Os and five I/Os shown, compared to VAS (Figure 4a) and PAS logically categorizes the requests per physical chip without (Figure 5a). From the beginning of the I/O scheduling, any memory request composition. (ii) Initiating data move- RIOS composes six memory requests associated with I/O ment, composing memory requests and committing them #1 and I/O #2, and commits them to C0, C1, and C2, whose per flash chip not per I/O, by traversing all flash chips in the chip offset is zero. While data movements corresponding SSD. During this process, the necessary data movements to I/O #1 and I/O #2 are being performed, RIOS in paral- between the host and the SSD are performed in an out-of- lel composes eight memory requests and commits them to order fashion. Further, the computation in step i can be C3, C4 and C5 by increasing the chip offset. Lastly, RIOS overlapped with the data movement in step ii. (iii) Keep schedules four memory requests related to I/O #4 and I/O continuing with step ii until the queue secures an available #5 to the remaining chips, namely, C6, C7 and C8. Since room. We refer to this scheduling strategy as resource- all these individual steps can be pipelined (they do not have driven I/O scheduling (RIOS), which can be viewed as a the same chip offsets), RIOS significantly reduces inter- type of fine-grain out-of-order execution strategy. This par- chip idleness by relaxing the parallelism dependency and allelism dependency relaxation allows RIOS maximize the presents opportunities for building transactions with high number of active flash chips at any given time, irrespective degree of FLP. of the I/O access pattern observed. 4.2 FLP-Aware Memory Request Over- One potential problem with RIOS is that, if it commits commitment memory requests by visiting flash chips in an arbitrary fash- The scheduling strategy adopted by NVMHC can as- ion, it can introduce undesirable “system level” resource sist flash controllers to build high-FLP transactions by be- contention. For instance, if RIOS visits each flash chip ing aware of the underlying flash microarchitecture charac- in a channel-first fashion (e.g., C0, C3, C6 in the exam- teristics. In order to improve the opportunities for build- ple of Figure 5a), the bus activities of each memory request ing high-FLP transactions at runtime, Sprinkler also over- such as flash commands, control commands and data move- commits memory requests for each chip; this is referred to ments, require channel bus arbitration, which leads to I/O as FLP-aware memory request overcommitment (FARO). serialization to some extent. To avoid this, RIOS visits the The idea behind FARO is that, if the scheduler is able to flash chips that have the same offset in each channel, across early-commit multiple memory requests targeting different different channels. It then increases the chip offset, and flash internal resources in a chip, it can have more flexibil- continues to visit the corresponding flash chips for memory ity in building a flash transaction with high parallelism at request composition and commitment until all flash chips the very beginning of the flash command handling. Moti- in the SSD are visited. This traversal order allows RIOS vated by this, our proposed FARO brings as many requests

7 as possible to flash controllers as early as possible, allow- I/O requests 2 overlap depth two different dies Two different dies and planes 2 connectivity 1 2 3 4 5 3 3 4 overlap depth ing them to coalesce multiple memory requests into a single 2 overlap depth 2 connectivity 2 3 4 3 1 connectivity a 1 flash transaction with better flash-level spatial and tempo- 5 X 5 two different 1 1 d 3 planes 5 g C0 C3 C6 ral transactional-locality. Figure 8 illustrates how much the 4 X 2 5 3 e FARO b 3 C1 C4 C6 over-committed memory requests (by FARO) can shorten 3 X 16 Different dies c 3 4 f C2 C5 C8 5 the latencies of different I/O requests, compared to VAS 2 X 3 1 3 Chip layout 1 X 4 2 3 4 overlap depth 3 (Figure 4b) and PAS (Figure 5b). With FARO, four over- 2 connectivity 5 Memory requests Different committed memory requests associated with four different Two different dies and planes planes I/Os (#1, #2, #4 and #5) present high transactional-locality at the time of building a flash transaction, and therefore, Figure 9: FLP-Aware Request Overcommitment (FARO). they can be built (using die interleaving and plane sharing FARO increases transactional-locality by controlling the FLP) as a single transaction, which represents the highest overcommitment priority dynamically. Note that all the chips (on the chip layout) are involved in I/O executions, FLP as far as C3 is concerned. Consequently, for the four and many of them have transactions consist of multiple memory requests shown, system-level resources experience memory requests coming from different I/O requests across only four bus activities and the single flash memory cell ac- multiple queue entries. tivity. In this example, FARO saves more than half of the I/O execution latencies for I/O requests #3, #4, and #5 in same I/O #3, FARO over-commits memory requests associ- the queue. The number of cycles saved by FARO is shown ated with ‘a’, instead of requests #1 and #5 in an attempt to in the bottom of the figure. improve the latency of I/O #3 as the second option. 4.3 Handling Live Data Migration One potential concern with FARO is that it might in- crease, in certain cases, the flash-level resource contention One problem with the physical address schedulers im- if it over-commits the memory requests without any pref- plemented in NVMHC is the handling of live data migra- erence. To address this potential problem, our implemen- tions, which is the process of reading valid data pages, tation of FARO considers overlap depth and connectivity writing them into new locations, and updating the map- among multiple memory requests in an attempt to control ping information regarding physical addresses. It is crit- the overcommitment priority dynamically. Overlap depth is ical to note that live data migrations can change physical the number of memory requests targeting different planes addresses during an I/O service. Depending on the under- and dies in the same flash chip. Connectivity on the other lying flash firmware strategy, the reason why live data mi- hand is the maximum number of memory requests that be- gration is invoked at runtime can be different, but usually long to the same I/O request. While the overlap depth is a the migration is performed because of, 1) garbage collec- metric oriented towards improving FLP, the connectivity is tion, 2) wear-leveling, or 3) bad block replacement, and the a metric that targets improving I/O latency. Sprinkler gives corresponding migration activities are similar in each case the highest priority to the request with the highest overlap [18, 12]. To address the migration problem, we introduce depth. In cases where there exist multiple requests having a readdressing callback strategy, which is a conventional the same depth value, FARO over-commits the memory re- callback routine that updates the physical data layout in- quests that have the highest connectivity among them. This formation in the upper-level I/O scheduler. Since Sprinkler dynamic priority control employed by FARO alleviates po- exploits the internal resource layout rather than the physi- tential resource contention that could be caused by over- cal address of memory requests, our readdressing callback committed memory requests. Figure 9 explains how con- is invoked only if the live data have been migrated between nectivity and overlap depth are used by FARO. In this ex- different flash internal resources. ample, the five I/Os have thirteen memory requests in total. 4.4 Implementation Details Especially, the length (the number of memory requests) of Algorithm. Algorithm1 describes how our scheduler sprin- I/O #3 exceeds the total number of chips, which introduces kles memory requests across multiple internal resources. a connectivity value of more than two in several chips (e.g., Sprinkler first identifies the physical layout of the mem- C1, C2, and C3). There exist twelve memory requests hav- ory requests in terms of indexes of chip, die and plane. It ing an overlap depth of 4 targeting chips C1, C2, and C3. keeps queuing the incoming I/O requests and identifying Since these memory requests head toward two different dies their physical layouts until there is no parallel I/O request and planes, they can be coalesced as a single memory trans- left or the queue has no more room. In the main scheduling action (indicated by ‘b’, ‘c’ and ‘d’ in the figure), and FARO part, Sprinkler selectively initiates data movements of mem- commits them first, instead of the memory requests belong- ory requests targeting the currently-visited flash chip. In ing to I/O #5 or I/O #1. In contrast, the memory requests this step, Sprinkler visits flash chips following the traversal heading to C0 can be captured by two transactions, ‘a’ and order explained in Section 4.1. It then checks which mem- ‘g’. Since ‘a’ is composedof memoryrequestsrelated to the ory request has a high overlap depth and a high connectiv-

8 Total transfer Numbers of Randomness Transac size (MB) Instructions (%) -tional /* Resource-Driven I/O Scheduling (RIOS) */ while queue.full() != true —— host.next request(&tag) != null do Read Write Read Write Read Write locality cfs0 3607 1692 406 135 92.79 86.59 Low /* queuing incoming I/O requests based on physical cfs1 2955 1773 385 130 94.01 86.12 Medium layout */ tuple chip(chip idx, die idx, plane idx) := core.preprocess(tag) cfs2 2904 1845 384 135 94.28 85.95 Low phy layout[chip idx].insert(chip, tag) cfs3 3143 1649 387 132 93.97 86.7 High cfs4 3600 1660 401 132 92.6 86.59 High /* try to enqueue tags as many as possible */ hm0 10445 21471 1417 2575 94.2 92.84 Medium while host.next request(&tag) = null —— queue.full() = true do hm1 8670 567 580 28 98.29 98.59 Medium for i :=0 toi/num channel do msnfs0 1971 30519 41 1467 99.79 87.23 Low /* stripping */ msnfs1 17661 17722 121 2100 88.8 66.71 Low for j :=0 tonum channel do msnfs2 92772 24835 9624 3003 98.13 99.97 High /* pipelining */ msnfs3 5 2387 1 5 22.52 64.79 High /* computation can be overlapped with proj0 9407 151274 527 3697 92.05 79.31 Medium flash I/O time */ proj1 786810 2496 2496 21142 82.34 96.88 Medium chip := phy layout.get chip(i*num channel + j) proj2 1065308 176879 25641 3624 78.74 93.93 Low /* FLP-Aware Request Over-commitment proj3 19123 2754 2128 116 75.01 88.37 Medium (FARO) */ proj4 150604 1058 6369 95 84.39 95.52 Medium tag := get highest overlap depth(chip) if tag = null then Table 1: Classifying our traces in terms of data transfer /* there are tags which have same sizes, the numberof I/O instructions, and randomness of the overlap depth */ tag := get tag considering connectivity(chip) issued reads and writes. The last column gives transactional mem vector := build memory request(tag) locality by statically analyzing the traces based on the chip commit mem requests(mem vector) layout we simulated in cases where there is no limit for the Algorithm 1: Sprinkle(host, queue, core) in NVMHC. device-level queue size. Note that necessary data movement initiations selectively occur, and composed memory requests are scheduled fore, in the read-after-write and write-after-write cases, a based on the physical layout information. host-side logical block adapter (or operating system) can simply return or overwrite the data using its own buffer. Because of this, in general, the standard storage-interface ity in order to identify the memory requests to over-commit protocols do not specify any rules on data integrity and con- memory requests. During this scheduling step, Sprinkler sistency management for request reordering. Nevertheless, keeps watching whether we have new I/O request arrivals we manage the write-after-read case by serving the read and queue is full or not to maximize the fine grain out-of- memory-requests first (only if the target contents of reads order execution potential as well as to relax parallelism de- are the same as the contents of writes in the plane-level) pendency. when FARO considers overlap-depth and connectivity, and Complexity. The memory space requirements to record the serves I/Os without any reordering if there exists a force- required physical layout information (resource identifica- unit-access command request, used by an OS to manage the tion) are negligible. Specifically, four bytes are sufficient integrity and consistency of the storage system. to store each piece of information, including chip, die and 5 Evaluation plane indices, and we observed that a total of 256KB is suf- ficient to cover all the workloads tested in Section 5. In ad- 5.1 Configuration dition, the computation in each iteration can be overlapped SSD and NAND flash. We evaluated Sprinkler using with data movement between the host and the SSD. Note a cycle-accurate SSD simulator with hardware-validated that any other scheduler implemented in NVMHC would multiple flash simulation model [19] that allows us vary have similar computation and space complexities to com- the number of flash chips, ranging from 64 flash chips (8 pose/commit requests. channels) to 1024 flash chips (32 channels). Each channel The Order of Output Data. NVMHC maintains an eight works on ONFI 2.x, which is the most popular flash inter- byte memory request bitmap, which covers 128KB∼1MB face specification in the SSD industry. Note that, consider- block size per queue entry. Each bit of this bitmap indi- ing slow write speed on flash memory array and power/cost cates an issued memory request. When the flash controller of devices, many vendors such as Micron, Fusion-IO and makes an upcall to inform a flash transaction completion, it OCZ employ ONFi 2.x, in stead of a 400MHz flash inter- clears bits correspondingto the memory requests associated face, even for high-performance PCIe SSDs (e,g., Micron with the transaction. The DMA engine brings back the data P320H, Fusion-IO ioFX series, and OCZ RevoDrive se- from the beginning of the I/O request offset to the host us- ries). For the flash microarchitecture configuration, each ing multiple payloads in an in-order fashion. Note that this flash chip employs two dies and four planes, and each die I/O completion process and bitmaps are required regardless consists of 8,192 blocks. 128 pages are put together as a of type of the scheduling strategy implemented in NVMHC. single block, and the unit size of each page is 2KB. Our Hazard Control. Sprinkler only schedules I/O requests us- simulation framework can capture the intrinsic write (pro- ing tags, which means that the actual data for writes sit on gramming) variation latency [10, 8, 19, 9] of Multi Level the host side buffer during the scheduling activity. There- Cell (MLC) NAND flash memory, varying from 200 µs

9 VAS PAS SPK1 SPK2 SPK3 VAS PAS SPK1 SPK2 SPK3

4 5

2.5x10 5.0x10

4 5

2.0x10 4.0x10

4 5

1.5x10 3.0x10

4 5

1.0x10 2.0x10 IOPS

3 5

1.0x10 5.0x10

0.0 0.0 Bandwidth (KB/sec) Bandwidth

cfs0 cfs1 cfs2 cfs3 cfs4 cfs0 cfs1 cfs2 cfs3 cfs4 hm0 hm1 hm0 hm1 proj0 proj1 proj2 proj3 proj4 proj0 proj1 proj2 proj3 proj4

msnfs0msnfs1msnfs2msnfs3 msnfs0msnfs1msnfs2msnfs3 (a) I/O bandwidth (b) IOPS(I/O operations per second)

VAS PAS SPK1 SPK2 SPK3 VAS PAS SPK1 SPK2 SPK3

8

1.8x10 0.90

8

1.5x10 0.75

8

1.2x10 0.60

7

9.0x10 0.45

7

6.0x10 0.30

7

0.15 3.0x10 Pending Time Pending

0.0 0.00 Avg. Latency (ns) Latency Avg. Normalized Queue Normalized

cfs0 cfs1 cfs2 cfs3 cfs4 hm0 hm1 cfs0 cfs1 cfs2 cfs3 cfs4 hm0 hm1 proj0 proj1 proj2 proj3 proj4 proj0 proj1 proj2 proj3 proj4

msnfs0msnfs1msnfs2msnfs3 msnfs0msnfs1msnfs2msnfs3 (c) Average I/O latency (d) Queue stall time. Figure 10: VAS, PAS, SPK1, SPK2, and SPK3 performance comparison.

(slow page) to 2200 µs (fast page) [25] based on the ad- improvement brought by SPK3 over VAS ranges between dress of the page being accessed. In addition, read latency 42 MB/sec and 300 MB/sec. Further, as compared to PAS, is configured to be 20 µs. Note that these values represent a SPK3 provides 1.8 times better throughput. In this case, the state-of-the-art SSD and NAND flash package. improvement ranges between 38MB/sec and 200MB/sec. Schedulers. We evaluate five different I/O schedulers: Our performance improvements are more pronounced when • VAS – Virtual address scheduler, using FIFO. the I/O instruction addresses exhibit higher (potential) transactional-locality (e.g., cfs3, cfs4, msnfs2∼3). • PAS – Physical address scheduler, using extra flash We also see that SPK3 improves throughput regardless queues. of how read or write intensive a workload is. As compared • SPK1 – Sprinkler, using only FARO. to PAS, SPK1 occasionally hurts performance. This is be- • SPK2 – Sprinkler, using only RIOS. cause, even though FARO is capable of increasing FLP, it • SPK3 – Sprinkler, using both FARO and RIOS. cannot always secure enough memory requests to achieve For all the schedulers tested, we introduce a standard com- high FLP without RIOS’s help. In contrast, SPK2 always mand queue, which allows storage devices to execute I/Os outperforms VAS and PAS, and exhibits better performance in an out-of-order fashion [34]. In this evaluation, PAS is than SPK1 in most cases. This is because RIOS can build implemented to support system-level, coarse grain, out-of- and commit memory requests irrespective of the host level order execution [27], which means that it can skip the busy information by employing a fine-grain, out-of-order execu- flash chips and commit the other memory requests to idle tion strategy. chips. IOPS. Figure 10b gives the IOPS values achieved by dif- ∼ ∼ Firmware. We implemented a pure page-level address ferent I/O schedulers. For cfs1, msnfs0 1 and proj0 1 mapping FTL and a garbage collection strategy similar to workloads whose access patterns include mostly small ran- the one employed in [1]. Flash controllers manage flash dom requests, the achieved bandwidth improvement is not transactions following the open NAND flash interface spec- dramatic, compared to other workloads. In these work- ification [29]. loads, SPK2 and SPK3 improveoverVAS and PAS by about Traces. We employ data center workloads [28] from public 2x. In contrast, proj2 consists of large I/O requests, which trace repositories [33]. Our traces consist of corporate mail have low transactional-locality. As a result, most schedulers file server (cfs), hardware monitor (hm), MSN file storage provide low IOPS. Unlike the performance observed un- server (msnfs), and project directory service (proj). The im- der other workloads, SPK1 outperforms SPK2 in this case. portant characteristics of our traces are given in Table 1. Even though the address accesses across requests exhibit low transactional-locality, memory requests that belong to 5.2 System Performance a given I/O request have sequentiality to some extent. As Bandwidth. Figure 10a plots the I/O bandwidth for five a result, in proj2, SPK2 can improve FLP without any help different I/O schedulers we implemented in NVMHC. One from RIOS. Further, by combining RIOS and FARO, SPK3 can see from these results that Sprinkler generates better shows better performance than any other scheduler in all throughput values than VAS and PAS. Specifically, SPK3 workloads. Even with proj2, SPK3 generates about 2x bet- boosts the I/O bandwidth by at least 2.2 times, compared ter IOPS than PAS. to VAS. For all the workloads we tested, the throughput Latency and Queue Stall Time. Figures 10c and 10d

10 VAS PAS SPK1 SPK2 SPK3 VAS PAS SPK1 SPK2 SPK3

100 100

80 80

60 60 (%) (%)

40 40

20 20

0 0 Intra-chip Idleness Idleness Intra-chip Inter-chip Idleness Idleness Inter-chip

cfs0 cfs1 cfs2 cfs3 cfs4 hm0 hm1 cfs0 cfs1 cfs2 cfs3 cfs4 hm0 hm1 proj0 proj1 proj2 proj3 proj4 proj0 proj1 proj2 proj3 proj4

msnfs0msnfs1msnfs2msnfs3 msnfs0msnfs1msnfs2msnfs3 (a) Inter-chip idleness. (b) Intra-chip idleness. Figure 11: Idleness analysis. plot average SSD device-level latency and the device-level

8589.9 8589.9

VAS VAS

4295.0 4295.0

PAS SPK3 queue stall time, respectively. In our evaluation, the device- 2147.5 2147.5

1073.7 1073.7 level latency means the response time per I/O request, not 536.9 536.9

268.4 268.4

134.2 134.2 s)

67.1 67.1 per memory request or transaction, and the queue stall time s) m 33.6 33.6 m

16.8 16.8

is normalized to that of VAS. SPK3 successfully reduces the 8.4 8.4

4.2 4.2 device-level latency from 59.1% to 92.3%, as compared to 2.1 2.1

1.0 1.0 Latency ( Latency Latency ( Latency

0.5 0.5

VAS, for all the workloads tested. SPK1 provides worse la- 0.3 0.3

0.1 0.1 tency than PAS under certain workloads such as cfs3, proj0, 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000

I/O request flow and proj1, since FARO itself cannot secure enough memory I/O request flow (a) VAS vs. PAS (b) VAS vs. SPK3 requests and still has parallelism dependency problem with such workloads. One can also observe that, all SPK sched- Figure 12: Time series analysis for PAS and Sprinkler ulers significantly reduce the queue stall time. In particular, (SPK3). SPK3 generates 80% and 64% shorter device-level latency than VAS and PAS, respectively. the queue stall time experienced with SPK3 is about 86% less than that of VAS. The shorter queue stall time at the device level increases opportunities for the host level mod- remaining schedulers tested. ules to parallelize I/O accesses and reduce the number of 5.4 Time Series Analysis blocking I/O requests. We now compare the device-level latencies experienced 5.3 Device-Level Idleness by PAS and SPK3 against VAS, using three thousand I/O in- Inter-chip Idleness. Figure 11a shows inter-chip idleness structions from the beginning of msnfs1 (one of our traces). values under different scheduling strategies. Even though Figure 12 clearly shows the superiority of SPK3 over PAS. PAS and SPK1 take advantage of the physical address space As shown in Figure 12a, PAS successfully reduces latency exposure, they cannot reduce inter-chip idleness signifi- of I/O requests, and provides, much more stable perfor- cantly. The main reason behind this is the fact that they mance, compared to VAS. This is because PAS schedules cannot fully parallelize data accesses of incoming I/O re- memory requests using physical addresses, and the extra quests due to parallelism dependency. In contrast, SPK2 queues employed for each flash chip allows course-grain activates as many flash chips as possible by relaxing paral- out-of-order execution. However, the latency improve- lelism dependency, which in turn significantly reduces the ment brought by PAS is limited because 1) it does not inter-chip idleness. Specifically, compared to VAS, SPK3 exploit FLP and 2) the memory request composition and improves inter-chip idleness by about 46.1%, on average. commitment when using PAS are not free from the par- Intra-chip Idleness. Intra-chip idleness (Figure 11b) paints allelism dependency problem. In contrast, SPK3 provides a different picture, compared to the inter-chip idleness. much shorter latency (which is even better than PAS) since Even though SPK1 suffers from parallelism dependency, the over-committed memory requests are composed using it is in a better position to compose high-FLP transactions fewer flash transactions, and these memory requests are than SPK2. Thus, SPK1 reduces intra-chip idleness much served in parallel by multiple flash chips at the same time, more than SPK2. SPK2 is able to reduce intra-chip idle- thanks to the relaxed parallelism dependency. ness slightly, although it does not build transactions with 5.5 Execution Time Breakdown high FLP in mind. This is because relaxed parallelism de- Figure 13 gives a breakdown of the total execution time pendency allows flash controllers to take more advantage into bus activate, bus contention, cell activate, and idle time of die interleaving than PAS or VAS. SPK3, employing components. As shown in Figure 13a, PAS wastes large both FARO and RIOS, performs worse than SPK1 because amounts of time, clearly indicating that it suffers from the it introduces, in some cases, more system-level contention low resource utilization problem, as it does not take into across the multiple memory requests composed by RIOS. account the flash microarchitecture characteristics. In con- It should be noted however that, when both intra-chip and trast, SPK3 increases the memory cell active time by maxi- inter-chip idleness are considered, SPK3 outperforms the mizing FLP as well as relaxing the parallelism dependency,

11 request (the number of memory requests associated an I/O)

Bus operation Bus contention Bus operation Bus contention

Memory operation System idle Memory operation System idle spans all chips, but not all the flash internals (512KB, 1MB, 100 100 2MB in 64 chips, 256 chips and 128 chips, respectively),the 80 80 utilization of VAS drops to some extent because it simply

60 60 strips an I/O request across multiple resources in a round-

40 40 robin fashion without taking into account the underlying

20 20 flash microarchitecture. However, as the data size keeps

0 0 Execution Breakdown (%) Breakdown Execution Execution Breakdown (%) Breakdown Execution

cfs0cfs1cfs2cfs3cfs4 cfs0cfs1cfs2cfs3cfs4 hm0hm1 hm0hm1 continuing to increase, the chip utilization increases again. proj0proj1proj2proj3proj4 proj0proj1proj2proj3proj4

mds0mds1 msnfs0msnfs1msnfs2msnfs3 msnfs0msnfs1msnfs2msnfs3 This is because a larger I/O size covers more flash internals, (a) PAS (b) SPK3 and this hleps to improve FLP. Figure 13: Execution time breakdown. SPK3 eliminates Unlike VAS, our SPKs exhibit different utilization char- system level idleness by 40.5% (50.7%), compared to PAS (VAS). acteristics. As shown in the figure, SPK1 can improve the chip utilization by 16% only if incoming I/O request sizes are large. This is because in this case it can secure enough as shown in Figure 13b. Also, the bus contention time in- memory requests that belong to a single I/O request to com- creases in SPK3 as a result of increasing the amount of flash pose a high-FLP transaction. However, it does not work memory cell activities in workloads whose fraction of reads well when request sizes are small, due to strong parallelism is larger than that of writes, such as cfs3, msnfs2, and proj2. dependency. In contrast, SPK2 shows better chip utiliza- However, we still have spare time, which can be utilized to tion only when the data sizes of the incoming requests are execute I/O instructions, in all the workloads tested. small. Even tough SPK2 can capitalize on internal resource concurrency by relaxing parallelism dependency, it suffers 5.6 Parallelism Analysis from the system-level resource contention. Lastly, SPK3 Figure 14 decomposes parallelism by four different lev- shows excellent and sustainable chip utilization. In this els; NON-PAL captures the I/O requests that are served by case, FARO consumes as many memory requests as possi- only SLP concurrency oriented strategies, such as chan- ble which are generated by RIOS. Specifically, overall chip nel stripping and pipelining; PAL1 represents the impact of utilizations with 64, 256, and 1024 chips are 71.2%, 61.5%, plane sharing when combined with the SLP concurrency and 44.9%, respectively, while the corresponding utilization strategies; PAL2 indicates the impact of die interleaving values for VAS are 37%, 21.2%, and 13.9%, in that order. combined with the SLP concurrency schemes; and finally, PAL3 captures the impact when die interleaving and plane 5.8 Memory Transaction Reduction Rate sharing are combined with SLP optimization strategies so As shown in Figure 16, the over-commitment strategy that I/O requests are fully served with the highest degree of employed by FARO has a great impact on reducing the parallelism (4x higher performance than NON-PAL). While number of transactions at a chip level. In contrast, the trans- VAS serves I/O requests with only PAL1, contributing to action reduction success of SPK2 is not very high, and gets 1% ˜3% of the total execution, PAS improves parallelism by even worse as the number of flash chips increases. This is exploiting the other levels of parallelism, as shown in Figure because SPK2 parallelizes data access among system-level 14a. However, there is no PAL3 in PAS, and it still expe- resources, which leads to low transactional-locality. There- riences low FLP due to the parallelism dependency prob- fore, the flash controller could not secure enough memory lem. SPK1 provides the best way of achieving high FLP, requests to coalesce them into a single transaction. SPK3 but it has a lower system-level chip utilization (see Section generates better data reduction rate (50.2% on average) than 5.7). The degree of parallelism in SPK2 is better than that SPK2 and enjoys better SLP than SPK1 by employing both of PAS; however, like PAS, SPK2 does not achieve high FARO and RIOS. levels of FLP. Lastly, the degree of parallelism obtained by 5.9 Migration and Re-addressing Call- SPK3 is lower than that of SPK1, but it enjoys the benefits back Impacts of both SPK1 and SPK2 and makes parallelism more bal- anced between SLP and FLP, thereby achieving high levels Since garbage collection (GC) is one of the most time of parallelism as well as high chip utilization. consuming tasks, and is one of the most frequently- occurring activities during live data migration, we next 5.7 Resource Utilization Analysis stressed VAS, PAS, and SPK3 by artificially introducing a Figure 15 plots the chip utilization results when varying very high number of GCs. In this experiment, VAS and the transfer sizes from 4KB to 4MB, and varying the num- PAS have no readdressing callback. We prepared pristine- ber of flash chips from 64 to 1024. In general, the chip state SSDs for non-GC evaluation and fragmented SSDs for utilization of VAS keeps increasing as the transfer size in- GC evaluation, which were filled by 95% with 1 MB ran- creases. However, in some cases where the length of the dom writes (just before the GC begins). As shown in Figure

12 NON-PAL PAL1 PAL2 PAL3

NON-PAL PAL1 PAL2 PAL3 NON-PAL PAL1 PAL2 PAL3 NON-PAL PAL1 PAL2 PAL3

100

100 100 100

80

80 80 80

60

60 60 60

40 40 40 40

20 20 20 20 FLP breakdown (%) breakdown FLP FLP breakdown (%) breakdown FLP FLP breakdown (%) breakdown FLP (%) breakdown FLP

0 0 0 0

cfs0cfs1cfs2cfs3cfs4 cfs0cfs1cfs2cfs3cfs4 cfs0cfs1cfs2cfs3cfs4 cfs0cfs1cfs2cfs3cfs4 hm0hm1 hm0hm1 hm0hm1 hm0hm1

proj0proj1proj2proj3proj4 proj0proj1proj2proj3proj4 proj0proj1proj2proj3proj4 proj0proj1proj2proj3proj4

msnfs0msnfs1msnfs2msnfs3 msnfs0msnfs1msnfs2msnfs3 msnfs0msnfs1msnfs2msnfs3 msnfs0msnfs1msnfs2msnfs3 (a) PAS (b) SPK1 (c) SPK2 (d) SPK3 Figure 14: Flash-level parallelism breakdown. While SPK1 (FARO only) maximizes FLP, SPK3 (FARO+RIOS) carefully balances parallelism between SLP and FLP.

VAS SPK1 SPK2 SPK3 VAS SPK1 SPK2 SPK3 VAS SPK1 SPK2 SPK3

80 80 80

70 70 70

60 60 60

50 50 50

40 40 40

30 30 30

20 20 20

10 10 10

0 0 0

4 8 4 8 4 8

16 32 64 16 32 64 16 32 64

128 256 512 128 256 512 128 256 512

102420484096 102420484096 102420484096

Transfer Size (KB) Transfer Size (KB) Transfer Size (KB) Flash-levelUtilization (%) Flash-levelUtilization (%) Flash-levelUtilization (%) (a) 64 flash chips (b) 256 flash chips (c) 1024 flash chips Figure 15: Chip utilization analysis. SPK3 outperforms other schedulers, irrespective of data transfer size and SSD internal configuration.

6

VAS SPK1 SKP2 SPK3 VAS SPK1 SKP2 SPK3

5 1.0x10

2.8x10 5 5 VAS VAS

3.5x10 3.5x10

5 VAS-GC VAS-GC

5 5 5 2.4x10

3.0x10 3.0x10 8.0x10

PAS PAS

5

5 5

2.0x10 PAS-GC PAS-GC 2.5x10 2.5x10

5

SPK3 5 6.0x10 SPK3 5 5

1.6x10 2.0x10 2.0x10

SPK3-GC SPK3-GC

5 5 5

5 1.5x10 1.5x10 1.2x10

4.0x10

5 5

4

1.0x10 1.0x10

8.0x10

5

4 4

2.0x10 4 5.0x10 5.0x10

4.0x10 Bandwidth (KB/sec) Bandwidth Bandwidth (KB/sec) Bandwidth

0.0 0.0 Thenumber of Trans. (10K) Thenumber of Trans. (10K)

0.0

0.0 4 8 4 8

16 32 64 16 32 64

128 256 512 128 256 512

102420484096 102420484096

4 8 4 8

16 32 64 16 32 64 Transfer Size (KB) 128 256 512 128 256 512 Transfer Size (KB) 102420484096 102420484096

Tansfer Size (KB) Tansfer Size (KB) (a) 64 flash chips (b) 1024 flash chips (a) 64 flash chips (b) 256 flash chips Figure 16: Flash transaction reduction rates. SPK3 reduces the number flash transactions by about 50.2% than VAS. Figure 17: Garbage collection and readdressing impact. SPK3 generates about 2x% better performance than VAS (and PAS as well) by efficiently updating the physical lay- 17, all the schedulers tested suffer from performance degra- out via readdressing callback. dation once the underlying FTL starts performing garbage collections. 6 Related Work and Discussion SPK3 exhibits 33% ∼ 78% performance degradation DRAM controller. Balancing timing constraints, fairness while VAS experiences 11 ∼ 28% performance degrada- and different dimension of physical parallelism has long tion. The main reason behind this performance degradation been a problem addressed by DRAM based memory con- of SPK3 is that many memoryrequestsare stalled at the sys- trollers [24, 23, 32, 26, 6, 11, 37, 15, 37]. SSDs however tem level due to the extra read and write activities, caused have sharp differences in device characteristics, form fac- by GCs. In other words, relaxing parallelism dependency tor (which leads diverse architecture configurations), asym- and achieving high transaction-locality are difficult because metric latencies, diverse commands and interface protocols. of GCs. However, SPK3’s performance with GCs is still Consequently, these differences lead to separate research di- much better than PAS and VAS because it successfully se- rections and needs for request scheduling and memory con- cures new information through readdressing callback after troller design. following the GC. As a result, SPK is able to spread the re- Parallelism. Prior studies recognize the need to ex- maining memory requests over multiple resources and coa- ploit parallelism in flash-based SSDs [4, 5] and propose lesce them again. concurrency-centric methods [7, 1, 21]. For exploiting dif-

13 ferent levels of internal parallelism, different page allo- References cation strategies [36, 13, 16] have also been investigated. [1] N. Agrawal, V. Prabhakaran, T. Wobber, J. D. Davis, Even though all these studies demonstrate significant per- M. Manasse, and R. Panigrahy. Design tradeoffs for SSD formance improvements and better parallelism over more performance. In USENIX ATC, 2008. serial alternatives, concurrency methods and page alloca- [2] A. M. Caulfield, J. Coburn, T. Mollov, A. De, A. Akel, J. He, tion strategies are typically fixed at the SSD design time, A. Jagatheesan, R. K. Gupta, A. Snavely, and S. Swanson. and thus they are not in a position to take advantage of par- Understanding the impact of emerging non-volatile memo- ries on high-performance, IO-intensive computing. In SC, allelism through I/O request scheduling. 2010. Request scheduling. In [30, 16], the authors attempt to un- [3] A. M. Caulfield, A. De, J. Coburn, T. I. Mollov, R. K. Gupta, cover the specific resource contention that occurs in SSDs, and S. Swanson. Moneta: A high-performance storage array the areas where parallelism is far below optimal, and present architecture for next-generation, non-volatile memories. In a dynamic request rescheduling scheme to improve perfor- MICRO, 2010. mance. However, they fail to account for the reality that the [4] A. M. Caulfield, L. M. Grupp, and S. Swanson. Gordon: addresses that correspond to requests that they “reschedule” Using flash memory to build fast, power-efficient clusters are virtual. for data-intensive applications. In ASPLOS, 2009. [5] F. Chen, R. Lee, and X. Zhang. Essential roles of exploiting Out-of-oder execution. Ozone [27] and PAQ [20] dynam- internal parallelism of flash memory based solid state drives ically schedule I/O requests based on physical addresses. in high-speed data processing. In HPCA, 2011. Ozone does not wait for the request that create conflicts [6] J. W. Davidson and S. Jinturkar. Memory access coalescing: at the system level, and serves other request in an out- a technique for eliminating redundant memory accesses. of-oder fashion. However, Ozone does not consider flash SIGPLAN Not., 1994. microarchitecture and corresponding FLP. Further, it re- [7] C. Dirik and B. Jacob. The performance of PC solid-state quires a hardware-assisted preprocessor, a postprocessor, disks (SSDs) as a function of bandwidth, concurrency, de- extra queues, and a reservation station to exploit physical vice architecture, and system organization. In ISCA, 2009. [8] L. M. Grupp, A. M. Caulfield, J. Coburn, S. Swanson, address. In contrast, PAQ is a software-driven dynamic E. Yaakobi, P. H. Siegel, and J. K. Wolf. Characterizing scheduler, which avoids resource conflicts and improves flash memory: Anomalies, observations,and applications. In parallelism. PAQ executes multiple transactions in an out- MICRO, 2009. of-order fashion by being aware of SLP and plane-level par- [9] L. M. Grupp, J. D. Davis, and S. Swanson. The bleak future allelism. Note that, even though PAQ considers resource of nand flash memory. In FAST, 2012. conflicts, it is limited to handle only read requests. [10] L. M. Grupp, J. D. Davis, and S. Swanson. The harey tor- Native command queue (NCQ). Recall that, unlike toise: Managing heterogeneous write performance in SSDs. DRAM memory requests, an SSD I/O request consists In USENIX ATC, 2013. [11] S. I. Hong, S. A. McKee, M. H. Salinas, R. H. Klenke, J. H. of multiple page-level memory requests whose size varies Aylor, and W. A. Wulf. Access order and effective band- based on host-side application characteristics. Conse- width for streams on a direct rambus memory. HPCA, 1999. quently, even though a device-level queue (such as na- [12] X.-Y. Hu, E. Eleftheriou, R. Haas, I. Iliadis, and R. Pletka. tive command queue and tagged command queue) allows Write amplification analysis in flash-based solid state drives. VAS/PAS to reorder incoming I/O requests in an out-of- In SYSTOR, 2009. oder fashion, it is difficult to construct a transaction with [13] Y.Hu, H. Jiang, D. Feng, L. Tian, H. Luo, and S. Zhang. Per- high FLP by coalescing memory requests that are scattered formance impact and interplay of SSD parallelism through across multiple I/O requests. advanced commands, allocation strategy and data granular- ity. In ISC, 2011. At a high level, all the studies mentioned in this section [14] A. Huffman. NVM express 1.0. 2011. ignore internal resource utilization and architectural chal- [15] I. Hur and C. Lin. Adaptive history-based memory sched- lenges exhibited by state-of-the-art many-chip SSDs. ulers for modern processors. 2006. 7 Conclusions [16] M. Jung and M. Kandemir. An evaluation of different page allocation strategies on high-speed SSDs. In HotStorage, In this paper, we propose Sprinkler, a novel device-level 2012. SSD controller, which targets maximizing resource utiliza- [17] M. Jung and M. Kandemir. Revisiting widely held ssd ex- tion and achieving high performance. Specifically, Sprin- pectations and rethinking system-level implications. In SIG- kler relaxes parallelism dependency by scheduling I/O re- METRICS, 2013. quests based on internal resource layout, instead of the order [18] M. Jung, R. Prabhakar, and M. T. Kandemir. Taking garbage collection overheads off the critical path in SSDs. In Mid- imposed by the device-level queue. Our extensive exper- dleware, 2012. imental evaluation using a cycle-accurate SSD simulation [19] M. Jung, E. Wilson, D. Donofrio, J. Shalf, and M. Kandemir. model shows that a many-chip SSD equipped with Sprin- NANDFlashSim: Intrinsic latency variation aware NAND kler provides at least 56.6% shorter latency and 1.8 ∼ 2.2 flash memory system modeling and simulation at microar- times better throughput than the modern SSD controllers. chitecture level. In MSST, 2012.

14 [20] M. Jung, E. H. Wilson, III, and M. Kandemir. Physically addressed queueing (PAQ): Improving parallelism in solid state disks. In ISCA, 2012. [21] J.-U. Kang, J.-S. Kim, C. Park, H. Park, and J. Lee. A multi- channel architecture for high performance NAND flash- based storage system. JSA, 2007. [22] J. Kim, Y. Oh, E. Kim, J. Choi, D. Lee, and S. H. Noh. Disk schedulers for solid state drives. In EMSOFT, 2009. [23] Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter. Atlas: A scalable and high-performance scheduling algorithm for multiple memory controllers. In HPCA, 2010. [24] Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter. Thread cluster memory scheduling: Exploiting differences in memory access behavior. In MICRO, 2010. [25] Micron, Inc. NAND flash MLC datasheet, MT29F8G08MAAWC, MT29F16G08QASWC. In http://www.micron.com/. [26] O. Mutlu and T. Moscibroda. Stall-time fair memory access scheduling for chip multiprocessors. In MICRO, 2007. [27] E. H. Nam, B. Kim, H. Eom, and S.-L. Min. Ozone (O3): An out-of-order flash memory controller architecture. TOC, 2011. [28] D. Narayanan, E. Thereska, A. Donnelly, S. Elnikety, and A. Rowstron. Migrating server storage to SSDs: analysis of tradeoffs. In EuroSys, 2009. [29] ONFI Working Group. Open nand flash interface. In http://onfi.org/. [30] C. Park, E. Seo, J.-Y. Shin, S. Maeng, and J. Lee. Exploiting internal parallelism of flash-based SSDs. IEEE CAL., 2010. [31] PCI-SIG. PCI express base 3.0 specification. 2012. [32] N. Rafique, W.-T. Lim, and M. Thottethodi. Effective man- agement of DRAM bandwidth in multicore processors. In PACT, 2007. [33] S. Repository. http://iotta.snia.org/. [34] SATA-IO. Serial ATA Revision 3.1. 2011. [35] Shawn Kang. Native PCIe SSD controllers. 2012. [36] J.-Y.Shin, Z.-L. Xia, N.-Y.Xu, R. Gao, X.-F. Cai, S. Maeng, and F.-H. Hsu. FTL design exploration in reconfigurable high-performance SSD for server applications. In ICS, 2009. [37] W. K. Zuravleff and T. Robinson. Controller for a syn- chronous DRAM that maximizes throughput by allowing memory requests and commands to be issued out of order. U.S. Patent No: 5,630,096, 1997.

15