arXiv:1705.04627v1 [cs.AR] 12 May 2017 oesr ieydseiaino coal n ehia w technical and scholarly of dissemination timely Univ ensure collection) State to garbage (regarding Pennsylvania data and new Dallas includes material at whe Texas done of mostly University and at Architecture Computer Performance High necnetdt omasnl trg,admlil I/O multiple and storage, single a being are form chips to memory flash interconnected NAND of thousands to hu dreds Specifically, changes. architectural severe 40 undergoing around still are widths 3]. 2, 31, by [14, brought flash storage advantages NAND conventional the avoid exploit to and rate attempt overheads, data interface an – in Express –, PCI 16GB/sec like is SSDs interfaces employ speed to high begun an through have data-intensive computing enterprise, performance Further, high laptops in workstations. deployed rapidly devices, and being mobile are in SSDs large-scale technology and storage dominant the come Introduction 1 runtime. at requests of memory half flash reducing the by aver- parallelism flash-level on more provides, 80.2% and age, patterns 68.8% request by I/O utilization different resource under overall improves times controlle it SSD ˜2.2 Further, state-of-the-art 1.8 the and than Sprinkler latency throughput our shorter better 56.6% with least equipped at SSD provides many-chip a that framework shows simulation SSD us- large-scale evaluation cycle-accurate a experimental ing extensive Our spe- to resources. requests transactional- cific memory improves flash over-committing (i.e., by transactions locality) of number addi- reduc the In and parallelism queue. flash-level device-level improves rather Sprinkler the layout tion, by resource imposed order internal the on than based requests Specifi- I/O schedu ing by chips. dependency parallelism flash relaxes per- Sprinkler NAND cally, Sprin- high targets additional achieving propose which without and controller, we formance utilization paper, SSD resource device-level this maximizing novel In a kler, SSDs. many-chip in 1 oee,sneidvda ADfls eieband- device flash NAND individual since However, be- have SSDs embedded and cards memory Flash-based problems emerging the of one is utilization Resource hsppri ulse t2t EEItrainlSymposiu International IEEE 20th at published is paper This pike:Mxmzn eoreUiiaini ayCi S Many-Chip in Utilization Resource Maximizing Sprinkler: 1 optrAcietr n eoySsesLbrtr,Yons Laboratory, Systems Memory and Architecture Computer Abstract ∼ 2 0M/e,mdr Ssare SSDs modern 400MB/sec, eateto ES h enyvnaSaeUniversity State Pennsylvania The EECS, of Department yugo Jung Myoungsoo jcmlbog [email protected] [email protected], n spresented is and ork. riy This ersity. ewas he n On m 1 rs. n- es n amtT Kandemir T. Mahmut and l- d ee.Freape oerqet ontsa cosalin- all across span not do requests system some a example, at For parallelism level. low introduce and turn utilization in chip which unbalanced their significantly, and vary MBs), (or can KBs offsets to re- data bytes SSD few a systems, from vary memory sizes request main quest I/O the Unlike an by pattern. exhibited rea- access first dependency The parallelism increases. is chips son of number problem the related as idleness utilization and from suffers architecture SSD thousands. thirty to we two from as dies flash of 1b), number the (Figure the and increase growing down, keeps sharply idleness goes memory-level inter- state- utilization the a 1a), resource of (Figure nal cycle-level bandwidth stagnates our SSD read many-chip fact, the of-the-art In that reveal solution. data promising simulation flash a more not and is more chips employing in- that resources means internal which of creases, amount the sig- as not improved unfortunately nificantly are SSDs many-chip of performance and performance. SSD requests, improve I/O potentially incoming therefore on can based these resources SSD All abundant ternal across [17]. accesses data parallelism parallelize that proposals internal prior layout of data advantage physical take can a determine been which also in- have investigated, 13] [7] 36, [16, and planes schemes [1] and allocation page dies level, Various multiple flash across a requests incoming at terleave comparison, SS In internal different across resources. them scatter mem- and flash su- requests, multiple ory into and request [1] I/O an a ganging split [7] From as perblocking such 5]. techniques [4, viewpoint, explored system ar been different have resources, approaches these utilized. chitectural manage are efficiently planes these to and well order dies In how flash on of thousands based or char- vary hundreds introducing SSDs performance modern expected, by of As performance acteristics easily their resources. can internal improve more SSDs and architecture, planes. up many-chip multiple scale accommodating this multi- which of to consists of Thanks chip each data flash dies, of single ple amount A maximum parallelism. the are access technologies extract flash to NAND developed new NAND being parallel, these In with integrated chips. being flash are cores and channels ebleeta hr r w esn h many-chip a why reasons two are there that believe We the expectation, common unlike that, found we However, 2 iUniversity ei ldSaeDisks State olid , ∗ in- D s s - . 4 NVMHC 1048.6
8 Queue 16 Host-Level Queue DDR DMA
524.3 Controller Engine 32 Host Driver Core
64 NAND Flash Controller
262.1 128 Controller Internals data transfer Host Side 131.1 Die Internals size (KB) Storage Side J Planes DataLanes 65.5 DATADATA REGISTERREGISTER DATA REGISTER Data bus (PCI Express) fabric CACHECACHE REGISTERREGISTER Bandwdith(MB/Sec) CACHE REGISTER 32.8 CACHE REGISTER Controller Controller Controller 11 BlockBlock 16.4 1 Block Blocks
2 8
32
128 512 k 2048 8192 NAND Flash 32768 NAND Flash
Peripherals NAND Flash CHIP 1 NAND Flash The number of flash dies CHIP 1 CHIP 1 CHIP 1 CHIP 1 CHIP 1 MemoryMemory ArrayArray CHIPCHIP 11 CHIPCHIP 11 CHIPCHIP 11 Memory Array CHIP 2 (a) Performance Stagnation CHIP 2 CHIPCHIP 22 CHIPCHIP 22 CHIPCHIP 22 CHIPCHIP 22 CHIPCHIP 22 CHIP 3 CHIP 3 CHIP 3 CHIPCHIP 33 CHIP 3 CHIP 3 Die 0 Die 1 Die 2 Die 3 100 CHIP 3 CHIPCHIP 33 CHIPCHIP 33 CHIP 4 CHIP 4 CHIP 4 4 CHIP 4 CHIP 4 CHIP 4 CHIPCHIP 44 CHIPCHIP 44 CHIPCHIP 44 Mutiplexed Fibre 80 8 Flash Interface 16 SSD Internals NAND Flash Chip Internals
32 60 Idleness (System-level Resources) (Flash-level Resources)
64
Utilization 128
40 data transfer Figure 2: A many-chip SSD architecture.
size (KB)
20
0 based on internal resource layout rather than the order im-
2 8 posed by the device-level queue (which is the current norm 32
128 512
2048 8192 ThePercentage of Chip Utilization andMemory Level Idleness (%) 32768
The number of flash dies [22, 16, 27, 16]). In addition, Sprinkler improves flash- (b) Utilization and Idleness level parallelism and reduces the number of transactions by over-committing flash memory requests to specific in- Figure 1: Many-chip SSD performance, chip utilization, and memory-level idleness sensitivity to varying number of ternal resources. To the best of our knowledge, this is the flash chips and data transfer sizes. Each curve corresponds first paper that suggests to exploit internal resource layout to a different data transfer size. and over-commit flash memory requests in order to maxi- ternal resources, and some other requests keep pending due mize resource utilization and parallelism, thereby improv- to chip-level conflicts, which in turn influences the number ing many-chip SSD performance. Our main contributions of resources that can be allocated at a given time. There- can be summarized as follows: fore, the degree of internal parallelism that can be enjoyed • Resource-driven I/O scheduling. Unlike conventional depends highly on incoming I/O access patterns, referred SSD controllers which schedule flash memory requests to as parallelism dependency in this work. Another reason based on the “order of incoming I/O requests”, Sprin- behind the utilization and idleness problems of emerging kler schedules them based on “available physical flash re- many-chip SSDs is low flash-level transactional-locality. sources” in a fine-grain, out-of-order fashion. This method, The transactional-locality in this work corresponds to the called Resource-driven I/O Scheduling (RIOS), “decou- ability of the references that form a flash transaction to ex- ples” parallelism dependency from the I/O request access hibit high flash-level parallelism. Since multiple dies and patterns, timings and sizes, improving overall resource uti- planes are connected to shared voltage drivers and a sin- lization by about 68.8% under a wide variety of realistic gle multiplexed interface, flash-level resources need to be workloads. managed carefully in order to exploit the maximum amount • Flash memory request over-commitment. In order to of parallelism. Specifically, highly parallel data accesses increase flash-level transactional-locality, Sprinkler over- across internal flash resources can only be achieved when commits flash memory requests to devices. This Flash- incoming requests span all of them (spatial) and all types level parallelism Aware Request Over-commitment (FARO) of request transactions can be identified within a very short maximizes the opportunities for building a flash transaction time period like a few cycles (temporal). As a result, low at runtime, parallelizing multiple memory requests with re- flash-level transactional-locality may introduce poor data spect to the constraints imposed by the underlying flash access concurrency and extra idleness. technology. FARO provides, on average, 80.2% more flash- In this paper, we propose Sprinkler, a novel device- level parallelism and reduces approximately 50% of flash level SSD controller, which targets maximizing resource transactions. utilization and achieving high performance without addi- • Reducing idleness in many-chip SSDs. We identify two tional NAND flash chips. Specifically, Sprinkler relaxes different types of idleness in an SSD: inter-chip idleness the parallelism dependency by scheduling I/O requests and intra-chip idleness. Sprinkler reduces inter-chip and
2 1) resources: shared Resources. Technologies System-Level exploit 2.1 and computation, and parallelism. of I/O levels different between overlap advanta take the to approach of resources internal multi-phase different a by with out archi- carried dealt this are In services I/O [35]. tecture, Mavell be- by gap architecture SSD performance many-chip medium the flash fill and (16GB/sec) (40 to interfaces attempt speed con- high an and tween in channels I/O units multiple trol interconnecting by chips Disks State Solid Many-Chip 2 and throughput, system better latency. I/O times together) lower 78% FARO 1.8 and least (RIOS at based Sprinkler provides address 20], physical [27, state-of-the-art a scheduler even and scheduler 16] based As [22, address virtual respectively. conventional 23.5%, a to and compared 46.1% by idleness many-ch pipeline intra-chip a a in in resources routine internal service SSD I/O different an by of and phases operation The 3: Figure eoyrqet,adeeue it. executes and requests, memory 3) core. the man- module, To software called embedded an addresses. translation, address physical this block memory age systems flash file to host addresses, with compatible addresses, virtual 2) implemente NVMHC. usually in are han- proto- schedulers packet interface I/O data device-level and host dling, handshaking the queueing, of to the related aware is cols is NVMHC that Since component components. only internal the SSD between and communication host for responsible logic, control ls Controller Flash Core oenSD mlyhnrd otosnso flash of thousands to hundreds employ SSDs Modern o-oaieMmr otCnrle (NVMHC) Controller Host Memory Non-Volatile
∼ Device-level Queue Queuing Arrivals 0M/e) iue2pcoilyilsrtsarecent a illustrates pictorially 2 Figure 400MB/sec). ls rnlto Layer Translation Flash samcorcso,wihi eiae otranslate to dedicated is which microprocessor, a is tags ssoni iue2 hr xs ormain four exist there 2, Figure in shown As
Memory Request Composition Request Memory I/O Request ulsa builds Parsing NVMHC Core (Flash Translation Layer) Translation Core (Flash NVMHC ahtransaction flash FL,i mlmne in implemented is (FTL), Initiation Movement Data Memory Request (Virtual Address) Data Movement rmmultiple from Memory Request Commitment Request Memory Translation Address
sa is Memory Request ge (Physical Address) d fashion. d
3 Data Movement pSD nti otn,a / eus ssre ymultiple by served is request I/O an routine, this In SSD. ip otecr rcso nteSDfrfrhrIOprocessing. I/O further for SSD the This in processor core the to cteste vraalbefls otolr.Fnly fla a Finally, build controllers. controllers flash available over them and scatters memory addresses, physical the into NVMHC of from addresses coming requests virtual be translates can FTL commitment and pipelined. queu- to composition related request phases memory multiple ing, the that so fashion timely a cesscnb aallzdoe utpefls controller flash multiple over parallelized be corresponding can the accesses FTL, physi by the determined after been have First, scattered addresses be resources. internal can multiple requests across memory associated many quest, flash underlying (SLP). Parallelism the System-Level of below. limitations explained as the transactional- technology, chip as the well on as depending locality varies by transaction exhibited parallelism flash of a amount The requests. memory igo h rcs fsriiga /,NMCenqueues called NVMHC information, I/O, request an host servicing the of process the of ning Routine. Service I/O inter- chip flash the registers. and trans- nal buffer data internal for a SSD path the to same between the connected fer share are and array chips an flash like channel Multiple medium. flash the 4) a ae nteudryn rtcl(.. V Express a NVM builds associated (e.g., then the protocol NVMHC underlying parses [14]). the NVMHC on host serviced, based a be tag Once to chosen tags. is schedules and queue level eea ye oa B nIOrqeti yial pi into split typically is request I/O multiple an MB, an from to ranging significantly, bytes vary several can request I/O an of length as to ini- referred host and is composition activity the size, This between unit SSD. the I/O movement and flash data atomic corresponding the the as tiates same the is size Channel ntenx tp VH ed these sends NVMHC step, next the In eoyrqetcommitment request memory Pipelining (SLP) Pipelining & Striping eoyrequests. memory sadt ahbtentefls otolrand controller flash the between path data a is ic,ulk h anmmr ytm,the systems, memory main the unlike Since, . ahtransaction flash ssoni iue3 ttebegin- the at 3, Figure in shown As Flash Transaction Decision Transaction Handling Transactions Transactions Handling eoyrequest memory Flash Controllers Flash ed ob efre in performed be to needs hl evn nIOre- I/O an serving While tags by coalescing nisondevice- own its in , eoyrequests memory eoyrequest memory Sharing Sharing (FLP) Interleaving & Sequence Execution / request I/O hs data whose multiple cal sh s and channels, and this process is called channel stripping. all memory requests in the transaction are required to follow Further, each flash controller can pipeline a series of I/O an appropriatetiming sequence that flash makersdefine, and commands, control commands and data movements, asso- each flash transaction has its own timing sequence. There- ciated with the transaction across multiple flash chips within fore, as shown in the handling transactions part of Figure 3, a channel, and this process is referred to as channel pipelin- the type of a transaction should be decided within a short ing. Even though channel stripping and channel pipelin- period of time before entering the execution sequence. Due ing aim to improve internal parallelism, the amount of SLP to this, the parallelism of each transaction potentially hasa brought by them depends highly on the incoming I/O ac- dependency on I/O access pattern, length and arrival timing cess pattern. As shown in Figure 1b, poor chip utilization, of incoming memory requests, which makes it difficult to caused mainly by parallelism dependency, is one of main achieve high levels of FLP. We refer to this as “parallelism reasons for performance stagnation on emerging many-chip dependency” and demonstrate specific examples exhibiting SSDs. low FLP below. 2.2 Flash-Level Technologies Challenge. First, in cases where multiple memory requests arrive at a flash controller in a time interval which is longer Resources. State-of-the-art SSDs also employ the follow- than the transaction type decision time of the controller, ing flash-level technologies: they will be served by separate transactions even though 1) Flash Chip consists of multiple dies and planes, and ex- they could have been serviced, from a flash-level perspec- poses them through a small number of I/O (e.g., 8 ˜16) pins tive, as a single transaction. This poor flash-level temporal and a CE (chip enable) pin to the system-level resources. transactional-locality can potentially contribute to low FLP. This flash interface reduces the I/O connection complex- On the other hand, since only one flash transaction can oc- ity and communication noise, but it also introduces a set of cupy the shared interface, bus and flash medium at a time, command sequences and data movements for handing flash once the transaction type is determined and the correspond- transactions. ing memory requests are initiated, other memory requests 2) Die is a memory island, connected to a single multi- heading to the same chip should be stalled until the shared plexedbus fibre throughthe flash interface and the CE. Note flash resources are free. Lastly, to take advantage of plane that the memory cells in different dies can operate indepen- sharing, addresses of the memory requests in a transaction dently. should indicate the same page and die offset in the flash Plane 3) is the memory array in a die, sharing the word- chip, but different block addresses (or plane addresses). As line and voltage drivers for accessing specific flash memory a result, low flash-level spatial transactional-locality can cells. also introduce low FLP and high intra-chip idleness. Flash-Level Parallelism (FLP). Even though multiple dies are squeezed into a single flash interface, a set of bus ac- 3 I/O Scheduling in Modern Controllers tivities for a transaction (e.g, flash command, data move- The state-of-the-art I/O scheduling schemes in ment, control signals) can be interlaced, and the multiple NVMHCs can be broadly classified as virtually ad- dies can independently work without any circuit-level mod- dress schedulers (VAS) [22, 30, 16] and physically address ification. Consequently, multiple memory requests can be schedulers (PAS) [27, 20], in terms of the address type of interleaved across dies via die interleaving, which in turn memory requests that they operate on. In this section, we improves chip throughput and response time for a transac- briefly explain these two schedulers and their drawbacks. tion n times, where n is the number of flash dies. Plane Virtually Address Scheduler (VAS). This type of sched- sharing activates multiple planes in a die through the shared uler decides the order of I/O requests in the device-level wordline access, thereby improving throughput by m times, queue, but builds and commits memory requests relaying m being the number of planes. Lastly, die interleaving and only on the virtual addresses of the I/O requests, provided plane sharing can be combined, which can improve transac- by the underlying FTL. Consequently, VAS can suffer from ∗ tion performance by approximately n m times. collisions among I/O requests, which leads to low levels Flash Transaction and Parallelism Dependency. Unlike of SLP and FLP. Figure 4a illustrates this problem1. In DRAM, NAND flashes have an wide spectrum of opera- this example, there exist five I/O requests arriving back-to- tion sets, commands and execution sequences – most flash back and containing a total of nineteen individual memory memory vendors at least offer ten flash operations, each of requests. Initially, VAS composes four memory requests which typically has a different execution sequence. In this context, a flash transaction is a series activities that the flash 1For all the system-level diagrams (Figures 4a, 5a, and 7), we show snapshots with respect to the device-level queue (e.g., native command controller has to manage in executing a flash operation. It is queue), memory request commitments, and the corresponding physical composed of a set of commands (i.e., flash, control, delim- layout by following the controller visit order (from left to right). For each snapshot, an uncolored box on each channel indicates idle chips, and mem- iter commands), data movements (i.e., contents, addresses, ory request numbers on the resource layout refer to the memory request status information). In addition, during the execution stage, commitment order.
4 11 1J$ Q`RV` 11 1J$ Q`RV` L %V%V L %V%V `V_%V `V_%V VIQ`7 VIQ`7 `V_%V `V_%V
7 VIRCV0VC01V1Q`IVIQ`7`V_%V HQII1 IVJ 7 VIRCV0VC01V1Q`IVIQ`7`V_%V HQII1 IVJ
VIQ`7V_V% J%IGV`