arXiv:2006.08975v1 [cs.AR] 16 Jun 2020 eie,teGUifrstehs osrietepg fault, over- page movement peripheral the data are service severe SSD to introduces NVMe host of unfortunately the and exception which informs GPU the GPU both the raises As devices, misses MMU fault. core GPU’s GPU page memory, a a by GPU [10]. requested virtualization the block data memory in a realize if management to example, memory GPUs For the in leverages (MMU) and unit disk memory swap GPU a as the SSD of NVMe underlying the DRAM utilizes vendor as GPU such [7]–[9]. violations issues time scaling retention DRAM and difficul the disturbances is to cell package memory due single expand on- a packages to of limited memory capacity of the the number while have enough [6], an GPUs deploy device, to cannot applications. space peripheral which large-scale board a memory, the as of main much practice, sets host-side is data the GPUs all of of accommodate that capacity than memory on-board less upto the by times, performance CPUs’ 100 exceed to GPUs par exhibits drives massive computing which the be While cores, can (TLP). parallelism GPU threads thread-level These high such threads. by of executed thousands simultaneously of hundred contai each or kernels, ten GPU applic multiple large-scale massive into GPUs, their decomposed the are by from tions be benefits brought the [1]–[5], power reap To computing bigdata cores. high and the analysis of graph cause data-inte as large-scale such the applications accelerate to prevailing become e,MU 2cce -AD DRAM Z-NAND, cache, L2 MMU, tem, n mly ag 2cceadfls eitr obfe the c buffer ZnG to that registers indicate flash th results 7.5 and evaluation achieve address cache To Our requests L2 requests. granularity. memory large memory access GPU’s of a their by employs benefits of fracti ZnG utilized mismatches small be the to a can reap SS due only bandwidth the bandwidth, to such within accumulated of MMU high arrays deliver GPU’s flash can Although the integrati accelerations. and in hardware network ultra-low- firmware flash fl high-throughput an SSD its a replacing with by with SSD further channels the DRAMs ZnG of capacity. bottleneck memory internal performance GPU removes the GPU maximize to Specifical SSD all SSD. latency and an replaces GPU by a ZnG imposed in penalties capacity memory performance the address maximize can which tecture, n:Acietn P ut-rcsoswt New with Multi-Processors GPU Architecting ZnG: ome h eurmn fsc ag eoycpct,a capacity, memory large such of requirement the meet To (GPUs) units processing graphics years, few past the Over ne Terms Index Abstract W rps n,anwGUSDitgae archi- integrated GPU-SSD new a ZnG, propose —We × ihrpromneta ro work. prior than performance higher dt oeet P,SD eeoeeu sys- heterogeneous SSD, GPU, movement, —data .I I. ls o clbeDt Analysis Data Scalable for Flash NTRODUCTION optrAcietr n eoySsesLaboratory, Systems Memory and Architecture Computer oe dacdIsiueo cec n ehooy(KAIST) Technology and Science of Institute Advanced Korea i hn n yugo Jung Myoungsoo and Zhang Jie http://camelab.org nsive allel ash an on ng ns ly, In is, a- D - t r usse.T gr u h anrao eidteper- the behind reason main the out figure mem performance GPU To huge traditional subsystem. the a ory and is approach by there this While overhead between GPU, movement latency. disparity to data Z-NAND close the Z-NAND long eliminate placing relatively can approach the read/write this as hide DRAM small these to a access address has buffer and its To firmware controller SSD DRAM. 3) SSD execute to customized than and a longer employs updates; HybridGPU much challenges, address in-place still manage forbids is to latency it firmware SSD as compatible of not mapping assistance (writes) is programming the Z-NAND which 2) requires access page, requests; minimum memory a the the is with 1) Z-NAND micro-seconds directly: of service few requests granularity to memory challenges a GPU several to the faces of Z-NAND micro-seconds type However, medi of [12]. new flash conventional a hundreds of DRAM, as latency from than access capacity Z-NAND, the higher reducing 1a. times while 64 Figure achieves in flash, pack- NAND shown flash Z-NAND as with ages packages DRAM on-board GPU’s eerdt as to referred vrl efrac fmn plctosa h user-level the at applications degrades many turn of in which performance faults, overall inc page servicing significantly of PCIe) latency of the (i.e., constraints bandwidth interfaces the hardware and memory.various SSD limit GPU the NVMe the domains, of to computing performance different memory across the copy then data and from The memory pag data main target same host-side the the the load moves to SSDs to NVMe needs the first from the host the and Specifically, architecture head. HybridGPU integrated analysis. performance An 1: Fig. ordc h aamvmn vred ro td [11], study prior a overhead, movement data the reduce To
J V`HQJJVH JV 1Q`@ HQ`V a yrdP design. HybridGPU (a) V_%V R1]: H.V` H:H.V R:``:7 ""HQJ `QCCV` HybridGPU G%``V` : : H:H.V rpsst ietyrpaea replace directly to proposes ,
DRAM buffer
Accumulated bandwdith (GB/s) 1000 100 10
Flash channel 1 b Bandwidth. (b)
11.20
Flash read
25.60
Flash write
341.30
SSD engine GDDR5
10.24 rease
Performance . the 4.80 ed gap a e - formance disparity, we analyze the bandwidths of traditional VIQ`7 V$1 V` .:`VR I:J:$VIVJ %J1 GPU memory subsystem and each component in HybridGPU. V H.L `1CV IVIQ`7 VHQRV #1$.C7R .`V:RVR The results are shown in Figure 1b. The maximum bandwidth Q:CVH1J$ %:$V :GCV1:C@V` H.VR%CV %J1 %J1 of HybridGPU’s internal DRAM buffer is 96% lower than that 1 %:$V`:%C 6VH% V .:JRCV` of the traditional GPU memory subsystem. This is because %J1 %:$V1:C@ the state-of-the-art GPUs employ six memory controllers to L G%``V` H:H.V communicate with a dozen of DRAM packages via a 384-bit %:$V1:C@ data bus [13], while the DRAM buffer is a single package V 1`V VI]Q` %J1 # H:H.V connected to a 32-bit data bus [11]. It is possible to increase the number of DRAM packages in HybridGPU by reducing JV 1Q`@ the number of its Z-NAND packages. However, this solution is undesirable as it can reduce the total memory capacity in Q% V` Q% V` +% * +% * GPU. In addition, I/O bandwidth of the flash channels and data VI `1 V* 1CC* :``:7 H `CV` processing bandwidth of a SSD controller are much lower than # V:R* H:H.V those of the traditional GPU memory subsystem, which can .:`VR )J]% * ``RH.1] also become performance bottleneck. This is because the bus structure of the flash channels constrains itself from scaling Fig. 2: GPU internal architecture. up with a higher frequency, while the single SSD controller manner. On the other hand, compared to the GPU interconnect has a limited computing power to translate the addresses of network, a flash channel has a limited number of electrical the memory requests from all the GPU cores. lanes and runs in a relatively low clock frequency, and We propose a new GPU-SSD architecture, ZnG, to maxi- therefore, its bandwidth is much lower than the accumulated mize the memory capacity in the GPU and to address the per- bandwidth of the Z-NAND arrays. ZnG addresses this by formance penalties imposed by the SSD integrated in the GPU. changing the flash network from a bus to a mesh structure. ZnG replaces all the GPU on-board memory modules with • Hardware implementation for zero-overhead address trans- multiple Z-NAND flash packages and directly exposes the Z- lation. The large-scale data analysis applications are typically NAND flash packages to the GPU L2 cache to reap the benefits read-intensive. By leveraging characteristics of the target user of the accumulated flash bandwidth. Specifically, to prevent the applications, ZnG splits the flash address translation into single SSD controller from blocking the services of memory two parts. First, a read-only mapping table is integrated in requests, ZnG removes HybridGPU’s request dispatcher. In- the GPU internal MMU and cached by the GPU TLB. All stead, ZnG directly attaches the underlying flash controller memory requests can directly get their physical addresses to a GPU interconnect network to directly adopt memory when MMU looks up the mapping table to translate their requests from GPU L2 caches. Since the SSD controller also virtual addresses. Second, when there is a memory write, has a limited computing power to perform address translation the data and the updated address mapping information are for the massive number of memory requests, ZnG offloads simultaneously recorded in the flash address decoder and flash the functionality of address translation to the GPU internal arrays. A future access to the written data will be remapped MMU and flash address decoder (to hide the computation by the flash address decoder. overhead). To satisfy the accumulated bandwidth of Z-NAND • L2 cache and flash internal register design for maximum flash arrays, ZnG also increases the bandwidth of the flash flash bandwidth. As Z-NAND is directly connected to GPU channels. Lastly, although replacing all GPU DRAM modules L2 cache via flash controllers, ZnG increases the L2 cache with the Z-NAND packages can maximize the accumulated capacity with an emerging non-volatile memory, STT-MRAM, flash bandwidth, such bandwidth can be underutilized without to buffer more number of pages from Z-NAND. ZnG further a DRAM buffer, as Z-NAND has to fetch a whole 4KB flash improves the space utilization of the GPU L2 caches by page to serve a 128B data block. ZnG increases the GPU predicting spatial locality of the fetched pages. As STT- L2 cache sizes to buffer the pages fetched from Z-NAND, MRAM suffers from a long write latency, ZnG constructs which can serve the read requests, while it leverages Z-NAND L2 cache as a read-only cache. To accommodate the write internal cache registers to accommodate the write requests. requests, ZnG increases the number of registers in Z-NAND Our evaluation results indicate that ZnG can improve the flash and configures all the registers within a same flash overall GPU performance by 7.5×, compared to HybridGPU. package as fully-associative cache to accommodate requests. Our contributions of this work can be summarized as follows: • New GPU-SSD architecture to remove performance bottle- II. BACKGROUND neck. The HybridGPU’s request dispatcher can be bottleneck to interact with both the underlying flash controllers and all L2 A. GPU internal architecture cache banks. To address this, ZnG directly attaches multiple Figure 2 shows a representative GPU architecture, in which flash controllers to the GPU interconnect network, such that streaming multiprocessors (SMs), shared L2 caches, memory the memory requests, generated by GPU L2 caches, can controllers and a memory management unit are connected via be served across different flash controllers in an interleaved a GPU internal network. Within SMs, multiple arithmetic logic mlyasae 2cceadhg-efrac off-chip high-performance and cache L2 shared the interrupt. a the copies to employ then response CPU in The memory GPU CPU. the the to to data interrupt target an sets hand regis hardware fault and CPU-side in page information the fault a Specifically, page employs programs logic. the MMU handling all GPU a fault accommodate applications, as page to multiple addition, of capacity In sets cache memory accesses. working limited walk memory a page of has a number GPU employs the also reduce MMU to GPU walke hundreds [19], cost table accesses cycles page memory of the the Since translation, threads. improv 32 the address To employs with completion. of request associated throughput the request and for the memory waits L1D address and the entry each corresponding issues buffer For walk then the page buffers. It records a state. walk allocates which page machine entry, the state of buffer hardware walk page set the to The miss, a machine TLB cache. and state all walk hardware table for a page page includes resource and walker shared handler table a fault wa is page page walker, MMU table As buffer, page [18]. the highly-threaded a work 2, including this SMs, the Figure in adopt in academia from we shown design publicly, memory designs virtual virtualization memory GPU [17]. (MSHR the cache registers Otherwise, L2 holding to cache. status forwarded the L1D miss and hits, by the tracked cache from are L1D served requests if be Afterwards, can [16]. address requests MMU logical the and to TLB translated via addresses be virtual to the need [13], requests cache, memory L1D virtualization the memory accessing support Before [15]. to ma (MMU) (TLB)/memory GPUs unit buffer state-of-the-art agement lookaside utilization the translation that the a Note improve integrates bandwidth. to the cache requests L1D exams memory of int unit requests larger logic memory but 32-bit This warp fewer the unit. merges a the and coalescing addresses in the cache, request threads to L1D thirty-two issued the by are generated accessing requests Before cache memory memory. L1D of shared comprised t is via and memory memory on-chip The on-chip units. the wh LD/ST access ALUs, instructions in load/store executed instruct are the the issues instructions on arithmetic and Based the warp. warp types, The this available of first. instructions an decoded decoded schedules the and then fetched scheduler are threads, warp warp 32 each of of group instructions a execute to called employed are (ALU) units oaheehg aaacs adit,GU usually GPUs bandwidth, access data high achieve To their documented not have ARM and AMD NVIDIA, Since Mem0.1 density (GB) a eoypcaecapacity. package Memory (a) 10
GDDR5 1 i.3 est n oe osmto analysis. consumption power and Density 3: Fig. warp 1.00 DDR4 1] nalc tp uigteeeuin the execution, the During step. lock a in [14], LPDDR4 2.00
Z-NAND 4.00
64.00
Power consumption 0.01
(W/GB)0.1 10 b oe consumption. Power (b) GDDR5 1.881
5.00 DDR4 LPDDR4 0.38
Z-NAND 0.20 ion ter ler 0.02 ile he of es n- lk o e ) r w drawbacks: two flash NAND new The bandwidth. high B. a deliver to DRAMs GPU of set process a To 6 memory. (i.e., multipl controllers GPU employ memory typically the GPUs L2 requests, to in memory massive forwarded miss the suppo requests are to memory they tiles the multiple cache, Once into accesses. the partitioned data than is parallel space it larger and a cache, exhibits L1D cache L2 shared The memory. RM(D4,mbl RM(PD4 n h new the and (LPDDR4) DRAM ( Desktop flash (GDDR5), NAND mobile DRAM (DDR4), GPU compare DRAM 3b and 3a Figures PD4,i sual oices h RMcpct,which capacity, DRAM (i.e., 64 the DRAM out increase to turns of unable efficiency is industry it power though LPDDR4), Even the Z-NAND. improves and DRAM memory successfully of lower types compared much other consumption power has to observe higher DRAM much can spends GPU and One that density respectively. figures the GB), from per (watt consumption rt ue oenSD mlya meddCUadan and as CPU embedded referred an DRAM, erase-befor employ SSDs this internal of modern Because rule, cells. write flash inter- its page-to-page re-programming the in-order to an an write follows addition, due Z-NAND allows In data, Thus, only basis. ference/disturbance. write page and to a overwrites programming in forbids reads/writes Z-NAND the serve should chan planes. flash and multiple dies the packages, including leveragin parallelism, fully internal by throughput its scalable a deliver can 3,000 Z-NAND (i.e., flash V-NAND 14 of by are those 100,000) than (i.e., higher cycles also P/E Z-NAND’s [12]. flash V-NAND whic respectively, 17 100us, are progr and 3us and take read Z-NAND Thus, of cycles. operations (P/E) to significantly program/erase and its technologies crease latency (SLC) read/program flash-level low- cell shorten the single-level reorganizing leverages Specifi by micro-architecture. Z-NAND and latency technology Z-NAND cell access flash layers. level vertically-stacked its piling 48 reduces over by further cells density o flash details memory the architectural latency its up the both increases shows Z-NAND of 4a Z-NAND. terms Figure Z-NAND in throughput. addition, performance and In promising efficiency. a power best provides the and density drs fZNN ah(f,FL.We h ubrof number called performs process, the firmware physical SSD clean-up When threshold, a a the FTL). under is to (cf., blocks the flash flash requests clean translate Z-NAND I/O to of incoming is address firmware of SSD addresses of logical role main The firmware. al,i irtsmlil ai ae rmflyprogramm fully from pages valid multiple migrates it cally, hl P RMpoie rmsn hogpt thas it throughput, promising provides DRAM GPU While nieGUDA,wihi yeadesbe Z-NAND byte-addressable, is which DRAM, GPU Unlike memory highest the exhibits Z-NAND DRAM, to Compared × ue[2;awoefls lc ed ob rsdbefore erased be to needs block flash whole a [12]; rule × noeal h ahbcbn fa S htemploys that SSD an of backbone flash The overall. in n 6 and × Z-NAND × oe eoydniyta Z-NAND. than density memory lower o capacity low atrta hs ftesaeo-h-r TLC state-of-the-art the of those than faster ntrso eoydniyadpower and density memory of terms in ) ∼ abg collection garbage 1] 1],ec oncigto connecting each [17]), [13], 8 S engine SSD and ihpwrconsumption power high oeeueSSD execute to , G) Specifi- (GC). erase-before- ∼ 10,000) cally, nels, am in- ed e- rt h g e f . 10 VLV HQ`V L1-L2 net L2-engine net SSD engine DRAM HQ`V 8 6
HQJJ 484 Q 1J V``:HV C:7V` J V`0VJ 1QJ 4 : : 256 100 HQ]7 L2-MC net TLB engine-flash net 44.8
C: . 1J V``:HV C:7V` L2 cache 25.6 C:JV C:JV VR%R:J V 10
`1`I1:`V R: :HQ]7 L1 cache V$8 V$8 12.8
flash array (GB/s) Bandwidth Latency breakdown Latency
R R 3.2
`1`I1:`V 1 0 1V 1V 1V R R DDR4 ZSSD R :``:7 GDDR5
LPDDR4 GPU(DRAM) HybridGPU
HybridGPU (a) SSD internal. (b) GPU-SSD architecture. (c) Performance. (d) Latency breakdown. Fig. 4: SSD internal architecture, GPU-SSD architecture and the throughput of different memory media. block(s) to clean block(s), erases the target block(s), and addition to the SSD engine issue, the network latency (between updates the mapping information in the internal DRAM. L2 cache and SSD engine) is significant, compared to the traditional GPU memory subsystem. This is because the SSD C. Prior work and motivation controller only employs 2∼5 low-power embedded processor To address the aforementioned challenges of GPU DRAM, cores [21], which are unable to send/receive many memory prior work attempts to increase the GPU memory with SSD requests in parallel. Thus, the SSD controller also becomes [11], [20]. Figures 4b and 1a respectively show the overviews the performance bottleneck of serving memory requests. of two typical solutions: a GPU-SSD system [20] and a Z-NAND access granularity. We perform a simulation-based HybridGPU system [11]. The GPU-SSD system employs a study to analyze the impact of replacing all GPU on-board discrete GPU and SSD as peripheral devices, which are DRAMs with Z-NAND flash packages. For the sake of brevity, connected to CPU via a PCIe interface. When page faults we assume there is no performance penalty introduced by occur in GPU due to the limited memory space, CPU serves the SSD controller. In this study, we configure the GTX580- the page faults by accessing data from the underlying SSD like GPU model [17] by using a GPU simulation framework, and moving the data to GPU memory through GPU software MacSim [22]. The important configuration details are listed framework. In addition, the page faults require redundant data in Table I. Figure 5a shows the performance degradation, copies in the host side due to the user/privilege mode switches. imposed by direct Z-NAND flash accesses, under the execution This, unfortunately, wastes host CPU cycles and reduces of 12 real GPU workloads [23]–[25]. One can observe from data access bandwidth. In contrast, the HybridGPU directly this figure that the performance degradation can be as high integrates the SSD into the GPU, which can eliminate CPU as 28×. This is because the memory access size in GPU intervention and avoid the redundant data copies. While the is 128B, which is much smaller than the minimum access HybridGPU exhibits much better performance than the GPU- granularity of Z-NAND flash (4KB). Therefore, 97% of flash SSD system, memory bandwidth of its embedded SSD module access bandwidth is underutilized when serving the requests. is not comparable to that of GPU memory. Figure 4c compares Workload characteristics. We evaluate the impact of direct the maximum data access throughput of GPU DRAM, desktop Z-NAND accesses on large-scale data analysis applications. DRAM, mobile DRAM, GPU-SSD and HybridGPU systems. Figure 5b shows the number of memory requests that repeat For the GPU-SSD and HybridGPU systems, data are assumed accessing the same pages. Each Z-NAND page, on average, to reside in the SSD. As shown in the figure, GPU DRAM is repeatedly read by 42 times, across all the workloads. This outperforms the GPU-SSD and HybridGPU systems by 80 observation indicates that it is still beneficial to buffer the re- times and 40 times, respectively. This in turn makes the SSD accessed data for future fast accesses. In addition to the read performance bottleneck in such systems when executing the operations, we also collect the number of write requests that applications with large-scale data sets. target to the same Z-NAND pages, called write redundancy in this work. The results are shown in Figure 5c. Each Z-NAND III. HIGH LEVEL VIEW page, on average, receive 65 times of write operations. Serving A. Key observations such write requests in flash pages can dramatically shorten the FTL and GPU-SSD interconnection. Figure 4d analyzes lifetime of Z-NAND. Thus, it is essential to have a buffer to multiple components that contribute the memory access la- accommodate write requests. tency and compares the latency breakdown between tradi- B. The high-level view of ZnG tional GPU memory subsystem and HybridGPU. The SSD engine, which performs FTL, accounts for 67% of the total Putting Z-NAND flash close to GPU. Figure 6a shows an memory access latencies. This is because the SSD engine in architectural overview of the proposed ZnG. Compared to commercial SSDs has limited computing power to process HybridGPU, ZnG removes the components of the request the memory requests, which are simultaneously generated dispatcher, the SSD controller and the data buffer (which are by the massive GPU cores. As a result, a large number placed between the GPU L2 cache and the Z-NAND flash). of memory requests can be stalled by the SSD engine. In Instead, the underlying flash network is directly attached to 70 80 180 160 write read
60 100 140 50 60
120 80 40 100 40 30 80 60 20 60 20 40 40 10 20 20 Read re-accesses Read
0 0 Write redundancy 0 Perform. degradation
0 pr Memory access fraction (%) fraction access Memory
pr-gaus pr-gaus pr-gaus gc1 gc2 deg bfs1 bfs2 bfs3 bfs4 bfs5 bfs6 gc1-FDTgc2-FDT gc1-FDTgc2-FDT gc1-FDTgc2-FDT FDT betw back bfs3-FDT bfs3-FDT bfs3-FDT gaus betw-backbfs1-gaus bfs2-gausbfs4-backbfs5-backbfs6-gausdeg-gram betw-backbfs1-gaus bfs2-gausbfs4-backbfs5-backbfs6-gausdeg-gram betw-backbfs1-gaus bfs2-gausbfs4-backbfs5-backbfs6-gausdeg-gram gram sssp3-gram sssp3-gram sssp3-gram sssp3 (a) Performance degradation. (b) Read re-accesses. (c) Write redundancy. (d) Access breakdown. Fig. 5: Performance of SSD components, performance degradation, read re-accesses and write redundancy. L:RR`8 R: : QI]% V%J1 H:H.V 7J8 ]`V`V H. HQ`V H:H.VH:H.V HQ`V G:J@G:J@ V_%V `V`V H. :`:CCVCV01H 1QJ G:J@ J HHV VH Q` 1 .`:.1J$ JV 1Q`@ IQJ1 Q` H.VH@V`
:H.V :``:7 1
R1V C:.H `C C:.H `C C:.H `C R:``:7
`VR1H 8 :$V`V$1 V` : : C:.HQJ `QCCV` .:`VRH:H.V C:.JV 1Q`@ :GCV .`:. V H. R ` VR1`VH
C:. C:. C:. 1 L]V`I% : 1QJ
]:H@:$V ]:H@:$V ]:H@:$V QH:H.V V$R`C:.JV 1Q`@ GQ:`R
(a) High-level view. (b) Execution flow of memory accesses. Fig. 6: High-level view of ZnG and the execution flow of memory accesses. the GPU interconnect network through the flash controllers. which introduces huge Z-NAND access overhead. To address While the flash controllers manage the I/O transactions of the these challenges, ZnG collaborates such two approaches. Our underlying Z-NAND, ZnG integrates a request dispatcher in key observation is that a wide spectrum of the data analysis each flash controller to interact with the GPU interconnect workloads is read-intensive, which generates only a few write network in sending/receiving the request packets. There exist requests to Z-NAND. Thus, we split FTL’s mapping table into two root causes that ZnG does not directly attach Z-NAND a read-only block mapping table and a log page mapping table. packages to the GPU interconnect network. First, Z-NAND To reduce the mapping table size, the block mapping table packages leverage Open NAND Flash Interface (ONFI) [26] records the mapping information of a flash block rather than a for the I/O communication whose frequency and hardware page. This design in turn reduces the mapping table size to 80 (electrical lane) configurations are different from those of KB, which can be placed in the MMU. While read requests the GPU interconnect network. Second, since a bandwidth can leverage the read-only block mapping table to find out capacity of the GPU interconnect network much exceeds total its flash physical address, this mapping table cannot remap bandwidth brought by all the underlying Z-NAND packages, incoming write requests to new Z-NAND pages. To address directly attaching Z-NAND packages to the GPU interconnect this, we implement a log page mapping table in the flash row network can significantly underutilize the network resources. decoder. The MMU can calculate the flash block address of Thus, we employ a mesh structure as the flash network, which the write requests based on the block mapping table. We then can meet the bandwidth requirement of Z-NAND packages forward the write requests to the target flash block. The flash by increasing the frequency and link widths, rather than row decoder remaps the write requests to a new page location leveraging the existing GPU interconnect network. in the flash block. Once all the spaces of Z-NAND are used up, we allocate a GPU helper thread to reclaim flash block(s) Zero-overhead FTL. As the SSD controller is removed from by performing garbage collection, which is inspired by [28]. ZnG, we offload FTL to other hardware components. GPU’s MMU can be a good candidate component to implement the Putting it all together. To make a memory request correctly FTL as all memory requests leverage the MMU to translate access the corresponding Z-NAND, the memory requests, gen- their virtual addresses to memory logical addresses. We can erated by GPU SMs, firstly leverage TLB/MMU to translate achieve a zero-overhead FTL if the MMU directly translates their logical addresses to flash physical addresses ( 1 ) (cf. the virtual address of each memory request to flash physical Figure 6a). GPU L1 and L2 caches are indexed by the flash address. However, MMU does not have a sufficient space physical address. If a memory request misses in the L2 cache to accommodate all the mapping information of FTL. An ( 2 ), the L2 cache sends the memory request to one of the alternative solution is to revise the internal row decoder of each flash controllers ( 3 ). The flash controller decodes the physical Z-NAND plane to remap a request’s address to a wordline of address to find the target flash plane and converts the memory Z-NAND flash array, which is inspired by [27]. While this requests into a sequence of flash commands ( 4 ). The target Z- approach can eliminate the FTL overhead, reading a page NAND serves such memory request by using the row decoder requires searching the row decoders of all Z-NAND planes, to activate the wordline and sensing data from the flash array. R].: V ’ ’ ’ 0’ C@ :$V1JRV6 Q1 Q$GCQH@ : :GCQH@ :$VQ`` V C: . VCVH HVCC HH C@ Q1RVH CQ: 1J$ QC%IJRVHQRV` `Q VH $: V VJ:GCV C@ $ Q1 VCVH R"#" $ HH C@ R C:JV ].: V RVHQRV` C:. H `CV` J0V` V` `Q VH `Q$`:II:GCV R `C:. VJ:GCV
(a) Mapping tables. (b) Row decoder modification. Fig. 7: Zero-overhead FTL design. C. The optimizations for high throughput MMU/TLB, the shared memory, and the flash row decoder. As explained in Section III-A, simply removing the internal Our MMU implementation adopts a two-level page table based DRAM buffer imposes huge performance degradation in a on a real GPU MMU implementation [15]. The page table GPU-SSD system. To address the degradation, ZnG assigns works as a data block mapping table (DBMT), whose entries GPU L2 cache and Z-NAND internal registers as read and store virtual block number (VBN), logical block number write buffers, respectively, to accommodate the read and write (LBN), physical log block number (PLBN) and physical data requests. Figure 6b shows the high-level view of our proposed block number (PDBN). Among these, VBN is the data block design. Specifically, we increase the L2 cache capacity by address of the user applications in the virtual address space. replacing SRAM with non-volatile memory, in particular STT- LBN is a global memory address, while PDBN and PLBN MRAM to accommodate as many read requests as possible. are Z-NAND’s flash addresses. The physical data block se- While STT-MRAM can increase the L2 cache capacity by 4 quentially stores the read-only flash pages. If memory requests times, its long write latency makes it infeasible to accommo- access read-only data, the requests can locate the positions of date the write requests [29]. We then increase the number of the target data from PDBN by referring their virtual addresses registers in Z-NAND to shelter the write requests. as an index. On the other hand, the write requests are served Read prefetch. L2 cache can better serve the memory requests by physical log blocks. We employ a logical page mapping if it can accurately prefetch the target data blocks from the un- table (LPMT) for each physical log block to record such derlying Z-NAND. We propose a predictor to speculate spatial information. If memory requests access a modified data block, locality of the access pattern, generated by user applications. they need to refer to LPMT to find out the physical location If the user applications access the continuous data blocks, the of the target data. Each LPMT is stored in the row decoder predictor informs the L2 cache to prefetch data blocks. As the of the corresponding log block. We will explain the detailed limited size of L2 cache cannot accommodate all prefetched implementation of LPMT shortly. The TLB is employed to data blocks, we also propose an access monitor to dynamically accelerate the flash address translation. It buffers the entries adjust the data sizes in each prefetch operation. of DBMT, which are frequently inquired by the GPU kernels. Construct fully-associative flash registers. Z-NAND plane Note that the physical log blocks come from SSD’s over- allows to employ a few flash registers (i.e., 8) due to space provisioned space and therefore, the blocks are not accounted constraints. The limited number of flash registers may not in the address space of Z-NAND [30]. Considering the limited be sufficient to accommodate all write requests based on SSD’s over-provisioned space, we group multiple physical data workload execution behaviors. To address this, we propose blocks to share a physical log block. We also create a log block to group the flash registers across all flash planes in the mapping table (LBMT) (in the shared memory) to record such same Z-NAND flash package to serve as a fully-associative mapping information (cf. Figure 7a). cache, such that memory requests can be placed in anywhere While MMU can perform the address translation, it does not within the flash registers. However, this requires all the flash support other essential functionalities of FTL such as the wear- registers to connect to both I/O ports and all flash arrays levelling algorithm and garbage collection. We further imple- in a Z-NAND package, which introduce high interconnection ment these functions in a GPU helper thread. Specifically, cost. To address this issue, we simplify the interconnection by when all the flash pages in a physical log block have been used connecting all the flash registers with a single bus, and only up, the GPU helper thread performs garbage collection, which the buses are connected to the I/O ports and flash planes. We merges the pages of all physical data blocks and the physical also propose a thrashing checker to monitor if there is cache log blocks. It then selects empty physical data blocks based on thrashing in the limited flash registers. If so, ZnG pins a few wear-levelling algorithm to store the merged pages. The GPU L2 cache space to place the excessive dirty pages. helper thread lastly updates the corresponding information in the LBMT and the DBMT. IV. IMPLEMENTATION Overall implementation of LPMT in row decoders. Figure A. Zero-overhead FTL design 7b shows the implementation details of a programmable row Overall implementation of zero-overhead FTL. Figure 7a decoder in Z-NAND. The flash controller converts each mem- shows an overview of the FTL implementation across GPU’s ory request to the corresponding physical log block number, Request number Q% V` V_8 RR`8 :`] L RR`8 1 `V` VR Channel ID 6 5 Q1RVH Y5Y !01H :GCV 4 1200 `VR1H Q` J 3 R R !01H Q%J 2 C:JV C:JV 1 0.000 CQ$1H Q1RVH V H. QJ `QC 1 2 3 4 5 6 7 8 9 `V`V H. 10 11 12 13 14 15 16 J%VRQ%J % Q`` V QC%IJ RVHQRV QC%IJ RVHQRV QJ1 Q` Plane ID (a) The design of dynamic read prefetch module. (b) Asymmetric Z-NAND writes. (c) Software I/O permutation. Fig. 8: Dynamic read prefetch design and S/W I/O permutation design. physical data block number and page index, and sends them three components: 1) a predictor to speculate data locality; 2) to the row decoder. To serve a read request, the programmable an extension of L2 cache’s tag array to track the status of data decoder looks up LPMT for the target data. If the target data accesses; and 3) our access monitor to dynamically adjust the hits in LPMT, the programmable decoder activates the cor- granularity of data prefetch. responding wordline based on the page mapping information Predictor table. We design the predictor in L2 cache to of LPMT. Otherwise, the row decoder activates the wordline record the access history of the read requests and speculate based on the page index and the data block number. On the the memory access pattern based on the program counter other hand, to serve a write request, a write operation is (PC) address of each thread, which is inspired by [11]. The performed by selecting a free page in the physical log block key insight behind this design is that the memory requests, to program the new data, and the new mapping information generated from the LD/ST instructions of the same PC address, is recorded in LPMT. As an in-order programming is only exhibits the same access patterns. The implementation of our allowed in Z-NAND, we can use a register to track the next predictor requires extending the memory requests with an extra available free page number in the physical log block. field to record the PC address and warp ID. Our predictor also The programmable decoder contains wordlines as many contains a predictor table, whose entries are indexed by the as those of Z-NAND flash array. Each wordline of the pro- PC addresses. We allocate 512 entries in the predictor table as grammable decoder connects to 2N flash cells and 4N bitlines default, which is the same to [11]. Each entry of the predictor ′ ′ ′ ′ (A0∼AN, B0∼BN, A0∼AN, B0∼BN), where N is the physical table contains a few fields for different warps to store the address length. The page mapping information of the LBPT logical page number that they are accessing. Fundamentally, is programmed in the flash cells of the programming decoder we track the accesses of five representative warps. Each entry by activating the corresponding wordlines and bitlines. The also includes a 4-bit counter to store the number of re-accesses detailed steps of a write operation in programmable decoder to the recorded pages. For example, if the warp 0 generates a are as follows: 1) the programmable decoder activates its memory request based on PC address 0 and the request targets wordline corresponding to the free page; 2) the values of to the same page as what is recorded in the predictor table, the page index, such as “1” and “0”, are converted as a high and counter increases by one. Otherwise, if the request accesses a low voltage, respectively. Such voltage is applied to B0∼BN, page different from the page number (recorded in the predictor ′ ′ while the inverse voltage is applied to B0∼BN. Thus, the table), the counter decreases by one, and the new page number ′ voltages on B and B program the flash cells; 3) for other is filled in the corresponding field of the predictor table. When rows, the “protect” gates are enabled to drive high voltages there is a cache miss, a cutoff test of read prefetch checks the to the drain selectors, which can avoid program disturbance. predictor table by referring to the PC address of the memory The programmable decoder operates as a content addressable request. If the counter value is higher than a threshold (i.e., memory (CAM) to search if the target data exists in a physical 12), we perform a read prefetch. log block. Specifically, the search procedure includes two Extension of L2 cache’s tag array. We extend each entry phases, which are shown in Figure 7b. In phase 1, the clock of L2 cache’s tag array with the fields of an accessed bit signal disables the gates, and the wordlines are charged with and a prefetch bit. These two fields are used to check if the a high voltage. In phase 2, the voltages converted from page prefetched data were early evicted due to the limited L2 cache ′ ′ index are applied to A0∼AN and A0∼AN. If the page index space. Specifically, we use the prefetch bit to identify whether matches with the values, stored in any row, the gates between the data stored in the cache line is filled by a prefetch, while A and A′ are enabled to dicharge the corresponding wordline the accessed bit records if a cache line has been accessed by a of the programmable decoder (which activates a row of the warp. When a cache line is evicted, the prefetch and accessed flash array). Otherwise, the wordlines keep the high voltage bits are checked. If the cache line is filled by a prefetch but to disable the row selection. has not been accessed by a warp, this indicates a read prefetch may introduce L2 cache thrashing. B. Read optimization Access monitor. To avoid early eviction of the prefetched Figure 8a shows an overview of the proposed dynamic read data and improve the utilization of L2 cache, we propose an prefetch to improve L2 cache utilization. Our design includes access monitor to dynamically adjust the access granularity of