DMA : An On-Chip Buffer for Improving I/O Performance

Dan Tang1,2 , Yungang Bao1, Weiwu Hu1, Mingyu Chen1

1 Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 2 Graduate School of Chinese Academy of Sciences, Beijing, China {tangdan,hww}@ict.ac.cn, {baoyg,cmy}@ncic.ac.cn

Abstract In the past decades, CPU and I/O device exchange data with direct memory access (DMA) technique. However, as the bandwidth requirements of the I/O devices grow rapidly, memory data moving operations have become critical for DMA scheme, which is bottleneck for I/O performance. In this paper, we propose a DMA cache technique which is adopting a dedicated cache (called DMA Cache) for buffering I/O data. This technique enables CPU and I/O device to exchange data between last level cache (LLC) and a DMA cache. The DMA cache technique significantly reduce unnecessary memory data moving operations and significantly improve I/O performance. We have evaluated the DMA cache technique by using memory reference traces of three applications (file copy, TPC-H and SPECweb2005) collected on a real machine. Experimental results show that the DMA cache technique improves performance by an average of 11.8% for a 256KB DMA cache with sequential prefetching for all benchmarks (up to 20.7% for a file-copy application) when using snooping-cache as baseline, and it performs best among all schemes we have evaluated. Moreover, we have observed an interesting phenomenon that the previous proposed shared-cache scheme, when compared with DMA cache, also has a high cache hit rate but may decrease performance (up to -8.8% for TPC-H) because of cache interference between accesses from CPU and DMA engine.

1. INTRODUCTION I/O accesses are essential on contemporary computer systems, whenever we load binary files from disks to memory or download files from network. Moreover, Many commercial workloads are I/O intensive. I/O buses and I/O devices technologies have been improved dramatically in the past decade. New I/O technologies, such as PCI-Express 2.0 [3], AMD’s HyperTransport 3.0 [1] and ’s CSI (QPI) [13], can provide bus bandwidths of over 20GB/s which are very close to that of a DDR2/3 DRAM memory system. The I/O devices performance also increases significantly so that a RAID system is able to provide a bandwidth of 600MB/s easily and the 10Gb can offer a bandwidth of 1.25GB/s. DMA technique is used to release processor from I/O process, which provides special channels for CPU and I/O devices to exchange I/O data. However, as the throughput of the I/O devices grows rapidly, memory data moving operations have become critical for DMA scheme, which becomes a performance bottleneck for I/O operations. We have investigated the DMA process and found that CPU and DMA engine have an explicit producer-consumer relationship and memory only plays a role of transient place for storing I/O data. If we replace memory with a high speed buffer (e.g. a cache) as this transient place, the I/O performance may be improved significantly. There are several proposals for improving message passing performance in CC-NUMA systems about a decade ago, such as switch cache [19][12] and cache injection techniques [17][15][16]. In the past several years, a few people have realized the potential of improving I/O performance. Iyer [11] proposed a chipset cache scheme to improve performance of Web Servers, Bohere et al [18] proposed a cache injection technique and Huggahalli et al [10] proposed a direct cache access (DCA) scheme. Both techniques enable DMA engine and processor to exchange data within processor cache. Edgar et al [14] show that cache injection can provide significant performance benefits but blind cache injection can be harmful to application performance. In fact, those recent proposals can be classified into shared-cache scheme which may suffer from interference between processor and I/O device. In this paper, we propose a DMA cache scheme which enables CPU and I/O device to exchange data between last level cache (LLC) and the DMA cache. The DMA cache scheme improves I/O performance significantly because it reduces both unnecessary memory data moving operations and shared-caches interferences between accesses from processor and DMA engine largely. Furthermore, the DMA cache is beneficial from adopting straightforward prefetching technique because it only serves I/O data which has regular access pattern. We have evaluated the DMA cache technique by using memory reference traces of three applications (file copy, TPC-H and SPECweb2005) collected on a real machine. Experimental results show the DMA cache technique improves performance by an average of 11.8% for a 256KB DMA cache with sequential prefetching for all benchmarks when using snooping-cache as baseline, and the performance improvement is up to 20.7% for a file-copy application. The DMA cache scheme performs best among all schemes we have evaluated. Moreover, we have observed an interesting phenomenon that previous proposed shared-cache schemes (e.g., DCA [10]), when compared with DMA cache, also has a high cache hit rate but may decrease performance (up to -8.8% for TPC-H) because of cache interference between accesses from CPU and DMA engine. This paper has the following contributions: • we reconsider the DMA mechanism, and reveal its essence that CPU and DMA engine have an explicit producer-consumer relationship and memory only play a role of transient place for I/O data transferred between processor and I/O device. Unnecessary expensive memory moving operations can be eliminated if we replace memory with a high speed buffer for the DMA. • Based on the observations, we propose a DMA cache scheme which enables CPU and I/O device to exchange data between last level cache (LLC) and a DMA cache directly. • We have evaluated the DMA cache technique and several previous proposed schemes, such as snooping-cache and shared-cache. We find that the DMA cache technique performs best among all schemes. • We have observed that the previous proposed shared-cache scheme suffers from cache interferences between CPU and DMA engine. This phenomenon may decrease performance (by up to -8.8% for TPC-H). Moreover, it would be aggravated in multicore systems. The remainder of the paper is organized as follows. In Section 2, we introduce the DMA mechanism and reconsider the role of memory in DMA process. Section 3 presents the DMA cache scheme and several design issues. Section 4 describes experimental setups, such as benchmarks, trace collection and evaluation methodology. Section 5 compares DMA cache’s performance with several other schemes. In Section 6, we discuss several other issues about the DMA cache scheme. We summarize the paper in Section 7.

2. THE INTERACTIONOF PROCESSOR,MEMORY AND I/ODEVICEIN DMASCHEME In this section, we introduce the DMA mechanism and reconsider the role of memory in DMA process. 2.1. The Process of DMA Scheme Figure 1 and Figure 2 depict the interaction of processor, memory and I/O device in DMA scheme. As shown in the figures, the interaction requires three data structures, i.e. DMA buffer, descriptor and APP buffer. Usually DMA buffers’ start address can vary in space and their size also varies. DMA descriptors are used to manage these discrete DMA buffers. Each descriptor includes a pointer of DMA buffer’s start address and a variable of DMA buffer’s size. Each descriptor also includes some variables for status information such as DMA buffer’s owner (could be processor or DMA engine).

Fig. 1. The Processes of DMA Receiving Mechanism Fig. 2. The Processes of DMA Transmitting Mechanism

Figure 1 illustrates the anatomy of DMA receiving mechanism. The detailed processes are described as follows: Step 1: Processor side (i.e. device driver) creates a descriptor for a DMA buffer. Step 2: Processor side allocates a DMA buffer in memory and initializes the descriptor with the DMA buffer’s start address, size and status information. Step 3: Processor side informs DMA engine of the descriptor’s start address. Step 4: DMA engine reads the descriptor’s content from memory. Step 5: With the DMA buffer’s start address and size information extracted from the descriptor, DMA engine receives data from I/O device and writes the data to the DMA buffer in memory. Step 6: The owner status of the descriptor is modified to be DMA engine after all I/O data is stored in the DMA buffer. Step 7: After all I/O data is received, DMA engine sends an to processor to indicate completion of data receiving. Step 8: Processor side handles the interrupt invoked by DMA engine and copies received I/O data from DMA buffer to APP buffer. Then, it frees each DMA buffer and modifies the owner status of the descriptor to be processor. After all is done, go to Step 2. Figure 2 illustrates the anatomy of DMA transmitting mechanism. The detailed processes are also described as follows: Step 1: Processor side (i.e. device driver) creates a descriptor for DMA buffer. Step 2: Processor side allocates a DMA buffer in memory and initializes the descriptor with the DMA buffer’s start address, size and status information. Step 3: Processor side copies the data to be transmitted from APP buffer to DMA buffer. Step 4: Processor side informs DMA engine of the descriptor’ start address. Step 5: DMA engine reads the descriptor’s content from memory. Step 6: With DMA buffer’s start address and size information extracted from the descriptor, DMA engine reads data from the DMA buffer and sends the data to I/O device. Step 7: DMA engine modifies the owner bit of the descriptor to be DMA engine after all data has been transmitted successfully. Step 8: DMA engine sends an interrupt to processor to indicate completion of the data transmission. Processor side frees the DMA buffer and modifies the status information of the descriptor. 2.2. Memory: A role of transient place for I/O data According to the DMA receiving and transmitting mechanisms mentioned above, we can find that the essence of DMA scheme is that processor and I/O device have an explicit producer-consumer relationship. For instance, DMA engine is producer and processor is consumer in DMA receiving mechanism, but switch the roles in DMA transmitting mechanism. Moreover, once data is produced, it must be used once and only once. We can also find that memory only plays a role of transient place for storing I/O data transferred between processor and I/O device. In DMA receiving processes, the data source is DMA engine (or I/O device) which writes source data (I/O data) to DMA buffer (see Step 5 in Figure 1). To provide I/O data for user applications, device driver performs an additional memory copy operation which copies I/O data from DMA buffer to the eventual destination – APP buffer (see Step 8 in Figure 1). Once I/O data is copied to the eventual destination (APP buffer), DMA buffer is freed. Device driver will allocate a new DMA buffer for a new DMA receiving operation, however, the start address of the new DMA buffer is usually different from the previous freed one. The DMA transmitting mechanism also requires an additional memory copy operation in Step 3 of Figure 2. Based on the findings, we can conclude that the I/O performance should be improved significantly if the overhead of these additional memory-copy operations can be reduced. 2.3. Related Work Snooping-Cache Scheme: Almost all prevalent processors adopt snooping-cache scheme for DMA scheme. When DMA engine receives data from I/O device and writes the data to DMA buffer, it also sends snoop requests to data cache to invalidate the cache blocks whose physical addresses are within the region of DMA buffer. The snooping-cache scheme has several advantages that (1) there is almost no modification for data cache, and (2) it is easy to solve data coherency problem (Figure 4 illustrates a typical cache state diagram of snooping-cache for I/O data coherency). But its shortcoming is requiring additional memory copy operations, which degrads I/O performance. Shared-Cache Scheme: Some studies have focused on reducing the overhead of additional memory copy operations for I/O data transfer. The key idea of these proposals is adopting shared-cache for DMA engine and processor to exchange I/O data. Iyer [11] proposed a chipset cache scheme to improve performance of Web Servers. The chipset cache handles the entire processor’s last level cache misses as well as the DMA memory references. In fact, the chipset cache plays a role of last level cache shared by processor and DMA engine. Huggahalli et al [10] proposed a direct cache access (DCA) scheme that enables DMA engine and processor to exchange data within processor’s cache. Their study focuses on network traffics and shows that the DCA scheme has poor performance for applications that have intensive disk I/O traffics (e.g. TPC-C). Some studies proposed the cache injection techniques [17][15][16][18] which is using some new instructions and some modifications of processor cache for I/O devices to inject I/O data into processor cache directly . In fact, the cache injection techniques can also be classified into the shared-cache scheme. Although shared-cache scheme may be beneficial in reducing the overhead of additional memory copy operations, it can cause cache interferences between processor and DMA engine. Edgar et al [14] show that blind cache injection can be harmful to application performance. Our evaluations have shown that the interference can degrade I/O performance significantly when the size of I/O data increases from 1KB to over 100KB. This may be the main reason that DCA [10] can only exhibit good performance for network I/O intensive applications rather than disk I/O intensive applications (more details are discussed in section 5). Moreover, the interference within shared cache would be aggravated in multicore systems. Other Schemes: It should be noted that the M5 simulator [8] adopt an IO cache in FS mode. However, the IO cache in M5 is to make device accesses work in coherent space. There are Some software effects for improving I/O performance, such as Zero-Copy technique [9]. The ”Zero-Copy” technique is able to avoid one memory copy from kernel buffer to user buffer. However, even though using ”Zero-Copy” technique, I/O data has to be written to memory first.

3. THE DMACACHE SCHEME Based on the observations mentioned above, we propose a DMA cache scheme to improve I/O perfor- mance by reducing memory copy operations. 3.1. The Principle of the DMA Cache

Fig. 3. Organization of DMA Cache System

The DMA cache is a dedicated cache for buffering I/O data. Like shared-cache scheme, it is advanta- geous in reducing the overhead of unnecessary memory copy operations. Unlike shared-cache scheme, the DMA cache technique is able to avoid interferences between processor and DMA engine. Furthermore, the DMA cache is beneficial from adopting straightforward prefetching technique because it only serves I/O data which has very regular reference pattern. Thus, CPU and I/O device can exchange data between the DMA cache and last level cache (LLC) to improve I/O performance. Figure 3 illustrates the organization of the DMA cache scheme. In this scheme, DMA cache is a dedicated cache placed at the same level of the processor’s last level cache (LLC) in memory hierarchy. A data path exists between the processor cache and the DMA cache for data exchange and coherence maintaining. In this paper, we mainly investigate the DMA cache system in a uniprocessor platform. 3.2. Data Coherency Between DMA Cache and Processor Cache Usually, adding a new cache would raise data coherency problem. In DMA cache scheme, we solve this problem by the following method. In DMA cache scheme, both processor cache and DMA cache will be accessed upon each memory reference (could be issued by CPU or DMA engine). Figure 4 and Figure 5 depict the state diagrams for processor cache and DMA cache in DMA cache scheme respectively. As shown in figure, the state diagram of processor cache is same as that of a typical uniprocessor for I/O data coherency. The state diagram for DMA cache is similar to that of processor cache except driving sources. In such way, the DMA scheme is able to ensure that data would never exist in both caches at any time.

Fig. 4. CPU Cache State Diagram Fig. 5. DMA Cache State Diagram

3.3. Data Path in the DMA Cache Scheme Since the cache coherency mechanism mentioned above ensures that data can only reside in one cache at any time, there are only three cases upon any references: 1) Hit in processor cache and miss in DMA cache. 2) Miss in processor cache and hit in DMA cache. 3) Miss in both caches. For case (3), the source of the reference determines which cache to handle the miss. For instance, if a DMA reference misses in both caches, the DMA cache is responsible for the miss. For case (1) and (2), a cache block migration operation is required. For instance, in case (1) that a DMA reference misses in DMA cache and hits in processor cache, the hit block will be migrated from processor cache to DMA cache and then be invalidated in processor cache. It should be noted that we adopt write-allocate policy and write-back policy for both caches in this paper. Thus, the dirty blocks in both caches would not be written back to memory immediately until they are is selected for replacement. However, other write policies (e.g., write through) could also be adopted for DMA cache, and we will discuss it in section 6. 3.4. The Prefetching Mechanism of DMA cache I/O data has very regular access pattern, being linear within a DMA buffer. Thus, even though the DMA cache adopts a straightforward sequential prefetching technique, it is able to significantly improve performance. We adopt a sequential prefetching scheme [11] with a prefetching degree [11] of four cache blocks for DMA cache. For case (3) described in above subsection, i.e. a DMA request misses in both DMA cache and processor cache, the sequential prefetcher fetches four blocks into the DMA cache from either memory or the processor cache (the later three prefetched blocks may reside in memory, processor cache or DMA cache). Those prefetched blocks hitting processor are migrated to DMA cache. DMA cache selects some blocks (≤4) for replacement (some prefetched blocks may already reside in the DMA cache). If the replaced blocks are dirty, DMA cache performs write-back operations to update the dirty blocks into memory. 3.5. Replacement Policy of the DMA cache We use an LRU-like replacement policy in which the DMA cache blocks are selected for replacement in the following priority: 1) First, the invalidated blocks will be selected. 2) Second, the least used clean block will be selected. 3) Finally, the least used dirty block will be selected. However, as described in section 2.2, the essence of DMA scheme is that processor and I/O device have an explicit producer-consumer relationship. Once data is produced, it must be used once and only once. This implies that blocks in the DMA cache will be reused at most once. While other replacement policy may be more efficient for DMA cache, a more detailed study of replacement policy is beyond the scope of this paper.

4. EXPERIMENTAL SETUP This section presents experimental setup in our evaluations, such as benchmarks, trace collection and evaluation methodology. 4.1. Benchmarks The benchmarks are described as follows: • File Copy: A real application that copying one 400MB file with Ext3 file system on Linux platform. The file-copy benchmark consists of nearly identical numbers of disk read and write operations. • TPC-H [5]: A decision support benchmark. It consists of a suite of queries and concurrent data modifications. The queries and the data resides in a database which is Oracle 11g with 10GB data in our case. The TPC-H benchmark is a disk read intensive application. • SPECweb2005 [4]: A SPEC benchmark for evaluating the performance of World Wide Web Servers. It is used for measuring a system’s ability to act as a web server. In SPECWeb2005, the web server machine performs network I/O and disk I/O operations simultaneously. 4.2. Trace Collection We run all the benchmarks on a real machine. The configurations of the machine are an AMD Opteron 2.2GHz processor and 2GB dual-channel DDR memory. We use HMTT system[7], a tool plugged in an idle DIMM to snoop memory reference address, to collect memory refernce traces of these benchmarks. To distinguish a memory reference issued by DMA engine or processor, we have modified the device drivers of hard disk controller and network interface card (NIC) on Linux platform. When the modified drivers allocate and release DMA buffers (see details in section 2.1), they record start address, size and owner information of a DMA buffer. Meanwhile, they send synchronization tags to the HMTT system. When the HMTT system receives synchronization tags, it injects tags (i.e., DMA BEGIN TAG or DMA END TAG) into physical memory trace to indicate that those memory references between the two tags (DMA BEGIN TAG or DMA END TAG) and within the DMA buffer’s address region are DMA memory reference initiated by DMA engine. The status information of DMA requests, such as start address, size and owner, is stored in a reserved physical memory and is dumped into a file after memory trace collection is completed. Thus, there is no interference of additional I/O access. We can differentiate DMA memory reference from processor memory reference by merging physical memory trace and status information of DMA requests. For the file-copy benchmark, we copy one 400MB file and collect 42M memory references. For TPC-H and SPECweb2005 benchmarks, we select a typical phase of 40M memory references. Since one reference is 64 bytes, a memory trace of 40M references represents accessing 2.6GB memory data. We believe it should be convincing for studying the behaviors of I/O accesses. 4.3. Evaluation Methodology We adopt an FPGA platform to emulate cache system. Our cycle-accurate experimental system consists of a last level cache, a DDR2 controller from a commercial processor (we use Lxxxx for blind review in this paper), a DDR2 DIMM model from Micron Technology [2] and our proposed DMA cache. All emulated caches work at a frequency of 2GHz and the DDR2 controller works at 333MHz. The whole experimental system is implemented in synthesizable RTL code. We use an FPGA based RTL emulation accelerator, the Xtreme system [6] from Cadence, to accelerate the above-mentioned experimental system. More detailed emulation parameters are list in Table 1.

TABLE 1. EMULATION PARAMETERS

CPU Cache DMA Cache Size 2MB 256KB 128KB 64KB 32KB Cache Line 64 Bytes 64Bytes Way 8-Way 4-Way Write Policy Write Allocate Write Allocate Write Back Write Back Replace Policy Random 1st Select invalid blocks 2nd Select least used clean block 3rd Select least used dirty block Access Time 20 Cycles 20 Cycles

We have evaluated four schemes: (1) snooping-cache, (2) shared-cache, (3) DMA cache without prefetch- ing and (4) DMA cache with prefetching. To further study the impact of various DMA cache capacities, we have also evaluated various DMA caches with capacities of 256KB, 128KB, 64KB and 32KB.

5. EXPERIMENTAL RESULTS We present our experimental results in this section. 5.1. Trace Characterization Table 2 shows the percentages of DMA memory references in various benchmarks. In Table 2 we can see that the file-copy benchmark has nearly the same percentage of DMA read references (15.4%) and DMA write references (15.6%), and the sum of two kinds of DMA memory references is about 31%. For TPC-H benchmark, the percentage of all DMA memory references is about 20%. The percentage of DMA write references (19.9%) is about 200 times of that of DMA read references (0.1%) because the dominant I/O operations in TPC-H is transferring data from disk to memory (i.e., DMA write request). For SPECweb2005, the percentage of DMA memory references is only 1.0%. Because the size of network I/O requests (including a number of DMA memory references) is quite small, processor is busy with handling .

TABLE 2. PERCENTAGE OF MEMORY REFERENCE TYPE

File Copy TPC-H SPECweb2005 CPU Read 45% 60% 75% CPU Write 24% 20% 24% DMA Read 15.4% 0.1% 0.76% DMA Write 15.6% 19.9% 0.23% Table 3 and Figure 6 depict the average size of DMA requests and the cumulative distributions of the size of DMA requests for three benchmarks respectively (one DMA request includes a number of DMA memory references with 64 bytes). For file-copy and TPC-H benchmarks, all DMA write requests are less than 256KB and the percentage of those requests with the size of 128KB is about 76%. The average sizes of DMA write requests are about 110KB and 121KB for file-copy and TPC-H respectively. For SPECweb2005, the size of all DMA requests issued by NIC are smaller than 1.5KB because the maximum transmission unit (MTU) of Gigabit Ethernet frame is only 1518 bytes. The size of DMA requests issued by IDE controller for SPECweb2005 is also very small, an average of about 10KB. Next, we will show that these characterizations have significant impact on performance of various schemes.

TABLE 3. AVERAGE SIZEOF VARIOUS TYPESOF DMAREQUESTS

Request Type Percentage Average Size File Copy IDE DMA Read 49.9% 393KB IDE DMA Write 50.1% 110KB TPC-H IDE DMA Read 0.5% 19KB IDE DMA Write 99.5% 121KB SPECweb2005 IDE DMA Read 24.4% 10KB IDE DMA Write 1.7% 7KB NIC DMA Read 52% 0.3KB NIC DMA Write 21.9% 0.16KB

Fig. 6. Cumulative Distribution of DMA Request Size

5.2. Performance Analysis For the sake of comparing various schemes thoroughly and clearly in such a cycle-accurate experimental system with commercial components (e.g., DDR2 controller), we define a few the metrics (list in Table 4) to evaluate our experiments. Figure 7 illustrates the normalized speedup of various schemes (the baseline is total cycles of snoop- cache scheme). We can find that the DMA cache scheme with the size of 256KB and prefetching outperforms all other schemes. The performance of file-copy benchmark is improved by 20.7% due to its large portion of DMA memory reference (31%). In case of TPC-H benchmark which has 20% DMA TABLE 4. METRICS DESCRIPTION

Metrics Description Normalized Speedup Use total emulated cycles of snoop-cache scheme as baseline DMA Read Memory Cycles consumed for accessing DDR2 DRAM in DMA read process DMA Write Memory Cycles consumed for accessing DDR2 DRAM in DMA write process DMA Read Cache Cycles consumed for accessing cache in DMA read process DMA Write Cache Cycles consumed for accessing cache in DMA Write process CPU Read Memory Cycles consumed for accessing DDR2 DRAM in CPU read process CPU Write Memory Cycles consumed for accessing DDR2 DRAM in CPU write process. CPU Read Cache Cycles consumed for accessing cache in CPU read process. CPU Write Cache Cycles consumed for accessing cache in CPU write process. DMA Write & A cache block is written by DMA request and then read by CPU Read Hit CPU request, i.e., DMA engine is producer and CPU is consumer. CPU Write & A cache block is written by CPU request and then read by DMA Read Hit DMA request, i.e., CPU is producer and DMA engine is consumer. memory reference, its speedup is about 8.0%. Although SPECWeb2005 only has about 1% DMA memory reference, its performance is improved by 6.5%.

Fig. 7. Normalized Speedup

Since we have found that CPU and DMA engine have a producer-consumer relationship, we investigate cache hit rate for shared-cache scheme and DMA cache scheme in the following two cases: • Case 1: DMA engine is producer and CPU is consumer Figure 8 illustrates the hit rates of the ”DMA Write & CPU Read Hit” metric (in Table 4) of various schemes. For DMA cache scheme, it should be noted that DMA engine only writes I/O data into DMA cache. It is interesting that the shared-cache scheme decreases performance for file-copy (-6.3%) and TPC-H (-8.8%) benchmarks from Figure 7, even though it exhibits higher ”DMA Write & CPU Read Hit Rate” than the DMA cache scheme with the size of 128KB. We further investigate the reason of this paradox by comparing shared-cache scheme with snooping- cache scheme. According to the Step 5 of the DMA receiving mechanism in Figure 1, DMA engine receives data from I/O device and writes the data to DMA buffer in memory. In snooping-cache scheme, Fig. 8. DMA Write & CPU Read Hit Rate

Fig. 9. Breakdown of Normalized Total Cycles all data is written into memory directly. In such situation, accessing DDR2 DRAM can achieve quite good performance due to few row buffer conflicts [21] because the I/O data is written into a continuous address region. However, in shared-cache scheme, the I/O data is injected into cache, incurring cache block replacement. If the replaced blocks are dirty, the cache have to (1) write these blocks back into memory first, and (2) refill the DMA request blocks into cache and finally (3) update the refilled blocks with DMA written data. Experimental results show that the portion of replaced dirty blocks accounts for about 70% in the file copy and TPC-H benchmark (see Figure 11). Unfortunately, because the replaced dirty blocks are usually non-continuous, DDR DRAM may exhibit poor performance due to lots of row buffer conflicts. Furthermore, the shared-cache replacement has sequela that the continuous I/O data may also be written back to memory non-continuously because they may be selected for replacement randomly later. Moveover, shared-cache scheme may cause cache thrashing problem. From Figure 9 which illustrates the breakdown of total emulated cycles of various schemes, the portion of the ”DMA Write Memory” metric (in Table 4) in shared-cache scheme accounts for about 20.3%. In summary, the negative impact caused by high portion of ”DMA Write Memory” overwhelms the positive impact caused by high ”DMA Write & CPU Read Hit Rate”. Although the DMA caches is much smaller than shared cache, they can also exhibit high hit rate. For instance, from Figure 8, the DMA cache with the size of 256KB exhibits even higher ”DMA Write & CPU Read Hit Rate” than that of shared-cache scheme. This is because the DMA cache has no interferences between processor requests and DMA requests. • Case 2: CPU is producer and DMA engine is consumer Figure 10 illustrates the rate of ”CPU Write & DMA Read Hit” metric (in Table 4) of various schemes. It also should be noted that CPU always writes data into processor cache for all schemes. From the figure, the DMA cache scheme exhibits the highest ”CPU Write & DMA Read Hit Rate”. But the value is very small, only 0.2%, 2.1% and 7.7% for file copy, TPC-H and SPECweb2005 respectively. Since CPU writes data into processor cache, this phenomenon implies that the behavior of processor cache is unpredictable. Thus, our proposed DMA cache scheme can not gain more benefits unless adopting an alternative mechanism that CPU is able to write I/O data into DMA cache directly. However, such alternative mechanism may cause complex cache controller.

Fig. 10. CPU Write & DMA Read Hit Rate

5.3. Performance of Various DMA Cache Configurations In this subsection, we present the the performance evaluation of DMA caches with various configura- tions, such as different size (256KB, 128KB, 64KB and 32KB) and with or without prefetching. Figure 7 and Figure 9 illustrate the normalized speedup and the breakdown of normalized emulated cycles of DMA caches with various configurations. From these two figures, those DMA caches with the size of less than 128KB (i.e., 64KB and 32KB) exhibit even worse performance than the baseline (i.e., snooping-cache scheme) for file-copy and TPC-H benchmarks. From Figure 8, the ”DMA Write & CPU Read Hit Rates” of the 32KB and 64KB DMA caches decrease sharply for file-copy and TPC-H which have a feature of large DMA requests size (an average of more than 110KB,see Table 3). Figure 6 also shows that over 80% of DMA requests’ sizes are larger than 64KB. This characterization leads to frequent replacement for dirty blocks in the small-size DMA caches. Figure 11 illustrates the replaced dirty blocks account for over 90% and 60% for file-copy and TPC-H respectively. However, as mentioned before, those replaced dirty blocks are non-continuous, leading to poor DDR memory performance due to lots of row buffer conflicts. Because these replaced blocks have not been referenced by CPU, the ”DMA Write & CPU Read Hit Rates” also decrease, causing the portion of ”CPU Read Memory” cycles increase significantly (see Figure 9). In SPECweb2005, this phenomenon disappears because the average size of DMA requests of SPECweb2005 is less than 10KB (see Table 3 and Figure 6).

Fig. 11. The Percentage of DMA Writes Causing Dirty Block Replacement

Next, we analyze the impact of prefetching technique on DMA cache’s performance. As shown in Figure 7, the DMA caches with prefetching almost always outperform those without prefetching in same configurations. In fact, I/O data has very regular access pattern, being linear within a DMA buffer. Thus, even though the DMA cache adopts a straightforward sequential prefetching technique [20], it significantly improve performance. For file-copy benchmark, the 256KB DMA cache with prefetching outperforms the one without prefetching by about 8.3%. Figure 9 shows that the portion of ”DMA Read Memory” cycles in 256KB DMA cache with prefetching is reduced to only one third of that of the DMA cache without prefetching. For TPC-H and SPECweb2005, the prefetching effect is not obvious. TPC-H is is an exception where the 64KB and 32KB DMA caches with prefetching exhibit worse performance than the DMA caches without prefetching. The possible reasons could be that: (1) the percentage of DMA read reference is very small, only 0.5% (see Table 3) and (2) while the size of DMA cache is small, the DMA write requests would replace the prefetched blocks, leading to lots of invalid prefethcing operations. Figure 12 illustrates the prefetching effect. From the figure, we can see that the straightforward sequential prefetching has high accuracy. The percentage of prefetched blocks referenced by DMA read requests are over 80% for the 256KB and 128KB DMA cache schemes. The prefetching approach for the file-copy benchmark has an impressive high accuracy, nearly achieving 100%. Fig. 12. The Percentage of Valid Prefetched Blocks

6. DISCUSSION This paper reveals the essence of DMA mechanism and propose the DMA cache scheme. Although experimental results show that the DMA cache scheme has effectiveness, there are still several implemen- tation issues for adopting DMA cache. The power issue should be considered while implementing a DMA cache, because our protocol makes the DMA cache must be accessed for every snoop request. However, the coherence protocol could be improved to makes the DMA cache snooped only if it has or may have the data. Moreover, the DMA cache could be power managed such that when it’s not used, it would be powered off. The replacement policy could also be improved. As we have emphasized many times that CPU and DMA engine have an explicit producer-consumer relationship and once data is produced, it must be used once and only once. Thus, an elaborate replacement policy could improve more I/O performance. For instance, as shown in Figure 9, the portion of ”DMA Write Memory” cycles could be reduced significantly if adopting a write-through policy, i.e., writing the I/O data into memory directly as well as DMA cache with clean status rather than dirty status simultaneously. The DMA cache capacity is a sensitive parameters related to various applications. As discussed in section 5, if the DMA cache capacity is smaller than the average size of DMA requests, it would decrease system’s performance. In our experimental machine (AMD Opteron and Linux platform), the DMA request size statistics show that an average sizes are about 110KB and 1.5KB for disk and for network respectively. Therefore, a DMA cache with the size of 128KB or 256KB should be enough for most contemporary computer systems. We collect memory traces from a single-core processor machine, and have found the shared-cache scheme decreases I/O performance because of interferences between CPU and DMA engine. Furthermore, the shared-cache scheme would be worse for I/O intensive applications in multicore system.

7. CONCLUSIONSAND FUTURE WORK In this paper, we reconsider the DMA mechanism and reveal the essence that CPU and DMA engine have an explicit producer-consumer relationship and memory only plays a role of transient place for I/O data. Thus, the I/O performance could be improved if reducing unnecessary memory access operations. We propose the DMA cache technique that enables CPU and I/O device to exchange data between last level cache (LLC) and a DMA cache. We collect memory traces of real applications from real machine, and evaluate the DMA cache scheme as well as previous proposed schemes on an FPGA emulating platform. Experimental results show that the DMA cache scheme outperforms all other schemes. Since the essence of DMA mechanism has been revealed, there would several other alternatives to improve I/O performance, such as improving shared-cache scheme with the same effect as the DMA cashe scheme, which will be explored elsewhere.

REFERENCES [1] Hypertransport. http://www.hypertransport.org. [2] Micron technology inc. http://www.micron.com/. [3] Pci-express 2.0. http://www.pcisig.com/home. [4] Specweb2005. http://www.spec.org/web2005. [5] The tpc benchmark h (tpc-h). http://www.tpc.org/tpch/default.asp. [6] Xtreme system. http://www.cadence.com/products/fv/xtreme series/Pages/default.aspx. [7] Y. Bao, M. Chen, Y. Ruan, L. Liu, J. Fan, Q. Yuan, B. Song, and J. Xu. Hmtt: a platform independent full-system memory trace monitoring system. In SIGMETRICS ’08: Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, pages 229–240, New York, NY, USA, 2008. [8] N. L. Binkert, R. G. Dreslinski, L. R. Hsu, K. T. Lim, A. G. Saidi, and S. K. Reinhardt. The m5 simulator: Modeling networked systems. IEEE Micro, 26(4):52–60, 2006. [9] U. Drepper. The need for asynchronous, zero-copy network i/o. In Red Hat Inc. White Paper, 2006. [10] R. Huggahalli, R. Iyer, and S. Tetrick. Direct cache access for high bandwidth network i/o. In ISCA ’05: Proceedings of the 32nd annual international symposium on , pages 50–59, Washington, DC, USA, 2005. [11] R. Iyer. Performance implications of chipset caches in web servers. In ISPASS ’03: Proceedings of the 2003 IEEE International Symposium on Performance Analysis of Systems and Software, pages 176–185, Washington, DC, USA, 2003. [12] R. Iyer and L. N. Bhuyan. Switch cache: A framework for improving the remote memory access latency of cc-numa multiprocessors. In HPCA ’99: Proceedings of the 5th International Symposium on High Performance Computer Architecture, page 152, Washington, DC, USA, 1999. IEEE Computer Society. [13] D. Kanter. The common system interface: Intel’s future interconnect. Technical Report. [14] E. A. Leon, K. B. Ferreira, and A. B. Maccabe. Reducing the impact of the memorywall for i/o using cache injection. In HOTI ’07: Proceedings of the 15th Annual IEEE Symposium on High-Performance Interconnects, pages 143–150, Washington, DC, USA, 2007. IEEE Computer Society. [15] A. Milenkovic and V. Milutinovic. Cache injection on bus based multiprocessors. In SRDS ’98: Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems, page 341, Washington, DC, USA, 1998. IEEE Computer Society. [16] A. Milenkovic and V. M. Milutinovic. Cache injection: A novel technique for tolerating memory latency in bus-based smps. In Euro- Par ’00: Proceedings from the 6th International Euro-Par Conference on Parallel Processing, pages 558–566, London, UK, 2000. Springer-Verlag. [17] V. M. Milutinovic and G. A. Sheaffer. The cache injection/cofetch architecture: initial performance evaluation. In MASCOTS ’97: Proceedings of the Fifth International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, pages 63–64, Haifa, 1997. IEEE Computer Society. [18] R. R. P. Bohrer and H. Shafi. Method and apparatus for accelerating input/output processing using cache injections. In US Patent No. US 6,711,650 B1, 2004. [19] M. Swanson, R. Kuramkote, L. Stoller, and T. Tateyama. Message passing support in the avalanche widget. Technical Report UUCS- 96-002, 10, 1996. [20] S. P. Vanderwiel and D. J. Lilja. Data prefetch mechanisms. ACM Comput. Surv., 32(2):174–199, 2000. [21] Z. Zhang, Z. Zhu, and X. Zhang. A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality. In MICRO 33: Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture, pages 32–41, New York, NY, USA, 2000. ACM.