DMA Cache: an On-Chip Buffer for Improving I/O Performance
Total Page:16
File Type:pdf, Size:1020Kb
DMA Cache: An On-Chip Buffer for Improving I/O Performance Dan Tang1,2 , Yungang Bao1, Weiwu Hu1, Mingyu Chen1 1 Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 2 Graduate School of Chinese Academy of Sciences, Beijing, China {tangdan,hww}@ict.ac.cn, {baoyg,cmy}@ncic.ac.cn Abstract In the past decades, CPU and I/O device exchange data with direct memory access (DMA) technique. However, as the bandwidth requirements of the I/O devices grow rapidly, memory data moving operations have become critical for DMA scheme, which is bottleneck for I/O performance. In this paper, we propose a DMA cache technique which is adopting a dedicated cache (called DMA Cache) for buffering I/O data. This technique enables CPU and I/O device to exchange data between last level cache (LLC) and a DMA cache. The DMA cache technique significantly reduce unnecessary memory data moving operations and significantly improve I/O performance. We have evaluated the DMA cache technique by using memory reference traces of three applications (file copy, TPC-H and SPECweb2005) collected on a real machine. Experimental results show that the DMA cache technique improves performance by an average of 11.8% for a 256KB DMA cache with sequential prefetching for all benchmarks (up to 20.7% for a file-copy application) when using snooping-cache as baseline, and it performs best among all schemes we have evaluated. Moreover, we have observed an interesting phenomenon that the previous proposed shared-cache scheme, when compared with DMA cache, also has a high cache hit rate but may decrease performance (up to -8.8% for TPC-H) because of cache interference between accesses from CPU and DMA engine. 1. INTRODUCTION I/O accesses are essential on contemporary computer systems, whenever we load binary files from disks to memory or download files from network. Moreover, Many commercial workloads are I/O intensive. I/O buses and I/O devices technologies have been improved dramatically in the past decade. New I/O bus technologies, such as PCI-Express 2.0 [3], AMD’s HyperTransport 3.0 [1] and Intel’s CSI (QPI) [13], can provide bus bandwidths of over 20GB/s which are very close to that of a DDR2/3 DRAM memory system. The I/O devices performance also increases significantly so that a RAID system is able to provide a bandwidth of 600MB/s easily and the 10Gb Ethernet can offer a bandwidth of 1.25GB/s. DMA technique is used to release processor from I/O process, which provides special channels for CPU and I/O devices to exchange I/O data. However, as the throughput of the I/O devices grows rapidly, memory data moving operations have become critical for DMA scheme, which becomes a performance bottleneck for I/O operations. We have investigated the DMA process and found that CPU and DMA engine have an explicit producer-consumer relationship and memory only plays a role of transient place for storing I/O data. If we replace memory with a high speed buffer (e.g. a cache) as this transient place, the I/O performance may be improved significantly. There are several proposals for improving message passing performance in CC-NUMA systems about a decade ago, such as switch cache [19][12] and cache injection techniques [17][15][16]. In the past several years, a few people have realized the potential of improving I/O performance. Iyer [11] proposed a chipset cache scheme to improve performance of Web Servers, Bohere et al [18] proposed a cache injection technique and Huggahalli et al [10] proposed a direct cache access (DCA) scheme. Both techniques enable DMA engine and processor to exchange data within processor cache. Edgar et al [14] show that cache injection can provide significant performance benefits but blind cache injection can be harmful to application performance. In fact, those recent proposals can be classified into shared-cache scheme which may suffer from interference between processor and I/O device. In this paper, we propose a DMA cache scheme which enables CPU and I/O device to exchange data between last level cache (LLC) and the DMA cache. The DMA cache scheme improves I/O performance significantly because it reduces both unnecessary memory data moving operations and shared-caches interferences between accesses from processor and DMA engine largely. Furthermore, the DMA cache is beneficial from adopting straightforward prefetching technique because it only serves I/O data which has regular access pattern. We have evaluated the DMA cache technique by using memory reference traces of three applications (file copy, TPC-H and SPECweb2005) collected on a real machine. Experimental results show the DMA cache technique improves performance by an average of 11.8% for a 256KB DMA cache with sequential prefetching for all benchmarks when using snooping-cache as baseline, and the performance improvement is up to 20.7% for a file-copy application. The DMA cache scheme performs best among all schemes we have evaluated. Moreover, we have observed an interesting phenomenon that previous proposed shared-cache schemes (e.g., DCA [10]), when compared with DMA cache, also has a high cache hit rate but may decrease performance (up to -8.8% for TPC-H) because of cache interference between accesses from CPU and DMA engine. This paper has the following contributions: • we reconsider the DMA mechanism, and reveal its essence that CPU and DMA engine have an explicit producer-consumer relationship and memory only play a role of transient place for I/O data transferred between processor and I/O device. Unnecessary expensive memory moving operations can be eliminated if we replace memory with a high speed buffer for the DMA. • Based on the observations, we propose a DMA cache scheme which enables CPU and I/O device to exchange data between last level cache (LLC) and a DMA cache directly. • We have evaluated the DMA cache technique and several previous proposed schemes, such as snooping-cache and shared-cache. We find that the DMA cache technique performs best among all schemes. • We have observed that the previous proposed shared-cache scheme suffers from cache interferences between CPU and DMA engine. This phenomenon may decrease performance (by up to -8.8% for TPC-H). Moreover, it would be aggravated in multicore systems. The remainder of the paper is organized as follows. In Section 2, we introduce the DMA mechanism and reconsider the role of memory in DMA process. Section 3 presents the DMA cache scheme and several design issues. Section 4 describes experimental setups, such as benchmarks, trace collection and evaluation methodology. Section 5 compares DMA cache’s performance with several other schemes. In Section 6, we discuss several other issues about the DMA cache scheme. We summarize the paper in Section 7. 2. THE INTERACTION OF PROCESSOR,MEMORY AND I/O DEVICE IN DMA SCHEME In this section, we introduce the DMA mechanism and reconsider the role of memory in DMA process. 2.1. The Process of DMA Scheme Figure 1 and Figure 2 depict the interaction of processor, memory and I/O device in DMA scheme. As shown in the figures, the interaction requires three data structures, i.e. DMA buffer, descriptor and APP buffer. Usually DMA buffers’ start address can vary in physical address space and their size also varies. DMA descriptors are used to manage these discrete DMA buffers. Each descriptor includes a pointer of DMA buffer’s start address and a variable of DMA buffer’s size. Each descriptor also includes some variables for status information such as DMA buffer’s owner (could be processor or DMA engine). Fig. 1. The Processes of DMA Receiving Mechanism Fig. 2. The Processes of DMA Transmitting Mechanism Figure 1 illustrates the anatomy of DMA receiving mechanism. The detailed processes are described as follows: Step 1: Processor side (i.e. device driver) creates a descriptor for a DMA buffer. Step 2: Processor side allocates a DMA buffer in memory and initializes the descriptor with the DMA buffer’s start address, size and status information. Step 3: Processor side informs DMA engine of the descriptor’s start address. Step 4: DMA engine reads the descriptor’s content from memory. Step 5: With the DMA buffer’s start address and size information extracted from the descriptor, DMA engine receives data from I/O device and writes the data to the DMA buffer in memory. Step 6: The owner status of the descriptor is modified to be DMA engine after all I/O data is stored in the DMA buffer. Step 7: After all I/O data is received, DMA engine sends an interrupt to processor to indicate completion of data receiving. Step 8: Processor side handles the interrupt invoked by DMA engine and copies received I/O data from DMA buffer to APP buffer. Then, it frees each DMA buffer and modifies the owner status of the descriptor to be processor. After all is done, go to Step 2. Figure 2 illustrates the anatomy of DMA transmitting mechanism. The detailed processes are also described as follows: Step 1: Processor side (i.e. device driver) creates a descriptor for DMA buffer. Step 2: Processor side allocates a DMA buffer in memory and initializes the descriptor with the DMA buffer’s start address, size and status information. Step 3: Processor side copies the data to be transmitted from APP buffer to DMA buffer. Step 4: Processor side informs DMA engine of the descriptor’ start address. Step 5: DMA engine reads the descriptor’s content from memory. Step 6: With DMA buffer’s start address and size information extracted from the descriptor, DMA engine reads data from the DMA buffer and sends the data to I/O device.