The Effect of High Bandwidth and Low Latency Memory on Computers

1 The Effect of High Bandwidth and Low Latency Memory on Computers Michael Kirzinger instruction cache is the same. Abstract—Adding high bandwidth and/or low latency The processors are available with either a 512KB L2 cache memory to a computer results in improved performance. This or a 1MB L2 cache. A 512 KB L2 cache was used for the performance gain is as a result of widening the bottleneck simulations. A clock speed of 2400 MHz was used. caused by the increasing gap in speeds between the processor A second cache configuration consisting of 32KB L1 and main system memory. This paper investigates how large of instruction and data caches with a 64KD L2 unified cache was a gain can be achieved by widening the bus between the CPU also used to stress the memory. This was necessary to do and memory, increasing memory speed, and reducing memory latency. Latencies of real modules are used in the simulation to since there was insufficient time to run simulations on very see what is possible by using available high performance large data sets. Using a smaller cache should have a similar memory modules. effect to using a large data set, that effect being cache misses caused by insufficient capacity in the cache to fit all the data. Index Terms—DDR, DDR2, GDDR3, high bandwidth, low The RAM parameters in SimpleScalar were based upon the latency, memory CAS and tRCD latencies of Samsung DDR[2], DDR2[3], and GDDR3[4] memory modules. The available burst length for DDR are 2, 4 and 8[2]. For DDR2, it is 4 and 8[3]. GDDR3 I.INTRODUCTION has burst lengths of 4 and 8[4]. A power of 2 line size in the NE of the main bottlenecks in computer systems today cache cache will always be used so that it is always creates a Ois the speed of the memory available to the processor. power of 2 ratio to the burst length. This bottleneck can be further divided into two categories: A To convert from CAS and tRCD latencies and the data rate bottleneck caused by memory latencies, and a bottleneck to the SimpleScalar parameters -mem:lat <first_chunk> caused by memory bandwidth. <inter_chunk>, equations (1) and (2) were used. This paper will investigate the effects of increasing the CPU_Clock×CAStRCD first_chunk= (1) memory bandwidth available to the processor. This will be Memory_Clock done by increasing the bus width between the memory and CPU_Clock inter_chunk= (2) processor and by increasing the speed at which the memory Memory_Data_Rate runs. As well, it will compare increasing the bus width to For DDR memory, the memory data rate is twice the increasing data rate for a fixed bandwidth. memory clock. All values for first_chunk and inter_chunk The effect of reduced latency will also be simulated. This have been rounded up. will look at values beyond what is currently available on the A summary of the modules used and their corresponding market. first and inter chunk values is shown in Table 1. CAS tRCD first_chunk inter_chunk II.SIMULATOR DDR 266 2 3 91 10 A slightly modified version of the SimpleScalar tool set (a DDR 333 2.5 3 80 8 unified diff against 3v0d SimpleScalar is available at DDR 400 3 3 72 6 http://www.ece.ualberta.ca/~mjk3/ece510/burst_length.diff) was used for all the simulations. One modification was made DDR2 533 4 4 72 5 to the memory system: A new parameter was added to set the DDR2 667 5 5 73 4 minimum and maximum burst length for the memory. This adds one of the differences between DDR and DDR2 (and DDR2 800 5 5 60 3 GDDR3) into the simulation. GDDR3 1000 7 8 72 3 GDDR3 1200 9 10 76 2 III.SIMULATION PARAMETERS GDDR3 1400 10 10 69 2 The processor used in the simulation is similar to a socket GDDR3 1600 11 12 69 2 754 AMD Athlon 64. This has a 64KB, 2-way set associative GDDR3 1800 11 12 62 2 L1 data cache[1]. The cache lines are 64 bytes wide and a least recently used replacement policy is used. The L1 Table 1: Latencies of various memory modules used 2 There were 5 different programs used to test the effects of The 256 bit DDR2 and GDDR3 performs poorly. This is a varying the memory parameters: result of the cache line only being twice the size of the bus 1) matrix: Multiplies two 100x100 matrices consisting of width. The memory is unable to do a burst read of 2, so it has double precision floating point numbers together. an expensive 2 reads. Once they go to 512 bits, the cache line 2) sort: Does a quick sort of 40000 double precision can be filled in 1 read. floating point numbers. In general, there is only a small improvement in going to a 3) fft: Performs a complex discrete fast fourier transform on wider bus when the wider bus does not allow the full burst a 65536 element double precision floating point number length to be used. array. Uses a general purpose FFT package[5]. Now that the memory system is being utilized more, the 4) filter: Runs an 8192 length double precision signal performance gains of a higher bandwidth system become through a 200 tap finite impulse response filter. more pronounced. Once again, the 256 bit DDR2/GDDR3 5) alphaBlend: Performs an alpha blend of two 800x600 24 suffers in performance for the same reason as before. bit bitmaps. V.SIMULATION 2: EFFECT OF CACHE LINE SIZE ON EQUAL IV.SIMULATION 1: EFFECT OF BANDWIDTH BANDWIDTH MEMORY CONFIGURATIONS The five test programs were run through the simulator for The cache line size will effect the performance of the each memory module on a 64, 128, 256, and 512 bit memory memory, as seen in the 256 bit bus case in the first simulation. bus. The memory's speed will be made better use of with larger The results, sorted by bandwidth, for the small cache is cache lines, but this can also harm performance for shown in Figure 1. Results for the large cache show the same applications which do not have data accessed in a sequential thing, just the values are closer together because of most of order. the data fitting into the cache. Performance is relative to the For this simulation, a constant bandwidth of 12.8 GB per 128 bit DDR 400. The fft program stresses the memory the second was used. The memory configurations with this most, with an 18.3% spread in performance. Increasing bandwidth are 256 bit DDR 400, 128 bit DDR2 800, and 256 bandwidth showed an increase in most cases. The small bit GDDR3 1600. These are all at or near the top of their differences in performance among two memory memory types for clock rate and overall latency. configurations with similar bandwidths are due to differences The results are all relative to the 128 bit DDR2 800 with a in the latency. 64 byte cache L2 line size. These results for the small cache processor are shown in Figure 2. The large cache simulation Performance of memory modules data shows the same results, except that the differences in 64bit DDR 266 performance are smaller. 64bit DDR 333 64bit DDR 400 The wider bus performs better than the faster memory 128bit DDR 266 64bit DDR2 533 running on a smaller bus. The slower memory ends up 128bit DDR 333 64bit DDR2 667 sending more data per burst read. The extra first chunk access 128bit DDR 400 64bit DDR2 800 times that the faster memory is encountering is causing it to 64bit GDDR3 1000 256bit DDR 266 perform worse than the slower memory on a wider bus. The 128bit DDR2 533 cache line size creates improvements in the performance until 64bit GDDR3 1200 256bit DDR 333 128bit DDR2 667 64 bit GDDR3 1400 256bit DDR 400 Effect of Cache Line Size 128bit DDR2 800 64bit GDDR3 1600 1.025 64bit GDDR3 1800 1.000 128bit GDDR3 1000 512bit DDR 266 0.975 256bit DDR2 533 0.950 128bit GDDR3 1200 e 0.925 512bit DDR 333 0.900 256bit DDR2 667 anc 0.875 128 bit GDDR3 1400 m 0.850 512bit DDR 400 or 0.825 256bit DDR2 800 f 128bit GDDR3 1600 r 0.800 DDR 128bit GDDR3 1800 e 0.775 DDR2 P 256bit GDDR3 1000 0.750 GDDR3 512bit DDR2 533 e v 0.725 i 256bit GDDR3 1200 t 0.700 a 512bit DDR2 667 l 0.675 256bit GDDR3 1400 e 512bit DDR2 800 R 0.650 256bit GDDR3 1600 0.625 256bit GDDR3 1800 0.600 512bit GDDR3 1000 0.575 512bit GDDR3 1200 512bit GDDR3 1400 0.550 512bit GDDR3 1600 10 100 1000 10000 512bit GDDR3 1800 Cache Line Size 0.90 0.92 0.94 0.96 0.98 1.00 1.02 1.04 1.06 Figure 1: Performance of various memory modules in the Figure 2: Relative performance of modules with a bandwidth small cache processor simulation of 12.8GB per second with varying cache line size 3 Effect of CAS and tRAS Latency 1.350 1.300 1.250 e c n 1.200 a m r 1.150 o matrix f r 1.100 sort e P fft 1.050 e filter v i t 1.000 alphaBlend a l e 0.950 R 0.900 0.850 0.800 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 CAS/tRAS latency (cycles) Figure 3: The effect of CAS latency using 128bit DDR2 800 memory modules.

Load more