Simulating Scratchpad Memories for General Purpose Processors

Francis B. Moreira, Marco A. Z. Alves, Philippe O. A. Navaux UFRGS - Universidade Federal do Rio Grande do Sul Informatics Institute - Parallel and Distributed Processing Group Av. Bento Gonc¸alves, 9500 - Campus do Vale - Bloco IV, Bairro Agronomia , Porto Alegre , Brazil {fbmoreira, mazalves, navaux}@inf.ufrgs.br

Abstract ficient execution. Optionally, it can be controlled by the compiler[2], the OS or the hardware. Scratchpad memories are simple arrays of bytes con- This paper aims to present and validate a method to sim- trolled to optimize access to data structures. As a regularly ulate a scratchpad on Simics[3], in order to allow simu- used memory within single-purpose processors due to lower lation of scratchpad usage in multiple simulated architec- energy requirements, such as mobile phones and other em- tures. The future objective is to analyze how a scratchpad bedded processors, a scratchpad memory can be an inter- can help multi-core systems. In section 2, a description of esting choice for general purpose systems if it can compete Simics is made and what is changed in the module to sim- with caches with ”least-recently-used” eviction policy. ulate a scratchpad. In section 3, a description about what This paper discusses implementing and simulating a sim- conditions are used to test this change is made. In section ple scratchpad memory, presenting results that show the va- 4, the output of the chosen evaluation method and observa- lidity of the scratchpad simulation method and the potential tions about such output. In 5, discussion over scratchpad ef- gain that could be achieved by using scratchpads. ficiency and plans for the future.

2. Simulation

1. Introduction Simics was chosen due to the cycle-accurate, full-system simulation potential. Simics possesses a module named g- Over the past few decades, processors evolved from , which implements simple caches that can easily re- mono-cycle architectures to pipeline architectures, which ceive attributions to properly describe them. This module brought with them the need for faster memories. With the can also be copied and manipulated in the workspace di- aggressive improvement of performance with out-of-order rectory, which will allow the user to rewrite it in order to execution and super scalar architectures, the only memory change the g-cache. Creation of new modules is also al- solution that was able to keep up to some extent was cache, lowed, being thoroughly explained [4]. and the more processors evolve, the larger is the amount Writing an entire device in Simics would be unnecessar- of cache that we need in order to provide data to keep up ily complex, so a smarter way was used to simply simulate processors throughput. Unfortunately, though, memory ef- just the scratchpad presence in the system. Each memory ficiency did stay behind, and became a bottleneck of our device in Simics has a timing model field, which serves to systems. Thus comes the research for better memory sys- identify another object (possibly and probably also a mem- tems, and in this paper, the scratchpad memory. ory) who is listening to this device’s transactions. With such A scratchpad memory is a simple array of bytes, small design, when using a g-cache, you first describe it, then and fast. A thorough description can be found in [1]. It connect it to a id splitter, which will be connected as the is vastly used in single-purpose systems, such as em- physical memory’s timing model; it will then listen to the bedded processors (Motorola MPC500,IBM’s PowerPC memory operations that are passed to the main memory, 405/440,IBM’s new Architecture), since there is no pass them to the L1 caches (executions if L1 instruction need for a cache to switch in available data for multiple pro- cache,reads or writes if L1 data cache), and, if penalties for cesses being alternated in the . It gives the pro- hits and misses are defined, the g-caches shall return the grammer the responsibility of deciding what should be put stall time (in cycles) as a result of the function of process- in the fast access memory, in order to obtain the most ef- ing the memory operation. The system, who called the func- This scratchpad is defined in literature as a simple mem- ory storage, an array of bytes, controlled purely by the ap- plication. It has no real politic, but to keep whatever it’s or- dered to store, returning an address indicating in which ad- dress of the scratchpad the memory block inserted begins, and remove whatever it’s ordered to remove. Observe that the scratchpad uses a namespace disjoint from the back- ing store when returning said indexes for different mem- ory blocks. Under the assumption that a scratchpad will re- turn a defined, constant access time (2 cycles for a L1 sized scratchpad, 6 cycles for a L2 sized scratchpad), and the ap- plication is profiled in order to choose which virtual ad- dresses constitute the memory block that will go into the scratchpad, the code in the file ”gc-specialize.c”, which con- tains the operate(memory operation) function (common to every memory device in Simics), can be changed. The oper- ate function either calls handle write or handle read. As an example, the altered handle write function(due to the large size of the actual function, it’s functionalities are specified as comments): Pay special attention to the penalty variable. It counts up the penalty and is usually returned as the stall time, but the only test made checks if the memory address is in the de- scribed block, and in case it is, penalty is returned with the Figure 1. as viewed by selected value. Simics, with Simics abstractions shown in el- This way, we shall simulate, accordingly to the name lipses passed to the cache (in this case, named dc for L1 data cache, short for data cache) the presence of a L1 scratch- pad if the object was instantiated as dc, or a L2 scratchpad tion, will then stall the processor for the value returned by otherwise. the function. Observe that in this code, 2 fields were created: mem base and offset. These fields are not originally within In the case of a two level cache design, if the L1 cache g-cache, as their creation merely serves the purpose of sim- detects a miss, it will actually call the L2 cache function ulating the scratchpad. In the code above, only one com- (who is listening to both L1 caches), and add the return parison is made, when, actually, blocks of memories value of the L2 cache function to it’s own, returning the need to be stored. So, what is needed is a list of base ad- accumulated value of all levels accessed to Simics as the dresses and their respective offsets, and check that the amount of cycles to stall the processor. In the usual Sim- sum of the offsets is not larger than the amount of scratch- ics model, L2 cache has a staller as it’s timing model. Such pad size available, as shown in figure 2, where the basic staller will simply return 78 cycles to the L2 cache func- idea of a scratchpad filled with some blocks of mem- tion, as it is defined to stall 78 cycles, thus simulating an ac- ory is shown. cess to the main memory in terms of stalled cycles. The ba- Two fields were created within Simics g-cache: sic idea of the hierarchy is shown in figure 1, where the id textslspm read and spm write (as shown in the code). splitter divides addresses listened from the main memory These will help us keep track of how many writes and reads in instructions and data, the L1 instruction cache listens to are assigned to the scratchpad. the instruction interface of the id splitter, the L1 data cache listens to the data interface of the id splitter, L2 cache lis- tens to both L1 cache, and a staller returns 78 cycles to L2 3. Evaluation Methodology cache in case a miss occurs in L2 cache. In this paper’s pro- posed model, there is also the abstraction of a controller The target program to test the validity of the scratchpad working as the Simics code in the data cache, as said con- simulation is a float point vector-matrix multiplication. We troller was changed to simulate a scratchpad, which is situ- shall store the entry vector in order to occupy exactly N ated at the right side of the cache to illustrate it in the simu- * 4(float) bytes, where N is the vector size. With the pro- lation. gram implemented, insertion of magic breakpoints is made, static cycles t handle write(generic cache t ∗gc , generic transaction t ∗mem op , conf object t ∗ space , m a p list t ∗map) { int penalty = 0 / ∗ updates statistics ∗ / Figure 2. The scratchpad ... / ∗ looks for a matching line ∗ / / ∗ checks if we got a hit ∗ / which are instructions that will trigger callback functions in / ∗ if positive: ∗ / Simics when they are executed, but will not influence the to- / ∗ we broadcast a hit depending tal execution time. With these, the user can use Simics vari- on cache status and MESI coherence ables to get the current number of cycles before and after model ∗ / the main loops execution, thus being able to determine how ... many cycles the execution took, and how much gain or loss penalty += mesi send(gc ,mem op) is achieved with different implementations. ... Magic breakpoints also bring the option of using a tracer, / ∗ if negative: ∗ / a Simics object that will log every instruction executed un- / ∗ update statistics about miss ∗ / til it is stopped. This is a low level method to observe in / ∗ update profilers ∗ / which logical addresses the vectors are allocated. The op- ... tional way is to simply use a printf(”%d”,vector), which will / ∗ process write −back/ write −through , print the base address of the allocated vector, and was used write −allocate policies ∗ / for this paper. ... / ∗ counts penalty of L2’s operate(), 4. Results and Analysis then adds it to it’s own ∗ / penalty += gc−>penalty. write next ; To validate the scratchpad simulation g-cache module, if ( gc−>timing model) a test was made in order to show it’s functionality: we ex- penalty += gc−>timing i f c −>operate( ecuted the vector-matrix multiplication with a 1024 sized gc−>timing model , space , vector and a 1024x1024 sized matrix as entries, with the map, mem op); regular g-cache module and the altered g-cache module, ... from the same state. As Simics is deterministic, you will / ∗ update STC statistics if STC is on ∗ / get the same results if the simulations starts from the same checkpoint. The configuration used for tests is shown in ta- / ∗CODE ADDED TO THE FILE ∗ / ble 1. if ( To simulate the scratchpad, the same configuration was (mem op−>logical address > gc−>mem base) used, but with the altered g-cache module, thus resulting in && table 2. This table shows the average of ten executions for (mem op−>logical address < gc−>offset )) { each configuration, being that every round of 10 executions INC STAT(gc, spm write ); begins at the same checkpoint for every configuration, re- if ( gc−>name dc) sulting in a clear result. The speedup here is negligible, as return 2; the important observation to be made is that the scratch- else return 6; pad simulation is valid and does not lose in terms of speed } when compared to the cache. / ∗CODE ADDED TO THE FILE ∗ / return penalty ; The obtained reads and writes for the first run in the g- } cache stats returned 1048576 reads and 1024 writes. This is correct, as the algorithm will store one value in each vector position, then, for each of the resulting vector positions, it will have to read an entry vector value, hence, 1024(entry vector) * 1024(resulting vector) equals 1048576. of pipeline stalls during execution. As a conclusion, we OS Ubuntu 6.06.2 LTS - SMP can observe that our simulation was valid and that indeed Processing IPC (without cache) 1.0 scratchpads might offer gain due to the similar execution Cores Pipeline Not modeled time, yet, great energy saving. Number of cores 1 Cores The major challenge will be that there is still no factor- Core model UltraSparc II - V9 ing on how much time and effort would be spent by the pro- Clock frequency 2 GHz Feature size 45 nm grammer on deciding which memory blocks should be in- serted in the scratchpad for better execution, although the Main Size 1 GB Memory Access latency 78 Cycles programming should not be complex with a simple alloca- Feature size 65 nm tion function (very similar to malloc). Caches L1 Instruction 16KB 2-way assoc. In the near future, we expect to test different combina- L1 Data 16KB 4-way assoc. tions of scratchpad and cache for the vector-matrix multi- L2 512KB 8-way assoc. plication to follow the same steps of [5] in order to obtain great gains in performance. After that, the long-term plan Table 1. Fixed simulation parameters. is to test standard benchmarks for multi-core systems and see if the scratchpad can help us further on making hard- ware more efficient, evaluating techiniques to auto define the data which will be placed on the scratchpad. Configuration 1 2 3 SPM Size 4096 bytes 4096 bytes 4096 bytes References SPM Penalty 1 2 2 [1] B. Jacob, S. Ng, and D. Wang. Memory systems: cache, L1 Cache Size 16378 bytes 16378 bytes 8192 bytes DRAM, disk. Morgan Kaufmann Pub, 2007. Mean Execution 228560297 228560297 228560297 Cache Only [2] L. Li, L. Gao, and J. Xue. Memory coloring: a compiler ap- proach for scratchpad memory management. In Parallel Ar- Mean Execution 226833021 227888454 228306817 chitectures and Compilation Techniques, 2005. PACT 2005. Cache with SPM 14th International Conference on, pages 329 – 338, 17-21 2005. Table 2. Table of results - average of 10 exe- [3] P. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, cutions G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full system simulation platform. Com- puter, 35(2):50 –58, feb 2002. Observe that in the test described as configuration 1 the [4] S. Version. Simics Programming Guide. Simics Programming scratchpad penalty return is set to 1, in order to obtain a Guide, 1998. good visibility of the results. Had the penalty been set to [5] A. Yanamandra, B. Cover, P. Raghavan, M. Irwin, and 2, it would result in configuration 2, where the scratchpad M. Kandemir. Evaluating the role of scratchpad memories in still competes well, as it’s helping the processor to avoid chip multiprocessors for sparse matrix computations. In Par- some misses. We could also cut off some data cache mem- allel and Distributed Processing, 2008. IPDPS 2008. IEEE In- ory, or else that would just be increasing the processor die ternational Symposium on, pages 1 –10, 14-18 2008. area for minimal gain. It does not mean we would warrant enough space for the scratchpad, but we cut half of the L1 data cache in order to observe the difference in configura- tion 3, and scratchpad has shown to still able to compete, which shows how efficient it can be in the scientific or en- gineering style of application.

5. Conclusions

The potential in researching scratchpads becomes obvi- ous with the presented results which provide full support for the scratchpad allocation. Thus, using the scratchpad to- gether with L2 has great potential to heap larger benefits. The overheads mentioned on the results are a trade off as scratchpads help us save energy and minimize the amount