Simulating Scratchpad Memories for General Purpose Processors

Simulating Scratchpad Memories for General Purpose Processors Francis B. Moreira, Marco A. Z. Alves, Philippe O. A. Navaux UFRGS - Universidade Federal do Rio Grande do Sul Informatics Institute - Parallel and Distributed Processing Group Av. Bento Gonçalves, 9500 - Campus do Vale - Bloco IV, Bairro Agronomia , Porto Alegre , Brazil {fbmoreira, mazalves, navaux}@inf.ufrgs.br Abstract ficient execution. Optionally, it can be controlled by the compiler[2], the OS or the hardware. Scratchpad memories are simple arrays of bytes con- This paper aims to present and validate a method to sim- trolled to optimize access to data structures. As a regularly ulate a scratchpad on Simics[3], in order to allow simu- used memory within single-purpose processors due to lower lation of scratchpad usage in multiple simulated architec- energy requirements, such as mobile phones and other em- tures. The future objective is to analyze how a scratchpad bedded processors, a scratchpad memory can be an inter- can help multi-core systems. In section 2, a description of esting choice for general purpose systems if it can compete Simics is made and what is changed in the module to sim- with caches with ”least-recently-used” eviction policy. ulate a scratchpad. In section 3, a description about what This paper discusses implementing and simulating a sim- conditions are used to test this change is made. In section ple scratchpad memory, presenting results that show the va- 4, the output of the chosen evaluation method and observa- lidity of the scratchpad simulation method and the potential tions about such output. In 5, discussion over scratchpad ef- gain that could be achieved by using scratchpads. ficiency and plans for the future. 2. Simulation 1. Introduction Simics was chosen due to the cycle-accurate, full-system simulation potential. Simics possesses a module named g- Over the past few decades, processors evolved from cache, which implements simple caches that can easily re- mono-cycle architectures to pipeline architectures, which ceive attributions to properly describe them. This module brought with them the need for faster memories. With the can also be copied and manipulated in the workspace di- aggressive improvement of performance with out-of-order rectory, which will allow the user to rewrite it in order to execution and super scalar architectures, the only memory change the g-cache. Creation of new modules is also al- solution that was able to keep up to some extent was cache, lowed, being thoroughly explained [4]. and the more processors evolve, the larger is the amount Writing an entire device in Simics would be unnecessar- of cache that we need in order to provide data to keep up ily complex, so a smarter way was used to simply simulate processors throughput. Unfortunately, though, memory ef- just the scratchpad presence in the system. Each memory ficiency did stay behind, and became a bottleneck of our device in Simics has a timing model field, which serves to systems. Thus comes the research for better memory sys- identify another object (possibly and probably also a mem- tems, and in this paper, the scratchpad memory. ory) who is listening to this device’s transactions. With such A scratchpad memory is a simple array of bytes, small design, when using a g-cache, you first describe it, then and fast. A thorough description can be found in [1]. It connect it to a id splitter, which will be connected as the is vastly used in single-purpose systems, such as em- physical memory’s timing model; it will then listen to the bedded processors (Motorola MPC500,IBM’s PowerPC memory operations that are passed to the main memory, 405/440,IBM’s new Cell Architecture), since there is no pass them to the L1 caches (executions if L1 instruction need for a cache to switch in available data for multiple pro- cache,reads or writes if L1 data cache), and, if penalties for cesses being alternated in the processor. It gives the pro- hits and misses are defined, the g-caches shall return the grammer the responsibility of deciding what should be put stall time (in cycles) as a result of the function of process- in the fast access memory, in order to obtain the most ef- ing the memory operation. The system, who called the func- This scratchpad is defined in literature as a simple memory storage, an array of bytes, controlled purely by the ap- plication. It has no real politic, but to keep whatever it’s ordered to store, returning an address indicating in which address of the scratchpad the memory block inserted begins, and remove whatever it’s ordered to remove. Observe that the scratchpad uses a namespace disjoint from the back- ing store when returning said indexes for different memory blocks. Under the assumption that a scratchpad will return a defined, constant access time (2 cycles for a L1 sized scratchpad, 6 cycles for a L2 sized scratchpad), and the ap- plication is profiled in order to choose which virtual addresses constitute the memory block that will go into the scratchpad, the code in the file ”gc-specialize.c”, which con- tains the operate(memory operation) function (common to every memory device in Simics), can be changed. The operate function either calls handle write or handle read. As an example, the altered handle write function(due to the large size of the actual function, it’s functionalities are specified as comments): Pay special attention to the penalty variable. It counts up the penalty and is usually returned as the stall time, but the only test made checks if the memory address is in the de- scribed block, and in case it is, penalty is returned with the Figure 1. Memory Hierarchy as viewed by selected value. Simics, with Simics abstractions shown in el- This way, we shall simulate, accordingly to the name lipses passed to the cache (in this case, named dc for L1 data cache, short for data cache) the presence of a L1 scratchpad if the object was instantiated as dc, or a L2 scratchpad tion, will then stall the processor for the value returned by otherwise. the function. Observe that in this code, 2 fields were created: mem base and offset. These fields are not originally within In the case of a two level cache design, if the L1 cache g-cache, as their creation merely serves the purpose of sim- detects a miss, it will actually call the L2 cache function ulating the scratchpad. In the code above, only one com- (who is listening to both L1 caches), and add the return parison is made, when, actually, blocks of memories value of the L2 cache function to it’s own, returning the need to be stored. So, what is needed is a list of base ad- accumulated value of all levels accessed to Simics as the dresses and their respective offsets, and check that the amount of cycles to stall the processor. In the usual Sim- sum of the offsets is not larger than the amount of scratch- ics model, L2 cache has a staller as it’s timing model. Such pad size available, as shown in figure 2, where the basic staller will simply return 78 cycles to the L2 cache func- idea of a scratchpad filled with some blocks of mem- tion, as it is defined to stall 78 cycles, thus simulating an ac- ory is shown. cess to the main memory in terms of stalled cycles. The ba- Two fields were created within Simics g-cache: sic idea of the hierarchy is shown in figure 1, where the id textslspm read and spm write (as shown in the code). splitter divides addresses listened from the main memory These will help us keep track of how many writes and reads in instructions and data, the L1 instruction cache listens to are assigned to the scratchpad. the instruction interface of the id splitter, the L1 data cache listens to the data interface of the id splitter, L2 cache listens to both L1 cache, and a staller returns 78 cycles to L2 3. Evaluation Methodology cache in case a miss occurs in L2 cache. In this paper’s pro- posed model, there is also the abstraction of a controller The target program to test the validity of the scratchpad working as the Simics code in the data cache, as said con- simulation is a float point vector-matrix multiplication. We troller was changed to simulate a scratchpad, which is situ- shall store the entry vector in order to occupy exactly N ated at the right side of the cache to illustrate it in the simu- * 4(float) bytes, where N is the vector size. With the pro- lation. gram implemented, insertion of magic breakpoints is made, static cycles t handle write(generic cache t ∗gc , generic transaction t ∗mem op , conf object t ∗ space , m a p list t ∗map) { int penalty = 0 / ∗ updates statistics ∗ / Figure 2. The scratchpad ... / ∗ looks for a matching line ∗ / / ∗ checks if we got a hit ∗ / which are instructions that will trigger callback functions in / ∗ if positive: ∗ / Simics when they are executed, but will not influence the to- / ∗ we broadcast a hit depending tal execution time. With these, the user can use Simics vari- on cache status and MESI coherence ables to get the current number of cycles before and after model ∗ / the main loops execution, thus being able to determine how ... many cycles the execution took, and how much gain or loss penalty += mesi send(gc ,mem op) is achieved with different implementations.

Load more