Tuning Socs Using the Dynamic Critical Path

Tuning SoCs using the Dynamic Critical Path

Hari Kannan† Mihai John D. Davis‡ Girish Budiu‡ Venkataramani* Stanford Microsoft Research, Silicon Mathworks, Inc*. University† Valley‡

MSR Technical Report MSR-TR-2009-44

1. Abstract designs, spanning designs from complex Ship We propose using the Global Dynamic Multi-processors to highly integrated Critical Path to diagnose system-wide embedded systems. The SoC building blocks bottlenecks using representative benchmarks -- Intellectual Property (IP) blocks -- used by a to direct embedded SoC optimizations. We manufacturer may come from a variety of provide real-world experience of internal and external sources. Regardless of implementing the global critical path (GCP) the SoC IP block source, the internal operation analysis framework on a Globally- of modules and associated corner cases may Asynchronous Locally-Synchronous (GALS) not be well understood or transparent to the SoC built around the LEON3 CPU. We SoC designers. Furthermore, with the high perform our analysis at the RTL level and degree of integration among IP blocks, and extend our evaluation to abstract RTL models. with the increasing amount of concurrent We use the power-delay product as the execution, understanding the interactions example cost function for optimization; we between various modules or blocks has can adjust the power-delay by tuning the become very difficult. SoC designers are frequency of the clock domains of each SoC forced to make educated guesses about the IP block. We show that the GCP optimization way the different modules impact each other’s framework can accommodate other cost performance. This is further complicated by functions as well, while effectively directing third party vendors of IP blocks that do not SoC optimization efforts. Our case studies provide source code access for their modules. demonstrate that the GCP algorithm can All of these factors make performance converge quickly to solutions even in the very analysis of SoCs or Multi-Processor SoCs large (exponential) search spaces describing (MPSoCs) extremely difficult. permissible SoC configurations, with no Recent work has established dynamic designer intervention (for instance, we find the critical path (or global critical path, GCP) solution of a 6-dimensional space with 19000 analysis B. Fields, S. Rubin et al., “Focusing configurations in 11 steps). Even though our processor policies via critical-path prediction”, initial implementation relies on manual source in the Proceedings of the 28th International code instrumentation, we only add 1% extra Symposium on Computer Architecture, lines of code to the design. This represents Goteborg, Sweden, Jun 2001G. annotating less than 0.2% of the ports of the Venkataramani, M. Budiu et al., “Critical overall Multi-processor SoC design. Path: A Tool for System-Level Timing Analysis”, in the Proceedings of the 44th 2. INTRODUCTION Design Automation Conference, San Diego, Silicon process technology scaling has CA, Jun 2007 as a powerful tool for enabled very high degrees of integration understanding and optimizing the dynamic resulting in complex System-on-a-Chip (SoC) performance profile of highly concurrent hardware/software systems. The GCP introduced by using approximations. If these provides valuable insight into the control-path errors are small, the use of approximate behavior of complete systems, and helps models (which can be simulated much faster) identify bottlenecks. The GCP tracks the is justified. chaining of transitions of the key control Alternative SoC optimization techniques signals and identifies the modules or IP blocks based on numerical design optimization, such that contribute significantly to the end-to-end as simulated annealing F. Ghenassia (ed), computation delay. “Transaction-Level Modeling with SystemC: In this paper, we propose using the GCP to TLM Concepts and Applications for identify and remove system-wide bottlenecks Embedded Systems”, Springer, 2005, or in MPSoCs. Using this knowledge, designers evolutionary algorithms T. Eeckelaert, T. can better direct their optimizations: to boost McConaghy, and G. Gielen, “Efficient the performance of underperforming modules, Multiobjective Synthesis of Analog Circuits lower power consumption, reduce excessive using Hierarchical Pareto-Optimal resources, etc. In the absence of such a tool, Performance”, in the Proceedings of the designers are often forced to simulate many Design, Test and Automation in Europe combinations of the various configurations J. Conference, Munich, Germany. March 2005 D. Davis, J. Laudon, and K. Olukotun, require significant simulation time, especially “Maximizing CMP Throughput with for large designs. This problem is exacerbated Mediocre Cores”, in the Proceedings of the by the lack of intuition related to unfamiliar or 14th International Conference on Parallel misunderstood IP blocks, and by the Architectures and Compilation Techniques, extremely large search space, exponential in St. Louis, MO, Sep 2005, in order to arrive at the number of IP blocks. Knowledge of the an optimal design. The overall system GCP allows designers to perform a directed architecture that we propose is depicted in search, and to reach optimal configurations Figure 1. quickly, significantly speeding up RTL simulation is commonly used in development time. The GCP framework can conjunction with software simulation to verify be used in conjunction with a variety of cost and validate large system design and for initial functions to guide the SoC optimization. software development J. D. Davis, C. Fu, and The GCP framework is ideal for SoC and J. Laudon, “The RASE (Rapid, Accurate MPSoC designs because such systems tend to Simulation Environment) for Chip be designed for a narrow range of Multiprocessors”, in Computer Architecture applications. As a result, the software running News, Vol. 33, Nov 2005. We start our on these systems is well defined and not investigation of the GCP analysis at the RTL general purpose. The dynamic critical path level, which provides a completely accurate analysis is only effective if the benchmarks or system evaluation or ground truth. We also applications used to drive the GCP framework investigate the impact of reduced simulation are representative of the actual workload. fidelity, by approximating RTL modules with We evaluate the effectiveness of the GCP black-boxes, and we evaluate the impact of technique using a system based on the open- the approximations on the accuracy of the source LEON3 SoC design H. Kannan, M. GCP computation. While there has been a lot Dalton, and C. Kozyrakis, “Decoupling of work on using the GCP at higher Dynamic Information Flow Tracking with a abstraction levels (e.g. software simulation, Dedicated Coprocessor”, in the Proceedings of network protocols), we are not aware of any the 39th International Conference on study that attempted to quantify the errors Dependable Systems and Networks, Estoril, Portugal, Jun 2009. We have modified the optimizations to remove system level RTL of this design to log control signal bottlenecks. We prove the utility of the transitions, which are used to compute the GCP for automatically directing GCP. Additionally, we have added different optimizations to find optimal SoC modules (in separate clock domains) to the configurations (our search of 19200 SoC design in order to emulate a more configurations converges in at most 11 complex SoC composed from a variety of IP steps). blocks.  We discuss how the GCP can be used to Our experiments show that the GCP guide the parameter space search for provides good feedback to designers by various cost functions; these functions correctly identifying system-wide bottlenecks. incorporate trade-offs between circuit Because we apply the critical path analysis to performance and other resources (power, the RTL design, we have the flexibility of area, design complexity, etc.). examining the critical path at a variety of  We share real-world experience of levels: within the modules, at the module incorporating the GCP at the RTL level in interfaces, or higher. Some designers operate a SoC framework consisting of both on abstracted views of the design such as blocking and non-blocking modules, electronic-system-level (ESL) models, or interacting concurrently. transaction-level models F. Ghenassia (ed),  We use a bottom-up approach and “Transaction-Level Modeling with SystemC: investigate the trade-off between increasing TLM Concepts and Applications for the level of abstraction and GCP accuracy. Embedded Systems”, Springer, 2005. These We develop the use of GCP for a mixed-IP designs however are written concurrently with block design approach that incorporates the actual hardware specification, and are not both fully defined IP blocks and black derived from the underlying RTL. Divergence boxes (IP blocks without source). from the actual design and the imperfect To the best of our knowledge, this is the modeling of critical transitions can decrease first work that uses GCP at the RTL level for the fidelity of the results computed using these an entire SoC design that consists of other GCP techniques using higher abstraction synchronous hardware components in multiple levels. clock domains. Using the GCP also helps the designers to The rest of the paper is organized as efficiently explore the search space for follows: Section 2 provides background on configuration parameters, arriving at optimal GCP and discusses related work. In Section 3, or near-optimal1 configurations much faster we discuss specific issues when implementing than exhaustive searches. Using a power-delay the GCP tool for an SoC. Section 4 provides product as the exemplar cost function, our details about our evaluation system while algorithm efficiently discovers the optimal Section 5 provides the evaluation. Section 6 combination of parameters for the IP blocks concludes. that constitute the SoC design. The specific contributions of this work are: 3. GLOBAL CRITICAL PATH DEFINITION  We advocate the use of the GCP as a tool to guide designers and direct their The formal definition of the Critical Path in operations research is “the longest path in a 1 We have verified optimality on small designs by exhaustive weighted acyclic graph.” An informal notion simulation; optimality cannot be ascertained for designs with very large configuration spaces, since a search for the optimum is of critical path has been used for a long time intractable. at various levels of system views, including asynchronous circuits S. M. Burns. is called “dynamic”). Performance Analysis and Optimization of Asynchronous Circuits. PhD thesis, California Figure 1: The Global Critical path is the longest chain of events Institute of Technology, 1991, modeled as in the timed graph. Petri nets A. Xie, S. Kim et al, “Bounding Modeling a hardware circuit as a graph, the average time separations of events in nodes in the graph are functional units and the stochastic timed Petri nets with choice”, in the edges are signals, shown in the rounded box in Proceedings of the 5th International Figure 1. To define the GCP, we have to Symposium on Advanced Research in consider an execution of the circuit, for a Asynchronous Circuits and Systems, Apr particular input; then we “unroll” the 1999 and synchronous circuits B. Fields, S. execution of the circuit. The unrolled circuit Rubin et al., “Focusing processor policies via (called a timed graph) contains a replica of the critical-path prediction”, in the Proceedings of entire circuit for each relevant time moment; the 28th International Symposium on an example is shown in Figure 1. The edges of Computer Architecture, Goteborg, Sweden, the timed graph are signal transitions2: an Jun 2001R. Nagarajan, X. Chen et al., edge between (f1, t1) and (f2, t3) represents a “Critical Path Analysis of the TRIPS signal leaving functional unit f1 at time t1 and Architecture”, in the Proceedings of the reaching f2 at time t3. Edges from a functional International Symposium on Performance unit to itself such as (f2, t1) to (f2, t2) Analysis of Systems and Software. Austin, represent computation delay. The timed graph TX, Apr 2006, as well as software modules , is an acyclic graph (for all edges, the end time network protocols P. Barford and M. Crovella, is larger than the start time); the longest chain “Critical Path Analysis of TCP transactions”, of events in the timed graph is the GCP. in the Proceedings of IEEE Transactions on Normally only control signals need to be Networking, 2001 and multi-tier web services. considered as parts of the GCP, because data A formal definition of the critical path can be signals transitions do not influence the timing found in G. Venkataramani, M. Budiu et al., of outputs3. “Critical Path: A Tool for System-Level Timing Analysis”, in the Proceedings of the 4. APPLYING GCP TO SOCS 44th Design Automation Conference, San GCP is easy to compute for asynchronous Diego, CA, Jun 2007. The critical path is also circuits because all signal transitions are related to critical cycles in pipelined explicit. Applying GCP to synchronous processors E. Borch, E. Tune et al., “Loose circuits presents many challenges that we loops sink Chips”, in the Proceedings of the address in this section. In particular, we Eighth International Symposium on High- discuss how GCP can be applied in practice Performance Computer Architecture, Boston, for analyzing SoC designs with the added MA, Feb 2002. complexity of multiple clock domains. The GCP should not be confused with the 3.1 Computing the GCP traditional notion of static critical path in The key idea for computing the GCP over synchronous circuits, which is defined to be all the modules is to track dependencies the longest of the possible signal propagation delays between two clocked latches. The 2 The edges of the timed graph are not the signals themselves, but dynamic GCP is more related to the concept the signal transitions. of instructions per cycle (IPC) for processors, since it is dependent on a particular workload 3 In some cases data signals do influence the control transitions, running on the system (which is why the path and these cases can be accommodated by our methodology. between control signals. We rely on an correct dependencies between input and algorithm proposed in B. Fields, S. Rubin et output control signals, and (3) it must model al., “Focusing processor policies via critical- transaction interleaving in the correct order path prediction”, in the Proceedings of the (e.g., the arrival of two input signals should 28th International Symposium on Computer not be swapped). Architecture, Goteborg, Sweden, Jun 2001 for We choose to compute the GCP at the RTL computing the GCP. For each module, we level because we regard it as the closest track the input and dependent output approximation4 to the actual hardware where transitions. Whenever an output signal makes no fidelity is lost. The GCP can be applied at a transition (i.e., the module produces a new other layers of abstraction of the system such output value), we must be able to attribute it to as transaction-level models (TLMs), if they a previous input transition, which triggered the accurately represent the hardware. TLMs are a computation. We only consider the last higher abstraction used in the design of arrival input that caused this output (an output integrated circuits F. Ghenassia (ed), may depend on multiple inputs). Even if we “Transaction-Level Modeling with SystemC: track these dependencies at runtime for each TLM Concepts and Applications for module in isolation, we can construct the GCP Embedded Systems”, Springer, 2005. by stitching the local transitions, starting with Currently, however, their use is mainly in the last transition of the system, and going RTL validation. These models are usually not back to the last arrival input which caused that derived from the RTL specification, but are transition. Recursively, this last arrival input hand-written to verify the RTL F. Ghenassia becomes the last transition, and the algorithm (ed), “Transaction-Level Modeling with is repeated until the start state is reached. This SystemC: TLM Concepts and Applications for chain of edges is the GCP. Embedded Systems”, Springer, 2005. The The GCP is usually a large data structure, so final goal of this research (in progress) is to we represent the GCP compactly as an edge understand how to build high-level models histogram: for each signal of the circuit we which do not lose precision in the computation count how many times its transition appears of the GCP, and how to use these models for on the critical path. A signal with a high count system optimization. is more critical than one with a low count. 3.3 GCP for Synchronous Circuits 3.2 GCP Accuracy Unfortunately, applying this methodology to As noted in Section 2, the GCP can be synchronous RTL-level circuits is not entirely computed at various levels of the system, from straightforward. Surprisingly, the GCP is very actual hardware to high-level simulations. We easy to build for handshake-based are interested in understanding the loss of asynchronous circuits, because all signal fidelity that can occur by using approximate transitions are explicit – and the critical path RTL models of the hardware. The GCP is composed of signal transitions. In clocked computed using the lowest levels is the circuits some signal transitions are implicit. ground truth GCP; the GCP computed using This section details some of the problems that abstracted models is just an approximation. we faced and the solutions we employed. Given our definition of the GCP, there are Dependencies in FSMs: For a complex three requirements for a model to produce an digital system, even in the presence of full accurate estimate of the GCP: (1) it must model all concurrent hardware blocks, (2) for 4 RTL simulation cannot account for non-determinism and is an each hardware block, it must model the approximation of the real hardware. RTL description, it is not always obvious what producer cannot compute. Thus the lack of the input--output dependencies are. When an transition on the stall indicates the absence FSM transitions to a state that outputs a signal, of the same unique resource. No changes of it is unclear which of the previous inputs the signal values are observed in either case, caused the output. but the meanings are quite different. We solve this problem by tracking Designer knowledge is required to solve this backwards dependencies through state problem. transitions. If the FSM contains no ε- Asynchronous-like handshaking: transitions (state transitions that are not Synchronous systems such as pipelined triggered by inputs), then the previous input is processors do not have explicit request and the cause. If there are ε-transitions, we move acknowledge signals between backward through these transitions until we communicating modules. (Instead, a see a transition caused by an external input. synchronous processor pipeline usually has Don’t cares in control logic: Another “stall” signals.) The asynchronous ack signal issue is related to some control signals being (which is really the logical equivalent of the computed using combinational logic; in such complement of the stall) can be the last cases inputs that generate control signals arrival input for a module, so modeling it is may actually be don’t cares. We ignore the very important. For this purpose, we “don’t care” issue in our current augment the synchronous circuit with implementation, and assume true “virtual” ack signals, going from consumers dependences in such cases. Resolving don’t to producers. The virtual ack signal is cares is the subject of future work. logically And-ed with the actual stall signal. Concurrent events: Multiple input control In our implementation, if all the inputs of a signals of a single module can transition in module are available, we break ties by the same cycle and multiple choices for the assuming that the virtual ack signal is the last arrival input are possible (this issue does last arrival input. not occur in asynchronous circuits); if a Pure sources and sinks: Consider the waveform is available, the last signal to register file in a simple pipelined processor stabilize can be chosen. Simulation ties can (i.e., in the absence of register renaming). be broken randomly, or by selecting control One control input to the register file is a signals in a round-robin manner. write enable. This kind of register file never Implicit signal transitions: In has a reason to stall a write request. Thus, in synchronous systems a signal may not the circuit graph, it is a pure sink (there are change its values between two clock cycles, only incoming control signals for the write but it may still imply a pair of logical port). A symmetric situation occurs with a transitions (down and up). Consider a pair of pure source (e.g., a DMA module controlled modules using a common clock, in a by the Ethernet interface). Computing the producer-consumer relation, connected with GCP requires the circuit graph to be strongly a pair of signals for handshaking: ready connected5. Adding virtual ack signals to (producer => consumer) and stall (consumer pure sources and sinks solves this problem, => producer). In normal operation, the ready making the graph strongly connected. signal (an input to the consumer) is set every clock cycle; this however indicates the 5 Sinks that cannot stall can never be reached by going backwards availability of a new resource (a data item) over a control edge, so they can never be on the critical path. Sources every cycle. The stall signal is an input to can cause the path construction algorithm to get “stuck”, since they the producer; as long as the stall is set, the have no in-edges. Signals with fanout: A signal such as a signals in the module’s interface. Based on the pipeline stall has a large fanout. Such a transitions of these signals, the GCP analysis signal should be treated as multiple provides hints about the module as a whole independent point-to-point signals that being on the system’s critical path. Paths happen to have the same value and transition internal to the module are obfuscated from the at the same time. The reason is that the stall GCP analysis. This solution reduces signal may be the last arrival input for some instrumentation effort and simultaneously pipeline stages, but not for others. allows the use of third-party netlists. We Modules with multiple outputs: If a investigate whether the lack of knowledge of module computes multiple outputs, it should internal structure of a module can cause be treated by the GCP-building algorithm as incorrect computations of the GCP in Section multiple modules, each with a single output. 5.3. The reason is that each output may have Future work includes investigating whether distinct dependencies. Examples include the use of split-transactions in SoCs (which caches, which interface with both the requires inter-chip protocols to use transaction pipeline and with the bus. tags in requests and responses), can be used to 3.4 GCP in an MPSoC infer input-to-output control signal Interestingly, the GCP provides greater dependencies without requiring detailed insight when analyzing systems with a high models of module internals. degree of concurrency: these designs have 5. EVALUATION SYSTEM complex interactions that are hard to In this section, we describe the system we understand. To make an analogy, this is used in our experiments. In order to keep our similar to diagnosing the impact of cache evaluation tractable, we started with a simple misses on performance; if several overlapping and well-understood system which models an misses are being serviced at each moment, it is SoC composed of up to 6 modules that can be hard to tell which misses cause the overall independently optimized, each of them in a slowdown. GCP is a very effective tool for separate clock domain. Our system is built diagnosing problems in MPSoCs composed of around GRLIB, the Gaisler Research IP multiple concurrent IP blocks or cores, since Library that consists of SoC components the GCP diagnoses the actual delays that interacting with the LEON3 SPARC V8 impact end-to-end performance. processor, a 32-bit open-source synthesizable Our methodology for computing the GCP CPU H. Kannan, M. Dalton, and C. requires low-level instrumentation of all the Kozyrakis, “Decoupling Dynamic Information control signals of all modules involved. It may Flow Tracking with a Dedicated not be possible to instrument internal control Coprocessor”, in the Proceedings of the 39th signals of third-party modules incorporated in International Conference on Dependable SoCs because source code may not be Systems and Networks, Estoril, Portugal, Jun available or may be encrypted. Additionally, 2009. LEON3 uses a single-issue, 7-stage modifying the modules to log critical signals, pipeline: Fetch, Decode, Register Access, manually or automatically, often requires a Execute, Memory, Exception and Writeback. thorough understanding of the module’s The processor has separate instruction and behavior. A solution to this problem is to data caches. The data cache follows a write- create an abstraction of the module instances through, and no-write-allocate-on miss-policy. and treat modules as black boxes. The The LEON3 communicates with DRAM, and designer needs to only identify the control other IP cores devices via a shared AMBA 2009,for security purposes. To explore the system bus. impact of a large configuration space, we 4.1 VHDL-level Instrumentation added support for multiple clock domains We modified the VHDL source code of the (CD). Each component, including the bus, is LEON3 design to log the transitions of control in a separate CD – i.e., the frequency of each signals. The LEON3 processor was originally CD can be adjusted independently. This was implemented using a single VHDL process, accomplished by adding asynchronous queues which required all stages of the pipeline to be between the various modules and the system updated simultaneously. In order to segregate bus, not shown in Error: Reference source not the control signals at the granularity of found. pipeline stages, we split the process construct SoCs contain third party IP blocks for which into seven VHDL processes, one per pipeline designers do not have access to source code. stage. This allowed us to track control signals We emulate this case by treating the that originated within the pipeline and affected coprocessor and DMA engine as black boxes. other stages. Along the lines of the discussion We restrict logging control signal transitions in Section 3.2, we added request (req) and for these IP blocks to just the interfaces acknowledge (ack) signals between adjacent provided by these blocks, thus reducing pipeline stages6. When a pipeline stage is instrumentation effort, but potentially ready to send data to the succeeding stage, it sacrificing fidelity. asserts the req signal (same as the write enable 6. EXPERIMENTAL EVALUATION signal of the latch register). The ack signal is We perform cycle-accurate behavioral asserted when the following stage is ready to simulation of the design’s RTL using operate on the data. Overall, we annotated less ModelSim 6.3 (structural simulation of the than 0.2% of the signals in the SoC. Our system can be used as well; according to the annotated code increased the system’s line discussion in Section 3.2, it should produce count by 1%. identical results). Logging all control signals Our system under test was designed to in our system did not increase the simulation mirror the composition of a contemporary time. small, embedded MPSoC. The system is SoC designers impose design performance composed of two processors, (one of which constraints that can be specified by cost has an attached coprocessor), a DMA engine, functions such as power-delay, area-delay, etc. a DRAM interface and a shared system bus. Cost functions typically include factors such Figure 3 shows a high-level overview of our as performance coupled with chip power, area, system architecture. The coprocessor is a 4- or other metrics. For the purposes of this stage pipeline that performs Dynamic evaluation, we define our cost function to be Information Flow Tracking (DIFT) on the the power-delay product (PD), summed over instruction stream executed by the main all the components in the SoC7: processor H. Kannan, M. Dalton, and C. 2 PD = Power x Delay = ∑(CiVi fi) x Kozyrakis, “Decoupling Dynamic Information (Execution Time)8 Flow Tracking with a Dedicated Coprocessor”, in the Proceedings of the 39th International Conference on Dependable 7 In section 5.1 we discuss how alternative cost metrics can be Systems and Networks, Estoril, Portugal, Jun accommodated.

8 C is the capacitance, V is the voltage and f the frequency of each 6 These signals do not change the functionality of the pipeline. system component i. We report normalized power-delay results Computing these results required a large with respect to the initial configuration. In all number of simulations (more than 130) even of our experiments, we execute a small when exploring just three degrees of freedom. synthetic benchmark on the processors. The We used the exhaustive search as the “ground main processor executes an integer truth” for finding the optimal. The GCP-based benchmark, while the second processor directed search uses a much reduced number executes an I/O benchmark. The two of simulation points in the search space while processors run concurrently, and compete with improving the optimization criterion, PD. This each other for resources, such as the shared directed search is completely automatic, and system bus. The coprocessor inspects the does not require any human intervention. The instruction stream committed by the main directed search converges very rapidly when processor, and checks for security flaws. the optimization algorithm makes monotonic While our benchmarks are small (hundreds of moves in the cost function space. thousands of cycles), our methodology can be easily extrapolated to more complex workloads. 5.1 Search Space Exploration In order to assess the effectiveness of the GCP method for quickly discovering high- quality configurations, we first performed an exhaustive search of the parameter space for 3 independent parameters: the clock frequencies of the second CPU, the coprocessor, and DRAM. (The clock frequency of the main CPU is held constant; frequencies are changed in increments of 5MHz). We constrain system Figure 2: Complete search space for 4-module system when varying the frequency of the 2nd CPU, coprocessor and DRAM. performance to be above a minimum The four surfaces correspond to the four legal values of the 2 nd threshold; an execution longer than the processor’s frequency. The colored arrows show the directed search followed by using the GCP from 4 initial random points. threshold is unacceptable and not shown in the surfaces in Figure 4. This makes the search The thick arrows in Figure Evaluation space have an irregular shape. The colors in System show the results of this directed search Figure 4 show the Power-Delay (PD) values overlaid on the exhaustive search space. We for all possible legal combinations, where red show four searches, starting from four random is high PD (bad) and blue is low PD (good). points (different colors), which rapidly The search proceeds by choosing one of two converge to the optimal configuration. In our kinds of moves: (1) increase system experiments, the longest search took only 5 performance, by speeding up a module on the simulation steps. critical path, or (2) decrease system power, by These results are applicable for other slowing down a module outside of the critical optimization functions that combine system path. Note that, while we only modify clock performance (delay) with other metrics (e.g., frequencies of components in these area, design time, reliability, etc). The experiments, we could choose other moves algorithm requires a set of parameters that can which impact the cost function, such as be changed for each module and knowledge of capacitance, voltage, even arbiter priorities their impact on the optimization metrics. The and cache sizes. current algorithm always improves the performance of modules on the critical path, and decreases the cost of modules outside of order to maintain consistency. With an the path. More sophisticated algorithms can be abstracted view, we merely see all memory formulated and used in this framework. requests from the processor, but not the 5.2 Higher Dimension Search Space context they are issued under. Thus, even While it may be possible to perform an though the processor does not stall (waiting exhaustive search of all the allowable for DRAM to reply for such stores), the GCP configurations for a small number of algorithm places these stores on the critical components, this approach quickly becomes path (in other words, as discussed in Section intractable for a larger number of modules. By 3.2, we are missing some of the dependencies making all 6 IP blocks in our system between inputs and outputs for the black-box configurable (ten possible configurations for module). In the non-abstracted view, these the main CPU, four for the DMA engine, and stores are not considered critical because the three for the system bus), the size of the processor does not stall. The difference in the search space grows from 160 to 19200. For critical path is proportional to the percentage such a large space, we cannot exhaustively of non-blocking requests. Modules that have compute the optimal configuration. This issue few non-blocking requests, or that allow the is even more acute for real systems, which can algorithm to infer the dependent input-output have tens or hundreds of degrees of freedom. pairs will provide accurate critical path results In Figure 5, we show the results of the in the abstracted view. directed search for the large search space, Thus, even when approximated, the critical which converges to a minimal PD path analysis can still provide useful hints for configuration in just 11 steps. This is the optimizing systems with black-box IP blocks. longest run out of multiple simulations This is a viable technique, depending both on performed. the design’s characteristics, and the designer’s tolerance to loss of fidelity. This shows promise for abstracting low-level detail in IP Figure 3: Directed search in a 6-dimensional space. blocks resulting in less logging overhead, and 5.3 Abstracting Away Module Information closed-source IP block compatibility. Using our SoC infrastructure, we obtained 7. CONCLUSIONS the critical path when the main CPU was We presented the case for using dynamic treated as a black box, and compared it with global critical path analysis for diagnosing and the path obtained with knowledge of the optimizing performance problems in SoC and internal CPU structure. The critical path MPSoC systems where the designer may not analysis thus, merely observed the transitions understand complex system interactions. on the inputs to, and outputs from the CPU. Adding instrumentation to source code to We found that both analyses ranked the same track GCP required knowledge of the system edges in the histogram to be critical. There and fluency in VHDL. However, we was a slight difference of 3% in the number of demonstrated that abstracted modules (black transitions seen between the abstracted and box IP blocks) with user-supplied context can non-abstracted case. provide close approximations to the GCP. By On further investigation, we found that this using publicly available IP blocks, we have difference was due to the non-blocking stores developed an MPSoC and optimized it for issued by the main processor that hit in its data power-delay using the GCP framework. cache. LEON3 has a write-through data cache Our model MPSoC consisted of GALS that follows a no-allocate-on-miss policy. All components. We showed that a directed search stores must be written to main memory in algorithm based on the GCP provides optimal [4] J. D. Davis, J. Laudon, and K. Olukotun, “Maximizing CMP configurations in a few steps (11 out of 19200 Throughput with Mediocre Cores”, in the Proceedings of the 14th International Conference on Parallel Architectures and possibilities). We successfully applied this Compilation Techniques, St. Louis, MO, Sep 2005 technique to SoC designs with 3-6 degrees of [5] J. D. Davis, C. Fu, and J. Laudon, “The RASE (Rapid, Accurate freedom. Our implementation required manual Simulation Environment) for Chip Multiprocessors”, in Computer Architecture News, Vol. 33, Nov 2005 VHDL instrumentation of less than 0.2% of [6] T. Eeckelaert, T. McConaghy, and G. Gielen, “Efficient the module signals or about 1% more lines of Multiobjective Synthesis of Analog Circuits using Hierarchical instrumentation code and added immeasurable Pareto-Optimal Performance”, in the Proceedings of the Design, Test and Automation in Europe Conference, Munich, overhead to the simulation time. By Germany. March 2005 abstracting RTL modules, the absolute [7] B. Fields, S. Rubin et al., “Focusing processor policies via difference of the GCP analysis was only 3% critical-path prediction”, in the Proceedings of the 28th International Symposium on Computer Architecture, Goteborg, different compared to complete GCP using the Sweden, Jun 2001 low-level GCP analysis. The overall GCP [8] F. Ghenassia (ed), “Transaction-Level Modeling with SystemC: ranking of module criticality was unchanged TLM Concepts and Applications for Embedded Systems”, Springer, 2005 using the abstracted or black-box RTL [9] H. Kannan, M. Dalton, and C. Kozyrakis, “Decoupling modules and the PD optimal design search Dynamic Information Flow Tracking with a Dedicated results were the same. Coprocessor”, in the Proceedings of the 39th International Conference on Dependable Systems and Networks, Estoril, In future work, we would like to Portugal, Jun 2009 automatically infer control signals from the [10] LEO HDL and generate the resulting N3 SPARC Processor. http://www.gaisler.com. instrumentation code to reduce designer effort. [11] R. Nagarajan, X. Chen et al., “Critical Path Analysis of the TRIPS We would also like to investigate generation Architecture”, in the Proceedings of the International of accurate abstract SoC models, which will Symposium on Performance Analysis of Systems and Software. Austin, TX, Apr 2006 speed up simulation for real, commercial [12] A. MPSoCs or SoCs. Finally, we would like to Saidi, N. Binkert et al., “Full-system critical path analysis”, in apply the GCP to larger SoC and MPSoC the Proceedings of the International Symposium on Performance Analysis of Systems and Software, Austin, TX, systems. Apr 2008 [13] G. REFERENCES Venkataramani, M. Budiu et al., “Critical Path: A Tool for [1] P. Barford and M. Crovella, “Critical Path Analysis of TCP System-Level Timing Analysis”, in the Proceedings of the 44th transactions”, in the Proceedings of IEEE Transactions on Design Automation Conference, San Diego, CA, Jun 2007 Networking, 2001 [14] A. [2] E. Borch, E. Tune et al., “Loose loops sink Chips”, in the Xie, S. Kim et al, “Bounding average time separations of events Proceedings of the Eighth International Symposium on High- in stochastic timed Petri nets with choice”, in the Proceedings Performance Computer Architecture, Boston, MA, Feb 2002 of the 5th International Symposium on Advanced Research in Asynchronous Circuits and Systems, Apr 1999 [3] S. M. Burns. Performance Analysis and Optimization of Asynchronous Circuits. PhD thesis, California Institute of Technology, 1991