SoC drawer: The resource view Page 1 of 9

SoC drawer: The resource view

Resource allocation can determine system architecture

Sam Siewert ( [email protected] ), Adjunct Professor, University of Colorado

Summary: A system-on-a- (SoC) can provide a single-chip solution, lower usage, better performance, more frugal use of board real estate, simpler integration, and lower part counts. Compared to multichip solutions, the SoC has huge advantages, but mistakes in sizing on-chip resources require spinning the ASIC and result in high cost. This article introduces approaches for SoC design from a resource perspective. The SoC design concept has appeal in a broad range of applications, from supercomputing to embedded systems.

Date: 04 Oct 2005 Level: Intermediate Activity: 895 views Comments: 0 ( Add comments )

Average rating (based on 15 votes)

This article is the first in the SoC drawer series. The series aims to provide the system architect with a starting point and some tips to make system-on-a-chip (SoC) design easier. The goal for an SoC is typically a single-chip solution; therefore, properly sizing memory, I/O, and (CPU) resources from the outset is critical. By comparison, multichip solutions often include approaches for resource sizing risk mitigation. For example, memory controllers for external memory devices can support a range of parts and sizes. You can also add to multichip solutions as well. Resource enhancement always has associated cost, but for SoCs, the cost is higher. This first article provides an overview of design from a resource perspective; subsequent articles will drill down and focus on specific methods to support this resource approach. The SoC drawer series is intended to arm the architect with tools and methods to get resource sizing right.

The emergence of SoC design and SoC-based architectures

What is an SoC? For the purposes of this series, I consider an ASIC to be in this category if it includes CPU, memory, I/O, and an interconnection between the three. As Wikipedia notes, "System-on-a-chip ... is an idea of integrating all components of a system into a single chip." The SoC has been talked about, marketed, and accepted in the new millennium, especially for embedded applications. More recently, with announcements of high-performance SoC designs such as the Cell chip, and use of these chips in consumer products including the Sony PlayStation 3 (PS3) and Microsoft® XBox, it has become clear that SoC designs will have broad impact. (For more information, see the press on the Cell architecture this year in the Resources section below.)

The initial use of SoC design has been focused on reducing part count for embedded applications and tightening the integration of processing, memory, and I/O resources with lower overall power consumption and a smaller footprint. Along with IBM, Sony, and Toshiba's unveiling of the Cell chip for the PS3, described in detail at ISSCC this past February, other major announcements have shown that SoC design has become a pervasive underpinning of many architecture roadmaps for the future. For example, IBM has also prototyped a blade server using Cell chips, though Cell experts like Arnd Bergmann say that Cell -like architectures will remain in the embedded and supercomputing application

http://www.ibm.com/developerworks/power/library/pa -soc1/index.html 4/25/2009 SoC drawer: The resource view Page 2 of 9

domains and won't likely show up on home or office desktops. (See Resources for an interview with Bergmann.) What is really exciting about SoC architecture is that supercomputing and embedded computing may become the cutting edge of . For supercomputing this is nothing new, but embedded systems have often followed rather than led architecture. SoC architecture brings embedded and supercomputing closer together

The IBM Blue Gene®/W and Blue Gene/L recently put the U.S. back in the lead in supercomputing, ahead of the the Japanese Earth Simulator, built by NEC and heralded as the fastest in 2002. Each Blue Gene node is best described as an SoC, given the integration of processing, , and interconnection networks with in a single ASIC. Similarly, by the definition of an SoC, the recently unveiled Cell chip embedded in the PS3 could even be considered a multiSoC ASIC. Each Cell Synergistic Processing Element (SPE) in fact integrates 256KB of load/store memory, processing, (DMA), (MMU), and interface, with 8 SPEs in all integrated with the Power Architecture™ technology-based Processing Element (PPE) in a single ASIC.

Any architect involved in embedded or supercomputing is most likely already working with SoC designs or will be soon. While SoC design might not presently be well suited to general-purpose computing (GPC), the roadmap for the future of GPC also holds multiprocessor designs on a single chip, and therefore includes aspects of SoC design. Given the reconfigurability, broad range of I/O devices, and whims of the GPC market, it's unlikely that general-purpose themselves will be designed as SoCs, but clearly they will contain chips and that are SoCs. It's not too risky to predict that most ASICs will be SoCs at some point, and that the SoC is a natural stage in the evolution of higher and higher integration. The degree of SoC-ness in a design is based upon the ability to stand alone and to provide services without requiring support from external chipsets like external memory devices.

Figure 1. System and view: The resource cube

Figure 1 depicts the challenge facing firmware and software engineers implementing software services on an SoC (or any system, for that matter). The origin and volume inside the green subspace defines a resource-rich situation where problems are easily solved with cycles, bandwidth, and megabytes. The red subspace within the resource cube defines a resource-constrained situation where significant effort will be required to tune the system in order to meet timing and requirements for services. The resource cube does not include additional dimensions (resources) such as power, pin count, layout space, or cost. It only portrays the firmware/software view, given that trade-offs have already been made in the hardware resource space to define these three basic software resources.

http://www.ibm.com/developerworks/power/library/pa -soc1/index.html 4/25/2009 SoC drawer: The resource view Page 3 of 9

Research on SoC architectures has led to interesting emergent software and hardware management concepts like dynamic scaling, where software can modulate CPU and power consumption based upon current computational needs. Likewise, hardware might modulate the CPU to control heating and notify the software layer that the clock rate has been reduced to handle the overheating. SoC design might be expanding the hardware and software interface, but the hardware design decisions defining this space are still not a bad place to begin high-level architectural definition.

This introductory article looks at some basic hardware and software decisions that affect processing, memory, and I/O. Future articles drill into individual aspects of the decision-making and consider hardware and software trade-offs for a given SoC architectural design decision. In general, all computers provide services, which can range from embedded services like digital control, to supercomputing services like sequencing DNA. Processing is perhaps the most carefully analyzed resource in most systems.

Processing resources and

How compute nodes or CPUs are scheduled depends upon the and service requirements. Figure 2 provides a taxonomy of scheduling methods for processing resources. It's not fully exhaustive, but is fairly complete.

Figure 2. CPU scheduling taxonomy

SoC processors can host resource-management services

Since the 1970s, real-time systems have included the concept of an admission policy for services (threads), where a new 's service requirements are analyzed relative to existing services to determine if the new service will cause the existing service to miss deadlines. Traditionally, this has been done offline, but a dedicated processing resource could execute a rate monotonic feasibility test online in an SoC design. Dynamic service admission is a difficult problem to solve on a single CPU or non-SMT (symmetric multithreading) , since the test itself interferes with running services.

http://www.ibm.com/developerworks/power/library/pa -soc1/index.html 4/25/2009 SoC drawer: The resource view Page 4 of 9

Some of these CPU scheduling taxonomy methods are mostly of historic interest, such as mainframe batch policies like Shortest Job Next (SJN). Methods such as asymmetric off-loading and dynamic Least Laxity First (LLF) or Earliest Deadline First (EDF) are, however, of great interest to modern SoC services for media applications, including video, audio, and game engines. Describing them all is far beyond the scope of this article, but this taxonomy provides a context for future articles discussing methods currently applied to SoC design.

As already noted, SoC design tends to blur traditional hardware and software interface lines, so an SoC architect might want to consider hardware-supported scheduling for a policy such as EDF. In EDF, the thread with the earliest deadline is executed until a thread enters the system with an even earlier deadline. This policy is often used for soft real-time services like real-time rendering. Recently, symmetric multithreading (SMT) has emerged to provide hardware support for multiple threads of execution. Understanding scheduling policies and mechanisms is critical for the SoC architect.

The simple CPU resource utilization equations below are a starting point for analysis of processing resources. For systems that have requirements to provide services without real-time deadlines, called best effort, Equation 1 provides an estimate of processing demands for a set of periodic service requests. The non-real-time utility bound for scheduling simply states that the work queued to a system over time must be less than full utility; otherwise, the work queue will grow. Equation 2 provides the same estimate of processing demands and a basic feasibility test for a set of services that determines if sufficient processing margin exists for these services to safely meet their required completion deadlines.

The real-time utility bound (Equation 2) was first presented by Liu and Layland (see Resources ) and is based upon the observation that most real-time systems provide a set of periodic services. In the basic test that Equation 2 provides (the rate monotonic least upper bound ), each service deadline is assumed to be equal to its release period, so instances don't overlap in time.

Equation 1. Non- real-time scheduling utilization bound

Sum(C i / T i) < 1.0 for threads i=1 to n

Equation 2. Real-time scheduling sufficient feasibility bound

1/n Sum(C i / T i) < [n * ( 2 - 1)] for threads i=1 to n

Much more precise real-time feasibility tests have been derived that have more since the introduction of the simple upper bound in Equation 2 (see Resources for additional reading). In general, scheduling processing resources requires decision logic for the next thread to execute (dispatch), a policy for that decision (for example, priority or EDF), and a test for feasibility. Feasibility provides analysis to determine whether sufficient resources exist to keep up with a workload or meet workload service deadlines. SoC architectures often include multiple processors and symmetric or asymmetric processing and SMT. Scheduling mechanisms for dispatch, policy, and feasibility are an important aspect of the design.

I/O interconnections

http://www.ibm.com/developerworks/power/library/pa -soc1/index.html 4/25/2009 SoC drawer: The resource view Page 5 of 9

Since SoCs include processing, memory, and I/O on-chip by definition, and most often include multiple processors, the I/O interconnection on-chip is fundamental to the design. Figure 3 provides an overview (taxonomy) of the many different schemes for interconnecting processing elements, memory, and I/O devices for any system, including SoCs. Each interconnection architecture has advantages and disadvantages based upon cost, complexity of implementation, complexity of usage by firmware and software, and performance. Since an SoC typically includes all resources needed on a single chip and often includes multiple processor cores (the IBM Cell architecture works this way, for instance), the interconnection scheme is critical to SoC design.

Figure 3. Interconnection network taxonomy

Static interconnection topologies

 Point-to-point: One-to-one connection of nodes, N nodes, N-1 connections, N-1 hops to any node  Ring: One-to-one connection and first to last, N nodes, N-1 connections, N/2 hops to any node worst case  Hub: One central node connected to N-1 nodes, N nodes, N-1 connections, two hops from any node to another  Tree: One root node connected to M sub-nodes for N nodes, N-1 connections, log(N) or fewer hops to any node  Square mesh: Each node connected to four nearest neighbors, N nodes, 2*N - 2*square-root(N) connections, square-root (N) hops worst case  Fully connected: Each node connected to all others, N nodes, N(N-1) connections, one hop all cases

Dynamic interconnection networks

 Bus: Arbitrated transactions for read and write (split transaction read, posted write optimizations)  Blocking : Some active circuits prevent others from becoming active  Non-blocking switch: All circuits may be active simultaneously

SoC memory

A modern GPC can be complex: it might include registers, L1 instruction, L1 , L2 and L3 unified caches, on-chip SRAM, off-chip DDR, and backed by random access storage. By comparison, an SoC design must fit the entire memory system on-chip. So, most often, an SoC will include L1 instruction and data caches with tightly coupled fast access SRAM. The SoC design

http://www.ibm.com/developerworks/power/library/pa -soc1/index.html 4/25/2009 SoC drawer: The resource view Page 6 of 9

might be complicated by the inclusion of multiple processors, especially for message passing or data sharing.

SoC cache considerations typically simplify the GPC cache design to reduce levels and in some cases fully eliminate cache. For multiprocessor (MP) SoCs, cache coherency is an issue that can be solved with hardware protocols or software-managed caches. If a cache allows for DMAs, it is said to be a push cache, a feature that can greatly accelerate store and forward designs. The following list includes key architectural decisions that must be made regarding an SoC cache design:

 Will each processor incorporate a traditional GPC for cache?  How will cache coherency be maintained for DMA I/O interfaces?  If multiple processors will use cache, how will coherency be maintained between processors sharing memory?  How will cache be implemented and maintained?  Will you be using a traditional GPC set-associative hardware design, or perhaps a simpler direct- mapped cache? Or will you take a software management approach to manage data in fast access memory buffers?

Carefully consider the list of cache design options below from a hardware and software viewpoint:

 Hierarchical Harvard architecture: Code and data are stored and cached separately, as opposed to a uniform cache that caches both code and data.  Cache coherency: If data can be DMA'd (transferred through DMA) or imported to memory under cache, it must be invalidated before it is read; if it can be exported, it must first be flushed from cache.  MP cache coherency: MP designs that share data will have cache coherency issues that can be solved by protocols implemented in hardware or software such as MOESI.  Push cache: A cache that can be DMA'd into and out of.  Direct mapped cache: A cache where every line/set has only one destination when loaded from main memory.  N-way set associative cache: A cache where every line/set has N possible destinations when loaded from main memory, with the line to be replaced chosen according to a policy such as Least Recently Used (LRU).  Software cache: Software implements traditional cache miss, hit, flush, invalidate, load operations using a low memory.

Due to the complexity of hierarchical cache designs, SoCs often include simpler approaches to speed memory access, including dual or multiported memory, allowing for DMAs, and multiprocessor access to shared memory. In general, the latency for memory access is the most likely cause of significant inefficiency in SoC processing, so incorporation of L1 caches or careful design to minimize latency using a tightly coupled on-chip SRAM is most often used. Complicating an SoC design with multilevel cache such as a unified L2 might not be worth the cost compared to inclusion of fast access memory scratch pads and multiported memory buffers. The following list provides an overview of some of the memory design decisions the SoC architect must consider:

 Dual or multiported memory: Blocks can be DMA'd into or out of while simultaneously being accessed by a processor using different blocks  Content addressable memory (CAM): The equivalent of an associative array where a data value presented to the CAM returns its storage address(es)  Tightly coupled memory: Memory with very low latency access for a given processor  Symmetric (SMP): Global shared memory with common access time for all

http://www.ibm.com/developerworks/power/library/pa -soc1/index.html 4/25/2009 SoC drawer: The resource view Page 7 of 9

processors in an MP architecture  Non- (NUMA): Memory banks with faster access to some processor nodes compared to others in an MP architecture

Future directions for SoC discussion

Future discussion in this series will expand upon the resource view presented here with specific design examples and discussion of design concepts, including:

 Design methods  Processing resource analysis, policies, and management  I/O resource analysis, policies, and management  Memory resource analysis, policies, and management  Hardware and software resource trade-offs  SoC  SoC EDA and verification  SoC design case studies, including Cell, Blue Gene, and others

Perhaps SoC design is really not that much different from system design in general; however, the risks and rewards related to a single-chip solution are both greater. This series therefore explores disciplined analysis and design for both hardware and software, since the cost of improperly sizing resources in an SoC may be much higher than a similar error in multichip designs. Future articles in this series will examine single resources and trade-offs between hardware and software complexity and cost based upon design decisions that size SoC resources. Right-sizing resources in SoC design is critical for these newly emerging single-chip architectures.

Resources

Learn

 The Wikipedia definition of a system-on-a-chip is helpful to understand what this emergent design concept really means.

 This IBM Research Blue Gene project Web page provides an excellent overview of Blue Gene architecture and news.

 Arnd Bergmann describes the Cell architecture in an interview on developerWorks, " Arnd Bergmann on Cell " (June 2005).

 Writing code for the Cell chip is not quite like writing code for a GPC. The Cell helps abstract the specifics, though, so efficient code can be generated from high-level languages like C.

 Chapter eight of Highly , G.S. Almasi and A. Gottlieb (Benjamin/Cummings Publishing, 1989) provides a good overview of interconnection networks. Likewise, Optical Networks: A Practical Perspective , R. Ramaswami and K. Sivarajan (Morgan Kaurmann Publishers, 2002) provides a nice review of blocking and non-blocking .

 The IBM POWER5™ includes an SMT engine , allowing multiple threads of execution to execute

http://www.ibm.com/developerworks/power/library/pa -soc1/index.html 4/25/2009 SoC drawer: The resource view Page 8 of 9

more efficiently on each processor.

 Thread scheduling policies specialized for soft real-time, including EDF (Earliest Deadline First) and LLF (Least Laxity First), are examples of custom SoC processor scheduling that might be considered for continuous media applications like gaming systems. For digital control and hard real-time SoC applications, the hard real-time rate monotonic policy is safer. Most GPCs provide a form of RR (Round Robin) timesliced scheduling for fairness and responsiveness expected by users. SoC designers might want to consider hardware support for more specialized scheduling and thread control.

 SoCs incorporating multiple processors will often be asymmetric, whereby processors are dedicated to specific services once and for all; often the NUMA memory model is used to speed up each processor, since interaction between processors is only occasional. For GPCs with multiple processors, the level of memory sharing and synchronization needed is hard to predict, so most often multiprocessor GPCs are based on an SMP architecture.

 Many SoCs have a flatter memory with TCM (Tightly Coupled Memory) that is on-chip, with low latency access so that the memory and processor core speeds are matched or closely matched. Off- chip memory or high latency access memory will slow down a processor significantly without cache. Many SoCs include L1 (Level 1 single cycle access) instruction and data cache on the order of 16 to 128KB. When multiple processors are incorporated, each with their own cache, the coherency of data shared between cores through global on- chip memory becomes an issue. The SoC designer should consider a cache coherency protocol such as MOESI , MOSI , MESI , or MSI .

 The Wikipedia definition of rate monotonic scheduling is a good place to start to gain an understanding of real-time processing resource analysis, but the serious designer should consult the original paper and more current and precise rate monotonic analysis methods. Liu and Layland's original paper, " Scheduling for multi-programming in a hard real-time environment " ( Journal of the Association for Computing Machinery, 1973) is one of the most frequently cited original works on real-time processor scheduling. The book Meeting Deadlines in Hard Real-Time Systems: The Rate Monotonic Approach, L. Briand and D. Roy (IEEE Computer Society Press, 1997) provides a more comprehensive overview of current rate monotonic methods of analysis.

Get products and technologies

Get products and technologies

 The Cell chip will be embedded in the PlayStation 3 gaming system. Competing systems, including the new Microsoft with 3.2GHz PowerPC®-based chip and the Nintendo Revolution, will also be Power Architecture technology-based.

 In general, SoCs can be designed as custom ASICs using soft cores such as the Configurable SoC , or using a combination of hard cores such as the PowerPC 405 found in the Virtex-II Pro 4 Reconfigurable SoC and custom FPGA surrounding logic. Tensilica-configurable SoCs include instruction set extensibility with TIE and VLIW FLIX . Definition of new VLIWs (Very Long Instruction Words) is one way to accelerate common firmware computations.

About the author

http://www.ibm.com/developerworks/power/library/pa -soc1/index.html 4/25/2009 SoC drawer: The resource view Page 9 of 9

Dr. Sam Siewert is an design and firmware engineer who has worked in the aerospace, telecommunications, and storage industries. He also teaches at the University of Colorado at Boulder part-time in the Embedded Systems Certification Program, which he co-founded. His research interests include autonomic computing, firmware/hardware co-design, /SoC architecture, and embedded real-time systems.

Trademarks

http://www.ibm.com/developerworks/power/library/pa -soc1/index.html 4/25/2009