...... RESEARCH CHALLENGES FOR ON-CHIP INTERCONNECTION NETWORKS

...... ON-CHIP INTERCONNECTION NETWORKS ARE RAPIDLY BECOMING A KEY ENABLING

John D. Owens TECHNOLOGY FOR COMMODITY MULTICORE PROCESSORS AND SOCS COMMON IN

University of California, CONSUMER EMBEDDED SYSTEMS.LAST YEAR, THE NATIONAL SCIENCE FOUNDATION

Davis INITIATED A WORKSHOP THAT ADDRESSED UPCOMING RESEARCH ISSUES IN OCIN William J. Dally TECHNOLOGY, DESIGN, AND IMPLEMENTATION AND SET A DIRECTION FOR RESEARCHERS Stanford University IN THE FIELD.

...... VLSI technology’s increased capa- (NoC), whose philosophy has been sum- Ron Ho bility is yielding a more powerful, more marized as ‘‘route packets, not wires.’’2 capable, and more flexible computing Connecting components through an on- Sun Microsystems system on single processor die. The micro- chip network has several advantages over processor industry is moving from single- dedicated wiring, potentially delivering core to multicore and eventually to many- high-bandwidth, low-latency, low-power D.N. (Jay) core architectures, containing tens to hun- communication over a flexible, modular dreds of identical cores arranged as chip medium. OCINs combine performance Jayasimha multiprocessors (CMPs).1 Another equally with design modularity, allowing the in- important direction is toward systems on tegration of many design elements on Corporation a chip (SoCs), composed of many types of a single die. processors on a single chip. Microprocessor Although the benefits of OCINs are vendors are also pursuing mixed approaches substantial, reaching their full potential Stephen W. Keckler that combine multiple identical cores with presents numerous research challenges. In University of Texas at different cores, such as the AMD Fusion 2006, the National Science Foundation processors combining multiple CPU cores initiated a workshop to identify these Austin and a graphics core. challenges and to chart a course to solve Whether homogeneous, heterogeneous, them. The conclusions we present here are or hybrid, cores must be connected in the work of all the attendees of the a high-performance, flexible, scalable, de- workshop, held last December at Stanford Li-Shiuan Peh sign-friendly manner. The emerging tech- University. All the presentation slides, Princeton University nology that targets such connections is posters, and videos of the workshop talks called an on-chip interconnection network are available online at http://www.ece.ucdavis. (OCIN), also known as a network on chip edu/,ocin06/program.html......

96 Published by the IEEE Computer Society. 0272-1732/07/$20.00 G 2007 IEEE

IEEE Micro micr-27-05-owen.3d 12/10/07 16:02:48 96 Cust # Owens ...... We found that three issues stand out as particularly critical challenges for OCINs: About the workshop power, latency, and CAD compatibility. The 2006 Workshop on On- and Off-Chip Interconnection Networks for Multicore Systems, First, the power of OCINs implemented held at Stanford University on 6 and 7 December 2006, brought together about 50 of the with current techniques is too high (by leading researchers from academia and industry studying on-chip interconnection networks a factor of 10) to meet the expected needs of (OCINs). The NSF-initiated workshop featured invited presentations, poster presentations, future CMPs. Fortunately, a combination and working groups. The 15 invited presentations gave a technology forecast, surveyed of circuit and architecture techniques has applications, and captured the current state of the art and identified gaps in it. The posters the potential to reduce power to acceptable covered related topics for which time did not allow a plenary presentation. Each of the five levels. Second, the latency of these networks working groups met for a total of four hours to assess one aspect of OCIN technology, to is too large, leading to performance degra- perform a gap analysis, and to develop a research agenda for that aspect of on-chip dation when they are used to access on-chip networks. Each working group then presented a briefing on its findings. memory. Research efforts to develop spec- We greatly appreciate the dedication and energy of the workshop participants in defining ulative microarchitectures that reduce laten- the research agenda we present in this article. The technology working group included Dave cy through a router to a single clock, circuit Albonesi, Cornell University; Keren Bergman, Columbia University; Nathan Binkert, HP Labs; techniques that increase signal velocity on Shekhar Borkar, Intel; Chung-Kuan Cheng, UC San Diego; Danny Cohen, Sun Labs; Jo channels, and network architectures that Ebergen, Sun Labs; and Ron Ho, Sun Labs. The system architectures working group members reduce the number of hops might overcome included Jose Duato, Polytechnic University of Valencia; Partha Kundu, Intel; Manolis this problem. Third, many on-chip network Katevenis, University of Crete; Chita Das, Penn State; Sudhakar Yalamanchili, Georgia Tech; circuit and architecture techniques are John Lockwood, Washington University; and Ani Vaidya, Intel. The microarchitectures incompatible with modern design flows working group included Luca Carloni, Columbia University; Steve Keckler, University of Texas and CAD tools, making them unsuitable at Austin; Robert Mullins, Cambridge University; Vijay Narayanan, Penn State; Steve Reinhardt, Reservoir Labs; and Michael Taylor, UC San Diego. The design tools working for use in SoCs. Research to provide library group included Luca Benini, University of Bologna; Mark Hummel, AMD; Olav Lysne, Simula encapsulation of network components Lab, Norway; Li-Shiuan Peh, Princeton; Li Shang, Queens University, Canada; and Mithuna might provide compatibility. Thottethodi, Purdue. The evaluation working group included Rajeev Balasubramaniam, The workshop identified five broad University of Utah; Angelos Bilas, University of Crete; D.N. (Jay) Jayasimha, Intel; Rich research areas and the key issues in each Oehler, AMD; D.K. Panda, Ohio State University; Darshan Patra, Intel; Fabrizio Petrini, Pacific area: National Labs; and Drew Wingard, Sonics. The generous support of the National Science Foundation (through the Computer N OCIN technology and circuits. How Architecture Research and Computer Systems Research programs) and the University of will technology (such as the CMOS California Discovery Program made the workshop possible. Bill Dally and John Owens roadmap from the International Tech- chaired the workshop, Timothy Pinkston and Jan Rabaey provided suggestions for workshop nology Roadmap for Semiconductors) direction, and Jane Klickman provided expert logistic and administrative support. and circuit design affect on-chip network design? N OCIN microarchitecture. What micro- Technology-driving applications architecture is needed for on-chip At the workshop, we considered two routers and network interfaces to meet representative technology-driving applica- latency, area, and power constraints? tions for on-chip networks. N OCIN system architecture. What system architecture (topology, routing, flow Applications for CMP systems control, interfaces) is best suited for Large-scale, enterprise-class systems as- on-chip networks? sembled as CMP-style machines require N CAD and design tools for OCINs. What a high-performance network to attain the CAD tools are needed to design on- throughput important to their applications. chip networks and systems using on- For these machines, users will be willing to chip networks? spend on power to achieve performance, at N Evaluation and driving applications for least to reasonable levels, such as to the air- OCINs. How should on-chip networks cooled limit for chips. Cost will be be evaluated? What will be the important because it will determine how dominant workloads for OCINs in many racks can be purchased for a data five to 10 years? center, but it will not be the overriding ......

SEPTEMBER–OCTOBER 2007 97

IEEE Micro micr-27-05-owen.3d 12/10/07 16:03:04 97 Cust # Owens ...... OCIN RESEARCH CHALLENGES

factor. With the emergence of graphics- back to a central control location. Commu- based applications targeted to the end user, nication devices for soldiers will have similar even desktop systems will have general- and computation, storage, and communication special-purpose computing cores and other requirements. Other possible applications platform elements integrated on a die. include real-time medical communication These designs, which require an appropriate devices, handheld gaming devices, and on-die interconnect, push technology limits PDAs. with their need for high bandwidth and low The primary driver in these systems is latency under power and area constraints. cost, followed by active power dissipation Representative applications of OCINs on (about 200 mW is necessary for a reasonable CMPs include battery life). Although performance is important for these systems, perhaps more N Data centers, including transaction-pro- important is their ability to easily connect cessing systems and Web servers. CMPs diverse IP blocks from different designers or address the need for further server vendors into a single system, motivating consolidation, assuming memory improved design styles and simple system bandwidth doesn’t limit performance. integration. N High-performance computation.This The design and performance goals of not only encompasses traditional sci- high-performance systems differ from those entific applications but has expanded of embedded systems. The research com- to real-time simulation, financial tasks, munity should acknowledge these differ- and bioinformatics. ences by pursuing research that addresses N Recognition/mining/synthesis. Recogni- broad problems across many program tion tasks include facial recognition domains, as well as more specific research and other computer vision tasks; data in only one domain. mining includes text, image, or speech search. Mined or other data is synthe- Technology and circuits sized to create new models. The most important technology con- N Medical and health. Examples are MRI straint for on-chip networks is power and CT image processing. consumption. A clear gap exists between N Desktop computers. Applications in- today’s technologies and what future on- clude computationally demanding me- chip networks will be using, not only for dia and gaming applications such as communication channels but also for mem- video and graphics. ories used for network buffering. Other constraints include design produc- tivity and cost, reflecting the problems of Applications for embedded systems using exotic or innovative technologies that The second driving application is em- require the development of CAD and vendor bedded systems. For example, handheld ecosystems. Still other constraints are re- personal electronic systems, of the same liability and fault tolerance, which are harder type as today’s highly integrated cellphone- to quantify. The latter constraints are even camcorder-MP3 devices, will require rout- more pressing for dynamically reconfigured ing networks between elements of their SoC routing networks because workload depen- designs. Most CMP applications are also dencies can make routing paths highly suitable for embedded applications, al- variable, not easily repeatable, and difficult though perhaps at a smaller scale. The to debug. The technology working group at embedded space also includes portable the Stanford workshop focused on power for applications that demand computing power enterprise-class CMP machines and personal coupled with efficiency. Next-generation handheld devices, using some basic ‘‘back-of- portable applications include civilian de- the-envelope’’ analysis. For a performance- vices such as firefighter communication oriented CMP server, the group first set devices that include real-time monitoring, bandwidth and latency targets required for local weather prediction, and video feed- a typical application and then considered ...... 98 IEEE MICRO

IEEE Micro micr-27-05-owen.3d 12/10/07 16:03:08 98 Cust # Owens whether the resulting energy costs would be feasible. For a battery-operated handheld device, the group first set total power dissipation and then calculated how much bandwidth that would support. Both calcu- lations showed clear technology gaps and thus research directions of interest. Enterprise-class CMP systems The technology working group imagined a next-generation CMP of the year 2015. Figure 1 shows this design. In a 22-nm technology, a reasonably optimistic design point might integrate 256 cores on a 400- mm2 die, in a 16 3 16 grid. A mesh routing grid for the 256 cores incorporates 480 total links (15 horizontal core-to-core links in Figure 1. A CMP machine in 2015: 256 cores in a 16 3 16 grid. each row or column and 32 total rows and columns), each 1.25 mm long. The chip, running at 0.7 V, could run at links, this results in 4.0 Tbps over 1.25-mm 7 GHz—about 25 gate delays per clock, on links, or 150 W at 0.25 mW/Gbps/mm. This par with modern cores. Optimistic wire exceeds our allocated 10-W budget for the technology projections estimate a latency networkchannelsbymorethananorderof using repeaters of 100 ps/mm and a power magnitude. Although the assumption of full cost of 0.25 mW/Gbps/mm.3 bandwidth in every link is unrealistic, many One potential application for such a CMP systems do exhibit relatively high utilization in is data mining. For this application class, we short bursts. Even with lowered total activity begin with a representative bisection band- factors of 25 percent, our network channel width requirement of 2 Tbytes/s.4 With 16 links spanning the chip’s bisector, this implies power is still unacceptably high. a 1-terabit per second (Tbps) bandwidth requirement per individual link. At a base Memory power. Each node requires a rout- of 7 GHz, achieving 1 Tbps er with buffering. Bandwidth requirements requires each link to have 145 bits, or for data mining call for 145-bit buses, perhaps 72 bits using double-data-rate cir- which would attach to five two-way ports cuits. The repeated latency of each 1.25-mm per router (bidirectional ports to the north, link is 125 ps, which enables a single clock east, west, and south, and to the local core). cycle per link hop. Long-distance transfers To allow flow control and error checking would benefit from multithreaded cores, so such as cyclic redundancy checks, the that communication across the entire chip buffers should store at least four flits deep; would not stall total forward progress. designers often use an additional multipli- Such a chip would devote 20 percent of cative safety margin factor of eight to ensure a 150-W power budget to an on-chip that local storage never limits channel interconnect network consisting of three utilization. This leads to more than components: channels (wires), buffers, and 45 Kbits per router of local storage, or switch. In our hypothetical design, we budget more than 10 Mbits of chipwide storage. 10 W for each component. We consider the For single-cycle access, the access latency first two components in more detail. of each 45-Kbit memory block should be under 150 ps. A basic low-power, six- Network channel power. We calculate net- (6T) SRAM cell, plus its amor- work channel power at peak throughput, tized portion of the decoder and sense-amp assuming every single link is fully active at its peripheral circuits, would require about peak bandwidth. At 1 Tbps and 4.0 total 0.16 mm2. Assuming that about 15 percent ......

SEPTEMBER–OCTOBER 2007 99

IEEE Micro micr-27-05-owen.3d 12/10/07 16:03:08 99 Cust # Owens ...... OCIN RESEARCH CHALLENGES

of the area would represent switched standard signaling to achieve market capacitance, each cell has a load of 2 fF. acceptance in the design community. Over 10 Mbits of memory, this leads to N Integrating multiple chips in a 3D (or at 20 nF, or about 70 W of total power at 100 least 2.5D) stack. Breaking apart a wide, percent activity. Again, this vastly exceeds single, monolithic chip into a stack of the proposed 10-W budget. Using faster but many smaller chips can make total more power-hungry register files would routes significantly shorter, saving exacerbate the power budget problem. total latency as well as total power. N Using photonics on chips. Optics has Personal electronics device achieved traction in chip-to-chip com- The personal electronics device driver is munication paths but not yet in on-chip of great interest not only to the home environments because of integration consumer who uses a cell phone daily but difficulties and the costs of translating also to professional users who need high between optical and electrical domains. bandwidth, a moderate amount of comput- However, given the potentially low ing, and some storage in a highly integrated power and extremely low latency of hand-held device. The technology working optic connections—15 to 20 times group expected a more specialized OCIN faster than repeated wires—optics on for this design space, and thus considered chips is an intriguing area of open research for building routing networks. a tighter constraint of 5 percent total power N to be network power. At 5 percent of a 200- Reoptimizing basic technology parame- mW limit (driven by battery life), network ters such as the metal buildup in modern channels can consume only 5 mW each. processes. On-chip routing networks, 2 with a preponderance of long wires Assuming a 50-mm chip, link lengths and a relative dearth of (at must be around 7 mm. With an expected least compared with modern micro- power consumption of 0.25 mW/Gbps/ processors), might benefit from trad- mm, this hypothetical system can sustain ing off dense, higher-capacitance lower a total on-chip bandwidth of only metal layers for lower-capacitance, 2.8 Gbps. This is remarkably small for coarser upper metal layers. Similar future systems; it only slightly exceeds the trade-offs might emerge as we reex- appetite of a pair of HDTV video feeds and amine underlying technologies specif- is almost certainly inadequate for tomor- ically for routing networks. row’s computing requirements.

Research agenda Microarchitectures and system architectures The dominant thread across both appli- Having identified the fundamental circuit cation scenarios is power, for communicat- and technology issues, we turned to higher- ing data across channels as well as for level design issues: OCIN microarchitectures storage and switching in the network and system architectures. The individual routers. In addition, memory-scaling trends presentations and the group discussion made underscore the difficulty of distributing clear that the best network microarchitecture a large, reliable, and fast memory across an depends strongly on an application’s band- on-chip network. Four fruitful research width and latency needs. A survey of several areas can help mitigate these difficulties: recent prototypes and products confirms that even for today’s technologies, OCIN design N Reducing power by reducing the voltage varies widely, including swing on wires.Inadditiontofunda- mental circuit design research, equally N high-bandwidth mesh networks con- important is developing a CAD ecosys- necting dozens of components,5,6 tem and design infrastructure for low- N ring and star networks for modest voltage signaling. ASIC design flows bandwidth communication between mandate a drop-in replacement for nearby IP blocks,7 and ...... 100 IEEE MICRO

IEEE Micro micr-27-05-owen.3d 12/10/07 16:03:14 100 Cust # Owens N shared-bus and crossbar architectures interface is a promising approach for pro- for SOC applications.8 viding atomicity and security, but such interfaces must not unduly add to latency. Some workshop attendees made cases for Research on remote queues,9 automatic simple networks (networks with highly method invocation on message arrival,10 concentrated, lower-bandwidth links such and integrated microarchitectural networks6 as a bus or segmented bus) for applications has previously appeared, but more work on with limited bandwidth demand. The in- both the hardware and the software sides of crease in on-chip wire density might even network interfaces is needed. extend the range for which such networks are feasible. However, other attendees Routers. Innovations in router architecture claimed that scaling to higher bandwidths and microarchitecture are needed to reduce requires routed networks with less-concen- OCIN latencies while maintaining reason- trated links rather than highly concentrated able area and power budgets. Reducing the bus-oriented networks. The difference in number of pipeline stages in the router is opinions among the attendees shows that critical, as is congestion control with further work is necessary to determine bounded or limited router buffering. Recent optimal network designs for applications work in speculative router architectures of varying bandwidth demands. pushing router pipelines to a single stage is Workshop members did agree that laten- promising, but more research is needed in cy and power are the two most critical cross- speculative microarchitectures to improve cutting design challenges for OCIN archi- accuracy and efficiency.11 Another promis- tectures. They also discussed several other ing research area is flow-control algorithms important research directions, including and microarchitectures that identify and programmability, managing reliability and accelerate critical traffic without substan- variability, and scaling on-chip networks to tially affecting the latency of less critical new technologies. traffic. Research on better network and interface support for out-of-order message Latency delivery to further the aims of adaptive Minimizing latency in on-chip networks routing is also promising. Improved effi- is critical to approaching the characteristics ciency and performance might be accessible of traditional chip-level bus interconnects, to networks that exploit some form of static which have typically been small in scale and or stable information from the application. low in latency. Low-latency networks make Potential examples are circuit-switched net- the system designer’s and the programmer’s works or a hybrid packet- and circuit- jobs easier because low overhead reduces the switched network, if circuit configuration need to avoid communication and en- time can remain small. courages efforts in exploiting concurrency. Exploiting wire density. The abundance of Network interfaces. Efficient, lightweight on-chip wires changes the trade-offs in OCIN interfaces are critical for overall network design. As mentioned, increased wire latency reduction because the transmission density can extend the viability of concen- time on wires and in routers in today’s trated networks (such as bus-like networks) by networks is often dominated by software allowing more links between network end- overheads into and out of the networks. We points. Increased wire density can also open see a need for thin network abstractions that opportunities for innovations in OCIN expose hardware mechanisms for use by topologies supported by higher-degree rou- application-level programmers. These net- ters. Finally, wire densities will likely reduce works should be tightly coupled to the the importance of virtual channels, because computation or storage elements attached to physical channels might no longer be the them, but they should also be general critical network resource. Such shifts in purpose to provide portability and utility relative technology costs demand examina- across various uses. Virtualizing the network tion and innovation in OCINs......

SEPTEMBER–OCTOBER 2007 101

IEEE Micro micr-27-05-owen.3d 12/10/07 16:03:14 101 Cust # Owens ...... OCIN RESEARCH CHALLENGES

Power embark on the creation of such tools for Power has become a major concern in OCINs. system design and must be budgeted and traded off among different system parts, Network robustness. Network robustness including the communication infrastruc- includes low-overhead support for deadlock ture. As described earlier, not all systems avoidance, mechanisms for quality of using on-chip networks will operate at the service for traffic of different priorities, same power-performance point. Promising and network-based tolerance of unexpected areas of research on power techniques for failures. One promising mechanism for various deployment domains include handling unusual network events in a light- weight fashion is network-driven exceptions N power-efficient designs that limit rout- that can be handled in software by general- er complexity and unnecessary work, N or special-purpose processing elements. adaptive power management that lets Network microarchitectures should be scal- networked systems shift power be- able across generations of systems, and tween computation and communica- a related challenge is interfacing on-chip tion on the basis of the application or networks to off-chip, board-level, rack- application phase, and N level, and systemwide networks. Unifying dynamic voltage and frequency mod- the protocols across these different trans- ulation in the network. port layers can make the protocols easier to build and easier for programmers to reason about. Programmability For an effective concurrent SoC or Network services. Incorporating more in- multicore system, a programmer needs a fast telligence into the network and its protocols and robust on-chip network transport, fast can ease programmer burden and simplify and easy-to-use network interfaces, and system design. Recently, researchers have predictable network performance. discussed incorporating support for cache coherence in the network layer.12 Other Modeling and measurement. In effect, to- possible research areas include security and day’s networks are black boxes to program- encryption services. Whether breaking mers, who find it very difficult to reason down abstraction barriers between the about network bottlenecks when writing transport layer and the memory layer is and optimizing their programs. To solve viable, and what other opportunities exist this problem, we recommend research into for creating high-level network-based ser- network modeling and measurement tech- vices remain open questions. niques for use by application programmers. Network modeling means developing cost models for network latency under different Reliability and variability traffic patterns and workloads to enable With shrinking transistor and wire dimen- programmers to predict how an application sions, reliability and variability have become will perform. The community should not significant challenges for IC designers. Past be surprised if sacrificing peak network research has examined methods of providing performance for a greater degree of pre- network reliability. Now on-chip networks dictability is desirable. Measurement means will need new lightweight mechanisms for network hardware, such as performance link-level and end-to-end service guarantees. counters, and tools that can synthesize the One example is self-monitoring links and measurements into feedback that helps switches that detect failures and intelligently programmers understand how an applica- reconfigure themselves. Both high-perfor- tion uses the network. Many tools have mance and embedded systems will require been developed over the past decade to help power-, latency-, and area-efficient, error- programmers understand program perfor- tolerant designs to provide useful on-chip mance on uniprocessors; it is time to network infrastructure...... 102 IEEE MICRO

IEEE Micro micr-27-05-owen.3d 12/10/07 16:03:14 102 Cust # Owens Fabrication variation. Fabrication process postsilicon technologies become viable, variability, either on dies or across wafers, new opportunities and constraints will can prevent a single static design from further drive the need for innovation in achieving high performance and low power interconnection networks. We must make for all fabricated devices. Postfabrication early investments in characterizing changing network tuning is a promising way to and emerging technologies from the per- tolerate fabrication faults as well as speed spective of on-chip networks as well as new variations of different network elements. network designs motivated by such shifts in Some form of network self-test, along with technology. configuration—perhaps in the same way on-chip memories employ redundant OCIN design tools rows—might prove useful. Another method The desire for flexible, high-performance might be to exploit elasticity in the network OCINs compatible with modern chip de- links to tolerate variations in router speeds, sign approaches motivates new approaches perhaps using self-timed or asynchronous to the design tools that will create them. circuits and microarchitectures. The design tools working group identified seven key research challenges in the de- Traffic variation. Another form of vari- velopment of CAD tools targeting multi- ability arises from the different types of core processor chips and SoCs. Figure 2 traffic delivered by different applications or shows an overview of these research chal- different phases of the same application. lenges. Applications differ in message length, message type (data, synchronization, and 1. Interface of network synthesis with so forth), message patterns (regular streams, system-level constraints and design.As unstructured, and so forth), and message chips move toward multicore design injection rates (steady or bursty). Again, the in future technologies, system-level abundance of on-chip wires provides an constraints become increasingly com- opportunity to specialize or replicate net- plex, and requirements become more works to improve latency or efficiency multifaceted. It is essential for OCIN across multiple types of loads. Identifying synthesis tools to interface effectively the proper set of on-chip communication with these constraints and require- primitives and designing networks that ments. The foremost challenge is the implement them will be a valuable line of accurate characterization and model- inquiry. ing of system traffic, such as that imposed by a shared-memory SoC or Technology scaling a platform-specific chip. Network design has always been subject 2. Hybrid custom and synthesized tool flow. to technology constraints, such as package General-purpose processors typically pin bandwidth. Although wire count con- lead the embedded market with ag- straints are less important on chip, smaller gressive, innovative microarchitectures feature sizes affect the relative cost of and custom designs. It is therefore communication and computation. Faster critical for design tools to leverage these computation relative to wire flight time high-performance designs within the motivates more intelligent routing algo- existing tool flow for easy adoption rithms designed to minimize message hop into mass-market embedded devices. count and network congestion. Combined Can we construct specialized libraries with the likelihood of large numbers of on- for networks, and how can we integrate chip networked elements, this trend indi- them into the entire CAD tool flow? cates a need for research into technology- This is particularly important for driven and scalable router, switch, and link facilitating fast transfer of research into designs. As emerging technologies, such as products benefiting the mass market. 3D die integration, on-chip optical com- 3. Design validation. A critical hurdle in munication, and any of many possible deploying on-chip networks is validat- ......

SEPTEMBER–OCTOBER 2007 103

IEEE Micro micr-27-05-owen.3d 12/10/07 16:03:14 103 Cust # Owens ...... OCIN RESEARCH CHALLENGES

Figure 2. Overview of CAD challenges in on-chip network design and how the subcomponents interact and form the envisioned next-generation CAD tools for on- chip networks.

ing their operation. The key questions in this domain can potentially leverage are how can we ensure designs are design tool feedback research in other robust in the face of process variations network domains such as the Internet, and tight cost budgets, and how can although there are clearly substantial we factor validation cost into the differences in the OCIN domain’s CAD design tool chain? requirements. 4. Impact of CMOS scaling and new 6. Dynamic, reconfigurable network tools. interconnect technologies. For design Not only must general-purpose multi- tools to be effective as CMOS scales, core chips support a wide variety of we need new timing, area, power, traffic and applications, OCINs in SoC thermal, and reliability models for platforms also must increasingly support future CMOS processes, circuits, and a wide variety of applications to facilitate architectures. New interconnect tech- fast time to market. So dynamic nologies must meet this need to ease reconfigurable network tools will be very adoption. Models and libraries should useful, allowing soft router cores that can be available with proposals of new be configured on the fly to match interconnects. This modeling infra- different application profiles, similar to structure should also be extensible to just-in-time software compilation. ensure integration of new technologies 7. Beyond simulation. Today’s network and interconnects. design tools rely heavily on network 5. Design tool chain with end-user feed- simulation to drive power and perfor- back. As network scale and complexity mance estimates. For future large-scale increase, new design tools must pro- networks and systems, however, simu- vide feedback to help designers. For lation will no longer be tenable because instance, feedback of network char- of their complexity. Thus, we see a need acteristics would allow designers to for research into analytical methods, quickly iterate their designs. Research such as formal methods and queuing ...... 104 IEEE MICRO

IEEE Micro micr-27-05-owen.3d 12/10/07 16:03:14 104 Cust # Owens analysis-based tools for estimating net- must efficiently support diverse traffic work power-performance. Although patterns but also must possibly meet researchers can leverage prior work, quality-of-service guarantees or even soft the key distinct features of on-chip real-time constraints. SoC architectures in networks (such as physical constraints which a large number of diverse IP blocks is and link-level flow control) motivate the rule rather than the exception exacerbate new analysis approaches as well. these needs.14 Supporting multiple cores on a single The seven design tool challenges will chip also reveals new management prob- critically affect both the embedded-SoC and lems. With server consolidation workloads, the general-purpose computing markets. a single CMP must be dynamically parti- Overcoming these challenges will enable tioned into several systems. But it must also complex, correct network designs that support performance isolation (one parti- would otherwise be impossible and facilitate tion’s traffic shouldn’t affect another parti- the adoption of on-chip networks. tion’s performance) and fault isolation (a partition reset shouldn’t force reset of Evaluation and driving applications for OCINs another partition). In addition, security The evaluation working group began by concerns require that different system parts identifying the applications and workloads running separate applications be effectively (described earlier) most likely to drive isolated. Interestingly, many of these seem- interconnect requirements and then charac- ingly diverse scenarios have a commonality terized those workloads in terms of the from the on-chip network’s perspective: architecture and programming model. Because the network is shared, all these From that characterization, we studied the scenarios require some form of network network requirements and pinpointed a re- isolation—either virtual or physical. search agenda to address them. We also forecast a need to support synchronization or communication primi- Architectural characterization and tives in the network for coherence-style programming models traffic (for example, to efficiently broadcast How do the driving applications affect an and collect invalidations at the home nodes) OCIN? These applications have diverse and message-passing traffic (for example, to access patterns. For example, one pattern broadcast data). In the first scenario, with is heavily cacheable traffic (read-only and hundreds of processing elements, even read-write sharing), which places a signifi- directory-based systems wouldn’t scale with- cant performance burden on the on-chip out such interconnect support. interconnect. Another pattern is streaming Because of these diverse application traffic from DRAM or I/O, which places requirements and the equally diverse pro- the primary burden on external interfaces gramming styles that will create software for (mainly because of pinout limitations) and these processors, we expect CMPs to a secondary burden on the on-chip network. support both coherent shared-memory and A second difficulty is the traffic’s bursty message-passing programming modes. This nature and the additional pressure that motivates efficient support in the intercon- places on congestion management mech- nect for cache-sized line transfers and anisms. variable-length message transfers. With increasing integration, we expect that single-chip devices will have a diverse Network requirements and evaluation metrics set of data producers and consumers On the basis of the architectural charac- attached to the OCIN. These might include terization, we recommend four areas of specialized engines such as shader, texture, emphasis in OCIN design and implementa- and fixed-function units.13 Packetization at tion: cache-line granularity would be inefficient for a subset of traffic generated by such N Efficient data transfer support at units. Hence, the interconnect not only various granularities for coherent and ......

SEPTEMBER–OCTOBER 2007 105

IEEE Micro micr-27-05-owen.3d 12/10/07 16:03:19 105 Cust # Owens ...... OCIN RESEARCH CHALLENGES

message-passing paradigms and for the current set of tools and method- different types of specialized engines. ologies. Many groups in academia and N Support for partitioning, including industry are resorting to emulation quality-of-service guarantees (perhaps through the use of FPGAs to over- through separate virtual channels or come the simulation speed problem. Is completely partitioned subnetworks), that sufficient? A concerted effort performance isolation (so that parti- across multiple research disciplines in tions don’t share routing paths in the computer engineering is necessary for OCIN), and isolation for security a realistic study of CMP and SoC (through partitioned subnetworks). workloads. N Clean, efficient common network N How can we compare different sys- interfaces to support multiple pro- tems under similar workloads? The gramming models.15 N community must develop a suite of Possible support for synchronization workloads and benchmarks for such and communication primitives such as a comparison (such as the SPEC suite multicast and barriers. used by the CPU community17). The suite should specify the mix of work- A further research challenge is to define loads to run concurrently and should evaluation metrics, such as latency and provide common evaluation criteria bandwidth, under the constraints of chip for comparison. This requires a co- area, power, energy, and heat dissipation. operative effort by groups in academia Another need is to standardize the evalua- and industry interested in CMP and tion metrics so that architectural implemen- SoC architectures. There has been tations can be unambiguously compared. initial activity in this direction in the Beyond the design issues mentioned SoC community18 and a call to action earlier, we also pose the following research in the CMP community.19 questions:

N With the need for dynamic partition- ing16 and possibly fault tolerance CINs are a critical technology that resulting from process variability or Owill enable the success of future the need for reliability, network to- CMPs and SoCs for embedded applica- pology doesn’t remain static. Dynamic tions. To make sure that this technology is partitioning thus creates subnetworks in place when needed, we recommend with different topologies than the a staged research program to carry out the static one. What support is needed at following key tasks: the hardware and system software Develop low-power circuits and architec- levels to support such dynamic recon- tures. To close the power gap, research figuration? should develop optimized circuits for N How can we develop analytical models OCIN components: channels, buffers, and to predict the real-time guarantees of switches, as well as architectures targeted for the architecture being designed? SoC low power. This research can reduce OCIN designs have a particular need for these power consumption by an order of magni- models. tude, allowing it to fit in the expected power N How can we monitor network perfor- envelopes for future CMPs and SoCs. This mance under constraints to study the work will set the constraints and provide effectiveness of networkwide policies? optimized building blocks for architecture For instance, once network utilization and microarchitecture efforts. has crossed a threshold, how does Develop low-latency network and router a particular class of traffic behave? architectures. Architecture research must N We recognize that realistic full-system address the primary issues of power and simulation, especially execution-based latency, as well as critical issues such as simulation, will not be possible given congestion control. This work should ...... 106 IEEE MICRO

IEEE Micro micr-27-05-owen.3d 12/10/07 16:03:19 106 Cust # Owens address network-level architecture (topolo- SoCs because of CAD tool and design flow gy, routing, and flow control), as well as incompatibilities. MICRO router microarchitecture. It should reduce ...... the delay of routers (possibly to one cycle) References and reduce the number of hops required by 1. ‘‘Special Session: Thousand-Core Chips,’’ a typical message. Circuit research to reduce 44th Design Automation Conf., 2007, http:// channel latency can also help close the www2.dac.com/data2/44th/44acceptedpapers. latency gap. This work will enable OCINs nsf/websessions/42. to match the latency of dedicated wiring. 2. W.J. Dally and B. Towles, ‘‘Route Packets, Encapsulate OCIN components. To make Not Wires: On-Chip Interconnection Net- OCIN technology accessible to SoC de- works,’’ Proc. 38th Conf. Design Automa- signers, research on design methods must tion (DAC 01), ACM Press, 2001, encapsulate the OCIN components and pp. 684-689. architectures in libraries and generators that 3. R. Ho, K. Mai, and M. Horowitz, ‘‘Managing are compatible with standard CAD flows— Wire Scaling: A Circuit Perspective,’’ Proc. for example, as parameterized hard macros. IEEE Int’l Interconnect Technology Conf., Tools that automatically synthesize OCINs IEEE Press, 2003, pp. 177-179. from these macros (as well as from blocks of 4. K. Bergman et al., ‘‘Optical On-Chip Net- standard logic) are also needed. This re- works for High-Performance, Energy-Effi- search will remove one of the largest cient Multi-Core Architectures,’’ poster roadblocks to adoption of on-chip networks session, Workshop On- and Off-Chip In- in SoCs. terconnection Networks for Multicore Sys- Develop prototype OCINs. The research tems, Dec. 2006, http://www.ece.ucdavis. community should design, construct, and edu/,ocin06/posters.html. evaluate optimized prototypes, which can 5. Intel’s Teraflops Research Chip, 2007, expose unanticipated problems, provide http://techresearch.intel.com/articles/Tera- a baseline for future research, and serve as Scale/1449.htm. a testbed for new OCIN components. This 6. P. Gratz et al., ‘‘Implementation and Evalua- work will also serve as a proof of concept for tion of a Dynamically Routed Processor OCINs, reducing their perceived risk and Operand Network,’’ Proc. 1st Int’l Symp. facilitating transfer of this technology to Networks-on-Chip (NOCS 07), 2007, pp. 7-17. industry. 7. ‘‘STMicroelectronics Unveils Innovative Develop standard benchmarks and evalua- Network-on-Chip Technology for New Sys- tion methods. To keep OCIN research tem-on-Chip Interconnect Paradigm,’’ focused on real problems, the community press release, Dec. 2005, http://www. should develop standard benchmarks and st.com/stonline/press/news/year2005/t1741t. evaluation methods. Standard benchmarks htm. allow direct comparison of research results 8. ‘‘Sonics Defines SoC Interconnect and facilitate information exchange between Choices,’’ press release, June 2006, http:// researchers. findarticles.com/p/articles/mi_m0EIN/is_2006_ If our recommended research course is June_26/ai_n16499393. successful, OCINs are likely to realize their 9. E.A. Brewer et al., ‘‘Remote Queues: potential to provide high-bandwidth, low- Exposing Message Queues for Optimization latency, low-power interconnect for CMPs and Atomicity,’’ Proc. 7th Ann. ACM Symp. and SoCs. OCINs will provide a key Parallel Algorithms and Architectures (SPAA technology needed for the large-scale CMPs 95), ACM Press, 1995, pp. 42-53. expected to dominate computing in the near 10. W.J. Dally et al., ‘‘The Message-Driven Pro- future. Without this research, OCINs won’t cessor: A Multicomputer Processing Node meet the needs of many next-generation with Efficient Mechanisms,’’ IEEE Micro, CMP applications—leading to a serious on- vol. 12, no. 2, Mar.-Apr. 1992, pp. 23-39. chip bandwidth issue for future computers— 11. R. Mullins, A. West, and S. Moore, ‘‘Low- and optimized OCINs won’t be usable in Latency Virtual-Channel Routers for On- ......

SEPTEMBER–OCTOBER 2007 107

IEEE Micro micr-27-05-owen.3d 12/10/07 16:03:19 107 Cust # Owens ...... OCIN RESEARCH CHALLENGES

Chip Networks,’’ Proc. 31st Ann. Int’l a PhD in electrical engineering from Symp. Computer Architecture (ISCA 04), Stanford University. IEEE CS Press, 2004, pp. 188-197. 12. N. Eisley, L.-S. Peh, and L. Shang, ‘‘In- William J. Dally is the Willard R. and Inez Network Cache Coherence,’’ Proc. 39th Kerr Bell Professor of Engineering and the Ann. IEEE/ACM Int’l Symp. Microarchitec- chairman of the Department of Computer ture (Micro 06), IEEE CS Press, 2006, Science at Stanford University. His research pp. 321-332. interests include high-speed signaling, com- 13. J. Held, J. Bautista, and S. Koehl, ‘‘From puter architecture, network architecture, a Few Cores to Many: A Tera-Scale and programming systems. Dally has Computing Research Overview,’’ 2006, a PhD in computer science from California http://download.intel.com/research/platform/ Institute of Technology. terascale/terascale_overview_paper.pdf. 14. L. Benini and G. De Micheli, ‘‘Networks on Ron Ho is a distinguished engineer at Sun Chips: A New SoC Paradigm,’’ Computer, Microsystems. His research interests vol. 35, no. 1, Jan. 2002, pp. 70-78. include off-chip and on-chip communica- 15. M. Katevenis, ‘‘Towards Light-Weight In- tion technologies. Ho has a PhD in tra-CMP Network Interfaces,’’ Workshop electrical engineering from Stanford Uni- on On- and Off-Chip Interconnection Net- versity. works for Multicore Systems, Dec. 2006. D.N. (Jay) Jayasimha is a principal engineer http://www.ece.ucdavis.edu/˜ocin06/program. html. in the Corporate Technology Group at Intel 16. J. Duato et al., ‘‘Part I: A Theory for Corporation. His research interests include multiprocessor architectures, interconnec- Deadlock-Free Dynamic Reconfiguration tion networks, and performance analysis. of Interconnection Networks,’’ IEEE Trans. Jayasimha has a PhD from the University of Parallel and Distributed Systems, vol. 16, Illinois at Urbana-Champaign. no. 5, May 2005, pp. 412-427. 17. J.L. Henning, ‘‘SPEC CPU2006 Benchmark Stephen W. Keckler is an associate pro- Descriptions,’’ SIGARCH Computer Archi- fessor of computer sciences and electrical tecture News, vol. 34, no. 4, Sept. 2006, and computer engineering at the University pp. 1-17. of Texas at Austin. His research interests 18. C. Grecu et al., ‘‘An Initiative towards Open include computer architecture, interconnec- Network-on-Chip Benchmarks,’’ 2007 tion networks, and parallel processor archi- http://www.ocpip.org/socket/whitepapers/ tectures. Keckler has a PhD in electrical NoC-Benchmarks-WhitePaper-15.pdf. engineering and computer science from the 19. J. Rattner, ‘‘Cool Codes for Hot Chips,’’ Hot Massachusetts Institute of Technology. Chips 18, 2006, http://www.hotchips.org/ archives/hc18/2_Mon/HC18.Keynote%20One/ Li-Shiuan Peh is a guest editor of this HC18.Keynote1.pdf. special issue. Her biography appears on page 5. John D. Owens is an assistant professor of electrical and computer engineering at the Direct questions and comments about this University of California, Davis. His research article to John D. Owens, Dept. of Electrical interests include GPU computing and, and Computer Engineering, University of more broadly, commodity parallel hardware California, Davis, One Shields Ave., Davis, and programming models. Owens has CA 95616; [email protected].

...... 108 IEEE MICRO

IEEE Micro micr-27-05-owen.3d 12/10/07 16:03:20 108 Cust # Owens