Computer Engineering 2007 Mekelweg 4, 2628 CD Delft The Netherlands http://ce.et.tudelft.nl/

MSc THESIS Communication-centric Debugging of Systems Chip using Networks on Chip

Siddharth Umrani

Abstract

The rapid technology scaling i.e. shrinking feature size means that a large number of components can be integrated on a single Integrated Chip (IC). This increased complexity translates into an increase in design effort and also potentially more design errors. Thus changes are required in the system-on-chip development which will reduce both design effort and design errors. To reduce design effort, a mod- ular design methodology which promotes reuse of already designed IP cores rather than the design of IP cores themselves is used. Thus the complexity of such a chip is resident in communication between these cores rather than in the computation taking place in them. The shriking feature size also introduces Deep Sub-Micron (DSM) effects in on-chip interconnect wires. Networks on chip have since evolved as a promising new type of interconnect which have the potential to alleviate these shortcomings. Effective debug aids in fast and accurate detection of majority of the CE-MS-2007-11 errors that may be present in the design thus reducing the number of iterations in the design cycle (and effectively the time to market). Traditional debug is core-based, where each of the IP cores in a SoC are the locus of debug actions. Communication-centric debug has been proposed as a complementary debug solution that uses the in- terconnect to debug the chip. Combination of these debug strategies might help speed up accurate error localization during debug and thus significant gains possible in reducing time to market. This thesis report presents a debug infrastructure that facilitates Communication-Centric Debug of System on Chip using Network on Chip.

Faculty of Electrical Engineering, Mathematics and Computer Science

Communication-centric Debugging of Systems on Chip using Networks on Chip A Debug Infrastructure

THESIS

submitted in partial fulfillment of the requirements for the degree of

MASTER OF SCIENCE

in

COMPUTER ENGINEERING

by

Siddharth Umrani born in Thane, INDIA

Computer Engineering Department of Electrical Engineering Faculty of Electrical Engineering, Mathematics and Computer Science Delft University of Technology

Communication-centric Debugging of Systems on Chip using Networks on Chip

by Siddharth Umrani

Abstract

he rapid technology scaling i.e. shrinking feature size means that a large number of com- ponents can be integrated on a single Integrated Chip (IC). This increased complexity T translates into an increase in design effort and also potentially more design errors. Thus changes are required in the system-on-chip development which will reduce both design effort and design errors. To reduce design effort, a modular design methodology which promotes reuse of already designed IP cores rather than the design of IP cores themselves is used. Thus the complexity of such a chip is resident in communication between these cores rather than in the computation taking place in them. The shriking feature size also introduces Deep Sub-Micron (DSM) effects in on-chip interconnect wires. Networks on chip have since evolved as a promising new type of interconnect which have the potential to alleviate these shortcomings. Effective debug aids in fast and accurate detection of majority of the errors that may be present in the design thus reducing the number of iterations in the design cycle (and effectively the time to market). Traditional debug is core-based, where each of the IP cores in a SoC are the locus of debug actions. Communication-centric debug has been proposed as a complementary debug solution that uses the interconnect to debug the chip. Combination of these debug strategies might help speed up accurate error localization during debug and thus significant gains possible in reducing time to market. This thesis report presents a debug infrastructure that facilitates Communication-Centric Debug of System on Chip using Network on Chip.

Laboratory : Computer Engineering Codenumber : CE-MS-2007-11

Committee Members :

Advisor: Kees Goossens, CE, TU Delft and NXP Semiconductors

Advisor: Georgi Gaydadjiev, CE, TU Delft

Member: Zaid Al-Ars, CE, TU Delft

Member: Ren´evan Leuken, CAS, TU Delft

i ii Dedicated to my parents and my brother Aditya

iii iv Contents

List of Figures ix

Acknowledgements xi

1 Introduction 1 1.1 Motivation ...... 1 1.2 Goals ...... 1 1.3 Previous Work ...... 1 1.4 Organization of Report ...... 2

2 Network-on-chip (NoC) 3 2.1 Introduction ...... 3 2.2 Interconnect Terminology ...... 3 2.3 Timeline of Interactions ...... 10 2.4 Æthereal NoC ...... 12 2.5 Network Interface ...... 13

3 Debug 17 3.1 Introduction ...... 17 3.2 Debug Flow ...... 19 3.3 Debug Granularity ...... 20

4 Communication Centric Debug 25 4.1 Introduction ...... 25 4.2 Design choices ...... 25 4.3 Debug Strategy for SoCs ...... 26 4.4 Locus of communication-centric debug control ...... 27 4.5 DTL Protocol ...... 29 4.6 Debug Control Actions ...... 29 4.7 Example ...... 35

5 Debug Hardware Infrastructure 39 5.1 Overview ...... 39 5.2 Monitors ...... 39 5.3 Event Distribution Interconnect (EDI) ...... 41 5.4 Test Point Registers (TPRs) ...... 46 5.5 Network Interface Shell (NI Shell) ...... 54 5.6 Test Access Port (TAP) ...... 62 5.7 Debug Flow Automation ...... 64

v 6 Debug Software Infrastructure 65 6.1 User programming via the TAP ...... 65 6.2 Use of Debug Infrastructure ...... 66 6.3 Debug Flow ...... 68

7 Results 71 7.1 Programming the TPRs ...... 71 7.2 EDI stop pulse distribution ...... 72 7.3 Debug Control Actions in the shells ...... 73 7.4 Area Cost and Speed ...... 75

8 Conclusions 79 8.1 Conclusions ...... 79 8.2 Future Work ...... 79

Bibliography 85

A Constraints on External Stop Pulse 87

B List of Acronyms 89

vi List of Figures

2.1 IP and its port ...... 4 2.2 Transactions (Read and Write) ...... 4 2.3 Messages and Elements ...... 5 2.4 Signal Handshake ...... 6 2.5 Signal Groups and Signals ...... 7 2.6 Connection and Channels ...... 8 2.7 Communication - (a) Narrowcast (b) Multi-initiator ...... 9 2.8 Hierarchies ...... 9 2.9 Master and Slave IPs communicating with NoC as interconnect...... 10 2.10 Timeline of Interactions (MNI - Master Network Interface, SNI - Slave Network Interface) ...... 11 2.11 Æthereal NoC...... 13 2.12 Æthereal connection...... 14 2.13 Æthereal Network Interface...... 15 2.14 Connections / channels in Æthereal...... 15 2.15 Visible granularities of Interactions ...... 16

3.1 Digital design flow(Source: [29])...... 17 3.2 Real-time debug approach. In this scenario, internal signals are observed in real-time via external on-chip pins...... 19 3.3 Scan-based debug approach. In this scenario, everytime the chip reaches a quiescent state, the functional clocks can be stopped and the internal state read out...... 20 3.4 Traditional scan-based debug flow (Source: [29])...... 21 3.5 Proposed scan-based debug flow ...... 21 3.6 (a) Granularity of internal NoC control. (b) Granularity of control be- tween IP and NoC...... 22

4.1 (a) Computation-centric debug (b) Communication-centric debug (Source: [37])...... 26 4.2 Debug flow using Communication-centric debug ...... 27 4.3 Locus of communication-centric debug control...... 28 4.4 Debug control action interfaces (MNI-Master Network Interface, SNI- Slave Network Interface)...... 29 4.5 DTL Signals (Source: [30])...... 30 4.6 Timeline for a Stop (MNI - Master Network Interface, SNI - Slave Network Interface) ...... 32 4.7 Timeline for a Continue (MNI - Master Network Interface, SNI - Slave Network Interface) ...... 33 4.8 Example illustrating the various debug actions over an IP-NoC interface . 34 4.9 Example SoC showing connections setup ...... 36

vii 5.1 The Debug Infrastructure ...... 40 5.2 Monitor Interface, where the monitor stop is connected to the EDI, link data to the router link which is to be monitored and monitor config to the monitorconfig TPR which specifies the breakpoint condition...... 41 5.3 Breakpoint Generation logic inside a Monitor ...... 41 5.4 Monitor gate-level waveforms for breakpoint hit ...... 42 5.5 Standing wave creation in the EDI ...... 43 5.6 Sub-sampling of a breakpoint hit pulse ...... 44 5.7 Stop Module Interfaces, where N is the number of neighboring devices (other Stop Modules and NIs ...... 44 5.8 Stop Module FSM, where stop in is the logical OR of all N neighbouring input stop signals and stop out the output signal to all N neighbouring devices...... 45 5.9 Stop Module waveforms for monitor stop ...... 46 5.10 Stop Module waveforms for external user stop through TAP ...... 46 5.11 Programming of the Monitor Config TPR ...... 47 5.12 The internal structure of the NI-Shell TPR, which is imperative to know during programming in order to be able to programme the right value for the desired control...... 48 5.13 Explains the function of Stop Enable field in the NI-Shell TPR . . . . . 49 5.14 Behaviour when Stop Condition field is de-asserted in the NI-Shell TPR 50 5.15 Behaviour when Stop Condition field is asserted in the NI-Shell TPR . . 51 5.16 Behaviour when Stop Granularity field is de-asserted in the NI-Shell TPR 52 5.17 Behaviour when Stop Granularity field is asserted in the NI-Shell TPR 53 5.18 Continue operation ...... 54 5.19 Explains the function of Continue field in the NI-Shell TPR ...... 55 5.20 NI Shell FSM (Mirror State transitions) ...... 56 5.21 Narrowcast Shell (in the FIFO shown the channel IDs of unfinished read requests are buffered) ...... 57 5.22 Narrowcast Shell FSM (Request channels) - ’FSM 1’ in Figure 5.21 . . . . 58 5.23 Narrowcast Shell FSM (Response channels) - ’FSM 2’ in Figure 5.21 . . . 59 5.24 Multiconnection Shell ...... 60 5.25 Multiconnection Shell FSM - ’FSM’ in Figure 5.24 ...... 61 5.26 TAP and its associated infrastructure ...... 63

6.1 Setup for performing control actions via the IEEE 1149.1 TAP ...... 65 6.2 Interesting SoC debug points (MNI-Master Network Interface, SNI-Slave Network Interface)...... 68

7.1 Programming of the Monitor Config TPR ...... 71 7.2 Programming of the NI Shell TPR ...... 72 7.3 Stop Module gate-level waveforms for monitor stop ...... 73 7.4 Stop Module gate-level waveforms for external user stop through TAP . . 73 7.5 Waveform for debug flow in a MNI ...... 74

viii 7.6 Request Stop in a MNI ...... 75 7.7 Request Stop / Single-step / Continue in a MNI ...... 76 7.8 Response channel stop in a MNI ...... 77 7.9 Response Stop / Single-step / Continue in a MNI ...... 78 7.10 Example SoC used during simulation and synthesis ...... 78

8.1 Example registers that can be polled to decide on NoC quiescent state. [32] 80 8.2 Shows the scan-chain concatenation order for a stop-module network. . . 81 8.3 High-level back annotation from statedumps...... 82

A.1 Timing diagrams showing minimum duration of external stop pulse . . . . 87

ix x Acknowledgements

This report concludes my Thesis project as part of my Master’s degree education in Computer Engineering at the faculty of Electrical Engineering, Mathematics and Computer Science, Technical University of Delft, The Netherlands. The project titled ”Communication-Centric Debugging of Systems on Chip using Networks on Chip : A Debug Infrastructure” was carried out from October 2006 till July 2007 at the SoC Architectures and Infrastructures department of NXP Semiconductors, Eindhoven, The Netherlands. I would like to thank Egbert Bol and Georgi Gaydadjiev who were instrumental in help- ing me with the financial assistance in order to help me pursue this study and devote all my energies towards it. I am grateful to my supervisors: • Kees Goossens (NXP Semiconductors, SoC Architectures and Infrastructure) • Bart Vermeulen (NXP Semiconductors, SoC Architectures and Infrastructure) and • Georgi Gaydadjiev (Technical University of Delft, Computer Engineering Labora- tory) for providing me with the opportunity of working on this project. Their continued guidance and support throughout the project duration has contributed to the success of this project. The meetings at NXP Semiconductors with Bart and Kees always provided me with ever-broadening horizons in the quest for the problem solution. I would like to especially thank them both, as this valuable experience has been highly rewarding for me personally. Besides I would also like to thank Andreas Hanson and Martijn Coenen for helping me with their expertise in the Æthereal network-on-chip and its automated design flow. Heartfelt thanks also to all my colleagues at NXP Semiconductors. I would like to thank all my colleagues and friends that I have made during these two years; Shiva Krishna, Andres Garcia, Benny Fallica, Patrick van Wijnen, Mitas Nikos, Ali Karimi, Catalin Ciobanu, Bogdan Spinean, Arnoud van der Heijden among others. All those enjoyable / stressful moments we shared whether studying late at night, drinking out in the bars, the barbecues or playing near the faculty parking; they will stay with me throughout my life. Last but not least, my sincere gratitude to my family without whose continued support and encouragement this project and my master study would not have existed. I can never thank you enough for this. This work is dedicated to you, my small way of thanking you all.

Siddharth Umrani Delft, The Netherlands August 25, 2007

xi xii Introduction 1

1.1 Motivation

The rapid technology scaling i.e. shrinking feature size means that a large number of components can be integrated on a single Integrated Chip (IC). This increased complexity translates into an increase in design effort and also potentially more design errors. Thus changes are required in the system-on-chip development which will reduce both design effort and design errors. To reduce design effort, a modular design methodology which promotes reuse of already designed IP cores rather than the design of IP cores themselves is used. Thus the complexity of such a chip is resident in communication between these cores rather than in the computation taking place in them. The shriking feature size also introduces Deep Sub-Micron (DSM) effects in on-chip interconnect wires. Networks-on- chip have since evolved as a promising new type of interconnect which have the potential to alleviate these shortcomings [12, 20, 40]. Effective debug aids in fast and accurate detection of majority of the errors that may be present in the design thus reducing the number of iterations in the design cycle (and effectively the time to market). Traditional debug is core-based, where each of the IP cores in a SoC are the locus of debug actions. Communication-centric debug [17, 37] has been proposed as a complementary debug solution that uses the interconnect to debug the chip. Combination of these debug strategies might help speed up accurate error localization during debug and thus significant gains possible in reducing time to market.

1.2 Goals

The principle objective of this project is to implement a debug infrastructure that will facilitate Communication-centric debug. Philips’ network-on-chip solution Æthereal was chosen as the interconnect on which the debug infrastructure is based. Goals of the project: • Define how Communication-centric debug is performed. • Implement a debug infrastructure in order to achieve it. • Integrate this infrastructure with the Æthereal design flow. • Demonstrate the results i.e. implementation of the infrastructure by simulations.

1.3 Previous Work

Monitoring services for network-on-chip have been already proposed in [8, 9, 10]. In [22] a good overview of SoC debug is found. With regards to debug, scan-based approach

1 2 CHAPTER 1. INTRODUCTION is used in [21] as is also done in our strategy. Present solutions for system-on-chip debug are core-based e.g. ARM’s CoreSight [27], DAFCA’s Flexible Silicon Debug Infrastructure [11] and Philips’ Core-based Scan Architecture for Silicon Debug [29]. But ours is communication-centric debug approach.

1.4 Organization of Report

This thesis report is organized as follows. In Chapter 2 we state various terminologies that are used to define communication taking place over the interconnect. Then we describe Network-on-Chip (NoC) and some important components of the Æthereal NoC. Chapter 3 reasons the need for debug in SoC design flow and the various debug flows that are currently used. The concept of Communication-centric debug is detailed in Chapter 4. We present a debug strategy for the SoC along with a proposal of how various debug actions will be performed in communication-centric debug. Our implemented debug infrastructure is explained in Chapter 5 (hardware) and 6 (software). Experimental results are given in Chapter 7. Finally the report ends with some Conclusions and directions for future work (Chapter 8). Network-on-chip (NoC) 2

2.1 Introduction

The shrinking feature size means that a larger number of components can be integrated onto a single chip. This translates into integration of greater number of IP cores on a single chip. The present day design methodology for the increasingly complex System- on-Chip (SoC) is a modular one which promotes reuse of already designed IP cores rather than the design of IP cores themselves. Thus the complexity of such a chip is resident in communication between these cores rather than in the computation taking place in them. The shrinking feature size also allows for the on-chip interconnect wires to be routed even more closer to each other. But this causes two parallel routed wires to form a capacitive element introducing crosstalk, interference, etc. otherwise known as DSM effects. Networks on chip have since evolved as a promising new type of interconnect which have the potential to alleviate these shortcomings [12, 20, 40]. From a functional point of view, traditional interconnects have been serial arbitration- based [34], but with the evolution of SoC with multiple IP cores and the ever increasing demand for more on-chip communication bandwidth, parallel arbitration-based inter- connects [2, 30] were developed. But these interconnects did not scale well to keep up with the exponential rise in the demand for on-chip communication bandwidth. Hence further research and development lead to the design of concurrent interconnects like multi-layer bus [3] and Network-on-Chip [19, 12]. These interconnects allow concurrent communication between various IP cores in the SoC, yet are scalable. They depict the most complex of interconnects both in terms of control (as there is no single point of control for the communication over the interconnect) and complexity (since the number of elements involved in the interconnect itself are quite large). NXP Semiconductors has developed its own Network-on-Chip solution Æthereal [16]. In Section 2.4 we detail some of the important architectural components of the Æthereal NoC and their functionality but before that we define the terminology that is used for defining interaction over an interconnect.

2.2 Interconnect Terminology

In this section we define certain terms which are key to understanding the communication over an interconnect.

IP and its ports Among two communicating IP blocks, the IP initiating the communication is known as the Master IP, while the other responding IP is the Slave IP. As shown in Figure 2.1

3 4 CHAPTER 2. NETWORK-ON-CHIP (NOC) every IP involved in communication does so via its port known as the IP port. For ease of illustration, we do not explicitly show the ports in further diagrams but are assumed to be present.

IP Port IP Port

Master IP Slave IP Core Core communication

Figure 2.1: IP and its port

Transaction A Master IP core communicates with other IP cores in a SoC by way of read and write operations. We define that an external read or a write operation executed in an IP processor core takes place as a transaction over the interconnect. As shown in Figure 2.2 a write transaction is composed of a write request followed by the write data (and optional write acknowledgement). A read transaction consists of the read request that is sent to the Slave IP and the read data sent in response by the Slave IP core back to the Master IP. WRITE TRANSACTION

Write Request and Data Master IP Slave IP Core Core Write Acknowledgement (if reqd.)

READ TRANSACTION

Read Request

Master IP Slave IP Core Core Read Data

Figure 2.2: Transactions (Read and Write) 2.2. INTERCONNECT TERMINOLOGY 5

Message We define a message as a uni-directional communication in a transaction. From Figure 2.3 we can see that every transaction consists of one or more messages. In case of a write transaction, the entire transaction itself (request and data) is also the request message (in case of no write acknowledgements) or the response (write acknowledgement) is the second message. Whereas a read transaction is composed of two messages, the request (from the Master IP to the Slave IP) and the response i.e. read data (from the Slave IP to the Master IP).

WRITE TRANSACTION

Data 3 Data 2 Data 1 Command

Master IP Slave IP Core Core

Ack

READ TRANSACTION Command

Master IP Slave IP Core Core

Data 1 Data 2 Data 3

- Message - Element

Figure 2.3: Messages and Elements

Protocol, Signal groups and Signals Direct communication between two IPs takes place in a language which is understood by both of them. This is known as the communication protocol. In majority of the communication protocols for on-chip communication, interactions are initiated and re- sponded to via handshaking between the communicating IPs. Figure 2.4 shows such a handshake between two IPs. The IP which wants to initiate communication, signals this by asserting the valid signal. The receiving IP acknowledges its acceptance by asserting the accept signal. This means that the target is also ready for communication. Only when both the valid and the accept signals are high, are the two IPs considered to be 6 CHAPTER 2. NETWORK-ON-CHIP (NOC) communicating with each other.

HANDSHAKE

valid Initiator IP data Target IP Core Core accept

Time (t) ( t1 < t2 <= t3 < t4 ) data t1

valid t2

t3

accept t4

t1 - Initiator IP asserts valid signal (means Initiator wants to start communication) t2 - Target IP sees valid signal is asserted (understands that Initiator wants to start communication) t3 - Target IP asserts accept signal (means data is transferred) t4 - Initiator IP sees accept signal (understands that Target has received the data it sent) @ t4+1 the Initiator can deassert the value on data.

Figure 2.4: Signal Handshake

The valid and accept signals together with certain other signals together perform a specific function. They are known as a signal group. For example, the command signal group is used by the master IP to initiate a new transaction with a slave IP in the chip. It consists of valid, accept and data signals along with some other protocol specific ones. All the signals in the command signal group together signal the transaction initiation. Only when both the valid and accept signals are asserted, the value on the data lines is considered valid and taken as the slave’s address. Subsequently a different signal group is used for the actual transfer of data. In case of a write transaction, the data is sent from the initiator (master IP) of a transaction to the target (slave IP) over a signal group known as write. For a read transaction, the data is sent as a response from the target 2.2. INTERCONNECT TERMINOLOGY 7 of a transaction (slave IP) to the initiator (master IP) over a signal group known as read. A signal group used for data transfer consists of mainly valid, accept ,data signals among others. Figure 2.5 shows a few signal groups and the signals they contain. The exact names of the signals and the signal groups may change for each protocol and are specific to it. So also are the number of signal groups and the exact signals which form each group.

SIGNAL GROUPS & SIGNALS

command

Initiator IP write Target IP Core Core

read

signal group signal

= valid

data

accept

Figure 2.5: Signal Groups and Signals

Element Further to the previous observable granularities, there is one other granularity which is observable independent of the underlying interconnect or the communication protocol used. This is the element. An element is a single valid-accept handshake. In the write 8 CHAPTER 2. NETWORK-ON-CHIP (NOC) transaction shown in Figure 2.3 a message from the Master to the Slave IP consists of multiple elements (viz. a command element and multiple data elements). Like a message, every element transfer is also a uni-directional communication.

Connection and Channels In an interconnect, the interactions for a read / write transaction take place either via a connection (called connection-oriented interconnect) that is set up between the Master IP core and the Slave IP core as is shown in Figure 2.6; or without one (called a connection-less interconnect). In a connection-oriented interconnect (e.g. NoCs like Mango [6], Nostrum [26], Æthereal [16], FAUST [4]), the ordering of all communication entering the interconnect is preserved when it comes out of the same. Whereas for a connection-less interconnect (e.g. NoCs like [18, 5, 7]) it may not be the case. In short, in connection-oriented interconnects providing QoS guarantees and ordering is easier compared to connection-less interconnects. As shown in Figure 2.6 a simple connection consists of two channels viz. Request and Response and every channel is uni-directional.

CONNECTION

Request Channel

Master IP Slave IP Core Core Response Channel

Figure 2.6: Connection and Channels

SoCs today have multiple IP cores and each IP core may be required to communi- cate with multiple other cores. As depicted in Figure 2.7(a) a single Master IP may communicate with multiple Slave IPs. In a connection-oriented interconnect this is done by setting up a pair of channels for each master-slave pair and is known as narrowcast connection [32]. Conversely, multiple Master IP cores may communicate with a single Slave IP core. This will involve multiple (simple) connections being set up by multiple masters to the same slave (Figure 2.7(b)).

Terminology Hierarchies Figure 2.8 shows the compositional hierarchy of the various terminologies defined in this Section. An IP communicates via its port and can have have multiple ports. A port of an IP can have one or more connections established through it. For every connection, the master IP can initiate multiple transactions. In case of a simple connection all transactions are with the same slave IP, whereas for a narrowcast connection they may be with different IPs. Every transaction is composed of one or more messages and a 2.2. INTERCONNECT TERMINOLOGY 9

Narrowcast Connection Simple Connection

(a) (b) M S1 M1 S

S2 M2

S3 M3

- Connection M - Master S - Slave - Channel - IP Core

Figure 2.7: Communication - (a) Narrowcast (b) Multi-initiator

IP

PORT

PROTOCOL CONNECTION

SIGNAL GROUP TRANSACTION CHANNEL

SIGNAL MESSAGE

ELEMENT

Figure 2.8: Hierarchies 10 CHAPTER 2. NETWORK-ON-CHIP (NOC) message in turn consists of one or more elements. A simple connection is made up of two channels viz. request and response, while a narrowcast connection has 2 channels (1 request and 1 response) per master-slave pair (2N total channels, where N is the number of master-slave pairs). An IP port has a protocol associated with it, using which it can communicate with other IPs which understand the same protocol. Protocols are composed of signal groups. These signal groups implement a handshake using valid / accept signals. A single valid /accept handshake corresponds to an element transfer. Hence an element could be a command which is sent to initiate a transfer or a data value.

2.3 Timeline of Interactions

Figure 2.9: Master and Slave IPs communicating with NoC as interconnect.

Figure 2.9 shows a master and a slave IP communication, with the NoC as inter- connect. The master IP communicates with the master network interface (MNI). The network then routes the data to the slave network interface (SNI) which in turn com- municates with the slave IP. Figure 2.10 shows the timeline of a write transaction (and its messages / elements) and how valid and accept signals accomplish the completion of transaction for the topology of Figure 2.9. The first two sets of traces are the handshakes that take place over request channel (REQ (1)) from master IP - MNI and the remaining two correspond to handshakes over request channel (REQ (2)) from SNI - slave IP. Every element transfer is essentially a valid-accept handshake. Only when both are asserted, an element transfer is said to be complete. A message transfer on an interface is complete when all elements constituting that message have been transferred. Hence elements and message are defined on each of the four interfaces (1–4) of Figure 2.9. On the other hand a transaction is defined end-to-end. Shown in Figure 2.10 is a write transaction and how the command and write data are transferred from the master to the slave IP. At 1, the master IP signals to the MNI that it wants to initiate a transaction by asserting the cmd valid signal and puts the address of the target on its cmd data lines. This is the start of the transaction, request 2.3. TIMELINE OF INTERACTIONS 11 Message Element Time (t) 12 13 wr2 wr3 wr2 wr3 wr1 wr1 write element Request message (SNI - Slave) addr addr command element wr2 wr3 wr2 wr3 WRITE TRANSACTION wr1 wr1 Request message (Master - MNI) addr addr 1 2 3 4 5 6 7 8 9 10 11 14 wr_data wr_data wr_data wr_data wr_valid wr_valid wr_valid wr_valid cmd_data cmd_data cmd_data cmd_data cmd_valid cmd_valid cmd_valid cmd_valid wr_accept wr_accept wr_accept wr_accept cmd_accept cmd_accept cmd_accept cmd_accept SNI MNI v=a=1 (Handshake complete) Signal Group Slave IP Master IP

Figure 2.10: Timeline of Interactions (MNI - Master Network Interface, SNI - Slave Network Interface) message and the command element transfer. The MNI sees this, and when ready signals it acceptance of the command by asserting the cmd accept signal (at 2). So when both master IP and MNI see cmd valid and cmd accept signals as being asserted, 12 CHAPTER 2. NETWORK-ON-CHIP (NOC) the command message transfer is said to be complete (at 3 in the timeline). This is one signal group (command) of the communication protocol. The data signal group is used for the transfer of data elements. In our example, the master IP has asserted the wr valid signal of the write signal group, even before the command message transfer is complete. But the data transfer starts only after the MNI asserts the wr accept signal. This takes place as follows. At 4 the MNI signals the acceptance of data by asserting the wr accept signal of the write signal group. Only then the transfer of the first data element is complete. The master IP then puts the second data element (if any) on the wr data lines at 5. The data elements are transferred between 4 and 7. The first data element transfer starts at 1 and completes at 5, whereas the second data element transfer starts at 5 and completes at 6. Each of these intervals is the life-cycle of the respective elements. The point 7 signals the end of transfer of all data elements and the request message over the master IP - MNI interface. Then at 8 we see that the slave network interface (SNI) starts the transfer of the command message to the slave IP (cmd valid goes high). The slave IP when ready accepts this command by asserting the cmd accept signal (point 9). Between 11 and 14 the transfer of the write data takes place between the SNI and the slave IP. Hence on this SNI-slave IP interface, 8 – 10 is when the command element transfer takes place and the request message is transferred between 8 and 14. The point 14 also signals the completion of the write transaction between the master and the slave IP.

2.4 Æthereal NoC

Æthereal [33, 16, 14] is a connection-oriented NoC wherein the connections can be classified depending on the services they provide. Resources are reserved for Guaranteed Services (GSs) which include real-time and streaming traffic. Thus Æthereal can provide guarantees on throughput, latency, jitter for such Guaranteed Throughput (GT) traffic [15, 14]. To prevent resource under-utilisation and to maximize resource usage, Best Effort Services (BESs) are also provided by means of Best-Effort connections. In these connections, data is sent whenever there are free resources available since slots are not reserved [15]. The basic infrastructure of the Æthereal NoC (Figure 2.11) consists of Routers and Network Interfaces. The network interface is the component of the network which com- municates directly with the IP cores. Different IPs may have different communicating protocols. On the other hand, internally the NoC routes the data in the form of flits (for format refer to [14]). Hence the network interface is where the conversion of these two different protocols takes place. The routers only perform the function of forwarding the data through the network from the source to the destination. A connection (Figure 2.12) as defined for Æthereal is set up between ports of two or more Network Interfaces (NIs). The communication is initiated by the Master Network Interface Port (MNIP) and the receiving end is called the Slave Network Interface Port (SNIP). Further each connection consists of two channels viz. request and response channels. The communication from the MNIP to the SNIP takes place over the request channel and that back from the SNIP takes place over the response channel. Hence a 2.5. NETWORK INTERFACE 13

Chip Master IP Core 1 Slave IP Core 2

IP port IP port

Network NI port NI port Network Interface 2 Network Router Interface 1

NI port Network Router Interface 3

NI port

IP port IP port

Slave IP Core 1 Master IP Core 2

Figure 2.11: Æthereal NoC. transaction is on a connection whereas a message is sent over a channel. All connections and channels are virtual and configured over physical links connecting the various internal components (routers, NIs) of the NoC. Multiple connections can be set up between a master-slave IP pair with a single port at each end. These connections could e.g. provide different types of service (GT or BE).

2.5 Network Interface

Figure 2.13 shows a Network Interface (NI) of the NoC. On the one hand the network interface communicates with the IP Core and on the other with the Router. The commu- nication with the IP Core takes place between the IP Port and the Network Interface Port (NI Port) in the IP protocol format. The network interface is composed of two major modules viz. the Network interface Shell (NiS) and the Network interface Kernel (NiK). A network interface has one network interface shell (NiS) per network interface port (NI Port) and only one network interface kernel (NiK). The communication between the NiS and the NiK takes place by way of messages. In the NiS it is the protocol adapters that 14 CHAPTER 2. NETWORK-ON-CHIP (NOC)

Master NI Slave NI

REQUEST Request Channel REQUEST

Master Slave IP Core IP Core

RESPONSE Response Channel RESPONSE

NoC

IP Port MNIP SNIP IP Port

MNIP - Master Network Interface Port SNIP - Slave Network Interface Port NI - Network Interface

Figure 2.12: Æthereal connection. perform the conversion between the IP protocol signals and this message format. The NiK then does the conversion between these messages and the Æthereal packet format. The Æthereal packets are then sent to the connected router. Every NiK has one port (Router port) over which it sends the Æthereal packets to the router and one or more NI Kernel ports which are used for communication with the network interface shells. As previously explained in Section 2.4, a simple Æthereal connection is set up between two network interface ports. In case of a narrowcast connection, a narrowcast adapter is used. Consider Figure 2.14, IP Core 1 communicates with IP Core 3 and 4. This results in a narrowcast connection being set up in the NoC as shown (Connection 1). A narrowcast adapter is used in NiS 1 for this narrowcast connection. The narrowcast adapter converts the IP protocol signals into messages which are then routed to the correct destination depending on the target address. On the other hand, both IP Core 1 and 2 communicate with IP Core 4. Thus two connections (Connection 1 and Connection 2) are set up in the NoC, one each corresponding to a master-slave IP pair. This necessitates the use of a multi-initiator adapter in the network interface shell connected to IP Core 4. The messages over the two connections are converted into the IP protocol format and sent to IP Core 4. The multi-initiator adapter serializes the transactions sent to IP Core 4. In our example, every NI port has only one connection set up from / to it, but theoretically multiple connections can be set up. In the NiK, there is the notion of channels. For every master-slave IP pair for whom communication takes place through a particular kernel, there are two channels (request and response). In Figure 2.14 the NiK in network interface 1 has 6 channels, 4 for the narrowcast connection (connection 1) and 2 for the simple connection (connection 2). Similarly the the NiK in network interface 2 also has 6 channels. Finally, in Figure 2.15 we show what granularities of an interaction are visible at various components / interfaces. This is vital to understanding our debug infrastructure and how the various debug actions are performed. 2.5. NETWORK INTERFACE 15

Network Interface

NI Shells NI Kernel

IP Port

IP Core

Router Protocol Adapters NI Ports

Router Port IP Core

IP Port

NI Kernel Ports

IP Protocol signals Messages Aethereal packet format

Figure 2.13: Æthereal Network Interface.

Narrowcast adapter

Network Interface 1 Network Interface 2

NI Shells NI Shells NI Kernel NI Kernel

Connection 1 IP IP Request channel Core 3 Core 1

Connection 1 Router NI Ports NI Ports

Response channel IP IP Core 4 Core 2

Connection 2

- Connection

Connection 1 - Narrowcast Connection Multi-initiator adapter

Connection 2 - Simple Connection

- Channel

Figure 2.14: Connections / channels in Æthereal. 16 CHAPTER 2. NETWORK-ON-CHIP (NOC)

Command group Network Interface

NI Shells NI Kernel valid

accept channel id

message IP data

Core 1 valid

accept channel id

data message packet

Router Command group Write group packet

valid

accept message

data IP message Core 2 accept valid

data

Read group

IP Protocol, Connection Message Channel Packet Visible Granularities Signal Group, of Interactions Signal, Transaction, Message,Element

Figure 2.15: Visible granularities of Interactions Debug 3

3.1 Introduction

With the increasing complexity of present day Integrated Circuits (ICs), errors in design stages are unavoidable. Building an error-free design may thus require multiple itera- tions. This adversely affects the total design-time for an IC. Also decreasing product life-cycles make it imperative to minimize time-to-market. Figure 3.1 shows the possible errors (left column) in the different design phases (middle column) and which verification techniques (right column) are used to locate them. Despite all the verification at the design and manufacturing stages, some of the errors remain undetected. Debug is then used for localization of these errors. Effective debug can thus help fast and accurate detection of majority of the errors that may be present.

simulation, design errors high level source formal methods

synthesis errors gate-level simulation, formal methods, (e.g. timing, logic) netlist timing verification

design rule layout DRC (Design Rule Checker), violations LVS (Layout Vs. Schematic)

undetected design errors

manufacturing errors manufacturing test

undetected design & debug manufacturing errors

Figure 3.1: Digital design flow(Source: [29]).

Despite all the existing pre-silicon verification and test methods, more than 60% of the designs contain errors in their first-silicon prototype [29]. This high percentage highlights the fact that existing methodologies aren’t efficient enough to locate design

17 18 CHAPTER 3. DEBUG and manufacturing errors in the prototype. The following reasons are cited in [29]:

• The pre-silicon verification methods are applied to a model of the IC. This model may not completely /accurately represent its actual physical behavior.

• If an accurate model is indeed made, then the computational costs involved hinder the exhaustive verification using the available methods.

Hence in order to minimize the time-to-market the location of these undetected design and manufacturing errors in first-silicon becomes important. Design-for-Debug (DfD) has been proposed as an effective means to achieve this [13, 36]. The debugging of a chip can be compared to manufacturing tests but there are some major differences as outlined below which further emphasize the importance of debug. In the testing environment, the test engineer would apply pre-defined test patterns through an Automated Test Equipment (ATE). The advantage with this methodology is that it is a lot easier to create a deterministic behavior than on an application board. Simulating the chip behavior and recording the responses is easier but the responses are obtained when the chip is in test-mode and not in functional-mode. This isn’t quite the best scenario for finding / reproduction of errors because some functional errors may not be visible in the test mode. In contrast, debugging of the chip is done in functional mode when it is part of the application board, in its operating environment where the probability of occurrence of errors is highest. Three IC requirements are listed in [36] for an efficient, structured debug methodology, viz.

1. Access to the functional pins of the chips.

2. Access to the internal signals and memories of the chip.

3. Controlled execution of the chip.

For effective debugging, controllability and internal observability are vital. DfD mod- ules as part of the structured debug methodology [36] provide the on-chip debug in- frastructure for these. The observability could be real-time (by way of on-chip pins) or scan-based (state of the internal registers, flip-flops, etc. is scanned out).

• In real-time observability (Figure 3.2), internal signals are captured through ex- ternal pins or in an on-chip memory trace. Examples are Philips’ SPY Method [38] and DAFCA’s Logic Debug Module [11]. Although this methodology gives the most accurate and up-to-date view of the chip state it suffers from scalability. Keeping in view that we propose to observe the network behavior which may have complex interactions and hence number of observable signals may be quite large. This will require either a large number of chip pins which is costly in terms of silicon area (multiplexers and trace memories) or a significant effort in selecting the appropriate signals which best represent the internal state of the chip.

• On the other hand, a scan-based approach (Figure 3.3) provides more internal observability as well as allows the debug engineer to control the functional behavior 3.2. DEBUG FLOW 19

of the chip. This gives him greater flexibility which can speed up error location. The downside is that each time the state is scanned out only a snapshot of the state is obtained. Hence multiple snapshots are required in order to understand the functional behavior of the chip and this could be time-consuming and may be difficult to recreate / read out the state at the exact moment of sampling.

Figure 3.2: Real-time debug approach. In this scenario, internal signals are observed in real-time via external on-chip pins.

Considering the pros and cons of both real-time and scan-based debug, the greater scalability and control coupled with the re-use factor (the manufacturing test scan-chains can be re-used for debug scan of internal state) that scan-based debug offers makes it more attractive. Hence we have chosen to follow a scan-based debug strategy in our proposed debug infrastructure. This satisfies the second requirement in an IC for an efficient and structured debug methodology. The third IC requirement for an efficient, structured debug methodology is the func- tionality for the debug engineer to control the execution of the chip. In debug this is done by way of debug control actions like stop, single-step and continue. Traditional core-based debug does this for instructions being executed on the IP cores. For our Communication-centric debug strategy, we implement these debug control actions for the communication taking place over the interconnect (explained later in Section 4.6).

3.2 Debug Flow

A scan-based debug flow is shown in Figure 3.4. Note that the resetting of the chip is only a functional reset. The breakpoint is programmed and then the user waits until a 20 CHAPTER 3. DEBUG

Figure 3.3: Scan-based debug approach. In this scenario, everytime the chip reaches a quiescent state, the functional clocks can be stopped and the internal state read out. breakpoint hit takes place. Then he can read out the internal state like flip-flop values and memory content. Now we present our proposed debug flow as shown in Figure 3.5 which is a modified version of the scan-based debug flow. Instead of only programming the breakpoint (as in scan-based debug flow), in our proposed debug flow the user also programs the debug control actions. Through these debug actions a controlled execution of the chip is possible. Before reading out any of the internal component values, it has to be made sure that the chip is in a quiescent state (i.e. there are no more ongoing interactions in the chip). Only after this can the functional clocks be safely stopped (without affecting / altering any functional behavior) and debug clocks switched on in order to scan out the internal state. The internal state like flip-flop values and memory content is read out. Then the user can program more debug actions (if he wants to debug further) and repeat the cycle else debugging is complete.

3.3 Debug Granularity

In this section we explore the various visible granularity levels of communication, and their usefulness in the bigger picture of SoC debug. For communication-centric debug, the debugging of the SoC is done by controlling the interactions between the various IP cores. Interactions are visible at different 3.3. DEBUG GRANULARITY 21

functional reset

program breakpoints

functional reset (optional)

wait until breakpoint hit

read out internal state

done

Figure 3.4: Traditional scan-based debug flow (Source: [29]).

functional reset

program breakpoint and debug control actions

functional reset (optional)

wait until switch back to quiscent state functional clock of chip

switch from functional to debug clock

read out internal state

program breakpoint and debug control actions

done

Figure 3.5: Proposed scan-based debug flow 22 CHAPTER 3. DEBUG granularities at various interfaces. [17] gives a detailed description of these. At the interface between the NoC and the IP cores the interaction can be viewed at the following granularities: cycle, instruction, element, message, transactions. From the viewpoint of the IP core a cycle or instruction level granularity is most relevant for useful debug. On the other hand at the network side of the interface; interactions at clock, element, message, transaction and other levels can be observed. Figure 3.6(b) shows these various granularities of control between the IP and the NoC. Within the network itself, the interactions can be observed at various granularities shown in Figure 3.6(a) which are visible at different components of the network.

Figure 3.6: (a) Granularity of internal NoC control. (b) Granularity of control between IP and NoC. 3.3. DEBUG GRANULARITY 23

Further in Section 4.6 we will explain which of the granularities are useful with respect to the locus of our debug control. 24 CHAPTER 3. DEBUG Communication Centric Debug 4

4.1 Introduction

With the increasing complexity of present-day System-on-Chip and the drive to integrate even more components to keep up with Moore’s law, building first-time error free designs is difficult. As already explained in Section 3.1, the need for debug of silicon has been necessitated in order to accurately identify these undetected design and manufacturing errors. Furthermore the early detection of design errors reduces the number of re-spins required and hence reduces the time-to-market. The increased number of IP cores on a chip means that complexity of such a chip is resident in communication between these cores rather than in the computation taking place in them. Thus in order to debug such systems more effectively and help quick localization of errors; a Communication-centric debug strategy has been proposed [17, 37] which complements the traditional core-based (Computation-centric) strategy [22, 23, 25] that monitors and debugs the multiple IPs. Figure 4.1(a) shows the Computation-centric debug strategy. The monitors are attached to the IP cores. Breakpoints are programmed in these monitors which generate an event on a breakpoint condition hit. The debug control then exercises control over the functional execution of the IP blocks. The IP blocks are stopped, their internal state is inspected and then the execution continued. This process is repeated until errors have been located. Figure 4.1(b) illustrates the complimentary Communication-centric debug strategy. Here the interconnect is the debug focus. The monitors trigger and generate events on breakpoint condition hit. The debug control will then control the functional behaviour of the interconnect. The interconnect can be stopped, its state observed and then the interactions continued. Thus instead of observing independent behaviour of the various IP blocks in the SoC, communication-centric debug allows the user to observe the different states of the IPs together in one place by way of the interactions between the IP blocks. The interconnect is the locus of these interactions and this is where we enforce debug control. Furthermore, as shown in 4.1(b) the IP cores can still be monitored and debug control enforced if is required.

4.2 Design choices

In this section we will elaborate on why certain design choices with respect to debug technique and the interconnect were made. In the proposed Communication-centric debug strategy, the interconnect is at the heart of the debug actions performed. We choose to use Network-on-chip as our interconnect for the following reasons:

• NoCs are commonly considered to be a promising new type of interconnect. They are a scalable solution both for issues related to SoC interconnect for deep sub-

25 26 CHAPTER 4. COMMUNICATION CENTRIC DEBUG

Figure 4.1: (a) Computation-centric debug (b) Communication-centric debug (Source: [37]).

micron technologies and for concurrency in interactions between SoC IP blocks.

• A NoC-based solution for effective and efficient debug communication control can be more readily ported to a single or multi-layered bus system, than the other way around.

• A NoC poses the maximum complexity with respect to parallelism, latency and scheduling. Hence choosing a NoC helps magnify any problems related to actual debug control mechanisms for the interconnect.

NXP Semiconductors has developed a Network-on-Chip called Æthereal which was taken as the interconnect on which we developed a debug infrastructure. A NoC is a complex interconnect with a lot of internal registers. To observe its internal state, the amount of internal state data would be quite large. In real-time debug, the number of observable signals would be too large and hence not scalable. Scan-based debug on the other hand offers scalability and re-use of manufacturing test scan-chains. The only worry is the time taken to scan out the internal state using the IEEE 1149.1 TAP. The test clock runs at 10 Mhz, hence to scan out around 38000 registers in an example NoC would take approximately 4 milliseconds. This is an acceptable speed considering it is the user who is going to observe this and then take an action depending on his diagnosis, which will take much longer comparatively. Considering the above arguments, a scan-based debug was chosen.

4.3 Debug Strategy for SoCs

In this section we present a debug strategy for localization of errors using Communication-centric debug. Figure 4.2 shows a flow diagram of this. The user starts by observing the interactions over the interconnect (in our case the NoC). With the debug architecture we have developed and also the various granularities (Section 2.2) of observability, he can localize the cause of the error as either the NoC itself or one or more 4.4. LOCUS OF COMMUNICATION-CENTRIC DEBUG CONTROL 27

IPs. From here on, he can go on and debug the NoC itself or the IP using their built-in debug infrastructure (traditional core-based debug). The big advantage with using the NoC as the starting point for debug is that the level of examination can be raised from bits and cycles to elements / messages / transactions. This makes it easier for the user to interpret what is going on inside the IC and co-relate the states of the various IPs through their interactions. The higher abstraction levels also allow for comparison of the simulation results at a level (transaction) that is consistent for both hardware and software.

Figure 4.2: Debug flow using Communication-centric debug

4.4 Locus of communication-centric debug control

The interconnect is at the heart of the communication-centric strategy’s debug control actions. The errors are located by observing and controlling the interactions between the various IPs over the interconnect. As shown in Figure 4.3 the communication between the IPs and the interconnect occurs between the IP port and the network interface port. The network interface port is connected to the Network Interface Shell. Hence it is in the network interface shell that we implement our debug control intelligence. Furthermore the communication between the IPs and the network interface shell takes place in the IP protocol suite. Our debug control is implemented by gating the appropriate control 28 CHAPTER 4. COMMUNICATION CENTRIC DEBUG signals of these protocol suites which as a result enforces the required debug control over the interactions. IPs use various protocol suites like DTL [30], AXI [24] and OCP [28]. For our simulations and results we have used IPs that communicate using the DTL protocol. In the following section we will give a brief introduction of the protocol communication and the various signal groups.

Chip

Slave IP Core 2

NI port

Network IP port Network Interface 1 IP port NIS Master IP Core 1 NIS Network Interface 2 NI port

N Router NIS NI port Network Interface 3 NIS

NI port

IP port IP port

Slave IP Core 1 Master IP Core 2

NIS - Network Interface Shell. (This is where the debug control intelligence is implemented.) - Interfaces over which we enforce debug control actions. (The communication over these interfaces takes place in the IP protocol suite, the control signals of which are gated while enforcing debug control.)

Figure 4.3: Locus of communication-centric debug control.

Since we implement the debug control functionality at the network boundaries (in the network interface shells), we implement this control at the granularities visible in the network interface shell. The network interface shell is the network’s window to its connected IP cores and vice-versa. A globally consistent view of the SoC is obtained at transaction level [17], hence control at a transaction level between a master-slave IP pair follows naturally. For a transaction, there are four interfaces over which debug control actions can be performed using the interconnect. These are shown in Figure 4.4 by numbers 1 – 4. At each of these interfaces, control at a message or an element-level is possible. In Section 5.4 we interpret these granularities in terms of programming of the proposed debug infrastructure and show the implementation results in Section 7. 4.5. DTL PROTOCOL 29

Figure 4.4: Debug control action interfaces (MNI-Master Network Interface, SNI-Slave Network Interface).

4.5 DTL Protocol

DTL is an on-chip communication protocol developed by Philips. In our setup the IP cores communicate with the NoC using DTL protocol and hence DTL Protocol Adapters are used in the network interface shells. In DTL, communication is always initiated by a DTL Initiator (which is the master) with a DTL Target. As shown in Figure 4.5 there are various signal groups. Important among these are the command, write and read groups. The command group is used to initiate a communication while the write and read are used to transfer the write and read data respectively. DTL is a handshake-based protocol and each of the signal groups have their independent handshake signals. The initiator when wanting to initiate a communication uses the valid signal of the command signal group and the target responds with an accept when it is ready. Only then are the values of the various other signals assumed to be valid. In the setup in Figure 4.4, the Master IP and the network interface communicating with the Slave IP (SNI) would act as DTL Initiators whereas the Slave IP and the network interface connected to the Master IP (MNI) would be DTL Targets. [30] gives a detailed description of the various data transfer modes and the timing diagrams involving DTL communication.

4.6 Debug Control Actions

With reference to the proposed novel Communication-centric debug strategy [17, 37], we provide the following debug control actions to the user:

• Stop

• Continue

• Single-step

• Scan in/out internal data 30 CHAPTER 4. COMMUNICATION CENTRIC DEBUG

Figure 4.5: DTL Signals (Source: [30]).

The NoC is the interconnect in our chip and during communication-centric SoC debug the debug control actions are performed on communication that takes place between the NoC and the IP cores. In Section 6.2 we show how the various debug actions can be programmed in our debug infrastructure with the available options and that they suffice to enable effective SoC debug (Section 7). 4.6. DEBUG CONTROL ACTIONS 31

Stop For a stop on a IP-NoC interface; contrary to stopping as implemented in traditional IP core debug where the functional clocks are gated, we gate the valid-accept handshake of the protocol. Figure 4.6 shows how a stop on the REQ (2) interface is completed with respect to the gating of a valid-accept handshake for topology of Figure 4.4. On the first interface between the master and MNI (REQ (1)), there is no stop and as shown in Figure 4.6, both command and write data elements are transferred. But on the interface between the SNI and the slave IP (REQ (2)) a stop is obtained. This is done by gating the cmd valid signal from the SNI to the slave IP as shown in Figure 4.6. In this case the slave IP is ready to accept (cmd accept is asserted) but since there is not cmd valid from the SNI no command transfer takes place. Since we want to stop the interaction on the interface REQ (2), the wr valid signal is also gated. This ensures that on the REQ (2) interface no elements will be sent and a message-level stop is achieved. On every IP-NoC interface such a valid-accept gating can be done per message or element level. Also a stop on only interface REQ (1) in Figure 4.4 imposes a transaction-level stopping. To sum up, stop can be achieved on each of the four interfaces (1–4) in Figure 4.4 allowing for a transaction / message / element-level stop.

Continue Another important debug functionality is the ability to continue a stopped SoC. This is complimentary to the stop functionality and both together give the debugger control over the functional execution of the chip. The chip is functionally continued by undoing the gating of the control signals (which was done to stop the chip) and is achievable at each IP-NoC interface. In Figure 5.19 we show the timeline for a continue after a stop has been achieved. When the SNI asserts the cmd valid signal to the slave IP, the REQ (2) interface in Figure 4.4 a continue is achieved. Also the write data is transferred (wr valid is asserted). In this way the entire message is transferred from the SNI to the slave IP. Also at this point we can say that the write transaction that was started by the master IP is complete.

Single-Step Single-stepping can be viewed as the combination of the above two functionalities. A single-step operation is equivalent to issuing a continue with an implicit stop. Tradi- tional single-stepping is at a clock cycle level granularity. Our debug infrastructure allows single-stepping at a message or element-level granularity for each of the interfaces between the NoC and the IP cores (1 – 4) in Figure 4.4. A transaction-level single-step can be achieved at interface (1) in Figure 4.4. Single-stepping is achieved by undoing of the gating of the valid-accept handshake and then gating them again. A single-step can be achieved independently for each interface between the NoC and the IP cores.

Scan out internal data With any debug methodology, lack of internal observability is a key issue. In the scan- based methodology, the internal state of all the registers, flip-flops, memories is dumped 32 CHAPTER 4. COMMUNICATION CENTRIC DEBUG No cmd_valid signal is sent by the SNI to slave IP. this even though the slave is ready (cmd_accept = 1). No wr_valid signal is sent by the SNI to slave IP. this even though the slave is ready (wr_accept = 1). Thus a stop is achieved. Time (t) wr1 wr1 addr addr wr2 wr3 wr2 wr3 wr1 wr1 addr addr wr_data wr_data wr_data wr_data wr_valid wr_valid wr_valid wr_valid cmd_data cmd_data cmd_data cmd_data cmd_valid cmd_valid cmd_valid cmd_valid wr_accept wr_accept wr_accept wr_accept cmd_accept cmd_accept cmd_accept cmd_accept SNI MNI Signal Group Slave IP v=a=1 (Handshake complete) Master IP

Figure 4.6: Timeline for a Stop (MNI - Master Network Interface, SNI - Slave Network Interface) 4.6. DEBUG CONTROL ACTIONS 33 Time (t) wr3 wr2 wr2 Write transaction complete wr1 wr1 wr3 addr addr Continue No interactions wr1 Stop wr1 addr addr wr2 wr3 wr2 wr3 wr1 wr1 addr addr wr2 Write transaction started wr_data wr_data wr_data wr_data wr_valid wr_valid wr_valid wr_valid cmd_data cmd_data cmd_data cmd_data cmd_valid cmd_valid cmd_valid cmd_valid wr_accept wr_accept wr_accept wr_accept cmd_accept cmd_accept cmd_accept cmd_accept SNI MNI Signal Group Slave IP v=a=1 (Handshake complete) Master IP

Figure 4.7: Timeline for a Continue (MNI - Master Network Interface, SNI - Slave Network Interface) 34 CHAPTER 4. COMMUNICATION CENTRIC DEBUG into either an on-chip memory or to an external memory from where it can be accessed by the debug software and the bits reconstructed back to the desired level of abstraction (register-level, transaction-level, application-level, processor instruction-level, etc.). In order to scan-out the data a dedicated interconnect can be used. Alternatively the NoC itself can be re-used or the manufacturing-test scan chains which are accessible via the IEEE 1149.1 TAP [35]. Scan chains are already used during the testing for manufac- turing defects [1]. Hence the IEEE 1149.1 scan-based manufacturing test infrastructure (which consists of the scan chains, the TAP and its controller) is re-used during debug for scanning out the internal state of the NoC instead of reusing the NoC or a dedicated interconnect [21, 31, 39, 37].

Chip

Slave IP Core 2

Network NI port

Network Master IP Core 1 Interface 2 NI port Network

Interface N 1 Router

NI port Network Interface 3

NI port

Slave IP Core 1 Master IP Core 2

Master IP Core 1 NoC

M9 M8 M7 M6 M5 M4 M3 M2 M1 After Stop

M9 M8 M7 M6 M5 M4 M3 M2 M1 Single Step

Scan out

M9 M8 M7 M6 M5 M4 M3 M2 M1 Continue

Figure 4.8: Example illustrating the various debug actions over an IP-NoC interface

In Figure 4.8 we show the various debug actions at a message-level granularity be- tween an IP-NoC interface. On the occurrence of a stop at the interface, the messages that are still to be transferred to the network will remain in the IP and those in transition will end up in the network. In the example shown, message M6 was in transition between 4.7. EXAMPLE 35 the IP Core and the network edge when a stop signal was distributed and received by all edges of the network. Hence the following message M7, will stay in the IP Core as illustrated. On a message-level single-step action, the next message, here M7 is sent over from the IP to the network before another stop is enforced on the interface. Later a continue action will restart the normal functional behaviour of the chip. In between these action when the chip is in a quiescent state i.e. there are no more interactions / communication taking place over and in the interconnect (NoC), the internal state of the NoC can be scanned out via the TAP port. In this way various debug actions can be performed centric to the communication infrastructure at various granularities. In Section 7 we show how these actions at some of the useful granularity levels are enforced during debug using our debug infrastructure.

4.7 Example

In this section we will illustrate what the debug actions translate to in terms of interaction on the interface between the IP and the NoC and the various granularities of debug control for the SoC. In a NoC, we show the communication between the connected master-slave IPs by the following scenarios: 1. Scenario 1: A simple connection (1 Master – 1 Slave)

2. Scenario 2: A narrowcast connection (1 Master – ≥ 1 Slave)

3. Scenario 3: Multi-initiator communication (≥ 1 Master – 1 Slave) Any other scenario is a combination of the above scenarios. For every connection present in the above communication scenarios, a Transaction / Message / Element-level granu- larity of debug control can be enforced. In the topology shown in Figure 4.9 two master communicate with two slaves by setting up connection through the NoC as shown. Con- nection 1 is a narrowcast connection, while connection 3 is an example of a simple connection. Master IP 1 and 2 communicate with Slave IP 2 by setting up connections 1 and 2 respectively to network interface 2 (NI 2). This is the scenario of multi-initiator communication.

Scenario 1 For a simple connection (connection 3), the user can exercise control over interfaces 5, 6, 7, and 8. For a transaction-level debug the valid-accept handshakes on request interface (5) are controlled. Consider a stop, no further transactions from master IP 2 are accepted by the network interface (NI 4) (done by gating accept signals to the IP) and all unfinished transactions (so those already accepted by NI 4) are allowed to complete i.e. all write transactions will be completed and for all read transactions the read data will return back to master IP 2. Only then do we say that a stop is complete. On the same interface (5), a message / element-level stop can also be achieved. During such a stop, an ongoing message / element transfer on that interface will be completed and the next message / element stopped from entering the network. This is done by gating 36 CHAPTER 4. COMMUNICATION CENTRIC DEBUG

Chip Slave IP Core 2

Request (2) Response (3)

Network NI 2

Request (1) Connection 1 Master IP NI 1 Core 1

Response (4) Connection 1 Connection 2

Connection 3

NI 3 NI 4

Request (6) Response (7) Request (5) Response (8)

Slave IP Core 1 Master IP Core 2

Figure 4.9: Example SoC showing connections setup the accept signal of the appropriate signal groups. Further, on each of the remaining interfaces too (6 – 8) a message / element-level stop can be achieved. This is done by gating the accept signal of the appropriate signal groups from the NIs on interface 7, and the valid signal of the appropriate signal groups from the NIs on interfaces 6 and 8. From a SoC point of view: • A transaction-level stop is possible only on one specific interface (the request in- terface from the Master IP).

• Any combination (as thought useful by the user) of message / element-level stop for each of the four interfaces can be achieved independently. From the NoC point of view: • For a simple connection - a transaction level stop can be obtained.

• For each channel of a simple connection - a message / element level stop is possible on each end of the channel. Additionally, continue and single-step debug actions are also possible at the desired granularity.

Scenario 2 This scenario shows a narrowcast connection (connection 1) between a single master (master IP 1) and two slaves (slave IPs 1 and 2). In our debug infrastructure, for every master-slave IP pair in a narrowcast connection a transaction-level debug control action is achieved. A transaction-level stop is done by gating accept signals from the network interface (NI 1) to the master IP 1(over interface (1) in Figure 4.9). For example, 4.7. EXAMPLE 37 the user can stop transactions only for slave IP 2. This is done by gating the accept signals from the network interface (NI 1) to the master IP 1 when a transaction with destination address as Slave IP 2 arrives. But since the transactions over the interface (1) are serialized, a stop on first transaction for slave IP 2 also blocks all transactions after it (even if they are for slave IP 1). All unfinished transactions (those already accepted by NI 1) for both the slave IPs (1 and 2) are allowed to complete. Furthermore, a message / element-level stop is possible on each of the interfaces 1, 2, 3, 4, 6, 7 independently (again by gating the appropriate valid / accept signal). From SoC point of view:

• A transaction-level stop is possible in a narrowcast connection for every master- slave IP pair. But this stop is not independent i.e. a stop for one pair implies a stop for all other pairs as well since all pairs have one common specific interface (the request interface from the Master IP) on which this stop is achieved.

• Any combination (as thought useful by the user) of message / element-level stop for each of the six interfaces can be achieved independently.

From the NoC point of view:

• For a narrowcast connection - a transaction level stop is possible for each master- slave IP pair of that connection.

• For each channel of a narrowcast connection - a message / element level stop is possible on each end of the channel.

Additionally, continue and single-step debug actions are also possible at the desired granularity.

Scenario 3 This scenario shows multiple master IPs communicating with a single slave IP. Each master IP sets up a simple connection to the slave (connection 1 and 2). In our debug infrastructure, a transaction-level debug control is possible for each of these connections. For a transaction-level stop, no transactions from master IP 1 to the network interface (NI 1) (over interfaces (1) in Figure 4.9) and that from master IP 2 to network interface (NI 4) (over interface (5) in Figure 4.9) are accepted by the respective network interfaces. This is done by by gating accept signals from the network interfaces. All unfinished transactions (those already accepted by NI 1 and NI 4) are completed. This is like stopping two separate connection at the same time. But the two connections involved in the above stop can also be stopped independent of each other i.e. a transaction-level stop only on one connection while the other one continues normally. A message / element- level stop is possible on each of the interfaces 1, 2, 3, 4, 5, 8 independently (again by gating the appropriate valid /accept signal). Suppose a message / element-level stop is achieved over interface (3) for those messages meant for master IP 1. As the messages are serialized, a stop on the first message / element for master IP 1 blocks all further messages for master IP 2 as well from being accepted by the network interface (NI 2). From a Soc point of view: 38 CHAPTER 4. COMMUNICATION CENTRIC DEBUG

• A transaction-level stop is possible for every master-slave IP pair. This is indepen- dent i.e. a stop for one pair does not imply a stop for all other pairs as well.

• Any combination (as thought useful by the user) of message / element-level stop for the six interfaces can be achieved independently.

From the NoC point of view:

• For a multi initiator communication - a transaction level stop is possible for each master-slave IP independent of each other. This is because each pair has a separate connection.

• For each channel in a multi initiator communication - a message / element level stop is possible on each end of the channel.

Additionally, continue and single-step debug actions are also achieved at the desired granularity. This scenario is a case of multiple simple-connections with the same desti- nation. Debug Hardware Infrastructure 5

5.1 Overview

In order to implement the ideas of the previous chapters we introduce the following debug infrastructure: • Monitors with their Breakpoint generators • Event Distribution Interconnect (EDI) • Test Point Registers (TPRs) • Network Interface Shell (NI Shell) • Debug Control Interconnect (DCI) and • Debug Data Interconnect (DDI) Figure 5.1 shows a block-level view of our debug infrastructure with the designed debug components (the dotted modules) and their location and interaction with other components of the SoC. The proposed debug methodology is communication-centric, hence no wonder most of the debug components are located in the communication in- frastructure (in our case the NoC). The TPRs can all be programmed via the IEEE 1149.1 TAP. A Monitor Config TPR is instantiated for every Monitor present, which allows the user to program a different breakpoint condition in each of the Monitors. The Breakpoint Generator (BP Gen) is the actual hardware inside each Monitor that gener- ates a breakpoint hit pulse which is fed to its attached Stop Module. The Stop Modules are instantiated per Router in the network and follow the topology of the routers. The stop modules with their distribution network form the Event Distribution Interconnect (EDI) which distributes the generated breakpoint pulse to all the network components (routers, kernels and shells). It is in the network interface shells (specifically the finite state machines (FSMs)) where the debug action decision is made. Further the test in- frastructure which consists of the Test Access Port (TAP) and its controller is used to program the TPRs / give an external stop, also know as the Debug Control Interconnect (DCI). While the Debug Data Interconnect (DDI) which again consists of the TAP, its controller and the inserted scan chains is used to scan out the internal state of the NoC. In the following sections each of the debug components; their architecture, functionality, properties, etc. is detailed.

5.2 Monitors

In [10] a method for automatic insertion of monitors into the Æthereal design flow has been proposed. This gives 100% channel observability and can monitor each of the

39 40 CHAPTER 5. DEBUG HARDWARE INFRASTRUCTURE

Test Acess Port (TAP)

Chip with Boundary Scan HW TAP Controller

Chip Slave IP Core 2 to TPR chain

Network Monitor Config TPR NI Shell TPR

NI port Stop BP Gen NI Shell Module

FSM Network Interface 1 Monitor

NI Kernel Master IP Core 1 NI Shell

NI port

N Network Interface 2 NI Shell FSM TPR Router Network Interface 3 NI Kernel

FSM NI Kernel NI Shell

NI port FSM

NI Shell NI Shell TPR NI port

NI Shell TPR

Slave IP Core 1 Master IP Core 2

Figure 5.1: The Debug Infrastructure router links at four levels of abstraction viz. physical raw, logical connection-based, transaction-based and transaction event-based. In our infrastructure, we use a very simplified version of the monitors in [8] which are automatically generated per router in the Æthereal design flow. They can be attached to any one of the router’s links and monitor the raw data over these links. When the breakpoint condition (which is programmed in their associated Monitor Config Test Point Register (TPR)) is met the monitor generates an active-high pulse which stays high as long as the breakpoint condition remains true.

Figure 5.2 shows the interface for the monitor used in our debug infrastructure. The monitor has a clk and a rst n as the standard inputs. The link data input is connected to the router link which is to be monitored. This along with the data on the monitor config input, which is connected to the monitorconfig TPR; together are used to determine a breakpoint hit and produce a pulse on the output pin monitor stop. This is connected to the EDI which distributes the event. The internal structure of the monitor can be visualized as similar to one shown in Figure 5.3. A comparator compares the monitor config and the link data values, and 5.3. EVENT DISTRIBUTION INTERCONNECT (EDI) 41

clk

rst_n

monitor_config [32:0] Monitor monitor_stop

link_data[33:0]

Figure 5.2: Monitor Interface, where the monitor stop is connected to the EDI, link data to the router link which is to be monitored and monitor config to the monitorconfig TPR which specifies the breakpoint condition. outputs a ’1’ on monitor stop as long as the condition remains true.

clk

monitor_config [32:0]

Comparator monitor_stop

link_data [32:0]

Figure 5.3: Breakpoint Generation logic inside a Monitor

Figure 5.4 shows gate-level traces for a monitor. Here, on a breakpoint condition (when the link data is equal to the specified monitor config value, a pulse is generated on monitor stop, which in this case is one clock cycle wide. The clock pulse is generated one clock cycle after the specified data is seen on the router link since the output signal is generated on the internally registered value. In our example, a value of hx10000015B has been programmed as the breakpoint condition (value on monitor config). When the link data value matches the one programmed a breakpoint hit pulse is generated on monitor stop.

5.3 Event Distribution Interconnect (EDI)

The Event Distribution Interconnect (EDI) is used to distribute events from the event generators (e.g. monitors, TAP controller), to various components of the SoC which need to respond to such events. The distribution of an event should take place as fast as possible (ideally single-cycle delay) for the response to be immediate. This is required 42 CHAPTER 5. DEBUG HARDWARE INFRASTRUCTURE

Monitor − Breakpoint Hit Page 1 of 1

Marker 1 = 45.360533736us Cursor = 45,399,911,383fs Baseline = 45.336000018us Baseline = 45,336,000,018fs TimeA = 45.399911383us Cursor−Baseline = 63,911,365fs 45.34us 45.35us 45.36us 45.37us 45.38us 45.39us

clk 0

link_data 'h20000015C 100800001 20000015A 10000015B 000000106 100800001 20000015C

link_data_r 'h100800001 000000106 100800001 20000015A 10000015B 000000106 100800001

monitor_config 'h10000015B 10000015B

monitor_stop 0

rst_n 1

Figure 5.4: Monitor gate-level waveforms for breakpoint hit in order that the components responding do so as close in time to the actual event triggering them. But such an implementation suffers from scalability. Our implemented EDI broadcasts events through stop modules which are present per monitor. The EDI distributes events at the functional frequency of the interconnect and the worst-case delay (in number of cycles) is equal to the maximum depth of the stop module network (Delay of 1 cycle per stop module). The stop module network has the same topology as the interconnect elements (in our case the routers of the NoC) which are monitored. This is required so that scalability is preserved and also to prevent any complex layout and routing constraints in silicon. Now we explain in detail our EDI implementation and its properties with the Æthereal NoC as the interconnect for the SoC. In our debug infrastructure, we use the EDI to distribute the stop signals to all the Network Interface Shells and to stop them functionally. This signal is locally (at every NiS) interpreted to be a stop at level of message / element or just ignored. The stop signal is an active high pulse which is generated by one of the monitors on detection of a breakpoint hit or when an external pulse is given through the TAP. This is then broadcast by the network of stop modules to ensure the quick distribution to the network edges. A Finite State Machine (FSM) in the stop modules ensures that the stop signal wave travels only in one direction and occurrence of multiple concurrent events does not create a standing wave as explained later in this Section.

EDI Properties:

• The stop modules are connected to the output links of all of its neighbours. Each stop module on detecting an incoming pulse sends out a pulse on each of its outgo- ing interfaces one clock cycle later. But it is required that the stop modules ignore some of the incoming pulses in order to prevent a standing wave. Consider the topology shown in Figure 5.5. On a breakpoint hit (time cycle 1), the attached stop module (SM 1) broadcasts this on all its outgoing interfaces in the next cycle

Printed on(2). Tue Jun The05 11:16:58 connected CEST 2007 stop module (SM 2) will seePrinted this by SimVision incoming from Cadence pulse Design andSystems, then Inc. in the next cycle (3) broadcast a pulse on all its outgoing interfaces. Further on 5.3. EVENT DISTRIBUTION INTERCONNECT (EDI) 43

in cycle 4, SM 1 will respond to an incoming pulse (broadcast in 3 by SM 2) by broadcasting a pulse later in cycle 5. Thus the two stop modules will keep feeding each other in a loop, even though the actual breakpoint hit condition may have gone. This is the creation of a standing wave. To prevent this the stop modules were designed to cancel out this wave. Each of the stop modules after responding to an incoming pulse ignore any pulse received in the following cycle (as explained below with the stop module FSM).

Breakpoint

1

4 2 5 3 NI 1 SM 1 SM 2 NI 4 2 4 3 5 4 5 2 3

NI 2 NI 3

NI - Network Interface SM - Stop Module

Figure 5.5: Standing wave creation in the EDI

• If the condition for a breakpoint hit stays true for multiple clock cycles then a train of pulses will be generated, though this train represents one event (Figure 5.6). The generation of a train is a consequence of the way stop modules are designed in order to prevent creation of a standing wave. For every 3 clock cycles (the 3 clock cycles is due to the stop module FSM design and implementation) the breakpoint hit remains high, a stop module generates one active-high pulse which is one clock cycle long. Hence a breakpoint hit pulse which remains high for more than 3 clock cycles is sub-sampled resulting in the generation of a train of pulses. These multiple pulses though correspond to a single event.

Figure 5.7 shows a generic stop module. The stop modules besides having clk and rst n signal inputs, are connected to the monitors via a monitor stop input. The monitors use this to signal the occurrence of an event. Every stop module receives an incoming pulse either via monitor stop or from one of its neighbouring stop modules via 44 CHAPTER 5. DEBUG HARDWARE INFRASTRUCTURE

Stop Module functional clock

Breakpoint hit

Sub-sampled EDI pulse

Figure 5.6: Sub-sampling of a breakpoint hit pulse the input stop signals (stop in 0...N). The stop module then broadcasts this stop event occurrence to all its neighbouring modules (other stop modules and NIs) via output stop signals (stop out 0...N). Also a stop module can receive a user stop signal, given via the TAP through the jtag stop signal. But only one stop module is connected to the TAP and the remaining stop modules have their jtag stop signal tied low.

clk

rst_n stop_out_0 monitor_stop

jtag_stop Stop Module stop_out_N stop_in_0

stop_in_N

Figure 5.7: Stop Module Interfaces, where N is the number of neighboring devices (other Stop Modules and NIs

The stop module FSM ensures that the: 1. Distribution of the stop signal behaves like a wave which travels only in one direc- tion.

2. Multiple breakpoint hits in time and / or place (if separated by three clock cycles) are distributed as separate pulses and the debug component in the NiSs (FSM) has the intelligence to interpret them as a stop signal or ignore it depending on whether a previous stop and / or continue signal has arrived. Figure 5.8 shows the Stop Module FSM. The stop module responds to an incoming stop signal only in state ’00’ and ’11’. State ’00’ is also the reset state for the stop module. After reset the stop module is in state ’00’ and it detects an incoming stop signal. On the next clock cycle it transitions to state ’01’ and sends a signal to all it neighbours signalling the detected stop signal. Then in the next clock cycle it transition to state ’10’ unconditionally. This state ’10’ ensures that the stop wave distribution continues only in one direction and cancels out any response wave due to broadcast. 5.3. EVENT DISTRIBUTION INTERCONNECT (EDI) 45

Further the stop module FSM transitions to state ’11’ in the next clock cycle. Here it can again respond to an incoming stop signal.

a

!reset b 00 01 00 - Waiting for stop signal 01 - Send out stop signal 10 - Do nothing 11 - Detect multiple cycle / event breakpoints

f e c

11 10 d

a - !(monitor_stop OR jtag_stop OR stop_in) / stop_out <= '0' b - (monitor_stop OR jtag_stop OR stop_in) / stop_out <= '0' c - next clock cycle / stop_out <= '1' d - next clock cycle / stop_out <= '0' e - (monitor_stop OR jtag_stop OR stop_in) / stop_out <= '0' f - !(monitor_stop OR jtag_stop OR stop_in) / stop_out <= '0'

Figure 5.8: Stop Module FSM, where stop in is the logical OR of all N neighbouring input stop signals and stop out the output signal to all N neighbouring devices.

Figure 5.9 shows a trace for the stop module shown in Figure 5.1. The stop module receives a stop signal (a pulse on monitor stop), from its associated monitor which generates a signal on a breakpoint hit. The stop signal stays high for one clock cycle and generates a pulse on each of its outputs (pulse on each of stop out) which is distributed by the EDI. The connected components then respond depending on the state they are in. If it is the stop module, then it either responds by further broadcasting a stop pulse or ignores the incoming pulse. The network interface shells receiving the stop pulse have the intelligence to interpret it in the right way. Figure 5.10 shows the stop module behavior when a stop signal is given by the user through the TAP. Here since the pulse is given externally by the user the incoming stop signal (jtag stop) may stay higher for multiple clock cycles. In this case a train of stop pulses is generated which are distributed to the network interfaces shells. The train is due to the fact that the FSM of the stop module generates a pulse everytime it detects an active-high input pulse (either on monitor stop or on jtag stop) separated by three clock cycles. In this case, it is not two separate breakpoint hits but a single one which remains high for multiple clock cycles. The minimum duration for which the user has to assert the external stop is two functional clock cycles, but there is no constraint as such for the maximum time. In Appendix A, we explain how these constraints have 46 CHAPTER 5. DEBUG HARDWARE INFRASTRUCTURE

Stop Module − Monitor Stop Page 1 of 1

Marker 1 = 45.360476909us Cursor = 45,400,102,659fs Baseline = 45.343955424us Baseline = 45,343,955,424fs TimeA = 45.400102659us Cursor−Baseline = 56,147,235fs 45.34us 45.35us 45.36us 45.37us 45.38us 45.39us

clk 1

jtag_stop 0

monitor_stop 0

rst_n 1

state_r 'b00 00 01 10 11 00

stop_out0 0

stop_out1 0

stop_out2 0

Figure 5.9: Stop Module waveforms for monitor stop been calculated. As in the previous case, the connected network interface shells / stop modules have the intelligence to interpret them in the right way.

Stop Module − JTAG Stop Page 1 of 1

Marker 1 = 40.400025695us Cursor = 40,519,995,648fs Baseline = 40.379829748us Baseline = 40,379,829,748fs TimeA = 40.519995648us Cursor−Baseline = 140,165,900fs 40.4us 40.44us 40.48us

clk 0

jtag_stop 0

monitor_stop 0

rst_n 1

state_r 'b00 00 01 10 11 01 10 11 01 10 11 01 10 11 00

stop_out0 0

stop_out1 0

stop_out2 0

Figure 5.10: Stop Module waveforms for external user stop through TAP

5.4 Test Point Registers (TPRs)

The Test Point Registers (TPRs) provide the programmability of the debug infrastruc- ture. By programming the TPRs, the user (the debugger) can program various break- points and also control the debug environment (like stopping, single-stepping and con- tinuing) during communication-centric debug. The TPR programming is a very potent tool which controls the underlying debug architecture which inturn controls the SoC functional behaviour in its target environment. In our debug infrastructure two TPR types are present viz.

1. Monitor Config TPR

2. Network Interface Shell TPR (NI-Shell TPR) Printed on Tue Jun 05 09:51:49 CEST 2007 Printed by SimVision from Cadence Design Systems, Inc. Now we will delve into the exact structure of these TPRs and their functions.

Printed on Tue Jun 05 09:32:25 CEST 2007 Printed by SimVision from Cadence Design Systems, Inc. 5.4. TEST POINT REGISTERS (TPRS) 47

Monitor Config TPR

The Monitor Config TPR is used to program the breakpoint generation hardware (the monitor) with the breakpoint condition. The programming is done via the IEEE 1149.1 test access port (TAP) which is already present and used for manufacturing tests. The monitor config port (which is the output port) of the Monitor Config TPR is connected to the Monitor which uses it as explained in the Section 5.2. Figure 5.11 shows the mon- itor config TPR programming as done through the IEEE 1149.1 TAP. When tpr hold goes low the value on tdi of the TAP starts shifting into the TPR (tpr enable is high) via tpr tdi. This is the start of the TPR programming (Marker 1 in the Figure). The shifting of the value takes place synchronous to the debug clock (tck). As soon as the shifting phase is complete the tpr hold goes high which indicates that the shifting phase is over. More importantly the value will remain stable as long as tpr hold is high. The value is then programmed when both tpr hold and tpr update are high (Marker 2 in the Figure) and is the update phase of the programming. This is reflected by the change in the value of monitor config at precisely this point in time. The value is then seen by the monitor which runs on the NoC functional clock. The separation between the shifting and the update phases allows for this safe crossover between clock domains and means that the Monitor Config TPR can be programmed when the NoC is functionally running without causing glitches or false breakpoint triggers. A more detailed descrip- tion of programming via the IEEE 1149.1 TAP can be found in [35]. This is how the actual programming takes place in hardware. In Section 6.1 we explain how the user programs this TPR.

Programming the MonitorConfig TPR Page 1 of 1

Marker 1 = 26.183299695us Marker 2 = 37.021587186us Cursor = 39,001,248,108fs Baseline = 25.784519078us Baseline = 25,784,519,078fs TimeA = 39.001248108us Cursor−Baseline = 13,216,729,030fs 26us 28us 30us 32us 34us 36us 38us

JTAG Port

tck 0

tdi 0

tdo Z

tms 0

trstn 1

Monitor Config TPR

monitor_config ' h 1 0 000000000 100000357

tpr_se 0

si 0

so 0

tpr_bypass 0

tpr_config 0

tpr_enable 1

tpr_hold 1

tpr_update 0

tpr_tdi 0

tpr_tdo 0

tpr_tck 0

ip_stop z

se 0

si 0 so Figure0 5.11: Programming of the Monitor Config TPR stop_condition ' b 0 0 0000

stop_enable ' b 1 1 1100

stop_granularity ' b 0 1 0100

continue ' b 0 0 0000

link_data ' b 0 1

link_data_r ' b 0 0

link_data_r_34 ' b 0 1

monitor_config ' b 1 0 000000000000000000000000000000000 100000000000000000000001101010111

monitor_stop 0

se 0

si 0

so 0

dtl_rst_n 1

Printed on Thu Jul 12 16:06:12 CEST 2007 Printed by SimVision from Cadence Design Systems, Inc. 48 CHAPTER 5. DEBUG HARDWARE INFRASTRUCTURE

NI-Shell TPR The NI-Shell TPR is a data register which provides the user with all the debug control over the interconnect interactions (in our case the NoC). Every network interface shell has an NI-Shell TPR associated with it. In each of these NI Shell TPRs, every channel of the network interface has one bit associated with it for every field of the NI Shell TPR (Figure 5.12). By programming the various NI-Shell TPRs the user can achieve transaction / message / element-level debugging by programming operations like stop, single-step and continue on a per channel granularity. Although the decisions for the debug actions are taken in the NI shell FSM (in Section 5.5 we explain this), the programmed values in the NI-Shell TPR dictate them. The NI-Shell TPRs are programmed in the same way as the Monitor Config TPR also through the IEEE 1149.1 test access port (TAP). In Section 6.1 we explain the user programming of this TPR. The structure of an NI-Shell TPR

Stop Enable Stop Stop Condition Continue IP Stop Granularity

stop_enable [1:2N] stop_granularity [1:2N]stop_condition [1:2N] continue [1:2N] ip_stop

1:N N+1:2N 2N+1:3N 3N+1:4N 4N+1:5N 5N+1:6N 6N+1:7N 7N+1:8N 8N+1

# of # of # of # of # of # of # of # of Response WIDTH = Response Request Response Request Response Request Request 1 Channels Channels Channels Channels Channels Channels Channels Channels

N = # Request channels = # Response channels

Figure 5.12: The internal structure of the NI-Shell TPR, which is imperative to know during programming in order to be able to programme the right value for the desired control. can be visualized as shown in Figure 5.12. It consists of 5 main fields:

1. Stop Enable: This field dictates whether or not interactions / data on a particular channel are stopped. This field is as wide as the total number (request + response) of channels in its associated NI-Shell. One bit is coupled with the stop behavior for each channel.

• A ’0’ means that the communication on the channel does not stop even when a stop hit (i.e. a stop pulse is received from the EDI (stop r) or a software stop has been programmed (stop condition[i] as explained in ’Stop Condition’ field later) for that particular channel occurs. Scenarios ’A’ and ’C’ in Figure 5.13. In scenario ’A’ there is no stop enabled for that channel (stop enable[i]is low) and no stop hit (stop is low), hence communication is not stopped (accept is still high). In scenario ’C’ even though there is a stop hit (stop is high), communication continues (accept is still high) because stop is not enabled (stop enable[i] is low). • A ’1’ on the other hand enables stop, for the channel that the bit corresponds to i.e. the interactions / data on that particular channel can be stopped; 5.4. TEST POINT REGISTERS (TPRS) 49

depicted by scenarios ’B’ and ’D’ in Figure 5.13. In scenario ’B’ stop has been enabled (stop enable[i] is high), but a stop does not occur (accept is still high) because a stop hit has not occurred (stop is low). In scenario ’D’, a stop occurs (accept goes low) since both stop has been enabled (stop enable[i] is high) and stop hit has occurred (stop is high).

STOP ENABLE

A B C D

clock

valid

accept

data 01 02 03 04 05 06 07 08 09 10

stop stop_enable[ i ]

A - No Stop occurs B - No Stop occurs even when stop enable is asserted, because no Stop hit occurred

C - No Stop occurs even when Stop Hit has occurred, because Stop enable is not asserted D - Stop occurs only when both Stop Enable is high and Stop Hit has occured

stop = stop_r OR stop_condition[ i ]

Figure 5.13: Explains the function of Stop Enable field in the NI-Shell TPR

2. Stop Condition: Provided the stop has been enabled (stop enable[i] is high) for the channel, the channel will stop either in response to a stop pulse from the EDI (stop r) or even in the absence of such a pulse. This depends on the value programmed in this field (stop condition[i]). Like the Stop Enable field, this is also as wide as the number of channels present in its associated shell and reserves one bit for each channel.

• A ’0’ means the channel stop occur only after a pulse from the EDI has been received. Scenarios ’A’ and ’B’ in Figure 5.14. In scenario ’A’ no stop occurs (accept is still high) because no pulse is received from the EDI (stop r is low). On the other hand in scenario ’B’ a stop takes place (accept goes low) as a stop pulse is received from the EDI (stop r is high). • A ’1’ means the channel will be stopped unconditionally. So the first element to occur on that channel after the programming of the Stop Condition field 50 CHAPTER 5. DEBUG HARDWARE INFRASTRUCTURE

(stop condition[i] is high) will be stopped irrespective of whether or not a stop pulse arrived from the EDI (stop r), provided only that the stop is enabled for that channel (stop enable is high). Scenario ’A’ in Figure 5.15. Here a stop occurs (accept is low) since an unconditional stop has been programmed (stop condition[i] is high). This is despite the absence of a stop pulse from the EDI (stop r is low). This field gives the user the flexibility to either wait for a stop pulse from the EDI (i.e. on a breakpoint hit or an external stop) before the stop happens or program a channel to be stopped (a software programmed stop), which is an unconditional stop. There are two purposes / reasons for this providing field: • Incase of a really long transaction, the user can stop NoC by programming this field without waiting for transaction to complete. • A single-step consists of a continue followed by an implicit unconditional stop. This field is used to achieve the implicit stop as explained in Section 6.2.

STOP CONDITION

A B

clock

valid

accept

data 01 02 03 04 05 06 07

stop_enable [ i ]

stop_r stop_condition [ i ]

stop

A - No Stop occurs when there is no EDI pulse (stop_r is low)

B - Stop occurs only after a pulse from the EDI is received (stop_r is high) Above two reference points illustrate that when Stop Condition field is deasserted, stop occurs only after a pulse is received from the EDI.

stop = stop_r OR stop_condition [ i ]

Figure 5.14: Behaviour when Stop Condition field is de-asserted in the NI-Shell TPR 5.4. TEST POINT REGISTERS (TPRS) 51

STOP CONDITION

A

clock

valid

accept

data 01

stop_enable [ i ]

stop_r stop_condition [ i ]

stop

A - No Stop occurs when there is no EDI pulse Above reference point illustrates that when Stop Condition Field is asserted, stop can occur even when there is no pulse from the EDI.

stop = stop_r OR stop_condition [ i ]

Figure 5.15: Behaviour when Stop Condition field is asserted in the NI-Shell TPR

3. Stop Granularity: In addition to the functionality of stopping channels, our de- bug infrastructure provides the user with the option of programming the granular- ity of the stop. In other words on what granularity should the ongoing interaction be interrupted and stopped.

• A ’0’ allows the ongoing interaction to complete at the message level. This means that the entire ongoing message is accepted before a stop occurs. Sce- nario ’B’ in Figure 5.16. A stop hit occurs at ’A’ (stop goes high), but a stop occurs (accept goes low) only after the ongoing message transfer is com- plete and the next message is not accepted (’B’). This is because the stop granularity is message-level (stop granularity[i] is low). • A ’1’ can be programmed for a more urgent stop. This will mean that a stop occurs at a much lower granularity (element-level). Scenario ’B’ in Figure 5.17. A stop hit occurs at ’A’ (stop goes high), and since the stop granularity is element-level (stop granularity[i] is high) a stop occurs immediately (accept goes low) at ’B’. 52 CHAPTER 5. DEBUG HARDWARE INFRASTRUCTURE

STOP GRANULARITY

A B

clock

valid

accept

data 01 02 03 04 05 06 01 message data_last

stop_enable [ i ]

stop

stop_granularity [ i ]

A - stop hit occurs B - Stop occurs only after ongoing message transfer is complete Above reference point illustrates that when Stop Granularity field is deasserted, stop occurs only after ongoing message transfer is complete.

stop = stop_r OR stop_condition [ i ]

Figure 5.16: Behaviour when Stop Granularity field is de-asserted in the NI-Shell TPR

4. Continue: Besides the ability to stop the interactions, the counter-ability to con- tinue stopped interactions is equally important in the debug scenario. Both to- gether give the user the power to observe the functional behavior of the SoC in a controlled fashion during debug. The Continue field also has one bit reserved per channel. The Continue field is interpreted differently from the three before. In the previous cases a ’0’ or a ’1’ written in the TPR is treated as the value itself and registered as the same value inside the shell as well (specifically in the FSM). But in case of continue, writing a ’1’ in the TPR causes an active-high signal (continue ni[i]) to be fed to the shell. On continuing, the shell then resets the signal value through the set-reset logic. This high pulse on the continue ni[i] signal is interpreted as a single continue pulse for that channel. With reference to Figure 5.18, when a ’1’ is programmed in the continue field for a particular channel (continue[i]), the set-reset logic (Set-Reset Logic (1)) out- puts a ’1’ to the NI Shell (signal continue ni[i] is high). The shell FSM then responds to this and continues the channel when all functional conditions for a con- tinue are true. It also sends an active high reset pulse (signal continue reset[i]) to the set-reset logic. This pulse is one clock pulse wide and it resets the output of the set-reset logic (signal continue ni[i]) to ’0’. As a result, every time a user wants to continue a particular channel he has to program a ’1’ at the appropriate bit in the continue field. Also when a continue takes place, the registered value 5.4. TEST POINT REGISTERS (TPRS) 53

STOP GRANULARITY

A B

clock

valid

accept

data 01 02 03 04 05 06

data_last

stop_enable [ i ]

stop stop_granularity [ i ]

A - stop hit occurs B - Stop occurs after completing current element, without the ongoing message transfer completing Above reference point (B) illustrates that when Stop Granularity field is asserted, stop occurs immediately, even before the ongoing message transfer is complete. stop = stop_r OR stop_condition [ i ]

Figure 5.17: Behaviour when Stop Granularity field is asserted in the NI-Shell TPR

which indicates that a stop hit has occurred (signal stop) is reset. This is done in order that the continued channel will now stop again only if, either another pulse is received from the EDI(stop r) or an unconditional stop has been programmed (stop condition[i] is asserted). Scenario ’A’ and ’B’ in Figure 5.19 explain the continue behaviour. At ’A’, no continue happens (accept is still low) as no continue pulse has been re- ceived (continue ni[i] is low). But as soon as a continue pulse is received (continue ni[i] is high) and functional conditions are valid (valid is high and data is available on data) a continue takes place (accept goes high). Also a reset pulse is sent out (continue reset[i] goes high)as depicted in Scenario ’B’. This pulse (continue reset[i]) also resets the continue (continue ni[i] goes low).

5. IP Stop: Every NI-Shell TPR also has one IP Stop bit which enables the NI Shell to gate the clock domains in the connected IP Core. This will also functionally stop the IP core and will allow us to stop all the components (the interconnect and the IPs) of a SoC at a state which are much closer in time with respect to each other. Otherwise only stopping the interconnect without the IPs means that 54 CHAPTER 5. DEBUG HARDWARE INFRASTRUCTURE

continue [ i ]

NI Shell TPR clk ......

continue_tpr [ i ] Set-Reset Logic (1) NI Shell continue_ni [ i ] set output FSM reset continue_reset [ i ]

clk

stop reset output

stop_r OR set stop_condition [ i ]

NETWORK Set-Reset Logic (2)

Figure 5.18: Continue operation

the state of the IPs will have advanced far ahead as they have continued internal operations.

• A ’1’ programmed allows for the IP Core connected to the NI Shell associated with the NI-Shell TPR to be stopped functionally. • A ’0’ on the other hand would not stop the clock domains of the connected IP Core. This could also be a scenario when a continue action is desired if the core was previously stopped.

5.5 Network Interface Shell (NI Shell)

The Network Interface Shell (NI Shell) is one of the most important component of the debug infrastructure. It is the NI Shell which actually implements the debug control action which is programmed by the user. As seen in Figure 5.1 both the NI Shell TPR and the EDI (the stop module) connect to the NI Shell. The stop module feeds into the network interface shell and signals every pulse on the EDI to it. In our debug infrastructure, a pulse on the EDI means a stop (either via a breakpoint hit or given by the user via the TAP). But because of the broadcast nature of the EDI (i.e. it only broadcasts every pulse it receives and does not take a decision), the use of the EDI can be extended in future. The NI Shell TPRs (as explained in Section 5.4) are programmed 5.5. NETWORK INTERFACE SHELL (NI SHELL) 55

CONTINUE

A B

clock

valid

accept

data 03 04 05

stop_enable [ i ]

stop

continue_ni [ i ] continue_reset [ i ]

A - No Continue without continue pulse B - Continue occurs only after continue pulse has been received and functional conditions are true Above two reference points illustrate that continue can take place only when a continue pulse is given.

stop = stop_r OR stop_condition [ i ]

Figure 5.19: Explains the function of Continue field in the NI-Shell TPR

by the user with the debug control action desired. The network interface shell, makes a decision depending on the values fed by these two components and implements the debug control. It is in the Finite State Machine (FSM) that this decision is made. The FSM cyclically steps through various functional states. For some decision-making functional state there is a corresponding mirror state (Figure 5.20). When a stop decision is made, the FSM steps into the mirror state (which is a pause state). This is the transition from state A to A’. During this transition, the control signals on the NI-IP interface are gated and hence no further communication can occur with the IP as long as the FSM stays in the mirror state (A’). In order for the FSM to come out of its pause state the user has to program a continue action (detailed in Section 5.4). This is the transition from state A’ to B. A continue action forces the FSM to transition back into a functional state so that the communication on the NI-IP interface can proceed as 56 CHAPTER 5. DEBUG HARDWARE INFRASTRUCTURE in normal functional mode. Besides a continue action being programmed by the user, the behavioral conditions for a transition to the functional state should also be satisfied (i.e. the conditions for transition on edge 1).

s A 1 , 2 , 3 - Conditions for A' functional transition 3 s - Condition for transition 1 1' to mirror state (stop action programmed and stop conditions are satisfied)

B C 1' - Condition for transition out of mirror state 2 (continue action programmed and conditions for transition on edge 1 are satisfied) A, B, C - Functional States of the FSM

A' - Mirror State

Figure 5.20: NI Shell FSM (Mirror State transitions)

Below we detail the FSMs in a Narrowcast Shell and a Multiconnection Shell. A narrowcast shell is the most generic case of a shell in a master network interface (MNI) while a multiconnection shell is the one for a shell in a slave network interface (SNI).

Narrowcast Shell Figure 5.21 shows a block diagram view of the master network interface of Figure 4.4. On the request channel, there is a finite state machine (FSM 1) which decides whether to accept a command / write data element offered by the master IP depending on the values programmed (for request channels) in its associated NI Shell TPR and the presence / absence of a pulse on the EDI. Another finite state machine (FSM 2) is present on the response channel. This decides whether to send a read data element to the master IP which requested it depending on the values programmed (for response channels) in its associated NI Shell TPR and the presence / absence of a pulse on the EDI. The FIFO in the shell is used to buffer the channel ID of the requests sent by the master IP for which a response is expected. This ensures that the responses are sent back to the master IP in the same order that requests for them were sent. The FSM for request channels of a narrowcast shell is shown in Figure 5.22. It consists of some functional states and some mirror states. The following are the states corresponding to those in Figure 5.22: 5.5. NETWORK INTERFACE SHELL (NI SHELL) 57

Figure 5.21: Narrowcast Shell (in the FIFO shown the channel IDs of unfinished read requests are buffered)

A: waiting for command message from master IP B: received command valid from master IP C: command accept is given to the master IP D: request command message is a read (read data is expected as response from the slave) E: request command message is a write (hence write data must follow). If there are multiple data elements to be transferred then FSM stays in this state till last data element transfer is complete. B’: Stop mirror state (for a command element) E’: Stop mirror state (for a write data element) The transitions: f1 - new command element i.e. cmd valid is high f2 - new command element transfer is complete i.e. cmd accept is high f3 - new command is a read request f4 - message transfer complete (command element only, since it is a read command) f5 - new command is a write request f6 - write data element transfer complete (wr valid and wr accept are high) but request message has multiple write data elements f7 - write data element transfer complete (wr valid and wr accept are high) and was the last data element of request message s1 - new command element for channel to be stopped AND stop has arrived i.e. (cmd valid is high, stop enable [i] = 1) AND (stop = 1) where stop = (stop r OR stop condition[i]) c1 - continue pulse is sent for the stopped channel AND f2 i.e. (continue ni [i] = 1) 58 CHAPTER 5. DEBUG HARDWARE INFRASTRUCTURE

AND f2 s2 - new write data element for channel to be stopped AND stop granularity = element AND stop has arrived i.e. (data valid is high, stop enable [i] = 1) AND (stop granularity = 1) AND (stop = 1) c2 - continue pulse is sent for the stopped channel AND f6 i.e. (continue ni [i] = 1) AND f6 c3 - continue pulse is sent for the stopped channel AND f7 i.e. (continue ni [i] = 1) AND f7

Figure 5.22: Narrowcast Shell FSM (Request channels) - ’FSM 1’ in Figure 5.21

The FSM for response channels of a narrowcast shell is shown in Figure 5.23. It consists of some functional states and some mirror states.

The transitions: f1 - next channel ID has been read from the FIFO. valid-accept handshake with FIFO complete. f2 - read data element transfer complete (rd valid and rd accept are high) but response message has multiple read data elements f3 - read data element transfer complete (rd valid and rd accept are high) and element was the last data element of response message 5.5. NETWORK INTERFACE SHELL (NI SHELL) 59

Figure 5.23: Narrowcast Shell FSM (Response channels) - ’FSM 2’ in Figure 5.21 s1 - new read data element for channel to be stopped AND stop has arrived AND (stop granularity = element OR first element of message) i.e.(stop enable [i] = 1) AND (stop = 1) AND (stop granularity = 1 OR blk size = max blk size) where stop = (stop r OR stop condition[i]), blk size = number of elements of the message still to be transferred, max blk size = total number of elements in the message. c1 - continue pulse is sent for the stopped channel AND f2 i.e. (continue ni [i] = 1) AND f2 c2 - continue pulse is sent for the stopped channel AND f3 i.e. (continue ni [i] = 1) AND f3

Multiconnection Shell Figure 5.24 shows a block diagram view of the slave network interface of Figure 4.4. There is just one finite state machine (FSM) in this shell. On an incoming message from the NiK, the FSM decides whether to offer it as a command / write data element to the slave IP depending on the values programmed (for request channels) in its associated NI Shell TPR and the presence / absence of a pulse on the EDI. If the command message is a read request, then the FSM will wait until it receives the read data elements from the 60 CHAPTER 5. DEBUG HARDWARE INFRASTRUCTURE slave IP. Then again depending on the values programmed (for response channels) in its associated NI Shell TPR and the presence / absence of a pulse on the EDI. As there is only one FSM, the requests which will initiate a response from the slave IP cannot be buffered.

Figure 5.24: Multiconnection Shell

The FSM for response channels of a narrowcast shell is shown in Figure 5.25. It consists of some functional states and some mirror states.

The following are the states corresponding to those in Figure 5.25: A: waiting for message from NiK to send to slave IP B: send command valid to the slave IP C: received command valid from slave IP and request is a write Since request command message is a write hence write data must follow. If there are multiple data elements to be transferred then FSM stays in this state till last data element transfer is complete. D: received command valid from slave IP and request is a read Since request command message is a read hence wait for read data from the slave IP. If there are multiple data elements to be transferred then FSM stays in this state till last data element transfer is complete. B’: Stop mirror state (for a command element) C’: Stop mirror state (for a write data element) D’: Stop mirror state (for a read data element) The transitions: f1 - new message received from the NiK and new command element sent to slave IP (cmd valid is high) f2 - IP responds by accepting it (cmd accept is high) and the command sent is a write 5.5. NETWORK INTERFACE SHELL (NI SHELL) 61

Figure 5.25: Multiconnection Shell FSM - ’FSM’ in Figure 5.24

request. f3 - IP responds by accepting it (cmd accept is high) and the command sent is a read request. f4 - write data element transfer complete (wr valid and wr accept are high) but request message has multiple write data elements f5 - write data element transfer complete (wr valid and wr accept are high) and was the last data element of request message f6 - write data element transfer complete (rd valid and rd accept are high) but request message has multiple write data elements f7 - write data element transfer complete (rd valid and rd accept are high) and was the last data element of request message s1 - new command element for channel to be stopped AND stop has arrived i.e. (cmd valid is high, stop enable [i] = 1) AND (stop = 1) where stop = (stop r OR stop condition[i]) s2 - new write data element for channel to be stopped AND stop granularity = element AND stop has arrived i.e. (data valid is high, stop enable [i] = 1) AND (stop granularity = 1) AND (stop = 1) s3 - new write data element for channel to be stopped AND stop granularity = element AND stop has arrived i.e. (data valid is high, stop enable [i] = 1) AND (stop granularity 62 CHAPTER 5. DEBUG HARDWARE INFRASTRUCTURE

= 1) AND (stop = 1) c1 - continue pulse is sent for the stopped channel AND f2 i.e. (continue ni [i] = 1) AND f2 c2 - continue pulse is sent for the stopped channel AND f3 i.e. (continue ni [i] = 1) AND f3 c3 - continue pulse is sent for the stopped channel AND f4 i.e. (continue ni [i] = 1) AND f4 c4 - continue pulse is sent for the stopped channel AND f5 i.e. (continue ni [i] = 1) AND f5 c5 - continue pulse is sent for the stopped channel AND f6 i.e. (continue ni [i] = 1) AND f6 c6 - continue pulse is sent for the stopped channel AND f7 i.e. (continue ni [i] = 1) AND f7

5.6 Test Access Port (TAP)

The IEEE 1149.1 TAP is not a part of the debug infrastructure that has been designed, but nevertheless plays a vital role in allowing the user to fully exploit the debug control options provided. The TAP along with the TAP controller (Figure 5.26) is used as the Debug Control Interconnect (DCI). On the other hand the Debug Data Interconnect (DDI) consists of the TAP, its controller along with the manufacturing scan chains. The DCI is used for programming of the TPRs or when providing an external stop pulse via the attached stop module. On the other hand, when the internal state of the interconnect is being read out, the TAP along with the scan chains act as the DDI. The TAP is the only window for the user to observe / program various internal components of the SoC. Also the reuse of the TAP and the manufacturing scan chains, means that the actual cost of the debug architecture in terms of area is only limited to the components like Monitors, TPRs, Stop Modules and additional logic in the Network Interface shells.

Now we illustrate in further detail the connectivity of TAP to the various debug components in the SoC (Figure 5.26). The TAP controller is essentially the component which converts the TAP signals (tck, trst n, tms, tdi, tdo) to and into the appropriate chip internal signals (tck, tdi, jtag stop, tpr tdo, tcb tdo, dbg so, etc.) so that correct debug actions are performed according to the TAP instruction given by the user. The user programming via the TAP in order to achieve below described actions by using the TAP instructions is explained later in Section 6.1.

• Programming a TPR: All the TPRs in our debug infrastructure form a single chain which starts (signal tdi) from the TAP controller and also ends (signal tpr tdo) at the TAP controller as shown in Figure 5.26. When programming a particular TPR in the chain of TPRs, the value on the tdi signal of the TAP is shifted through until it is in the correct TPR (the shifting phase). Then this value is actually programmed in the update phase as previously explained in Section 5.4. With the 5.6. TEST ACCESS PORT (TAP) 63

Figure 5.26: TAP and its associated infrastructure

clock domain crossing taken care of (also explained in Section 5.4), the TPRs can be programmed both when the NoC is running in functional mode and in debug mode. If a Monitor Config TPR is programmed, the monitor will then generate an event trigger when the programmed breakpoint condition is met. Incase of a NI Shell TPR, the various fields are interpreted as explained in Section 5.4 and appropriate action taken in the NI Shell FSMs.

• Giving an external stop pulse: An external pulse given by the user is fed by the TAP controller to the attached stop module which distributes it through the EDI. The TAP feeds the external pulse to the stop module over the jtag stop signal. This external stop pulse is given when the NoC is in functional mode. There are certain constraints on the duration of this external stop pulse as stated in Section 5.3.

• Switching from / to debug and functional clocks: The TAP controller is also con- nected to a Test Control Block (TCB) and the Clock Control Slice (CCS) as shown in Figure 5.26. The CCS is the module that provides the clock for the NoC. It takes as input both the clock from the clock generator and the debug clock (tck). Depending on the value programmed in the TCB either one of these clocks is fed to 64 CHAPTER 5. DEBUG HARDWARE INFRASTRUCTURE

the NoC. For detailed information on what are the different programmable values of the TCB, how they are programmed and its resultant behaviour please refer to Section 7 in [29].

• Scanning out internal state of NoC: The DDI is used to scan out the internal state of the NoC. This action takes place in the debug mode i.e. NoC is fed the debug clock (tck) by the CCS instead of it functional clock. In this mode, all the internal scan chains are concatenated into one long shift register. Since NoC is in debug mode, the TAP’s tck signal is applied to the functional flip-flops in the NoC. This causes the shift register to shift out its state on the tdo output pin of the TAP on subsequent tck clock cycles.

5.7 Debug Flow Automation

The generation, instantiation and connecting all the debug components with the other SoC components has been integrated with the automated Æthereal design flow. The NI-Shell TPRs are instantiated with their width depending on the number of channels in each NI shell. The Stop Modules are generated for every Router that is instantiated in the network. One of the Stop Module is connected to the TAP. Also Monitor Config TPRs are generated for every monitor that is specified in the network. Debug Software Infrastructure 6

6.1 User programming via the TAP

Figure 6.1: Setup for performing control actions via the IEEE 1149.1 TAP

In our setup, we use the Philips tool Incide to perform all our actions via the TAP. Figure 6.1 shows this setup. Incide runs, on the PC which is hooked up to the TAP of the SoC. The user specifies the actions desired in a Tcl script (in the form of TAP instructions), which is invoked by Incide and as a result programs the appropriate hard- ware inside the SoC. Numerous instructions are available to the user through which he can specify the control actions as described below. The various actions that the user can perform via the IEEE 1149.1 TAP are: • Reset: In order to repeat the debug session, the user should be able to control the reset of the chip via the debugger software. The DBG RESET instruction achieves this. When this instruction is given the TAP generates a reset signal which is

65 66 CHAPTER 6. DEBUG SOFTWARE INFRASTRUCTURE

combined with the functional reset of the chip through some additional logic, thus enabling the debug user control over the functional reset of the chip. • Programming the breakpoint: Programming of the breakpoint is done by program- ming the Monitor Config TPR. The user specifies the breakpoint condition (in our case the link value) which is written to this register. This is achieved through the PROGRAM TPR instruction. Here the user has to specify the TPR name and the value to be programmed as arguments with this instruction. • Programming the debug control actions: The various debug control actions (stop, single-step and continue) are programmed by programming the NI-Shell TPR. The PROGRAM TPR instruction achieves this. Like in the previous case, the user has to specify the NI-Shell TPR name and the value to be programmed in it. • Giving an external stop pulse: The user can also give an external stop pulse via the TAP. This gives the user the flexibility to send a stop pulse externally via the TAP. He can then stop the SoC without programming a breakpoint and waiting for a breakpoint hit to happen before a stop is initiated in the NoC. Using the JTAG STOP instruction, the user can give a stop pulse which is fed by the TAP controller to the Stop Module connected to it. This stop pulse will be one debug clock (tck) cycle wide. The EDI then distributes this stop pulse to all connected components of the NoC. • Switching from / to debug and functional clocks: The TAP controller is also con- nected to the Test Control Block (TCB) and the Clock Control Slice (CCS). The CCS feeds the clock to the various NoC functional components. Through proper programming of the TCB, the user can choose which clock is fed to the NoC (the functional clock which comes from the clock generators or the debug clock which comes from the TAP). [39] gives a detailed description of exact nature of program- ming the TCB. The PROGRAM TCB instruction is used to program the TCB with the value to be programmed as the argument. • Scanning out internal state of NoC: The scanning out of the internal state of the NoC (flip-flop and memory content) is done by re-using the scan chains that are inserted during manufacturing tests. In order for this, first the functional clocks are switched off. This is necessary so that any functional communication is not upset. Then the internal state is scanned out from the scan chains via the TAP. Activating the scan chains when the functional clock is running may cause glitches in the NoC data which will corrupt the system state. After switching off the functional clock the debug clock (tck) is fed to the NoC by the CCS. The DBG SCAN instruction is then used which results in scanning out the internal data.

6.2 Use of Debug Infrastructure

We control the SoC functional behavior by controlling the interaction of the IP cores with the communication infrastructure, hence only the communication at the boundary between the network and the IP cores needs to be controlled. The Network Interfaces 6.2. USE OF DEBUG INFRASTRUCTURE 67

(more specifically the Network Interface shells) form this interface between the network and the IP cores. The NI Shells are built such that they have the necessary intelligence to control the interaction in the presence / absence of an EDI pulse (stop r) according to the actions (stop, single-step, continue) which have been programmed by the user. The user can program these actions via the NI-Shell TPR which is instantiated for every NI Shell. The programming of the NI Shell TPR is done by the user via the DCI using the IEEE 1149.1 TAP. We will further detail the exact nature of this programming and the granularity of control this imposes.

• Program a breakpoint: This is done in the Monitor Config TPR. The breakpoint condition will be programmed (monitor config) by the user via the DCI using the IEEE 1149.1 TAP.

• Stop: In order for a stop to occur, first and foremost the Stop Enable field in the NI-Shell TPR should be set (stop enable[i] =’1’) for that particular channel. But just enabling a stop is not sufficient. A stop occurs after a stop pulse from the EDI is received or unconditionally for that channel depending on Stop Condition field (stop condition[i]). So if the Stop Condition is set (stop condition[i] =’1’) then a stop for the channel will occur even in the absence of an EDI stop pulse else (stop condition[i] =’0’) a stop will take place after a stop pulse is received from the EDI. The urgency or granularity of stop depends on the Stop Granularity field (stop granularity[i]). If this is not set (= ’0’) then a message- level stop will occur (i.e. the ongoing message transfer if any, will be allowed to complete) else in the situation that it is set (=’1’) an element-level stop occurs. This means that the ongoing element transfer if any is allowed to complete before a stop occurs.

• Continue: A channel is continued (or transitioned out of its stop (mirror) state into functional state) by setting the corresponding Continue field (continue =’1’). When the NI-Shell detects that a continue is programmed, it transitions the FSM out of the stop state, into functional mode.

• Single-step: As explained previously in 4.6, a single-step is a continue action followed by an implicit stop. This is achieved as follows. A continue happens on setting the appropriate bit of the Continue field (continue[i] =’1’). If at that moment the Stop Condition field is also set (stop condition[i] =’1’) then this means an unconditional stop follows this continue action which in effect is a single- step. Like in a stopping scenario, the granularity of a single-step depends on the Stop Granularity field (stop granularity[i]). For an element-level single-step (i.e. allow one more element to be transferred before stopping again) this field is set (stop granularity[i] =’1’) otherwise a message-level single-step (i.e. allow one more message to be transferred before stopping again) occurs.

• Scanning out internal state of the NoC: Since we use a scan-based debug approach, the internal flip-flop and memory content in NoC is scanned out in order to observe the state of the NoC. This is done using the DDI. After the NoC is in a quiescent state (there is no more communication taking place inside the 68 CHAPTER 6. DEBUG SOFTWARE INFRASTRUCTURE

NoC), the functional clocks are switched off by programming the TCB through the TAP. Then the internal state of the NoC is scanned out via the DDI which runs at debug clock frequency.

Figure 6.2: Interesting SoC debug points (MNI-Master Network Interface, SNI-Slave Network Interface).

The interfaces numbered 1–4 in Figure 6.2 denote the key points of debug control in our infrastructure for SoC debug. For each port of a network interface an NI-Shell TPR is associated which defines the debug actions (stop, continue and single-step) for each of them independently. If a transaction-level debug is required then only the interaction at request (REQ) interface between the Master IP and the TNI (denoted by 1) should be controlled. This will give a master-view of the traffic over the entire SoC. Incase of multiple masters, all such request interfaces for each of the master should be controlled. Moreover a message-level as well as an element-level debug view is possible at each of the interfaces. A very fine granularity (per-channel) of control is available to the user due to the presence of control bits per channel in the NI-Shell TPRs. Though present infrastructure only allows controllability of interactions of the NoC with its external components (the IP cores), this is sufficient for initial debug of the SoC.

6.3 Debug Flow

1. SoC Running (Initial Condition) 2. Programming of TPRs through TAP controller. (user) • NI-Shell TPR - to program the various debug actions • Monitor TPR - to program breakpoint hardware inside the monitor. 3. Update TPRs. (user) 4. Reset the chip (functional reset) - optional. (user) 5. Breakpoint condition occurs (detected by the monitor) / pulse given through the TAP controller, and EDI communicates this to all NIs by pulses. (hardware) 6.3. DEBUG FLOW 69

6. NI Shells detect the pulse (for stopping in this case) and take action depending on the state they are in (FSM state) and the values programmed in the Stop Enable and Stop Granularity fields of the NI-Shell TPRs. (hardware)

7. NI Shells which have stop enabled, transition to stop (mirror) state. (hardware)

8. Detect / interpret whether or not the NoC has stopped all communication i.e. it is in a quiescent state. (user)

9. While NoC hasn’t stopped do Steps 10 to 15.

10. Reprogramming of the TPRs. (user)

• Set Stop Enable (=1 for all channels) and Stop Granularity (=1 for all channels, fastest possible stop) fields in all NI-Shell TPRs.

11. Update TPRs. (user)

12. Give a pulse through the TAP controller to the stop module. (user)

13. This pulse is distributed by the EDI to all the NI Shells. (hardware)

14. NI shells detect the pulse and those that haven’t stopped transition to stop (mirror) state while those which have stopped do not react. (hardware)

15. Detect / interpret whether or not the NoC has stopped all communication. (user)

16. If all the NoC communication has stopped (i.e NoC is in a quiescent state) then the internal state of the NoC can be scanned out via the IEEE 1149.1 TAP. (user) This is done as follows:

• First, switch the functional clocks to debug clock. This is done by program- ming the Test Control Block (TCB) using the TAP controller. • Then the internal state is scanned out through the TAP.

17. Reprogramming the TPRs (user)

• Program a 1 in the Continue field for those channels which the user wants to continue functionally, and by programming the Stop Condition field a normal continue or a single-step can be enforced. • Also Stop Enable and Stop Granularity fields for various channels can be reprogrammed according to the action that is desired.

Note: when the Continue bits for the various channels are programmed by the user, he /she has to make sure not to stall the network and ensure that the desired continue action can actually complete.

18. Update TPRs (user) 70 CHAPTER 6. DEBUG SOFTWARE INFRASTRUCTURE

19. NI Shells detect the change in the various NI-Shell TPRs and depending on the value that has been programmed in the Continue field, transition out of the stop state (provided the data for transfer does not stall it further) and resume further operation. (hardware)

20. Continue action (hardware)

• If a Single-step has been programmed (Stop Condition = 1) then depending on the Stop Granularity the NI Shells transition through the FSM and return back to the stop state. GoTo Step 15. • If a normal continue was programmed (Stop Condition = 0) then * If Stop Condition = 0 for all the channels (and a Continue was pro- grammed for all channels then GoTo Step 1. * Else GoTo Step 5. Results 7

In this chapter we will show the results.

7.1 Programming the TPRs

In this section we show the gate-level traces of programming for the Monitor Config TPR as done via the IEEE 1149.1 TAP. Figure 7.1 below shows how this is done.

Programming the MonitorConfig TPR Page 1 of 1

Marker 1 = 26.183299695us Marker 2 = 37.021587186us Cursor = 39,001,248,108fs Baseline = 25.784519078us Baseline = 25,784,519,078fs TimeA = 39.001248108us Cursor−Baseline = 13,216,729,030fs 26us 28us 30us 32us 34us 36us 38us

JTAG Port

tck 0

tdi 0

tdo Z

tms 0

trstn 1

Monitor Config TPR

monitor_config ' h 1 0 000000000 100000357

tpr_se 0

si 0

so 0

tpr_bypass 0

tpr_config 0

tpr_enable 1

tpr_hold 1

tpr_update 0

tpr_tdi 0

tpr_tdo 0

tpr_tck 0

ip_stop z

se 0

si 0 so Figure0 7.1: Programming of the Monitor Config TPR stop_condition ' b 0 0 0000

stop_enable ' b 1 1 1100 Whenstop_granularitytpr hold goes' b 0 1 low0100 the value on tdi of the TAP starts shifting into the TPR continue ' b 0 0 0000 (tpr enablelink_data is high)' viab 0 1 tpr tdi. This is the start of the TPR programming (Marker 1 in thelink_data_r Figure). The' b 0 shifting 0 of the value takes place synchronous to the debug clock link_data_r_34 ' b 0 1 (tck). Asmonitor_config soon as the' b shifting 1 0 000000000000000000000000000000000 phase is complete the tpr hold goes high which100000000000000000000001101010111 indicates monitor_stop 0 that these shifting phase0 is over. More importantly the value will remain stable as long as tpr holdsi is high. The0 value is then programmed when both tpr hold and tpr update so 0 are highdtl_rst_n (Marker 2 in1 the Figure) and is the update phase of the programming. This is reflected by the change in the value of monitor config at precisely this point in time. The value is then seen by the monitor which runs on the NoC functional clock. The separation between the shifting and the update phases allows for this safe crossover

71

Printed on Thu Jul 12 16:06:12 CEST 2007 Printed by SimVision from Cadence Design Systems, Inc. 72 CHAPTER 7. RESULTS between clock domains and means that the Monitor Config TPR can be programmed when the NoC is functionally running without causing glitches or false breakpoint triggers. A more detailed description of programming via the IEEE 1149.1 TAP can be found in [35]. This is how the actual programming takes place in hardware. The user can enforce this by using the PROGRAM TPR instruction as explained in Section 6.1. Further in Figure 7.2 we show the programming of a NI Shell TPR.

Figure 7.2: Programming of the NI Shell TPR

7.2 EDI stop pulse distribution

Figure 7.3 shows the traces for a stop module when it receives a stop pulse from the monitor, while Figure 7.4 shows the waveforms for a stop module when an external stop is given to it via the TAP. A stop module reacts to an incoming stop signal (monitor stop is high) only when it is in state ’00’ (state r)and then it transitions to state ’01’. Then in the next clock cycle it transitions to state ’10’ and then outputs a signal one clock pulse wide on the output ports viz. stop out0..N. 7.3. DEBUG CONTROL ACTIONS IN THE SHELLS 73

Stop Module − Monitor Stop Page 1 of 1

Marker 1 = 45.360476909us Cursor = 45,400,102,659fs Baseline = 45.343955424us Baseline = 45,343,955,424fs TimeA = 45.400102659us Cursor−Baseline = 56,147,235fs 45.34us 45.35us 45.36us 45.37us 45.38us 45.39us

clk 1

jtag_stop 0

monitor_stop 0

rst_n 1

state_r 'b00 00 01 10 11 00

stop_out0 0

stop_out1 0

stop_out2 0

Figure 7.3: Stop Module gate-level waveforms for monitor stop

Stop Module − JTAG Stop Page 1 of 1

Marker 1 = 40.400025695us Cursor = 40,519,995,648fs Baseline = 40.379829748us Baseline = 40,379,829,748fs TimeA = 40.519995648us Cursor−Baseline = 140,165,900fs 40.4us 40.44us 40.48us

clk 0

jtag_stop 0

monitor_stop 0

rst_n 1

state_r 'b00 00 01 10 11 01 10 11 01 10 11 01 10 11 00

stop_out0 0

stop_out1 0

stop_out2 0

Figure 7.4: Stop Module gate-level waveforms for external user stop through TAP

7.3 Debug Control Actions in the shells

We will show debug control at various granularities in a Master Network Interface (MNI). Figure 7.5 shows an overall picture of the debug flow as seen for gate-level simulations in its shell. The NI Shell TPR is initially programmed (stop is enabled for some channel). Then after a functional reset the NoC restarts communication. This functional reset is done in order to ensure that a breakpoint condition that has been programmed is not missed. When the breakpoint hit occurs the EDI distributes this to all the network interfaces in the network. On arrival of a stop pulse from the EDI, the communication through the shell shown in Figure 7.5 stops due to the gating of valid / accept handshake. The shell is then in a quiescent state. Now the user can reprogram the TPRs, read out internal state via the TAP or do a combination of these steps.In the waveform shown in Figure 7.5, a continue is programmed and hence as can be seen, the shell continues in its functional behavior.

In Figure 7.6, we show a stop on a request channel from master to MNI from the MNI’sPrinted on pointTue Jun 05 09:51:49 of view. CEST 2007 The stop has been enabled by programmingPrinted by SimVision from the Cadence appropriate Design Systems, Inc. bit in the NI Shell TPR. On arrival of a stop pulse from the EDI, the accept signals for the

Printed on Tue Jun 05 09:32:25 CEST 2007 Printed by SimVision from Cadence Design Systems, Inc. 74 CHAPTER 7. RESULTS

Figure 7.5: Waveform for debug flow in a MNI command and write signal groups are gated (i.e. no more accepts are sent to the master IP). But the communication on the response interface continues. Hence a stop does not occur immediately. This will depends on the number of commands that have already been accepted and are pending, and also how the interfaces between the SNI and the slave IP has been programmed.

Figure 7.7 is the most complex scenario and shows almost all the possible debug actions programmable. First, stop is enabled for one of the request channels. Then on arrival of an EDI stop pulse the shell stops. Then the stop condition field is set to enforce an unconditional stop. But the shell is already stopped so only after a continue pulse is given, the shell steps ahead by communication unit and then stops again. This is a single-step. Single-stepping at a message and element-level granularity is obtained depending on the the value programmed in the stop granularity field as is shown in Figure 7.7. Finally when the stop granularity field is reset, then the functional behavior continues normally when another continue pulse is given. This will continue until another EDI pulse is seen or stop condition field is programmed again.

In Figure 7.8 we show how only the response channel between the MNI and master IP is stopped when the requst channel continues normally. Also it can be seen that , even though the stop from EDI arrives when a response message is being sent, the stop 7.4. AREA COST AND SPEED 75

Figure 7.6: Request Stop in a MNI does not occur immediately since the stop granularity is set to message (stop granularity = ’0’).

Finally in Figure 7.9 we show actions similar to those shown in Figure 7.7 but this time on the response channel.

All the above described scenarios are also implemented at the interface between the SNI and the slave IP. The shells there (in the SNI) also have the same intelligence and are able to perform these debug control actions.

7.4 Area Cost and Speed

Figure 7.10 shown below was used during simulations and synthesis to obtain the various waveforms shown in previous sections of this chapter. The example SoC consists of 4 IP cores (2 masters and 2 slaves) which communicate by setting up the connections as shown in the figure. I synthesized the SoC, with the NoC running at 125 MHz and 250 MHz both with and without debug infrastructure that has been designed and developed. 76 CHAPTER 7. RESULTS

Figure 7.7: Request Stop / Single-step / Continue in a MNI

In Table below we show the actual running speed after synthesis and area numbers for some different cases.

Target Speed and De- Running Speed of NoC Area (no. of blocks) bug for SoC (Mhz) 125 MHz without debug 991893.44 (core area) 142.186 125 MHz with debug 1005559.62 (network area) 135.943 250 MHz without debug 1009201.50 (core area) 250.013 250 MHz with debug 1033722.94 (network area) 250.000

Since the level at which the synthesis takes place is different for SoC with and without debug we cannot give actual percentage increase in area when the debug infrastructure is included. For SoC without debug the core is synthesized, whereas for one with debug hardware inserted the network is synthesized. Eventhough we cannot comment on the percentage increase in area of the NoC / SoC when the designed debug infrastructure is added, we can certainly comment on the location and complexity of area cost. Most of the additional area cost is associated with the NI-Shell TPRs and the additional states and logic that has been added in the shells. For every channel in the network, we add 8 bits in the NI-Shell TPR plus another 8 registers in the shells. Hence for the TPRs and the shells, the increase in area cost is linear with the increase in the network interfaces. 7.4. AREA COST AND SPEED 77

Figure 7.8: Response channel stop in a MNI

Additional area cost is also due to the EDI which consists of the stop modules. The stop modules have the same topology as the router network and one stop module associated per router. Hence the increase in area complexity for the stop modules is also linear with the increase in the number of routers in the network. So overall for the entire network, the increase in area is linear in the increase in the network size. 78 CHAPTER 7. RESULTS

Figure 7.9: Response Stop / Single-step / Continue in a MNI

Chip Slave IP Core 2

Request (2) Response (3)

Network NI 2

Request (1) Connection 1 Master IP NI 1 Core 1

Response (4) Connection 1 Connection 2

Connection 3

NI 3 NI 4

Request (6) Response (7) Request (5) Response (8)

Slave IP Core 1 Master IP Core 2

Figure 7.10: Example SoC used during simulation and synthesis Conclusions 8

8.1 Conclusions

The ever increasing complexity of present day Integrated Circuits (ICs) means that errors in the first design iteration are unavoidable. Building an error-free design may thus require multiple design iterations. Effective debug can aid in reducing the number of iterations (and time to market) with fast and accurate detection of majority of the errors that may be present. Additionally shrinking feature size means that greater number of IP cores can be integrated on a single IC, effectively shifting the complexity of the IC from the IP cores to the interconnect. Communication-centric debug has been proposed as a debug strategy where the interconnect of the SoC is used to debug the ICs. In this strategy, raising the abstraction level from clock cycles to a higher level (like transactions) allows for a consistent view in both hardware and software. This makes it easier to interpret the actual functional behaviour of the IC and thus might help locate errors faster. A debug infrastructure is built in order to facilitate communication-centric debug. This infrastructure allows the user to both control and observe the functional behaviour of the chip. By reuse of some of the manufacturing test infrastructure we have tried to limit the increase in SoC area. The generation of debug hardware components that have been designed is integrated with the Æthereal design flow. Finally, we have shown by simulation how this infrastructure is used to actually perform communication-centric debug. In this thesis we have also proposed a debug flow for SoCs which combines both Communication-centric debug and the traditional core-based debug. Further we propose a few thoughts for future work, one of which is necessary for the complete debug flow to be implemented while the others merely facilitate a richer user experience.

8.2 Future Work

The present debug infrastructure allows for Communication-centric debug of the SoC. But in order to have a comprehensive debug setup a few additional features can be incorporated.

• After a breakpoint hit, a stop pulse is distributed throughout the network to stop the communication taking place in the SoC. After a stop has been detected, all the interfaces of the NoC may not stop interactions depending on the actions programmed for each of them in their corresponding TPRs. Even if they do stop then it may not all happen at the same instance in time. Hence a polling mechanism is required which can poll the network and determine whether or not all interactions

79 80 CHAPTER 8. CONCLUSIONS

in the network have stopped and that it is in a quiescent state. In this polling stage, NoC internal registers which reflect end-2-end flow control for the connections can be polled. The NI internal registers like credits / remote buffer space (Figure 8.1 can be read out during polling and a decision made based on these values. Only then can we safely switch from functional clock to debug clock in order to obtain a statedump.

Figure 8.1: Example registers that can be polled to decide on NoC quiescent state. [32]

• The existing monitors used can only observe raw link data. Though this is useful at a very low debug abstraction level (like clock-level), a higher level programming of programming is desired especially when the debug itself is at a higher abstraction level. Hence monitors which allow breakpoints to be set for transactions on various inter-IP communications will allow for this higher-abstraction level of programming of breakpoints.

• The debug infrastructure that has been designed allows only for NoC external debug (between the NoC and the IP interfaces). This is useful for the SoC debug. The various IPs will have their own debug architectures to debug them stand- alone. But as yet there is no explicit debug infrastructure to debug the NoC itself. A debug infrastructure to debug the NoC itself needs to be developed. This can be achieved by way of extensions to the present one. For example, the routers and network kernels can also be designed to respond to pulses from the EDI. With built-in intelligence like the NI Shells; stop, single-step and continue functionalities can be incorporated in these components. TPRs similar to the NI-Shell TPR can help the user program the debug control actions for each of the routers and network interfaces.

• Although the TPRs and stop modules are instantiated per NI-Shell/Monitor and router respectively, it is notable that their concatenation into a single scan chain may not always be optimal with respect to minimal routing length of wires.For example, Figure 8.2 consists of a four stop module network. The numbering of the 8.2. FUTURE WORK 81

stop module indicates their instantiation order. Presently, during concatenation into a single scan chain the order along the red-line (numbering order 1-2-3-4) would be followed. Instead it would be more efficient to follow the shortest path along the topology, which would minimize the routing length. In our case the orange-line (numbering order 1-2-4-3). The same also holds true for concatenation of all NI-Shell TPRs. A more topology-aware algorithm may be implemented in future to iron out this inefficiency.

Stop Module 1 Stop Module 2

Stop Module 3 Stop Module 4

Figure 8.2: Shows the scan-chain concatenation order for a stop-module network.

• The statedump file presently obtained can be back-annotated to the abstraction of the various internal registers. It would be interesting to investigate the back- annotation to the level of messages and transactions. Here the debugger would for example get a view of where certain transactions / messages / elements are in the network. Figure 8.3 what this could possible like. The network has two connections viz. 1 and 2. Then a back-annotation would tell the debugger in which component each of the message for a connection are. In our example, message 1 of connection 1 (M 11) is in network interface 2 (NI 2). The second message of the same connection (M 12) is partly in router 2 (R 2) and the rest in network interface 1 (NI 1). The third message (M 13) is in network interface 1 (NI 1). 82 CHAPTER 8. CONCLUSIONS

Chip

Master IP Core 1 Slave IP Core 2

Network NI port NI port M 21 M 11 R 2 M 13 M 12 NI 2 M 12 NI 1

NI port M 22 NI 3 R 1 M 23 M 24

NI port

Slave IP Core 1 Master IP Core 2

Connection 1 - Master IP Core 1 to Slave IP Core 2 Connection 2 - Master IP Core 2 to Slave IP Core 2

M 12 - 2nd Message of Connection 1. NI - Network Interface R - Router

Figure 8.3: High-level back annotation from statedumps. Bibliography

[1] M Abramovici, M.A. Breuer, and A.D. Friedman, Digital Systems Testing and Testable Design., 1990.

[2] ARM, AMBA specification. rev. 2.0, 1999.

[3] , Multi-layer AHB. overview, 2001.

[4] Edith Beigne, Fabien Clermidy, Pascal Vivet, Alain Clouard, and Marc Renaudin, An asynchronous NOC architecture providing low latency service and its multi-level design framework, Proc. Int’l Symposium on Asynchronous Circuits and Systems (ASYNC), 2005.

[5] Davide Bertozzi and Luca Benini, Xpipes: A network-on-chip architecture for gi- gascale systems-on-chip, IEEE Circuits and Systems Magazine (2004), 18–31.

[6] Tobias Bjerregaard, The MANGO clockless network-on-chip: Concepts and imple- mentation, Ph.D. thesis, Informatics and Mathematical Modelling, Technical Uni- versity of Denmark, DTU, 2006.

[7] Evgeny Bolotin, Israel Cidon, Ran Ginosar, and Avinoam Kolodny, QNoC: QoS ar- chitecture and design process for Network on Chip, Journal of Systems Architecture 50 (2004), no. 2–3, 105–128, Special issue on Networks on Chip.

[8] C˘alinCiorda¸s,Basten, Twan, Andrei R˘adulescu,Kees Goossens, and Jef van Meer- bergen, An Event-Based Network-on-Chip Monitoring Service, Proc. of the High- Level Design Validation and Test Workshop (HLDVT), November 2004, pp. 149– 154.

[9] C˘alinCiorda¸s,Kees Goossens, Twan Basten, Andrei R˘adulescu,and Andre Boon, Transaction Monitoring in Networks on Chip: The On-Chip Run-Time Perspective, Proc. of the IEEE Symposium on Industrial Embedded Systems (IES), October 2006.

[10] C˘alinCiorda¸s,Andreas Hansson, Kees Goossens, and Twan Basten, A Monitoring- aware NoC Design Flow, Proc. of the EUROMICRO Symposium on Digital System Design (DSD), August 2006.

[11] DAFCA, DAFCA In-Silicon Debug: A Practical Example, June 2005.

[12] William J. Dally and Brian Towles, Route Packets, Not Wires: On-Chip Intercon- nection Networks, Proc. of the 38th Design Automation Conference (DAC), June 2001.

[13] Wilco de Boer and Bart Vermeulen, Silicon Debug:Avoid Needless Respins, Proc. Electronics Manufacturing Technology Symposium, July 2004, pp. 277 – 281.

83 84 BIBLIOGRAPHY

[14] John Dielissen, Andrei R˘adulescu,Kees Goossens, and Edwin Rijpkema, Concepts and Implementation of the Philips Network-on-Chip, IP-Based SOC Design, Novem- ber 2003.

[15] Kees Goossens, John Dielissen, and Andrei R˘adulescu, The Æthereal network on chip: Concepts, architectures, and implementations, IEEE Design and Test of Com- puters 22 (2005), no. 5, 21–31.

[16] Kees Goossens, John Dielissen, and Andrei R˘adulescu, The Æthereal Network on Chip: Concepts, Architectures, and Implementations, IEEE Design and Test of Computers 22 (2005), no. 5, 414–421.

[17] Kees Goossens, Bart Vermeulen, Remco van , and Martijn Bennebroek, Transaction-based communication-centric debug, Proc. Int’l Symposium on Net- works on Chip (NOCS), May 2007, pp. 195–206.

[18] Pierre Guerrier and Alain Greiner, A Generic Architecture for On-Chip Packet- Switched Interconnections, Proc. Design, Automation and Test in Europe Confer- ence and Exhibition (DATE), 2000, pp. 250–256.

[19] A. Hemani, A. Jantsch, S. Kumar, A. Postula, J. Oberg,¨ M. Millberg, and D. Lindqvist, Network on chip: An architecture for billion transistor era, Proc. of the IEEE NorChip Conference, November 2000.

[20] J¨orgHenkel, Wayne Wolf, and Srimat T. Chakradhar, On-chip networks: A scal- able, communication-centric embedded system design paradigm., Proc. of the 17th International Conference on VLSI Design (VLSID), 2004, pp. 845–851.

[21] Kalon Holdbrook, Sunil Joshi, Samir Mitra, Joe Petolino, Renu Raman, and Michelle Wong, microSPARCTM: A Case Study of Scan-Based Debug., ITC, 1994, pp. 70–75.

[22] A. Hopkins and K. McDonald-Maier, Debug support for complex systems on-chip: A review, IEE Proceedings Computer and Digital Techniques 153, no. 4.

[23] R. Leatherman and N. Stollon, An Embedded Debugging Architecture for SoCs, IEEE Potentials 24, no. 1.

[24] ARM Limited, AMBA AXI Protocol Specification. Version 1.0, March 2004.

[25] K.D. Maier, On-Chip Debug Support for Embedded Systems-on-Chip, Proc. Int’l Symposium on Circuits and Systems (ISCAS), 2003, pp. 565–568.

[26] Mikael Millberg, Ernald Nilsson, Rikard Thid, and Axel Jantsch, Guaranteed Band- width Using Looped Containers in Temporally Disjoint Networks within the Nostrum Network on Chip, Proc. Design, Automation and Test in Europe Conference and Exhibition (DATE), 2004.

[27] William Orme, Debug IP for SoC Debug, December 2005. BIBLIOGRAPHY 85

[28] OCP International Partnership, Open Core Protocol Specification. Version 2.0, September 2003.

[29] Philips Semiconductors, CoReUse 4.1: Core-based Scan Architecture for Silicon Debug. Version 1.4, February 2003.

[30] , CoReUse 4.2: Device Transaction Level (DTL) Protocol Specification. Ver- sion 2.4, February 2005.

[31] G.J. Rootselaar and B. Vermeulen, Silicon Debug: Scan Chains Alone Are Not Enough, Proceedings IEEE International Test Conference (ITC) (Atlantic City, NJ, USA), September 1999, pp. 892–902.

[32] Andrei R˘adulescu, John Dielissen, Kees Goossens, Edwin Rijpkema, and Paul Wielage, An efficient on-chip network interface offering guaranteed services, shared- memory abstraction, and flexible network programming, Proc. Design, Automation and Test in Europe Conference and Exhibition (DATE) (Washington, DC, USA), vol. 2, IEEE Computer Society, February 2004, pp. 878–883.

[33] Andrei R˘adulescuand Kees Goossens, Æthereal Services, July 2003.

[34] Philips Semiconductors, The i2c-bus specification, January 2000.

[35] IEEE Computer Society, IEEE Standard Test Access Port and Boundary-Scan Architecture-IEEE Std 1149.1-2001., 2001.

[36] Bart Vermeulen and Sandeep Kumar Goel, Design for Debug: Catching Design Errors in Digital Chips, IEEE Des. Test 19 (2002), no. 3, 37–45.

[37] Bart Vermeulen, Kees Goossens, Remco van Steeden, and Martijn Bennebroek, Communication-centric SOC debug using transactions, Proc. European Test Sym- posium (ETS), May 2007.

[38] Bart Vermeulen, Steven Oostdijk, and Frank Bouwman, Test and debug strategy of the PNX8525 NexperiaTM digital video platform system chip., ITC, 2001, pp. 121– 130.

[39] Bart Vermeulen, Tom Waayers, and Sandeep Kumar Goel, Core-Based Scan Archi- tecture for Silicon Debug., ITC, 2002, pp. 638–647.

[40] Paul Wielage and Kees Goossens, Networks on Silicon: Blessing or Nightmare?, Proc. of the EUROMICRO Symposium on Digital System Design (DSD) (Dort- mund, Germany), September 2002. 86 BIBLIOGRAPHY Constraints on External Stop Pulse A

Present-day SoCs have multiple clock domains and hence there are bound to be clock- domain crossings. These crossings have to be taken into account and the designer has to make sure that there are no timing violations. As a result certain constraints are often imposed to ensure the correct functional behaviour. One such clock-domain crossing takes place in the stop modules discussed in Section 5.3. In our case the stop modules operate at the functional clock frequency of the NoC while the external stop which is given through the IEEE 1149.1 TAP is at debug clock (tck). To ensure safe clock domain crossing in the stop modules certain constraints are imposed on the duration of the external stop pulse which is given by the user. The minimum duration of the stop pulse is two functional clock cycles of the clock on which the stop module operates. This is obtained as follows. In Figure A.1 the first

Sampling Times

Stop Module Clock

External stop pulse

a

b

c

Time (t) t1 t2

Figure A.1: Timing diagrams showing minimum duration of external stop pulse waveform is the functional clock of the stop module. At times t1 and t2 the external stop pulse is sampled (i.e. every rising edge of the stop module clock). • In the first scenario ’a’ the external stop pulse is already high when it is sampled

87 88 APPENDIX A. CONSTRAINTS ON EXTERNAL STOP PULSE

at time t1. Hence the external stop pulse will be sampled correctly.

• In scenario ’b’ the external stop pulse has not yet reached a value which is consid- ered high at time t1. Later this pulse goes low before it can be sampled a second time (at t2), hence this pulse given will be missed.

• In the third scenario ’c’, at the first sampling time t1 the external pulse value is still low. But since it stays high for atleast two functional clock cycles of the stop module, this pulse is detected high on the next rising edge (at t2). In this way the pulse will not be lost.

There is no strict constraint on the maximum duration for the external pulse. But it is interesting to note that for every 3 clock cycles of the stop module functional clock, the stop module generates one pulse on the EDI due to reasons as previously explained in Section 5.3. List of Acronyms B

ATE Automated Test Equipment BE Best Effort BES Best Effort Service BP-TPR BreakPoint TPR DCI Debug Data Interconnect DDI Debug Data Interconnect DfD Design-for-Debug DSM Deep Sub-Micron DTL Device Transaction Level E2EFC End-to-End Flow Control FIFO First-In, First-Out GL Gate Level GS Guaranteed Service GT Guaranteed Throughput IEEE Institute of Electrical and Electronics Engineers IP Intellectual Property JTAG Joint Test Action Group MNI Master Network Interface MNIP Master NIP NI Network Interface NIP Network Interface Port NiK Network interface Kernel NiS Network interface Shell NoC Network-on-Chip OCP Open Core Protocol R Router RTL Register Transfer Level SNI Slave Network Interface SNIP Slave NIP SoC System-on-Chip TAP Test Access Port TCB Test Control Block TLM Transaction Level Model TPR Test Point Register VHDL VHSIC Hardware Description Language

89 90 APPENDIX B. LIST OF ACRONYMS