Research Collection

Doctoral Thesis

Programmable intellectual property modules for system design by reuse

Author(s): Röwer, Thomas

Publication Date: 2000

Permanent Link: https://doi.org/10.3929/ethz-a-004039264

Rights / License: In Copyright - Non-Commercial Use Permitted

This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use.

ETH Library Diss. ETH No. 13905

Programmable Intellectual Property Modules for System Design by Reuse

A dissertation submitted to the SWISS FEDERAL INSTITUTE OF TECHNOLOGY ZURICH

for the degree of Doctor of Technical Sciences

presented by THOMAS ROWER Dipl. Ing. born August 1st, 1969 citizen of Germany

accepted on the recommendation of Prof. Dr. W. Ficlitner, examiner Prof. Dr. L. Thiele, co-examiner

2000 f < * f s ; « : Acknowledgments

I would like to thank my supervisor. Prof. Dr. Wolfgang Fichtner,

for his overall support and for his faith in me and my work. 1 would also like to thank Prof. Dr. Lothar Thiele for reading and commenting on my thesis.

Special thanks to Hubert Kaeslin and Norbert Felber for their en¬

couragement and support during my research work as well as for their proofreading of and commitment to my thesis. Furthermore, I want to thank the secretaries for taking care of the administration and Christoph Wicki for excellent supervision of hard- and software.

Hanspeter "Die Granate" Mathys was a great, help in taking the chip-photos included in this thesis. The support of Mathias Bränclh and Robert Reutemann on the chip design process was of great value.

1 acknowledge the financial support of KTL a Commission for tech¬ nology and innovation supported by the Swiss Government and Siemens Schweiz AG.

Several students worked with me during my thesis and provided very

helpful discussions on different topics of this thesis. Peter Lût hi did a

great job in implementing the register-based .

Very special thanks go to Markus Th alma nu and Manfred Stadler

my co-workers on the ITRASYS project. Many ideas that are presented in my thesis originate from discussions with the two.

1 also want to express my gratitude to all my colleagues at the Inte¬ grated Systems Laboratory who contributed to the good working envi-

lii IV Acknowledgments

ronment, especially Michael Oberle made the IIS a FUN place to work.

I owe very special gratefulness to my parents who made this all possible, in the first place.

I want to particularly thank my family, my wife, Susanne, for her un¬ limited patience and understanding, and our son, Alexander, for sleep¬ ing quite a lot at night. Contents

Acknowledgments iii

Abstract ix

Zusammenfassung xi

1 Introduction 1

1.1 Goals of this work 3

1.2 Structure of the thesis 3

2 Hardware/software co-design 5

2.1 Principles 5

2.2 System-level hardware/software co-design 6

2.3 An example implementation 8

2.3.1 Description of the system 10

2.3.2 Hardware/software partitioning 11

2.3.3 Task scheduling and interrupt handling 12

2.3.4 instruct ion Application-specific set processor . . 13

v VI Contents

2.3.5 Area efficiency 15

2.4 Discussion and Outlook 19

3 System design by reuse 23

3.1 History of IC design 23

3.2 System design and IP reuse 25

3.2.1 System design flow 27

3.2.2 IP module design 29

Hard IP modules 29

Soft IP modules 30

IP design How 31

3.2.3 Functional verification 33

3.2.4 Test 34

4 The case for programmable intellectual property mod¬ ules 37

4.1 Present Situation 38

4.2 Concept 42

4.3 System design advantages 43

4.3.1 Flexible hard-IPs 45

4.3.2 Adaptable interfaces 45

4.3.3 Block test 47

4.4 Evaluation 48 Contents vu

5 The embedded processor 51

5.1 Requirements 52

5.2 The stack processor 53

5.2.1 Architect ure 53

5.2.2 Qualitative adaption of 1 he processor core .... 56

5.2.3 Numerical parametrizatiou of the processor core 57

5.2.4 System interfaces 60

5.2.5 Implementation 60

5.3 The register-based processor 61

5.3.1 Architecture 62

5.3.2 Qualitative adaptation of the processor core ... 66

5.3.3 Numerical parameterization of the processor core 66

5.3.4 System interfaces 67

5.3.5 Implementation 69

5.4 Comparison 70

5.4.1 Compiler support 70

5.4.2 Interrupt capabilities 71

5.4.3 Parametrizatiou 71

5.4.4 Area efficiency 71

6 Functional verification of PIP modules 77

6.1 Behavioral model 78

6.2 Expected responses generation 82

6.3 Functional veriiicatiou of the customized RTL model . . 85 vin Contents

6.4 Test bench and real-time constraints 85

7 A pip module design example 87

7.1 Architecture 87

7.2 Adaptable Interfaces 91

7.3 Built-in self-test 95

7.4 Parametrization of the PIP module 97

7.5 Implementation Results 100

8 Conclusions 105

A Parameter set of the PIP module 107

B Parameter set of the register-based processor IP module 113

Bibliography 117

Curriculum Vitae 131 f\ %r\ r~i -|- -i^»

The design of integrated circuits (IC) is currently undergoing a paradigm shift from "application-specific" to "reusable". Increasing sys¬ tem complexity asks for a new approach to IC design, because starting a design from scratch in each and every new project leads to severe violations of time-to-market requirements. Housing large portions from previous projects for a new system may overcome the imposed efficiency problems.

Hardware/software co-design is one of the most prominent method¬ ologies used to improve the design efficiency and quality. HowTever. system design by reuse is problematic in combination with system level hardware/software co-design. The tight coupling between a system pro¬ cessor and several reusable hardware blocks does not fit well into a, sys¬ tem design flow based on several independent intellectual property (IP) blocks.

This thesis proposes the concept of programmable intellectual prop¬ erty modules (PIP) to solve this problem. The key innovation is to include a processor into every major fP module. PIP modules oiler the possibility to integrate highly reusable IP modifies that have superior properties because the embedded processor can be used for several sys¬ tem design tasks. Some of these are flexibility in the implementation of standards and protocols, system debugging after silicon production or the ability to postpone decisions due to t he high configurability. PIP modules fit perfectly into a system design flow based on system-level functional partitioning and IP-level hardware/software partitioning.

The most important building block of PIP modules is obviously

IX X Abstract

the embedded processor. A detailed investigation of processor concepts showed that a register-based architecture using stack registers for fast interrupt switching is best suited for PIP modules.

A convenient functional verification flow is crucial for successful IP modules. A simulation-based flow suitable for highly parametrized PIP modules has been established. It uses a configuration-independent be¬ havioral model to check the correctness of the synthesizable RTL model.

An experimental implementation of a STM-J./STS3 block demon¬ strates additional advantages of the PIP concept, like flexible hard-IPs, software driven BIST, and adaptable interface protocols. Zusammenfassung

Der Entwurf integrierter Schaltungen (IC) erfährt momentan eine tiefgreifende Veränderung von "anwendungsspezifisch"' zu "wiederver¬ wendbar". Die ständig wachsende Komplexität der integrierten Syste¬ me macht eine neue Vorgehensweise im Entwurf von ICs nötig. Die vom Markt geforderten immer kürzeren Produkt zyklcn können nicht mehr erfüllt werden, wenn in jedem neuen Projekt wieder von Grund auf neu mit dem Entwurf begonnen werden muß. Wenn es gelingt, große Tei¬ le vorhergehender Projekte in einen neuen IC zu übernehmen, können viele Effizienzprobleme gelöst werden.

Die am häufigsten angewandte Methode zur Steigung der Effizienz und Qualität beim Entwurf integrierter Schaltungen ist die gleichzeitige Erstellung von Schaltungsteilen und Programmcode. genannt hardwa¬ re/software co-design. Es ist allerdings sehr problematisch, diese Metho¬ de mit einem Systementwurfsprozess zu verbinden, der auf der Wieder¬ verwertung großer Teile vorheriger Schaltungen basiert. Die sehr enge Kopplung zwischen dem Systemrechner und mehreren wiederverwend¬ baren Sehaltungsteilcn passt nicht gut in einen Entwurfsprozess, der auf der Verwendung von IP-Modulen (IP. intellectual property) basiert.

Um dieses Problem zu lösen, wird in dieser Doktorarbeit das Prin¬ zip programmierbarer, wiederverwendbarer Bausteine (PIP) vorgeschla¬ gen. Die wichtigste Neuerung dieses Konzepts ist, einen Prozessor in je¬ des größere IP-Modul zu integrieren. PIP-Module haben herausragende Eigenschaften, welche durch die Integration des Prozessors ermöglicht werden und bieten viele Vorteile beim Systementwurf. Zum Beispiel erreicht mau hohe Flexibilität bei der Implementation von Protokollen und Standards, nach der Produktion der integrierten Schaltung können

XI Xll Zusammenfassung

mögliche Fehler noch korrigiert werden oder man kann Entscheidungen erst spät im Entwurfsprozess fällen. PIP-Module passen perfekt in einen Entwurfsprozess, der auf der Aufteilung der Funktionalität auf System¬ niveau und der Einteilung in Hardware und Software auf Modulniveau beruht.

Der wichtigste Baustein von PIP-Modulen ist offensichtlich der inte¬ grierte Prozessor. Eine detaillierte Untersuchung verschiedener Prozes¬ sorkonzepte ergab, daß eine registerbasierte Architektur, die um Schat¬ tenregister für schnellen Kontextwechsel ergänzt wurde, am besten ge¬ eignet ist.

Die funktionale Verifikation ist äußerst wichtig für den Erfolg von IP-Modulen. Für hochgradig parametrisiorbare PIP-Module wurde im Rahmen dieser Arbeit ein simulationsbasiorfer Ansatz etabliert. Er be¬ nutzt ein konfigurationunabhängiges Verhaltensmodel, um die Richtig¬ keit des RTL-Models zu überprüfen.

Durch die Implementation eines STM-1/STS3 Blocks konnten wei¬ tere wichtige Vorteile von PIP-Modulen zu Tage gefördert werden. Das neue Konzept ermöglicht die Verwirklichung von flexiblen hard-IP-Mo- dulen, einen in Software realisierten BIST und anpassbare Schnittstel¬ lenprotokolle. Chapter 1

Introduction

In 1969 Gordon Moore formulated his famous law that states that the

complexity of integrated circuits will double every 18 months. Moore proposed this law because the trend had been valid for the last sev¬ eral years and would enable the semiconductor industry to maintain its growth by constantly pushing into newr application areas.

During the 1980s and 90s the production technology was the ma¬ jor menace for Moore's law. The improvement to ever smaller feature size was threatened by several problems that demanded intensive re¬ search and development to be overcome. Figure 1.1 shows the impres¬ sive growth of the number of transistors per chip for memory chips and different processors, respectively. Today it seems definite that produc¬ tion technology will improve at the pace of Moore's law for at least another ten or fifteen years, as foreseen bv the ITRS Roadmap [SIA99]. The ability to integrate more and more functionality on a IC presented a constant challenge to circuit designers. In the near future, increasing design complexity according to the scheduled rate will be limited by the ability of circuit designers to handle the massive complexity [SIA99].

Currently. ICs are designed using hardware description languages (HDL). HDLs allow the designei to describe the functionality of the circuit under design in software on register transfer level (RTL). The resulting HDL code then gets synthesized into gates that in turn need

to be assembled on the silicon die to form the final IC. Software tools

1 2 Introduction

Figure 1.1: Increasing complexity of integrated circuits according to Moore's Law

that help to automate the design flow are available. Two approaches are taken to improve design efficiency and to better handle complexity.

Firstly, the ASIC functionality is distributed between hardware and software. Processors or controllers that run software code are placed on one die with custom hardware to jointly implement the desired function¬ ality. This enables the development of the software-controlled function¬ ality after the hardware design is completed or even while the chip is al¬ ready in production, which significantly reduces time-to-market [GS98].

The other approach is the use of Intellectual Property (IP) mod¬ ules to assemble the [GZ97]. IPs are relatively large functional blocks, designed to be reused in as manv different systems as possible. The increased design effort for the first implementation pays off in future designs. IP modules come mainly in two different forms. Soft IPs consist of HDL code that can ho synthesized and then inte¬ grated into the system. This solution offers very high flexibility because the HDL code is amenable to parametrization, but is relatively unpre¬ dictable in terms of physical properties of the resulting circuit. Hard-IPs are reused as routed blocks that can just be placed and connected to the target system. They are very inflexible and technology dependent. 1.1 Goals of this work 3

but need a much smaller design effort and offer good predictability.

1.1 Goals of this work

In this thesis a new paradigm called "Programmable Intellectual Prop¬ erty" (PIP) modules is proposed. It efficiently combines hardware/soft¬ ware co-design and system design by IP-reuse. The major goal of this work is to prove the feasibility of the new paradigm. Additional benefits of PIP modules that will be highlighted in the thesis are:

• Highly flexible hard-IPs

• Software driven logic BIST

• Adaptable interface protocols

A very important part of PIP modules is the embedded processor. The requirements for this processor will be formulated and two example implementations shall be evaluated with respect to their suitability for PIP modules.

1.2 Structure of the thesis

Chapters 2 and 3 provide an introduction to hardware/software co- design and system design by IP-reuse, rehpectively. These two chapters present the state-of-the-art in these fields antl give an outlook on future trends.

Chapter 4 introduces the new paradigm of programmable intellec¬ tual property modules. It gives the background of the design princi¬ ple, benefits and drawbacks. In chapter 5 the general requirements for the embedded processor are discussed and two possible processor ar¬ chitectures are investigated in detail. The functional verification flow developed for PIP modules is described in chapter 6. Chapter 7 finally describes an example implementation of a PIP module. The work is concluded in chapter 8. \ 0 Chapter 2

Hardware/software co-design

2.1 Principles

In hardware/software co-design the functionality of an application spe¬ cific integrated circuii is split into hardware and software running on an embedded processor. This partitioning has two major advantages over "classic" ASICs which implement all functionality in hardware.

1. The time needed to develop the circuit can be reduced by a co- design approach. Hardware and software can be developed in parallel which reduces the overall time to complete the project. However, this is only true if no time must be spent on processor development and on creating a software development environment for the embedded processor.

2. The flexibility of the final product is increased by the possibility to download different software \Tersions,

3. As the implementation is split into two parts - hardware and

software - it becomes easier to handle the increasing amount of functionality.

o 6 Hardware/software co-design

On the other hand, a new breed of problems arises. The most ob¬ vious new task is optimal partitioning between hardware and software. It has an enormous impact on system flexibility, size, and development time. Next, the hardware-software interface must be designed carefully to obtain seamless integration of both parts inside a single chip. The communication between the different blocks of the chip and the em¬ bedded processor, as well as task scheduling, are major parts of this interface. Verification and test issues must also be adapted to the hard¬ ware/software design flow.

2.2 System-level hardware/software co-design

System-level hardware/software co-design is characterized by a accompanied by application-specific hardware on one chip. The ASIC hardware is used for performance critical tasks like high-bandwidth or demanding signal processing. The em¬ bedded processor takes over control tasks that do not demand ultimate performance.

The main advantages of this approach have been stated in the pre¬ vious section. However, they do not come for free. A whole new class of problems must be solved by the ASIC designer. The basic complication is the need to find a clever way to embed the software content into an ASIC.

Obviously the designer must partition t he system functionality into two parts. One that will be implemented in hardware and another part that becomes software running on the embedded processor. This partitioning is usually accomplished manually by experienced design¬ ers. Automated partitioning on system level has been intensively in¬ vestigated for some time [GM92] [BdML+97] [GVNG94]. The Vulcan system [GdM96] starts with all functions allocated to ASICs and moves selected functions to the processor to reduce implementation costs. The COSYMA system [HEHB9 1] uses the opposite approach, it starts with all functions implemented in software running on the embedded proces¬ sor and moves selected functions to hardware to meet the performance goals. 2.2 System-level hardware/software co-design 7

Recent publications clearly state that the search space must be sig¬ nificantly limited to successfully allocate tasks to hardware and soft¬ ware [BG99] [MBR99]. This can be done by using hierarchical ap¬ proaches [GQXHG99], by limiting the search space to specific applica¬ tion fields [SKK+00], or by intensive user-tool interaction. Fully auto¬ matic hardware/software partitioning on system-level is currently not feasible.

The task can be further complicated by real-time constraints. In this case timing becomes a part of the system functionality. Definition of system functionality and timing constraints in an executable format on system-level is far too complex to be efficiently utilized. On the RT-level all timing information is captured by a single clock signal. Real-time constraints can be expressed in multiples of the clock cycle time. This is not possible on system level and increases the modeling complexity significantly.

Under real-time constraints the hardware-software interface is of spe¬ cial importance. Different tasks must be scheduled for execution on the embedded processor. The best fixed priority scheme for periodic tasks is rate monotonie scheduling [YW96]. For dynamically changing task structures the earliest deadline first (EDF) strategy promises better re¬ sults [STZ+99]. Rate monotonie scheduling assigns a fixed priority to each task and schedules the tasks according to their priority. The high¬ est priority is given to the task with the shortest period. The scheme yields a maximum processor utilization of 69%. For higher utilization it can no longer be guaranteed that all deadlines are met under worst case conditions. This worst case assumption is very pessimistic, practical im¬ plementations prove that 88% is a more realistic upper bound [SRS94]. The utilization does not include the time consumed by processor task switches, cache misses, or pipeline stalls.

The embedded processor is t he key element of a hardware/software co-design system. The architecture of the processor must be chosen with respect to the hardware/software partitioning, the hardware/software interface, and performance demanded by the software tasks.

The global system-specific decisions are hard to solve for automatic tools and are generally done manually by experienced designers. The rest of this section describes a practical approach to the design of a hardware/software co-design system. The solutions found are quite Hardware/software co-design

system-specific and not intended to offer general insights. However, the details of the system design process described will prove valuable in the following chapters.

2.3 An example implementation

Microelectronics has enabled the rapid growth of the telecommunica¬ tion market. The development led to ever increasing data rates, new unprecedented services and heterogeneous networks. Today, decreasing time to market and ever shorter product life times pose severe problems on telecommunication chip designers [GvWHf96].

central timing source

telecommunication management network

add-drop i PDH signals multiple —

national boundary

Figure 2.1: SDH network incorporating add-drop that form the interface between SDH and PDH

In this environment companies are pushed towards hardware/software co-design, intellectual property reuse, and systems on a chip (SOC). Al¬ though telecommunication is a well suited field for IP reuse because standards define the functionality, it is not very common today, which 2.3 An example implementation

is due to the discipline required and high initial investments it takes. discussed Hardware/software co-design on the other hand is intensively and starts to be in academia since some time [GvPL+96] [BdML+97] included in advanced industrial projects.

Network , Management SDH-I/Os Ai55Mbit/s

T63 x 2 Mbit/s t1x34Mbit/s PDH-I/Os 34 Mbit/s 2 Mbit/s

data with overhead removal or insertion path pi areas not working n from or into the data path the system clock overhead handling

Figure 2.2: Block diagram of the add-drop system showing the data path hardware blocks and the embedded processor. 10 Hardware/software co-design

2.3.1 Description of the system

A highly integrated add-drop multiplexer (ADM) for Synchronous Dig¬ ital Hierarchy (SDH) networks serves as an example here. The ADM is embedded into the STM-1 hierarchy and serves in telecommunication networks as an interface between SDH and Plesiochronous Digital Hi¬ erarchy (PDH) (see Figure 2.1). A detailed description of the ADM-1 system is given in [TSFF99] and [ThaOO].

The system provides two 155Mbit/s STM-1 ports, 68 2Mbit/s PDH interfaces and one 34Mbit/s PDH in/output. To reduce system size and costs, all data processing in between these ports is performed in one single ASIC. Its data path is shown in Figure 2.2.

After initial frame synchronization in the STM-1 block, the data stream is serial/parallel converted. The overhead is separated from the payload and forwarded to the embedded processor. System bus bandwidth is maintained although the overhead is removed from the data stream. The resulting extra bandwidth is used to account for data rate variations throughout the ASIC by marking used and unused blocks on the bus. The STM-1 bus interface adds the data at the incoming clock speed to the system bus. No data rate adaption is performed at this point, although the bystem bus may have a slightly different clock frequency than the STM-1 block.

The overhead of the virtual containers VC-4, VC-3, and VC-12 is also processed in software and the data stream is fed to the com¬ bined buffer-coupling matrix unit. In the buffer the frequency adaption for the whole data path is performed. Both Sub-Network-Connection- Protection (SNCP) and Multiplex-Section-Protection (MSP) are real¬ ized by the coupling matrix unit. The bundling of SNCP and MSP into one coupling matrix and the compaction of all frequency adaptations into one large buffer help to simplify the data path and reduce the gate count. The C-12 and C-3 containers are finally transmitted to the PDII outputs [Int96].

In the up-path, from the PDH inputs to the SDH output, the dif¬ ferent units basically do the opposite of their counterparts in the down- path. The data are multiplexed into the appropriate containers and the overhead is, where necessary, requested from the processor. The STM-1 signal is eventually output at system clock frequency. 2.3 An example implementation 11

In addition to the overhead management the embedded processor also forms the interface to the network management. It receives and handles the configurations for the whole system and delivers fault and performance data to the network management interface for further pro¬ cessing.

2.3.2 Hardware/software partitioning

For a long time several tasks of a broadband telecommunication network element have been realized in software, especially the fault and perfor¬ mance data processing with long integration times and the interfacing to the network management. These tasks are executed by a standard processor mounted on the system board and can be programmed using standard high level languages. All tasks that are not directly related to the ASIC hardware and that do not need fast reaction are realized in this processor.

Our task was to find the optimal solution for the partitioning be¬ tween hardware and software inside the ADM-1 ASIC. To do so, we analyzed the basic building blocks of telecommunication systems. Typ¬ ical systems consist of one or more high-bandwidth signal paths and, apart from frequency adaptation blocks, of relatively slow control and supervision loops. The signal paths can be divided into several sections with quite different data rates: In our ADM-1 data rates vary from 155Mbif/s to 2Mbit/s.

The data stream contains the payload and the overhead which is inserted into the data stream for network control and supervision pur¬ poses. The overhead has to be analyzed, and countermeasures for pos¬ sible transmission errors need to be taken. The management interface allows configuration of the complete system and acquisition of short- term fault and performance data for further processing.

Using one or several Digital Signal Processors (DSP) seemed inad¬ equate to us for several reasons. The high data rate of the incoming data stream would require a very fast and specialized DSP. The 126 2Mbit/s channels on the other hand initiate frequent and inefficient task switches. Although this solution offers the highest flexibility, it would be large and power-consuming. 12 Hardware/software co-design

The chosen partitioning was motivated by the idea to have flexibil¬ ity wherever needed, and, at the same time to offer high performance wherever required. As the payload consumes the major part of the bandwidth we decided to handle it in hardware which offers high per¬ formance and reliability.

The overhead, on the other hand, is analyzed in software. Overhead processing is changing frequently because the standards get updated or new functionality is added. Some overhead bytes are, for example, al¬ ready reserved for future use but their functionality is not yet specified. Other overhead bytes are used for different purposes in different coun¬ tries. Here the additional flexibility of the software processing helps to quickly react on changes or feature changes late in the design cycle.

2.3.3 Task scheduling and interrupt handling

The scheduling of the data processing tasks, according to their priority, is the task of the interrupt handler. In our ADM-1 most interrupts occur periodically, but due to the asynchronicity of the different hierar¬ chical SDH levels the sequence of the interrupts changes heavily. The implemented rate-monotonic interrupt scheduling guarantees that high- priority tasks will always suspend the execution of lower-priority tasks. The high number of program interruptions poses no problem to a stack architecture processor, since no context has to be saved explicitly. The current context will be pushed down the stack after an interrupt launch and return to the top of the stack after the execution of this interrupt is finished.

The different interrupt sources of the ADM-1 data path are scattered over the whole ASIC. Simple synchronous data transfers at processor clock arc speed not possible due to the long distances on the very large silicon die. This is mainly due to the fact that the ASTC was integrated on a 12x12 mm2 sea-of-gates master (see Table 2.3). The transfer of one data byte from a data path hardware block to the processor would take more than one processor clock cycles. Therefore, a special hardware interface was required to avoid idle cycles of the processor.

In write direction, a FIFO forms the interface between the fast pro¬ cessor and the data path blocks running at lower clock frequencies (Fig¬ ure 2.3). In read direction, the interrupt handler first collects data from 2.3 An example implementation 13

Data path I IRQ ACN

Interrupt \nv v Cache Data RAM Read RAM control

FIFO overflow

FIFO I Interrupt I ,r Handler

RTI

Processor Instruction Core RAM IRQ / IRQ Address Processor stall

Figure 2.3: Interrupt handler oj the ADM-1 ASIC

the data path into a cache RAM and then forwards the interrupt to the processor. Therefore, the processor can pick the data from the cache at its own clock speed. For economical reasons a single processor for all software tasks should be used and nonetheless software complexity should be kept low. The major problem of the one-processor approach is the complex scheduling of the many jobs with different frequencies. To solve it all software tasks are handled interrupt driven, and rate- monotonic scheduling which gives the highest interrupt priority to the highest-frequency task was applied. This approach yields more than 1 Million interrupts per second in the ADM-1 system. The interrupt frequencies of the 17 sources vary from 1 kHz to 72 kHz.

2.3.4 Application-specific instruction set processor

To deal with such high interrupt loads we implemented an application- specific stack processor. Stack architectures avoids time-consuming save and restore operations which are typical for register-based architectures.

Since SDH traffic and overhead data are organized on a one-byte 14 Hardware/software co-design

basis, and the ADM-i data path is 8 bit wide as well, an 8 bit stack architecture for the processor is adequate.

The processor operates with a three-stage pipeline: instruction fetch, instruction decode, and execute & write back. The instructions use

16 bits. The memory is organized as separate data and program sec¬ tions (). The first pipeline stage fetches instruc¬ tions from the on-chip instruction RAM. During stage two, instructions are decoded, control signals are generated and target addresses are cal¬ culated. In stage three instruct ions are executed, data registers modi¬ fied, and a data-memorv access is performed. Since new addresses are computed in stage two, the ALU operation can use the same cycle as a memory read or write operation. After a jump or conditional branch, the next instruction is alwavs executed (delayed branch) to prevent a .

Four LIFO hardware stacks are implemented: two 8-bit wide stacks to store data and condition codes and two 16-bit wide stacks for mem¬ ory and return addresses (Figure 2.4). Two stacks of equal width share one physical memory, in which one of the stacks grows top down while the other grows bottom up. This results in dynamic partitioning of the memory. Both stack RAMs can be accessed concurrently in the same processor cycle. Since the return stack and the condition code stack are used exclusively for interrupt handling, the two stacks sharing one RAM are never accessed simultaneously and therefore no structural hazards occur. The four top elements of the data and address stacks are implemented such that thev can be accessed directly from the ALU. This reduces the data memory traffic and therefore enhances the per¬ formance.

The ALU is designed io process 8-bit data and provides common logic, arithmetic, and shift functions. Depending on the instruction, the top element or the two top elements will be consumed from the data stack. The result of the ALU operation can be restored into any of the top four data stack registers. To provide efficient block access operations a second 16-bit ALU with reduced functionality is available for address processing. 2.3 An example implementation 15

A1 D1 ALU > >

> A2 > D2

> A3 > D3

> A4 > D4

y Return Adr

2.4: Figure Architecture of the application-specific instruction set pro¬ cessor

2.3.5 Area efficiency

To show another benefit of our hardware/software co-design approach the gate count of a pure hardware implementation is compared to the corresponding co-design solution.

The chosen hardware/software partitioning asks for the implementa¬ tion of the overhead processing in software. The overhead is separated 16 Hardware/software co-design from the payload by hardware blocks and then processed by the embed¬ ded processor. In case of a pure hardware implementation, the processor is not needed but additional hardware blocks must be implemented for the analysis of the overhead bytes and the subsequent filter processing. To determine the amount of extra gates required by these hardware blocks the VC-12 block was synthesized with and without the filter pro¬ cessing functionality [Tnt96]. The difference between the two versions was 14'092 gate equivalences (GE) (see Table 2.1).

System Block Required Space [GE] VC-12 (filter logic) 14092 VC-3 (filter logic) app. 14000 VC-4 (filter logic) app. 28100 STM-1 (filter logic) app. 14000 VC-4/STM-1 (storage) 11424 TOTAL 81616

Table 2.1: Area consumption of the overhead processing hardiüarc needed m case of a pure hardware implementation of the add-drop mul¬ tiplexer system.

The VC-12 block enables very efficient implementation of the dif¬ ferent filters due to the many time-multiplexed data channels in this hierarchical level. In the ADM-1 system. 126 VC-12 data streams are processed (see Figure 2.2). All configurations, intermediate results and variables need to be st ored for each channel separately. As the different channels are time-multiplexed, they have a fixed sequence in the data stream. The overhead processing data is stored in a RAM in the same order as the channels appear in the data stream. Therefore, the RAM can be read in a regular, rotating way which makes the access fast and very gate-efficient. The filter processing increases the VC-12 RAM size the same way the processor data RAM will be increased in case of the hardware/software co-design solution. Consequently the difference in size originates only from the filter logic to be implemented. This yields the mentioned 14^92 GE.

The VC-3 hardware block is organized the same way as the VC-12 block, although it handles only 6 channels. It also uses the convenient rotating RAM access principle. As the VC-12 and the VC-3 overheads are very similar, a comparable amount of additional gates is required for 2.3 An example implementation 17 the implementation of the filter processing. The filter processing itself has to be implemented individually for every hierarchical level because the different levels are asynchronous and can not be synchronized. Only the different channels in one hierarchy can reuse the filter hardware thanks to the time multiplexed ordering.

The two VC-4 and the two STM-1 containers all ask for separate hardware implementation of their overhead processing since they orig¬ inate from different optical inputs and do not follow any fixed timing relation. The VC-3 and VC-4 overheads are exactly the same, so again approximately 14'000 GE per VC-4 can be saved here. On the STM-1 level the fault processing has only about half the extent of the other levels.

For the VC-4 and the STM-1 blocks it is inefficient to store the con¬ figurations in RAMs, as these RAMs would be very small. Consequently registers are used on these levels. At this point the multiplex direction of the data path (right side of Figure 2.2) has to be taken into account. The overhead has to be inserted into the data stream and the required data is delivered by the embedded processor in our solution. Without the processor, however, this data would also require extra registers on the top two levels. Altogether, this consumes an additional 11'424 gate equivalents compared to the RAM storage in the processors data RAM1.

Part Required Space [GE] ASTP 15326 Program RAM 26880 Data RAM 9792

TOTAL 51998

Table 2.2: Gate equivalents needed for the ASIP implementation and for the program and data RAMs

In software, on the other hand. 630 instructions are required to implement four subroutines that perform the overhead processing on at least three of the four hierarchical levels. Additionally, 448 instructions are needed for the individual tasks on the four levels. Also, 1224 bytes were reserved for the storage of variables and intermediate results apart

1One bit of RAM is counted as one gate equivalent, which is adequate for SRAMs. 18 Hardware/software co-design

from configurations. RAM space and the 15'326 gate equivalents of the processor itself add up to a total of 51'998 GE (Table 2.2). This is a 36% reduction compared to the haidwaie implementation.

Figure 2.5: Mieiophotoqraph of the ADM-i ASIC implementation

The reported partitioning implies that the major part of the system's functionality is fixed by the hardware blocks Kind and number of in- and outputs and the global aichitecture of the ASIC aie detei mined by these hardware blocks Softwaie implementation of the oveihead processing makes it possible to adapt the lesulting ASIC to different specifications. 2.4 Discussion and Outlook 19

From this partitioning some advantages for the system design and the development of the hardware parts originate. The functionality of the hardware is reduced to fixed and well defined tasks. Different sections of the data path, as defined in the telecommunication stan¬ dards [Int96], are implemented in separate blocks of hardware. These blocks also isolate the overhead from the data stream or insert it again where necessary. This way, reusable blocks are almost naturally de¬ fined. These Intellectual Property (IP) modules offer the demanded performance for the high-bandwidth data path and considerably in¬ crease the design efficiency in future projects. The interfaces between the IP modules are well defined which is another important requirement for successful design reuse.

In addition, individual customers can be provided with the features they need for their special demands by downloading different software for each customer. During the system test phase special software can help to quickly verify the system or to replace built-in test circuitry.

The results of the system integration are illustrated in Figure 2.5 and Table 2.3. For economic reasons the ASIC was implemented in a sea-of-gates technology. In Figure 2.5 the different parts of the data path and the embedded processor are marked.

Technology 0.35/mi sea-of-gates Metal layers 3 Core supply voltage 2.5 Volt Pad supply voltage 3.3 Volt Package 672 pin ball-grid-array Chip size 144 mm"

Table 2.3: Physical data of the ADM-1 ASIC implementation

2.4 Discussion and Outlook

Some conclusions can be drawn from this syst em design that will influ¬ ence the way future systems are designed:

The practical approach taken in this project was to specify the sys- 20 Hardware/software co-design

tern in a top-down manner, which means that first the global view of the system as a part of the télécommunication network is defined. Next the architecture of the system was designed with special attention paid to the hardware/software issues and the advantages of a single-chip so¬ lution. Finally, the system was partitioned into blocks that were to be handled by individual designers.

System integration, on the other hand, was performed using a bottom- up approach. Each block was specified by the block designers, and then, the different parts were assembled to build the complete system.

The combination of the top-down and the bottom-up approach is problematic because the interface between the two relies solely on hu¬ man communication. As systems grow and the design teams grow with them, this approach becomes more and more error prone and time con¬ suming. Also, the design data management becomes very challeng¬ ing when ever bigger design teams must work on one consistent data base. Additionally, the system designers must have detailed knowledge of what the system-level trade-offs mean on RT-level.

These are the two major issues that lead to the increasing design

efficiency gap.

1. The overview over a large system, while still keeping the detailed implementation in mind, becomes impossible without automated support.

2. Designing the basic building blocks from scratch in every new project is far too time consuming.

An executable system description will be necessary for future sys¬ tems to automate many decisions along the way from system design to silicon integration. To enable such an approach a library of pre-designed IP modules must be available. This way, the interface between the top- down and the bottom-up approach can be automated and the time consuming block design step can be skipped completely. A more de¬ tailed look at a possible system design flow will be provided by the next chapter.

Another important observation concerns the complexity of a system description that serves as the starting point for future system designs. 2.4 Discussion and Outlook 21

Describing the system's functionality, while neglecting all timing issues, seems to be quite simple. The number of system-level real-time con¬ straints, on the other hand, are in the order of several ten thousands for the telecommunication system described in this chapter. One has to keep in mind that this system is still relatively small compared to what we expect in forthcoming years. % y \ ^

s V -

V ^ ^ \ Chapter 3

System design by reuse

Today the integrated circuit (IC) is the most important hardware com¬ ponent embedded in machines and business systems. This puts stringent requirements of complexity, time to market, and quality on the design that are becoming intractable for engineers who use conventional de¬ sign methods and tools. As the technology continues to add capability almost without limit, solving the IC design problem in the " era" will require new system architectures, new tools and meth¬ ods, and an infrastructure able to support the reuse of large parts of the design from chip to chip.

This chapter gives an overview of the historical advances in IC design methodology and details recent changes.

.1 History or IC design

During the early days of IC design graphical entry was used to layout polygons of different material on several layers (see Figure 3.1(a)). The combination of these polygons would then yield the IC properties. As the complexity started to increase. IC design was taken to a higher level of abstraction, the circuit level. Now circuit elements, like transistors (see Figure 3.1(b)). were connected and manually placed, still using

23 24 System design by reuse

graphie entry.

In the 1970s, metal-oxide-semiconductor (MOS) technology emerged as the most effective approach to achieve larger scales of integration. Very large scale integration (VLSI) became a reality, and the architec¬ tures and products that would leverage this technology were brought onto the market. The most important product introduced during this time was the , which started its incredible success by coupling IC technology and software. The programmability of the pro¬ cessor was subsequently adopted in other chip designs. This combina¬ tion termed hardware/software co-design is described in more detail in chapter 2. At the same time designers started to think at the ugate level" (Figure 3.1(c)).

Standard cell libraries emerged in the 1980s. Standard cells are log¬ ical blocks thai have a regular structure, such as fixed height or power lines at fixed positions. The regularity of this type of structure led to tools that could automatically place and interconnect the standard cells in a design and, in turn, increase the design productivity. Because this type of design is most often optimized for random logic in specific ap¬ plications, it is called an application-specific integrated circuit (ASIC).

By the end of the decade, designers again recognized that a more powerful technology was needed to capture the design intent. A new language called VHDL was created ab a way to document designs and enable military projects to move design descriptions from manufacturer to manufacturer. Somewhat later the quite similar Verilog language was standardized after being used as a company internal language by "Cadence Design Automation". The use of these hardware description languages (HDL) generally increases design productivity by hiding some of the lower-level details. These languages allow the designer to con¬ ceptualize the design at a higher level of complexity, referred to as the register transfer level (RTL) (see Figure 3.1(d)). The use of these lan¬ guages has been encouraged by the creation of standard cell libraries and the availability of powerful tools that can automatically synthesize detailed logic descriptions from the high-level HDL description.

With each step in this evolution the design efficiency increased dra¬ matically and created an arrav of new applications for ICs. On the other hand, area efficiency, power consumption, or performance decreased.

That means, with every step the IC drifted further away from the op- 3.2 System design and IP reuse 25

(a) Layout level: Geometrical as¬ (b) Cnciui level: Assembly of in¬ sembly of polygons dividual transistors

AddSub: - A (A, B, Select) < - B -Q Select = '1'

S <= A + B;

COUT' S <= A - B;

— Ci AddSub;

(c) Logic level: Connect different (d) Registei Transfer Level (RTL): logic gates like "AND" or "XOR" HDL description of the desired functionality

Figure 3.1: Four major improvements in IC design methodology

timal implementation but more functionality could be implemented in less time.

3.2 System design and IP reuse

This section focuses on the problem of coping with an exponentially growing number of transistors per chip. According to the International Technology Roadmap for Semiconductors [SIA99], Moore's Law will hold true for at least the next decade. Table 3.1 shows some character¬ istic data from the 1999 edition of the roadmap. It becomes immediately 26 System design by reuse

clear that the complexity of ICs will continue to explode in forthcoming years.

Year 1999 2002 2005 2008 2011 2014 Technology jnm] 180 130 100 70 50 35 DRAM [Gbits/chip] 1.07 2.15 8.59 24.3 68.7 194 ASIC [Gtrans./chip] 0.16 0.43 1.06 2.62 6.49 16 Clk speed1 [GHz] 0.5 0.7 0.9 1.2 1.5 1.8

fj=- wiring levels l 8 8 9 9 10 I/O pads2 [103] 0.7 1.3 1.9 2.3 2.7 3 Power dissipa [W] 90 130 160 170 174 183

Table 3.1: Technology forecast taken from the International Technology Roadrnap for Semiconductors 1999 Edition

When talking about system design, the metaphor of designing a "sys¬ tem on a chip*' (SOC) is used throughout this thesis. Of course, a com¬ plete system includes such elements as packaging, electro-mechanical interfaces, and power sources, areas beyond the scope of this thesis. The metaphor, however, is useful to describe the circumstance that to¬ day's ICs are able to integrate the functionality of complex systems. Additionally, SOC describes the need to think about design from the perspective of a top-down architectural framework.

IC design at the system level concentrates on the design as a whole. Some areas of interest include overall system performance, cost, testa¬ bility, assessment of competitiveness, and interaction of subsystems. System design provides a set of specifications, a design context for the subsystems, and a définit ion of the function of the system. This level is devoid of nearly all details of how a system is implemented. Instead, it focuses on the function of the system. Tradeoffs can be studied quickly, but the limited amount of implementation detail makes it difficult to automatically convert this level into lower levels. Very few tools are available to support design at the system level, making the designer's experience and good heuristics essential for success.

'Acioss-chip clock foi high peifoimance ASICs 2High peifoimance ASICs, the given immbei is for signal pads only, a compaiable numboi of powei supply pads must be added to get the number of total pads 3High peifoimance with heat-smk, foi batteiy duven chips powei dissipation is about 70 times smaller 3.2 System design and IP reuse 27

Even for very large design projects, it is possible to put a large design team in place, use very disciplined project management, and get the job done. However, this approach does not realize the full potential of the technology. Smaller design teams can build successful SOCs if they can focus on the unique portions and reuse blocks that have been designed earlier. Few sub-circuits are already used on a limited basis to implement RAMs. arithmetic units, or standard interfaces.

3.2.1 System design flow

Currently the major electronic design automation (EDA) vendors are of¬ fering tools that address part of the system design process [GS98], such as hardware-software co-verification. Major efforts are undertaken to improve the tool suite towards higher automation and to enable system- level synthesis. Especially automatic partitioning of functions between hardware and software is just in its infancy. Including reusable com¬ ponents into the system design process is also far from usual, because judging a component's exact functionality and performance is mostly based on the system designer's experience.

A unified approach to system-level design is not yet defined, although intense research is described in the literature [VvRBdM96] [GVNG94] [B+97] [BSV95]. The goal is to have an executable design specification as a starting point and do design space exploration, optimization based on constraints inherent to the specification and finally co-synthesize the system with as little designer interaction as possible. Because covering the complete design space, even for just medium size systems, exceeds current techniques by far, some papers [PWM97] [SRMB98] propose an approach based on high-level models to keep the exploration in close bounds. A prerequisite to this approach is that software and hardware models for data-flow oriented and reactive systems exist or can be gen¬ erated from existing low-level desciiptions.

Not all complexity issues aiise solely from advances in semiconduc¬ tor technology. IP issues associated with SOC products are injecting considerable complexity into the design process. SOCs generally con¬ tain IPs from various sources, which complicates the design by the fact that very few standards exist for integrating the same design in sev¬ eral monolithic devices. Recognizing the problems involved in creating 28 System design by reuse

System System performance behavior "^ r~ Real time constraints

'" t \ f Behavioral model simulation \l t Mapping to System RTL-IP performace database simulation

Functional Interface verification refinement testsuite

i '

< » Implementation flow

Figure 3.2: System design flow that incorporates predesigned IP mod¬ ules

standards for IP exchange, the semiconductor and EDA industries have formed the Virtual Socket Interface Alliance (VSIA).

A coarse overview of a possible system design flow is given in Fig¬ ure 3.2. Many details remain subject to intensive research and impose immense problems for automated tools. However, the overview is suf¬ ficient to derive constraints for the design of IP modules to be used in this flow, As already described in section 2.4, the capture of all timing constraints is very challenging especially for systems with a significant amount of control functionality. This necessitates to split the system description into functionality (System behavior) and timing (System performance). The functional description is used to select the IPs to be integrated into the system. This leads to the first IP design constraint: An IP module must be accompanied by a behavioral model to enable an efficient selection process and system-level simulations.

Next, the system is assembled using the RTL IP models from the database, and a system simulation checks the imposed timing con¬ straints. This will very likely be an iterative process between module 3.2 System design and IP reuse 29

selection and parameterization on the one hand and, functionality and performance on the other. Consequently: IP modules mvst be widely partimetrizahle to provide enough flexibility to fit into different systems.

Once the IPs are selected, the inter-IP interfaces must be finalized and the system design process proceeds from logic synthesis to place, route, and eventually tape out. IP nitei]faees must be independent of

the IP core.

The complete design flow is accompanied by a functional verification test-suite that, starting from the behavioral level, iteratively checks the different steps in the design process. At least the first verification step, that checks the equivalence between the behavioral and the RTL model, will be simulation based because formal methods are currently not fea¬ sible of handling such diverse models. ,4 parametrizahle testbeneh must

be provided that enables the descrihed verification process.

3.2.2 IP module design

IP modules do not only consist of the reusable block itself but must be accompanied by documentation, an executable behavioral model, and a test-bench for functional verification that includes stimuli and expected response vectors. IPs are known in three different forms, soft- and hard-modules. Hard-IPs are reused as pro-designed, placed, and routed blocks. Soft-IP modules, on the other hand, contain a synthesizable

IIDL description of the block. The intermediate form - firm-IP - consists of a netlist representation of the design. This representation makes it very unpractical for reuse, because it has the unwanted properties of both other IP forms. It offers little parametrizability due to the pre- synthesized netlist format and still has to undergo the time consuming place and route procedure.

Hard IP modules

Hard-IPs are very popular today because reusing functional blocks as physically pro-designed macros comes closest to the idea of "plug-and- play"' SOC design. Additionally, predecessors of hard-IPs are in use for quite some time in the form of memory blocks. More recent hard-IP 30 System design by reuse

macros implement processor cores, standard interfaces, or other func¬ tionally inflexible blocks. Thinking of a hard-IP as a RAM in the classic design flow, reveals the way hard-IPs are used in todays system designs. The IP is fit into the floorplan as a fixed block. The surrounding blocks must be synthesized such as to obey the timing constraints imposed by the IP block. During the physical design steps power consumption, pin placement, and alike ha,ve to be taken care of.

Another major advantage of hard cores is the fact that functional verification of the IP does not need to encumber the IP user. This greatly reduces the risk of using such a component. On the other hand, hard-IPs are technology dependent. This means, they are exclusively designed for just one technology, and migration to another technology is highly problematic. Also the production test methodology is fixed before system integration of the component. This restricts the system designer in choosing a test strategy. More details on SOC test will be provided in section 3.2.4.

The flexibility offered by hard IPs is limited to few quantitative customization^ at most. Qualitative customization^ that alter the func¬ tionality of a hard IP are very costly. All different functional units have to be implemented in the IP. Choosing one functionality leaves the un¬ wanted parts laying idle on the chip. Also the interfaces and the data transfer protocols are fixed, and severe effort is necessary to adapt the other modules in the system to the hard IP.

Soft IP modules

Designing the functional internals of IP-modules is more than just writ¬ ing application-specific RTL code. Designing IP-Modules is "designing for the future". Designers need to be aware that the IP-Module will be used in several different environments, most of them unknown at design time. The IP-Module needs to be adaptable to changing system de¬ mands but should still offer the specified functionality and performance in any case. Soft IP-Modules offer the possibility to generously cus¬ tomize several features to exactly adopt the module to current system demands.

The two prime measures to offer high flexibility olTP modules are pa¬ rameterization and changeable interfaces. Parameterization means that 3.2 System design and IP reuse 31

several properties can be changed by the IP user by means of parameters in the HDL code. However, writing parametrizable HDL code is chal¬ lenging and time consuming. The effort to code a block in parametrized HDL is about twice that of writing application-specific code. Many seemingly simple problems like vector ranges, maximum/minimum val¬ ues, or comparison to a constant become quite challenging for generously parametrized HDL code. The consequences for functional verification that result from writing parametrizable code will be covered in sec¬ tion 3.2.3.

IP design flow

Next the constraints and requirements resulting from the system design flow described in the previous section, will be applied to the design of IP modules. This results in the four steps shown in Figure 3.3.

The first step in an IP-based system design flow is obviously to select the blocks that shall be used in the system (see Figure 3.3(a)). A major decision here is the selection between "buy" or "make". Buying an IP from outside sources is the faster way but holds several uncertainties. The lack of standards intensifies this burden.

A simple and convenient way to check an IPs functionality is very important here. Executable behavioral models are very helpful for this step and provide a necessary complement to written documentation.

Apart from the functionality many other properties of the module have to be checked before a decision can be made. The parainetrizability must be such that the IP can be adapted to the system under design.

Numerical and functional parameters must be available for this pur¬ pose. Another important item to look at are the interfaces of the IP. To guarantee seamless data transfer between the different blocks of the system, everv block must provide interfaces that fit into the interfaces of its neighboring blocks.

To implement a successful design foi test (DFT) strategy all IPs in the system must fully suppoit the strategy. Because of the size of

SOCs built-in self test (BIST) is about the only way that enables an efficient production test. That means that only IPs that are equipped with BIST or can easily be incorporated into the desired test strategy can be used. 32 System design by reuse

<,\\\*i «Ä M

(a) Select Modules Use behav¬ (b) Paiameteii/e RTL-models ioral models to evaluate function¬ Adjust the IP modules to the ality and select the needed IPs desned paiametination

^ /

> ^v* ». \\» 4

(c) Functional veiiheaiion (d) Svstem assembly Synthesi/e, Check the functionality of the place, and mteiconnect the IPs to

paiametii/ed IPs compose the SOC. and implement design foi testability ciicuitty

Figure 3.3: Principal d(siqn steps to binld a system on a chip from pre-desiqned blocks

The second system design step is the paiametrization of the selected IPs. Parametiiyabihly of IPs can be divided into three major classes:

1. Qualitative customization can be used to fine tune the IP-module's

functionality. Examples are the insti notion set of a processor oi the pipeline depth.

2. Quantitative tustomi/ation of parameteis like word width or mem- oiy capacit ies These ai e numeucal settings that change the ''size"' of the IP. 3.2 System design and IP reuse 33

3. Support of various system interfaces to exchange data between the IPs in a system.

Functional verification of the parametrized modules is the third step in the design How. This is a delicate step because it must be performed by the IP user. The goal is to verify the correctness of the IP using the customer-specific parameter set. Pre-verification of the module is impractical due to the large parameter space that would have to be covered.

Formal verification methods are not feasible for this step because they currently do not support comparison of behavioral and RTL mod¬ els. However, some tools provide a generic language that can be used to describe the IP functionality and then check it against the parametrized RTL model. The major problem of simulation approaches is the avail¬ ability of stimuli vector sets that can be used configuration independent.

Finally, the system must be assembled from the selected IPs. The EDA tools used for the following steps are already in place and only need incremental improvements but no principle redesign. As mentioned before a proper DFT strategy is essential for the final production test of the silicon chip.

3.2.3 Functional verification

Highly parameterizable soft IPs need to be functionally verified by the system designer, as stated above. Two different approaches are known: Formal verification [Nor96] and simulation [SRT+99]. Formal methods are very popular to compare two very similar design representations, like RTL code and netlist but is current lv not capable of handling het¬ erogeneous inputs like behavioral and RTL code. Simulation is the far more mature method today but it bears several IP-specific problems.

The major problem of functional verification of intellectual proper¬ ties by simulation is to find a way that enables the IP user to quickly verify the functionality of the IP using his own custom parameter set. A test bench that can be used to conduct the verification process for ran¬ dom parametrization must be delivered along with the IP itself. Thai means that the test bench must be parametrizable the same way as the 34 System design by reuse

IP to enable its use for every possible configuration of the IP. Another severe problem here is to find stimuli and expected responses that offer high functional coverage but are universally usable for all parameter sets.

Scanchain

r~ Scan chain

T| CO CD

XI

CD Scan chain CO A ~A A. Reset / Hold

Reset/Hold BIST controller A! Done

Clock

Figure 3.4: Logic BIST architecture

3.2.4 Test

Testing of large SOCs becomes more and more impractical due to the increasing transistor to pin ratio. While the increases according to Moore's law, the number of ASIC pins increases at a far lower rate (see Table 3.1). This means that less access to the chip's internals is available for production testing. Consequently, more test vectors are needed to achieve the desired fault coverage, which makes the test very time consuming and needs very high speed and expensive automatic tebt equipment (ATE).

A widely used strategy to tackle this problem is built-in self-test 3.2 System design and IP reuse 35

(BIST). The key concept is to equip every core inside a SOC with ad¬ ditional circuitry that automatically tests it and so to eliminate or at least reduce the system test effort [Zor98].

Figure 3.4 shows a widely used set-up for this test approach [Wol98] [RT99]. Scan circuitry is inserted into the device under test (DUT) much as for off-line testing. Additionally, a linear-feedback shift reg¬ ister (LFSR) is integrated to supply random test patterns. Another LFSR with parallel inputs (multiple input shift register) is used to col¬ lect the responses from the scan chains. A signature with low aliasing probability is generated and can be checked after test completion. A test controller is required to control reset, start, and stop of the BIST process.

In [PDH99] the author shows that deterministic patterns are re¬ quired after BIST execution in order to reach acceptable test coverage. Several publications propose the use of embedded processors for the ap¬ plication of the additional deterministic vectors [PMN99] [PR99]. The central system processor of a liardware/software system can be used to apply vectors to the different IPs in the system.

Chapter 4

JL lit/ C/doU IvJl programmable intellectual property modules

From the previous chapters it is clear that various problems associated with the use of hardware/software co-design have been investigated for some time already and solutions are maturing. The use of IPs in SOC design, on the other hand, is just in its infancy. Many open problems still remain. Some of these problems are stated in the next section, solutions will be proposed in sections 4.2 and 4.3 of this chapter. In chapter 5 two processor architectures are compared to evaluate their suitability to serve as the processor embedded within programmable intellectual property modules. Chapter 6 introduces the functional ver¬ ification flow necessary for programmable intellectual property modules. An implementation of a PIP module, described in chapter 7, proves the feasibility of the new design concept. Finally, chapter 8 discusses the advantages and disadvantages of the new paradigm.

37 38 The case for programmable intellectual property modules

4.1 Present Situation

When starting a new system design the designer must make decisions on how to translate the given functionality into a hardware architec¬ ture. Basically the result will be between two extremes, a pure soft¬ ware implementation on a programmable general purpose processor or a custom-logic solution. Chapter 2 describes the trade offs that have to be made to decide between the two or a mixture. This decision includes not only the partitioning between hardware and software but also the system division into separate functional blocks.

For many systems the partitioning between hardware and software and the partitioning into different functional blocks interfere. Parti¬ tioning the system into functional units yields blocks that can be im¬ plemented as IPs. Hardware/software partitioning, on the other hand, may be such that it is advantageous to implement one part of a block's functionality in hardware and another part as software running on the system processor (see Figure 4.1(a)). To reuse this block in a future design, the system designer needs the hardware and the software part of the IP. This restricts reuse of the block, because a system processor is always mandatory. No "stand-alone" use of the IP is possible.

Large systems with a large system processor do not fit well into the proposed system design flow (see Figure 3.2). With this set-up nearly all real-time constraints have to be checked on system level because most of them apply to the software running on the embedded processor. Additionally, scheduling and long interconnects for the bus connecting hardware blocks and system processor have to be taken into account.

For a solution that elegantly solves these problems see section 4.2.

Hard-IP modules are very popular today because they come closest to the idea of "plug-and-play" svstem design. On the other hand, the flexibility offered by hard IPs is limited to few quantitative custoniiza¬ tions. Qualitative custoniizations that alter the functionality of a hard IP are very costly. All different functional units have to implemented in the IP. Choosing one functionality leaves the unwanted parts laying idle on the chip (see Figure 1.3(a)). Also the interfaces and the data transfer protocol being used are fixed and severe effort is necessary to adapt the other modules in the system to the hard IP. A possibility to design highly flexible hard-IP modules is described in section 4.3.1. 4.1 Present Situation 39

I d hardwaie block \>N> software block

*"" •" data path control signals

(a) Classic situation with lestucted leusalnhty due to conflicfmg functional- arid hardwarc/softwaic-paititiomng

programmable IP data path

(b) System design using piogiammable intellectual piopeity modules foi mcieased i disability

Figure 4.1: Using hardwan/softican co-design in combination with reuse of intellectual property moduli s in system-on-a-chip designs

Interfaces between different IPs are intensively discussed. To seam¬ lessly assemble several IPs to a SOC, data exchange between the blocks is an crucial topic. As the modules may run at different frequencies or use different interface protocols, it is quite difficult to find a common interface type. In many cases, glue logic must be added by the system 40 The case for programmable intellectual property modules designer to interface incompatible IPs.

A possible solution for this problem is to translate the IP propri¬ etary interface protocol to a more widely used standard. IPs could be equipped with interface modules, each of which is optimized for a certain interface protocol (see Figure 4.4(a)). The interfaces can then be selected according to the current system requirements. To enable communication between several different modules inside SOCs, a whole bunch of interface types and protocols must be available for the different IPs. With this bct-up inter-module communication problems still occur if no compatible interface protocol is available for the different IPs.

The Virtual Socket Interface Alliance (VSIA) currently works on interface standards to address this problem [All]. To interface IPs run¬ ning at different frequencies a globallv asynchronous locally synchronous (GALS) is proposed in [MVK+99] [MVF00]. A new solution is proposed in section 4.3.2.

Testing of large SOCs becomes more and more impractical clue to the increasing transistor to pin ratio. Also the test is very time consuming and needs very high speed and expensive automatic test equipment (ATE). A widely used strategy to tackle this problem is built-in self-test (BIST). The key concept is to equip every core inside a SOC with BIST circuitry to eliminate or at least reduce the system test effort [Zor98]. A BIST implementation yielding high fault coverage can be found in section 4.3.3.

Highly parameterizable soft IPs have to be functionally verified by the system designer. This is necessary for two reasons. Firstly, the correct implementation of the module using the chosen parameter set, that is not known at IP design time, shall be proved. Secondly, pre- verification for all possible parameter sets is impossible due to the large parameter space. Two different approaches are discussed: Formal ver¬ ification [Nor96] and simulation [SRT+99]. Simulation is the far more mature method today but it still bears several problems.

Functional verification of intellectual properties bv simulation is very problematic as the IP designer has to find a way that enables the IP user to quickly verifv the functionality of the IP using his own custom parameter set. The principal problem here is to find stimuli and ex¬ pected responses that offer high functional coverage but are universally 4.1 Present Situation 41

usable. The functional verification flow proposed here is described in chapter 6.

Using a new paradigm call Programmable Intellectual Property (PIP) modules can help to tackle these problems. The remainder of this chap¬ ter will introduce the new concept and present the advantages associated with it.

*—»• L^ <—* L , i yir—*- : 'l_l_ i'—i-'-. «!" " >_>. *-*• +-*- IP IP System- A

; ,, . il .- , t embedded

processor -•• - . v r i i _ i_ ^-*- IP *-»- ,

RAMs *-

, IP IP ASIC HW/SW Co-Design ASIC

(a) Classic: Small, highly special¬ (b) Current: Flexible system design ized blocks that offer little reusabil¬ due to hardware/software co-design ity with improved reusability

IP- IP- embedded embedded processor processor

RAMs RAMs PIP PIP ^\ i ^

IP- IP- embedded embedded

processor processor

RAMs RAMs PIP

System-On-Chip

(c) Coming: Highly flexible programmable TPs (PIP) with maximum reusability additionally offering several system design benefits

Figure 4.2: Evolution of ASIC design, from custom specific blocks to highly rev s able programmable intellectual property modvles 42 The case for programmable intellectual property modules

4.2 Concept

The fundamental innovation of our concept is to integrate a processor into each major IP module which enables hardware/software co-design inside the IP. This approach gives access to new solutions for the most significant problems of SOC design by IP reuse.

The use of an embedded processor opens a door to all the bene¬ fits that come with hardware/software co-design. The IP user gets the opportunity to implement customer-specific features the IP designer did not even know about when designing the IP. This enables a very widespread use because it is easy to add new functionality to the IP.

The hardware overhead is kept extremely low because storage and per¬ formance requirements can be adjusted to meet the current demands.

Time-to-markct can be reduced by a system/software co-design ap¬ proach. System feature changes are possible late in the design cycle, even after having selected the IP components or having finished system design.

Figure 4.2 shows the evolution of ASIC design in recent years. It starts from highly specialized hardware blocks that offer limited reusabil¬ ity but high performance and small area and power consumption (Fig¬ ure 4.2(a)). As the ASIC functionality grew more and more, designers started to utilize hardware/software co-design (Figure 4.2(b)). Func¬ tionally flexible systems containing hardware blocks with increased reus¬ ability resulted.

The benefits can even be increased by using programmable intellec¬ tual property modules (Figure 1.2(c)). The IPs used in hardware/soft¬ ware co-design ASICs are either pure hardware or pure program-con¬ trolled. Using the PIP paradigm enables the IP-designer to utilize the of advantages intermediate solutions. Additionally, our concept of pro¬ grammable intellectual property modules helps to solve many problems of SOC design. This is mainly due to the flexibility that is added to the IPs by using the embedded processor. Together the new possibilities will make system design by IP reuse a lot simpler. 4.3 System design advantages 43

4.3 System design advantages

As mentioned earlier, tools for automatic system design arc not available at the moment. That means the step from idea and specification to implementation is based on experience and heuristics and has to be done manually by the system designer. This makes the step of selecting IP modules for the system under design uncertain. The fact that PIP modules are extremely flexible clue to their software content, problems in the selection step can be corrected later in the system design cycle.

Automatic hardware/software co-design on system level is a much too complex task for today's tools. The divide-and-eonqucr approach of PIP modules significantly reduces the search space. This makes the step better suited for automatic tools and success becomes much more likely.

Relocating the HW/SW interface into the IP (see Figure 4.1(b)) simplifies many aspects of the system design flow shown in Figure 3.2. Hardware/software partitioning is not feasible for automatic tools on system level because of the very large search space. On IP level the partitioning task is within much tighter bounds and can very well be done automatically or at least with only few designer interaction. The top level timing constraints can very elegantly be checked on IP level. This is much easier and more precise than on system level because task scheduling and processor utilization are simpler to predict than for the far more complex system level.

Using a single system processor that is connected to several hardware blocks on system level yields severe routing problems. As technology scaling continues and the wire delay becomes more and more dominant, long interconnects become highly unwanted in large ASIC designs. PIP modules move much of this interconnect back to the local IP level which offers much better technology scaling. The drawback of this is the need to place many small RAM blocks spread all over the chip area.

During the module selection step an exact fit of system specification on IP functionality is no longer required. Due to the software content of the IP. features can be adjusted to what is needed in the system under design. The system designer can combine the benefits of hard¬ ware/software co-design and IP reuse. 44 The case for programmable intellectual property modules

(a) Haid-IP module with all difieient functional blocks implemented m haidwaie and a multiplex« to select the cuirently desued func¬ tionality

^ W^ ^\»\",l \l^ ^ 3

(b) PIP module that can be functionally customized by downloading

new softwaie

Figure 4.3: Functional customization of hard-IP modules

In many cases it is not only the size of a system that makes sys¬

tem design so complicated but if is the hetciogeneity. Even veiy large systems that consist mainly of a high peifoimance data path, like an encryption system or a video compiession , are quite easy to handle. The problems arise when a large data path has a significant amount of control logic attached to it. In this case joint synthesis of

data path and contiol pose seveie problems that are hard to solve for large systems. The strict sepaiation between data path and control simplifies the design of large systems significantly. This principle is also exploited in PIP modules by dividing the block into data path hardware and control softwaie miming on the embedded processor 4.3 System design advantages 45

4.3.1 Flexible hard-IP s

Hard IPs designed according to our concept become much more flexible. While the functionality can not normally be altered, the programmable device inside the IP opens the opportunity to down-load new software versions and thus to change the features. This enables functional cus¬ tomization by software download. Highly predictable but still function¬ ally flexible modules result. The functionality can even be changed after implementation of the module into the SOC.

Additionally, the functional customization of PIP modules does not require extra hardware as opposed to the classic solution in Figure 4.3(a). So the area efficiency is much better than before.

4.3.2 Adaptable interfaces

! MlX&tlAlii« V v » lÎNV1 N* Ft ^ embedded i*^, :> processor »C

rü%fliiliifrii iA,i,„vA,kA,.a.»

(a) IP module with difteient mtei- (b) PIP module that offeis flexible face modules that can be changed sottwaie duven mteiface protocols at synthesis time

Figure 4.4: System interfaces of IP modules

To seamlessly interface IP modules either many different interface

types must be available for every IP or a common interface protocol must be used. Adaptable interfaces are a much more universal so¬ lution which, however, is haid to achieve in hardware. By handling the1 interface traffic in software, many different protocols can be impie- 46 The case for programmable intellectual property modules

merited using the same hardware. If additionally the interface hardware is parametrizable virtually any interface type is possible.

(a) SOC with several IPs and a system processor that can be used to conduct the BIST of different modules

(b) PIP module that enables a local BIST conducted by the embedded processor

Figure 4.5: Software driven BIST of IP modules

Obviously, handling the interface traffic in software offers far less per¬ formance than a pure hardware implementation. A trade-off between flexibility and performance must be found. In case of a processor bus, for example, the available data rate is the key property and must not be compromised by a software protocol. In other cases bandwidth require¬ ments are not so stringent and flexibility is important for widespread use. 4.3 System design advantage 47

A good way to think of the feature is to think of a device driver for computer externals. If you attach a new device to a standard port you must install new protocol software to access the new hardware. The same procedure works for PIP modules as well. If the PIP module is attached to another block in the system, the appropriate interface protocol software enables data transfer between the two.

4.0.0 JtilOCK TJGSTi

Several publications propose the use of embedded processors for the application of BIST vectors [PMN99] [PR99]. Figure 4.6 shows the architecture that is necessary to utilize the embedded processor for BIST purposes. The BIST controller can choose between primary inputs, LFSR. or the embedded processor to feed the scan chains. The MISB can be used for the pseudo random and the deterministic part of the test.

Scan chain

-a rn Scan chain o 3 o cr TÏ

co Q. C/D co Q. O CD "t Q.

3D Scan chain A A A Reset /Hold

Reset, Hold BIST controller -*"Done

Clock

Figure 4.6: Logic BIST architecture from Figure 3.4 with the addi¬ tional possibility to apply deterministic patterns from the embedded pro¬ cessor 48 The case for programmable intellectual property modules

Up to now, the central system processor is used to apply vectors to the different IPs in the system. This separates the BIST from the IP and again necessitates the implementation of a system processor to use the BIST (see Figure 4.5(a)). Our concept makes it possible to utilize the processor embedded within the IP for the IP test which closely integrates IP and BIST (see Figure 4.5(b)). Additionally, the system test time can be reduced bv running several processor-driven BISTs simultaneously.

4.4 Evaluation

Given the different possibilities to implement a given functionality a general-purpose processor clearly offers the best hardware reusability, due to the limited reusability. However, the software that implements the actual functionality must also be reused or written by the system designer and performance issues must be carefully evaluated. A pure hardware implementation offers good performance but only limited flex¬ ibility.

Both implementation types suffer badlv from the fact that customer- specific features are difficult to implement. The PIP approach offers very good reusability and combines performance and flexibility (see Fig¬ ure 4.7). If an IP module consists of a high performance data-path and a significant control part PIP modules present the best implementation choice.

The major benefits of the PIP paradigm are the excellent fit into the system design flow (see Figure 3.2) and the additional system design advantages detailed in the previous sections. System design by IP reuse becomes much more manageable and flexible when using PIP modules. 4.4 Evaluation 49

HW/sw Flexibility partitioning

Figure 4.7: Trade-off between performance and flexibility in the im, plementa.tion of a given algorithm ' » ] v- \ V '% Chapter 5

The embedded processor

The most important building block of PIP modules is obviously the embedded processor. The processor not only implements a major part of the functionality, it also is involved in test and interfacing. It is important to have the same processor module in all PIP modules for hardware and software design efficiency reasons and to enable the reuse of compiler and assembler. All these reasons make the choice of the embedded processor crucial for the success of PIP modules.

This chapter gives a detailed evaluation of the processor architecture. The basic requirements for the processor embedded within PIP modules will be stated in section 5.1. The following two sections describe the implementation of two different processor concepts. In section 5.2 the stack architecture processor, that is used in the PIP implementation

described in chapter 7. is introduced. Section 5.3 presents an alterna¬ tive processor concept that is not based on the stack architecture, but is register based. Both processors have been realized as highly parametriz-

ablc IP modules. Finallv. section 5.4 compares both processor concepts with respect to the requirements from section 5.1.

51 52 The embedded processor

5.1 Requirements

This section lists the requirements for the PIP embedded processor. As the functionality of PIP modules can vary in a very wide range, there are many trade-offs that must be evaluated for every individual system. The goal of this section is to formulate requirements that enable the PIP designer to use the same processor IP in nearly every new PIP mod¬ ule based on a hardware and control functions implemented in software. That means that the processor does not have to execute substantial data processing fund ionality.

1. Compiler support: For efficiency reasons it is indispensable to have a high-level language compiler available for the processor. Additional tools like a debugger, for example, are nice-to-have

add-ons. The processor architecture must be suited for a compiler to generate highly efficient assembler code. The code must be efficient in terms of execution speed and code size. Both translate directly into silicon area because the code must be stored in an on-chip RAM and the processor must be fast enough to meet the imposed timing constraints.

2. Interrupt capability: Scheduling different data processing tasks is most flexible when the task execution is initiated by an interrupt launch. In this case no fixed timing relation between the tasks must exist and processor utilization can be divided according to the tasks priorities. Therefore the embedded processor should be

able to handle interrupt launches very efficiently. The interrupt load, however, will not be large provided IP modules are relatively small.

3. Parametrizability: To enable the use of the processor in many different IP modules it must be highlv parametrizable. All three customization categories, qualitative customization, quant it ative customization, and interfaces, must be available. Especially the instruction set and the instruction coding play an important role. While the instruction set mainly influences coding efficiency, the coding has significant impact on the size of the instruction RAM.

4. Area efficiency: As we saw in the previous chapter a major draw¬ back of PIP modules is their large area. This area penalty is due 5.2 The stack processor 53

to the fact that many small processors with individual RAMs are much less efficient than one large system processor. The area ef¬ ficiency of the complete processor package, that means the core with all RAMs and data path interfaces, is an important factor for the choice of the processor architecture.

5.2 The stack processor

This section describes the stack architecture processor that was used in the PIP module implementation (bee chapter 7). The first implementa¬ tion of this processor was the system processor of the hardware/software co-design project described in section 2.3. The IP module implementa¬ tion of the processor has been taped-out as part of a previous IP design project.

5.2.1 Architecture

The most significant advantage of the stack architecture is the capability to do a context switch without having to save the current context to the data RAM. Stack processors simply move the current context down the stack with every new item being pushed on the stack. After interrupt completion the stack contains the context as before, because the items from the interrupting routine have been removed and the old context is back on top of stack.

The major drawback of the classic stack machine [Jr.89] is that it allows access only to the top stack element. This makes programming very inefficient and yields quite voluminous code. To cure this prob¬ lem the enhanced stack architecture implemented here uses the solution shown in Figure 5.1. Read and write multiplexers allow random access to a configurable number (four in Figure 5.1) of registers on top of stack.

The stack is handled in the classic way which means that data items are moved down and up the stack by push and pop operations, respectively.

The enhanced stack architecture is a mixture between a stack and a register-based architecture. The programmer has direct access to a number of registers like in a register-based architecture. On the other 54 The embedded processor

Instruction

imm

" i i ' ^fy-—^ry

\ W /

*- wo \ AI 1 ! /

1 ' ' -, —*)pi ~*~> D1

f 1/ —» ~*\ D2 &

—^_ "**> D3

—^ "**> D4

Data Stack

Figure 5.1: Data ALU and data stack of the stack processor, offering direct read and write access to four registers on top of btack

hand, all registers are treated like stack registers and the register eon- tents moves up and down the stack.

Figure 5.2 shows the complete processor architecture. Additional features of the enhanced stack architecture processor are:

• Separate data and instruction memories (Harvard architecture) enable the access to both RAMs in the same processor cycle and so significantly increase the processor performance.

• A four-stage pipeline. In the first stage instructions are read from program memory (Instruction Fetch). After decoding the instruc¬ tion (Instruction Decode) in stage two. it is executed and data registers are modified (Execute) in the third stage. In stage four 5.2 The stack processor 55

Interrupt Program Data Memory Handling Memory Memory Mapped Interface Interface Interface IO

Û V y CJ

data communication switches and control n

' vv U out of 66 \ a |i out of 66 f> bit wide \ ü bit wide nstructions instructions/ 1

1 1""! "I 1 1 1 1 i >1,2,3...,5 >1---C0 >1 2,3,-

> RA2 • > >CC2 D2 • > D3 > Ae > RAß i i >cce 71 5 Return Cond. > DK ' Address Code X a Stack Address Stack Stack RAM

Interface Data Stack RAM Interface

Figure 5.2: Enhanced stack architecture processor IP-Module with cus¬ tomizable parameters shown as Greek letters

data memory access (Write Back) is done.

• A simple branch prediction and delayed branch execution. This enables the execution of nearly one instruction per cycle. The branch prediction is based on the assumption that backward jumps are always taken while forward jumps are never taken.

• An additional address ALU with reduced functionality. This ALU enables very efficient block transfer opérai ions. The address stack

can also be configured to allow direct access to more than one register.

• Return addresses and condition code storage on separate stacks. 56 The embedded processor

Having all data on stacks makes interrupt and subroutine calls very efficient. A single processor cycle latency is needed after an interrupt launch to push the current (PC) on the return address stack.

The next three sections list the parametrization of the stack proces¬ sor according to the classes introduced in section 3.2.2.

5.2.2 Qualitative adaption of the processor core

The most significant adaptions to the core's functionality can be made by altering the number of suppôt ted instructions fJ, and the instruction width V. From a set of 60 instructions the TP user can assemble the subset required for the current application. The way this is done is described in Code-Sequence 5.1. The hardware associated with the unwanted instructions will be removed during logic synthesis. constant INSTR.SET stddo gic-vector (MAX downto 0) :== ( I

o => T, - Instruction 1

i => \)\ Instruction 2

2 => "0\ Instruction 3

> MAX —r *r, Instruction MAX+1 );

Code-Sequence 5.1: Parametrization of the instruction set: 'I'means instruction will be sythesized; "0* means instruction is not used

The instruction width has to be adapted to hold this subset and to offer enough space to access the k data and c address registers. Wider instructions also allow larger operands in immediate instructions.

When the instruction width is changed, the instruction coding must also be altered. For this implementation the coding was done to en¬ sure the densest possible opcode. Therefore, the instructions were not divided into fields of fixed length, but the available space for opcode, register addresses and immediate values varies from instruction to in¬ struction. 5.2 The stack processor 57

5.2.3 Numerical parametrization of the processor

core

Three data stack parameters can be varied:

1. Stack Width (J: This value determines the width of the processor data path. For SDH broadband telecommunication applications

a 8 bit wide data path is ideal, because overhead data are byte- oriented. For other applications 16 or 32 bit wide data paths may offer better performance.

2. Number of directly accessible registers K: This parameter indicates the number of registers on top of stack that can be randomly accessed. Directly accessible registers require extra selection logic, but give the compiler more flexibility in generating assembler code. For high values of k the processor behaves more like a register based architecture, for small values the behavior is close to the classic stack machine.

3. Stack Depth À: Determines the total number of stack elements (2A + k). Applications with many interrupt levels require a large stack because all data remain on stack when a higher priority in¬ terrupt is launched. Also block transfer operation can be very efficiently programmed if a large data stack is available. All data can first be pushed on the stack and then stored to the final loca¬ tion, always reading the same stack position.

The carry code and zero flag are saved on the condition code stack after an interrupt launch. The condition code stack depth 6 equals the number of interrupt priority level in the host system. The usual width to of the condition code stack is 2 bit. If all instructions using the zero flag are eliminated from the customized instruction set, the stack width

can be reduced to 1 bit.

The address stack can be customized in an analogous way:

1. Stack width S: The address stack is used to store data memory ad¬ dresses,. Therefore the width of the address stack registers directly

determines the data memory address range r. 58 The embedded processor

2. Number of directly accessible registers : Just like the data stack, the address stack can have more than one directly accessible regis¬ ter. This greatly simplifies block access, using arrays or pointers.

3. Stack depth TT; Again the black depth is dependent on the number of interrupt levels in the current application and the amount of registers used by the different routines.

Return Program Address Stack Address Stack Memory Interface Interface Interface

Adaptable Processor Core

async. Re9s RAM Condition Data Data Stack Code Stack Memory Interface Interface Interface

sync. sync. RAM RAM w. Regs no Regs

IP-Module

Figure 5.3: Changeable intc rfaces that separate the processor core from the system environment

The return address stack stores the program memory counter after a routine is suspended due to an interrupt or a subroutine call. Therefore the program memory address width 7 and the stack widtti a are the same. The return address stack depth 6 is dependent on the number of supported interrupt priorities and subroutine calls, and so has to be adapted to the application at hand. 5.2 The stack processor 59

An inleriupt launch is always accompanied by the si art address of the new loutine. The width £ of this address can be equal to 7 if the inlenupi 1 outines are spread over the whole program memory, or smaller than 7 if the hit ei nipt 1 out hieb are confined to a certain range. From a set of eight diflcient tiaps, i> can be selected.

Figure 5.4: Chip layout of the miplemoited stack processor inttUectual propci tij module 60 The embedded processor

5.2.4 System interfaces

For the address stack, the data stack, the condition code stack, and the return address stack four different interface types are available (see Figure 5.3). They can be used to expand the respective stacks with:

1. Registers: If a memory occupies only a few bits of memory, regis¬ ters use less silicon area than a RAM. This option will most likely be used for small stacks like the condition code stack, whereas

they are not well suited for large memories like the instruction memory.

2. Asynchronous RAMs

3. Fast synchronous RAMs: This interface type is designed to attach

a RAM with a short access time. Data from the RAM is directly fed into the IP core logic.

4. Slow synchronous RAMs: For RAMs with a long access time, the interface contains additional flip- between the RAM and the core logic to enable short cycle times.

Memory access is critical for processor performance. Therefore, only asynchronous and fast synchronous RAM interfaces are available for the data and the program memory. We assume that the fastest available

RAMs are used for data and program memory.

5.2.5 Implementation

For a test integration (see Fig. 5. i and Table 5.2) we implemented the IP-Module using a 0.6//m 3 layer metal CMOS process. The settings of the parameters shown as Greek letters in Fig. 5.2 are given in Table 5.1.

For the test integration the program and data memory interfaces are realized to interface with asynchronous oif-ehip RAMs. The return address stack and the condition code stack are implemented as regis¬ ters. Finally, the data and address stack RAMs are configured as fast 5.3 The register-based processor 61

Param. Description Value

a Return Address Stack width 16 bit Return Address Stack depth 9 byte S Address Stack width 16 bit

e Number of directly accessible Ad¬ 2 dress Stack registers TC Address Stack RAM addr. space 271 5 bit

Ld Condition Code Stack width 2 bit e Condition Code Stack depth 8 byte

a Data Stack width 8 bit

K Number of directly accessible Data 4 Stack registers X Data Stack RAM address space 2A 6 bit

t' Number of supported instructions 66

7=tv Program Memory addr. space 27 16 bit r=a Data Memory address space 2r 16 bit ijj Number of available exceptions 8

Î Interrupt Address range 16 bit

V Instruction width 16 bit

Table 5.1: Parameter settings for the test integration

synchronous on-chip RAMs. We used flip-flop based "RAMs" from the Synopsys DesignWare library1 in the test integration.

5.3 The register-based processor

Many IP-related properties of the register-based processor implementa¬ tion are similar to the properties of the stack processor. In this section only the differences, like the basic architecture and the instruction cod¬ ing, will be explained in detail, while the similarities are just shortly mentioned. Both, architecture and instruction coding of the processor are similar to the DLN architecture described by Hemiessy and Patter¬ son in [HP96].

1 Those elements aie rather stoiage arrays than RAMs as they are not custom designed maeio blocks but composed of standard cells from the library. 62 The embedded processor

Process 0.6 /im CMOS Metal layers 3 Supply voltage 5V Signal pads 90 Supply pads 30 Chip size inch pads 4.5 x 4.1 mm2 Chip size excl. pads 2.9 x 3.3 mm2 Max. operating frequency 111 MHz

Table 5.2: Key values of the test integration

5.3.1 Architecture

The register-based architecture of the processor is shown in Figure 5.5. All general-purpose registers can be read and written by the data ALU.

The register contents will not be moved as in the stack processor. Gen¬ erally, the register contents must be saved to the data RAM after an interrupt launch, which consumes several clock cycles that can not be used for "useful"' operations. Therefore, the general purpose registers of the processor are implemented as top-of-stack registers. After an inter¬ rupt launch, precious processing time can be saved by just pushing the current register contents down one stack level. This way, neither the pipeline is ever flushed nor any no-operation (NOP) cycles are necessary after an interrupt launch.

The stack handling is completely hardware-controlled and can not be influenced by the programmer. After an interrupt launch the contents of all registers is pushed down one level, which immediately frees up the top level registers for the next interrupt routine. This behavior is completely hidden from the programmer and only invoked by an interrupt.

The maximum interrupt latency achieved by this architecture is two clock cycles. This occurs if a conditional branch instruction is in the instruction fetch pipeline stage. To save the correct program counter the branch must proceed to the execute stage where the decision whether the branch is taken or not is made. Now the new interrupt that had to wait for the correct PC to be determined can be honored. After of completion an interrupt routine the second stack level is pushed up again and the present top level gets overwritten as the routine that used 5.3 The register-based processor 63

Instruction

imm

L/L/

£_A Ur^-iC U

D1 ., D2 D3 _ D4

&>Dr : !>D3* ^D4* ~ i ï>di** i WD3*" !>D4** 1 ^Z7

Figure 5.5: Architecture of the register-based processor) with hidden from the programmer behind the general purpose registers

the current top level is Ierminated. No "housekeeping" operations like in the stack architecture are necessary.

To efficiently code the instruction set it is divided into three instruc¬ tion types (see Table 5.3):

f. R-Type instructions: These instructions have three operands, two sources and one target. Arithmetic and logic operations fall into this call. For this class the opcode is extended by the function

field to enable more efficient instruction coding.

2. I-Type instructions: For this instruction type one source operand 64 The embedded processor

Type Instruction Fields Description Target Source Source R Opcode Func. Arith. /Logic Reg. Reg. J Reg.2 Target Source Immediate I Opcode Immediate Reg. Reg. value J Opcode Target address Branch/Jump

Table 5.3: The thee basic instruction classes of the register-based processor

of the R-type instruction is substituted by an immediate value.

3. J-Type instructions: These instructions are used for jumps and branches. They contain only the opcode and the target address

or offset.

The length of the instructions is determined by the I-Type instruc¬ tions. The instruction length for a given parameterization can be de¬ termined by: (5 + 2* ceil(log2\)+a) (5.1)

Where À is the number of data regibters and er is the width of the data registers (see Figure 5.6).

The other features (see Figure 5.6) of the register-based architecture processor are quite similar to the stack processor, which makes them very easy to compare:

• Separate data and instruction memory (Harvard architecture).

• Read-aft er-write hazards are resolved in hardware. Being able to access the same register in consecutive order greatly simplifies code generation.

• A four-stage pipeline very similar to the stack architecture pro¬ cessor. In the first stage instructions are read from the program memory (Instruction Fetch). After decoding the instruction in stage two (Instruction Decode), it is executed in the third stage (Execute). In stage four, registers are updated and data memory access takes place (Write Back). 5.3 The register-based processor 65

_b 1-

*-* >. E O <£> CO CO S CO o ^o i« r Ol £ 0) CD +* o 0) o CM CO y CO s CO • * • • Q. Ç < < < cnl CO CM 0) o DC c A A A A co 3 CO 0) >_ E X3 o • TJ o CM co CO < < < CM CO o CO *-< CM CO CO 0) tj T- °'g 3 A A A .A. Ü Ü 5 - 2 O .o ** to < I E CM CO o CO < < "O CM c CO 3 ~ CO A A A A_ 0) c «

Ö «^ C (flJC ~ *~1 CM CQJ î= a> o CO < <

~ rr n< TJ *- CM # TJ W OC f" ^ A A A.

Figure 5.6: Register-bas(d IP module with customizable parameters shown as Greek letters

• An additional address ALU with reduced functionality for efficient block access operations. The address register banks have been implemented as top-of-stack registers. 66 The embedded processor

• Return addresses and condition code storage on separate stacks.

• Parametrizable RISC-like instruction set of up to 40 instructions.

5.3.2 Qualitative adaptation of the processor core

Application-specific optimization of the core's functionality can be per¬ formed by selecting the number of supported instructions v required for the current application from a total set of 40 instructions. The hardware associated with the unwanted instructions will be implicitly discarded during logic synthesis. As an example, if all arithmetic data memory address instructions are disabled and only immediate memory access is retained, the address ALIT will be completely removed. Otherwise, with full parameterization, the address ALU will be inferred (shaded areas I or II in Figure 5.6).

5.3.3 Numerical parameterization of the processor

core

From the perspective of au IP user, the most significant adaptations influencing the final chip area have to be done by choosing the right number of general purpose registers as well as the required stack depths.

The following parameters can be varied:

1. Data width a of the processor data path and of all data registers.

2. Data memory address width S defining the maximum addressable

memory size 2(\ The value for the address width S can range up to twice the data width (6 < 2a).

3. Instruct/on memory address width

cessible instruction memory size 2".

4. Number of data registers À: Randomly accessible top of data stack registers. There is no upper bound for this parameter, but exces¬ sive size will result in high area occupation.

5. Number of address registers n: Randomly accessible top of address stack registers. 5.3 The register-based processor 67

6. Data and address stack depth K: This stack depth defines the maximum number of simultaneously launched interrupt routines.

7. Return address stack depth ß: The return address stack depth ß is dependent on the number of supported interrupt priorities and subroutine calls and has therefore to be adapted to the application at hand. The total amount of return address stack registers is

calculated as 7 * ß.

The expression

[h * (

An interrupt launch is always accompanied by the start address of the corresponding interrupt service routine. The width of this address is equal to the instruction memory address width cv. Therefore, the interrupt routines can be spread over the whole program memory. The carry flag is saved on the condition code stack during an interrupt. The condition code stack's depth equals the number of interrupt priority levels, its width is 1 bit.

In addition, the processor supports trap instructions with different priorities £ to allow for communication between processor and interrupt controller. When the processor executes a trap instruction, the cor¬ responding priority will be passed to the external interrupt controller. This instruction can also be viewed as software interrupt from the pro¬

cessor to the interrupt controller. Finally, there are four different run¬ time exceptions: Data and Address Stack Under- & Overflow and Re¬ turn Address Stack Under- k: Overflow.

5.3.4 System interfaces

The register-based processor IP-module is separated into core and sys¬ tem interfaces to permit for easy adaptation to the target environment. The interface modules are designed according to the principle already discussed for the stack architecture processor. 68 The embedded processor

Figure 5.7: Micro photograph of the implemented register-based pro¬ cessor chip

For the test integration the data RAM was an on-chip SRAM that allowed only very long cycle times. Therefore, an additional interface was developed that decouples the core and the data RAM clock fre¬ quencies.

A FIFO buffer of parametiizable depth is used to write data to the

RAM. It allows burst writes fiom the processor core to the data mem¬ ory without processor stalls. FIFO depth, memory read latency, and write latency are individuallv parametiizable. The data read access has 69 5.3 The register-based processor

Value Param. Description

a Data width 16 bit

S Data memory address width 10 bit bit a Instruction memory address width 11 A Number of data registers 12 2 TT Number of address registers 4 H Data / address stack depth ß Return address stack depth 20 e Number of available traps 8 instructions 40 f Number of supported

1/ Instruction width 29 bit

Table 5.4: Parameter settings for the ttst integration of the register-

based processor

the write is stalled priority over a write access. That means that process

if a read occurs even when the FIFO is not empty yet. Additionally, the FIFO is searched for the requested data.

5.3.5 Implementation

for The test implementation was done using the same technology as the used. As stack processor. For the data RAM an on-chip SRAM was this RAM is much slower than the core logic, the decoupling interface

was used to access the RAM. The instruction memory is off-chip and assumed to be asynchronous. All stacks are implemented using flop- flops as the storage elements.

Figure 5.7 shows the resulting ASIC. As expressed by equation 5.2 this the area consumed by the data registers is quite large. In configu¬ ration almost half the chip area is occupied by the data registers. 70 The embedded processor

5.4 Comparison

Both PIP processor architectures try to enhance a well known processor concept. The stack processor adds several directly accessible registers to cure the drawback of having only the top register accessible. This yields a mixture between stack and register based processors. The sec¬ ond implementation goes the opposite direction, it starts as a regular register-based processor. This time, stacks are added for every directly accessible register to increase the interrupt capabilities by again com¬ bining stack and direct access.

5.4.1 Compiler support

One of the tasks of a compiler is to map variables to register locations. This register allocation is optimized by the compiler to result in as few RAM accesses as possible. For a stack architecture this is much harder than for a register-based architecture [HP96]. In a stack architecture the register content is not statically mapped to a register location. The contents get moved up and down the stack with almost every instruc¬ tion. This means that the compiler must keep track of the complete stack to know where a variable from the program is currently located. Also, the compiler must make sure that everything that was pushed on the stack is removed again before the routine ends. This clean-up is done automatically, that is, controlled by hardware in the register based processor.

To the other hand the stack architecture is very well suited for sub¬ routine calls and excellently supports deeply nehted loops. That means, that short routines that make frequent use of subroutines favor the stack architecture. Large routines that utilize a lot of variables and only few procedures or functions will preferably use the register-based architec¬ ture. Additionally, register-based machines are far more common than stack processors today. That also means, that programmers and com¬ pilers for register-based machines are in better supply than their stack counter parts.

In Table 5.9 the results of the comparison are summarized. The compiler support is much broader for the register-based (+) than for the stack (0) architecture. 5.4 Comparison 71

5.4.2 Interrupt capabilities

The interrupt capability of the two concepts is quite balanced. Both do not need idle cycles or NOPs to perform a context switch. An interrupt will be delayed for a maximum of two cycles if a branch instruction is in the instruction fetch stage. The different rating in Table 5.9 is due to the fact that many interrupt levels lead to very high area consumption in the register-based processor as indicated by equation 5.2.

5.4.3 Paranietrization

Both processor IP modules have been parametrized in nearly every pos¬ sible detail. No differences in parametrizability where found between the two architectures. The instruction coding used for the register-based processor makes instruction set changes fairly simple. This principle, however, is no generic property of the register-based architecture, but could also be used for the stack processor. In Table 5.9 both architec¬ tures get the same rating for parametrizability.

To prove that the high parametrizability really works and to test the area efficiency for different parameter sets several test synthesis runs were made. The parameter sets used are shown in Tables 5.6 and 5.7. To get a fair comparison between the two architectures the stack depths (see "Fable 5.6) where chosen to yield alike resources for both processors. The numerical settings (see Table 5.7) were varied in four steps between a small controller with a four bit wide data path and a full blown 32 bit processor.

5.4.4 Area efficiency

Compared to a multiple-stack version a single stack implementation makes it much simpler to implement the lower part of the data stack as a RAM. For the 16x16 and the 32x32 implementation this possibility was used. The synthesis constraints where the same for all different synthesis runs. All processors synthesized were optimized to run at 12ns cycle time and with the same setup, hold, load, and drive constraints at the in- and outputs. The results arc given in Table 5.8 and in Figure 5.8. 72 The embedded processor

Process 0.6 /im CMOS Metal layers 3 Supply voltage 5 Volt Signal pads 85 Supply pads 35 Chip size inch pads 4.6 x 4.6 mm2 Chip size excl. pads 3.7 x 3.7 mm2 Max. operating frequency 118 MHz

Table 5.5: Key values of flie test integration

Parameter Value

Stack processor Data stack cleptlr #IRQLevel x #directDRegs Cond. code stack depth #IRQLevel Address stack depth #IRQLevel x #directARegs Return addr. stack depth #IRQLevel x 2

Register-based processor Data register stack levels #IRQLcvel Cond. code stack levels #IRQLevel Address register stack levels #IRQLcvcl Return addr. stack levels #IRQLevel x 2

Table 5.6: General parameter set of the synthesis runs for the staek

processor and the register-based IP module respectively

For the 4x4 and 8x8 parameter set the register-based architecture has slight advantages over the stack machine. This changes for the 16x16 and 32x32 parameter sets when the stack machine uses a RAM to implement part of the data stack. Implementing part of the data stacks of the register-based architecture in a RAM would make sense for even bigger configurations or if more interrupt levels are required. A drawback of the RAM solution used for the stack architecture is. that another quite small RAM block must be used which increases the

2For the ''16xL6" confirmation DStackDepth is 16 and a 48xL6 bit RAM (0 26 mm2) was used For the '32x32'' configuiation DStackDepth is 32 and a 128x32 bit RAM (0.67 mm2) was used. 5.4 Comparison 73

Name Address Register Data Register IRQLevel PC number3 width number width

4x4 1 8 bit 4 4 bit 2 8 8x8 2 10 bit 8 8 bit 3 10 16x16 4 12 bit 16 16 bit 4 14

32x32 4 16 bit 32 32 bit O 18

Table 5.7: Numerical parameter set of the synthesis runs for the stack processor and the register-based IP module respectively

Name combinational sequential total 4x4 0,72 0.48 0,37 0.38 1,09 0,86 8x8 1.25 0.92 0.76 0.82 2,01 1,74

16x16 *-)«Ozj Zi ,'00 1.62 2,19 4,15 4,84 32x32 7,10 8,14 3.99 9.17 11,09 17,32

Table 5.8: Area consumption (in mm2) of the two processor- architec¬ tures - stack processor (left) and reg/sfcr-based (right)

problems during physical design.

In addition to the four different parameter sets described in Table 5.7 two different instruction sots wore used for test synthesis runs. In the reduced instruction sot for the RISC processor the signed arithmetic operations and the address ALU were discarded. Instruction variations were removed from the stack processor's instruction set. The exact assembly of the instruction sets is shown in Tables A.4 and A.5 for the stack architecture processor and in Table B.2 for the register-based processor. Both processors showed average area savings of 7.5% for the reduced instruction sets.

Table 5.9 shows the ratings for the area efficiency of the two ar¬ chitectures for the small and large configurations. It is obvious from Figure 5.8 that the large sequential area made up by the stack regis¬ ters makes it much less efficient than the stack processor using a data stack RAM. On the other hand, decreasing costs for chip area and sig-

3For the stack piocessoi "numbei" indicates how many stack registeis aie directly accessible, foi the iegister-based piocessoi it is the ••number" of geneial purpose registeis. 74 The embedded processor

Rating Property Stack Register Compiler support 0 + Interrupt capability 0 Parametrizability + Area efficiency 0 / + o/-

Table 5.9: Comparison, bet/ween the two processor concepts. Three different rating are used (+, 0, — ). The area is compared for minimal (left) and extensive (right) configurations.

nificantly increasing system complexity reduce the significance of this property.

As Table 5.9 indicates, the two concepts are quite equally suited for the PIP embedded processor. The wide spread use of register-based pro¬ cessors and better supply of programmers and compilers may promote the decision for the register-based concept. 5.4 Comparison 75

17.3 Size 4 [mm2]

9 %$M ... ÎWÎV5 vsSNXN MM combinational area 3 8 area ^ sequential mmm :::::

7 MM :::::

M KwS 5*î*»wî :::::::::Â WM MM $Ê$ÊM

^îv XV v :11 N^\xx Witt

TO? >"NS^ "\V Ä 4x4 8x8 16x16 32x32 Configu¬ ration

Figure 5.8: Various synthesis runs with different parameter sets - stack architecture (left) and register-based architecture (right) 1 V v tk 1 V Chapter 6

Functional verification of PIP modules

Functional verification of highly parametrizable IP-modules bears sev¬ eral problems. As described before, the IP-user can choose a param¬ eter set that exactly matches the current system requirements. After customizing the IP-Module, the functional correctness of the custom configuration must be verified.

The principal goal of functional IP verification is to prove the equiv¬ alence of specification and implementation. Application- or customer- specific scalability, however, introduces severe difficulties into the func¬ tional verification flow. The problems are even more severe for PIP modules as part of the functionality is implemented in software. De¬ signing a test-bench that scales according to the IP solves only part of the problem. The more severe problem is the generation of test vectors that work for every possible IP configuration. To enable the functional verification of the customized IP-module using one single test bench for all parameter sets, the following two step functional verification flow is proposed.

The basic idea of this flow is to generate the expected responses for the selected configuration of the IP from configuration-independent stimuli. In the first step the configuration-independent behavioral model

i t 78 Functional verification of PIP modules

is used to generate the test pattern (see left side of Figure 6.1).

The behavioral model is split into processor and data path, just like the RTL model. This enables the generation of test vectors even for features implemented as custom software using a single test bench. As the behavioral model is timing- and configuration-independent, the cor¬ rect sequence of the data base entries captures the "pure'" functionality, neglecting all timing information.

The second step (see right side of Figure 6.1) is the actual verification of the RT-lcvel IP model. In this step the outputs of the RTL model are compared against the generated expected responses from the first step. The stimuli vectors used for the expected response generation in the first step are also the starting point for the second verification step.

Latency-independence is an important property of test benches for

IP modules because distinct IP configurations may exhibit different la¬ tency. This feature becomes even more important for programmable modules, as interrupt scheduling and software runtime should not in¬ fluence the verification results. The verification flow proposed in this chapter is completely latency- and configuration-independent.

6.1 Behavioral model

The behavioral model serves two purposes, it is used for the component selection step in system design (see Figure 3.2) and it serves as the "golden model" during functional verification. The language in which the behavioral model is written is of minor importance. In this work

VHDL was used for compatibility reasons and to enable the use of the same simulator as for the RTL model.

The backbone of the behavioral model is a record that stores the current state of the data path during the simulation process. Every item that is stored in a flip-flop or in a RAM in the RTL model has an entry in this record. The declaration of the record is given in pscudo- VHDL in Code-Sequence 6.1.

To capture type and number of the data path the first two entries,

"PathConr and "PathNum". are used. Next the state of the frame 6.1 Behavioral model 79

STM-1/STS3 frames

bytes -> triples parallel -+ serial conversion conversion assembler

source code

i t "converter" assembler

generic binary code format code

loader

Behavioral embedded embedded RTL ' Model processor processor Model

Figure 6.1: Two step functional verification by simulation

alignment must he stored. The "FramingState'" entry can take one of five possible states. Where "IF0...IF4" means that the current slate is "In Frame1" and the number gives the amount of current miss defections of the framing word. If the alignment word is incorrect for the fifth time, the state changes to "Out Of Frame"

Two different bit interleaved parity (BIP) check sums have to be calculated. The "BIPS" uses the complete frame, while the "BIP24" 80 Functional verification of PIP modules

type datapathStateType is record PathConf natural; PathNum natural;

Fram nig State type D {IFO. .. . JF4, OOF}; BipS record type {curreniBIP, stableBIP : natural}; Bip'2A record type {euvrentBIP, stableBIP : natural}; AcbualOlI array [8 downto 0] of byte; Position record type {Line Num, ColNum : natural}; Tripi eN urn natural; Point er State point erSt at eType: end record; type pointerStateType is record

Pointer array [2 downto 0] of natural; Extr Pointer array [2 downto 0] of natural; PointerStatus array [2 downto 0] of {inc, dee, hold}; Valid array [2 downto 0] of boolean; Start array [2 downto 0] of boolean; UselU array [2 downto 0] of boolean; HSIsStart array [2 downto 0] of boolean; end record;

Code-Sequence 6.1: Records used to capture the state of the data paths and the pointers, respectively

skips the first three overhead lines. For both check sums a copy of the completed calculation of the previous frame and a copy that is currently worked on for the current frame are part of the record. As the overhead is transfered to the embedded processor for each line, nine bytes must be available for this purpose.

The position of the current data item inside the frame is captured by "Line-" and ''Column-Number". As the pointer uses triples to mark the beginning of the payload the '•Triple-Number'" is also stored to sim¬ plify the pointer calculation. The state of the pointer state machine is captured in a separate record.

In this record each entry is an array t o hold the values of all three po- 6.1 Behavioral model 81

tentially different pointers. The "Pointer" is the value that is currently used to determine the starting point of the payload. "ExternalPointer*' holds the value that was extracted from the "HI"' and "TI21' bytes of the current frame. The indication where a pointer operation is possible or not is determined in software and stored in the "'PointerStatus".

The rest of the record consist s of boolean identifiers that indicate whether a byte is a ""valid" payload byte, the frame "start"'. or if the bytes following the "HS"1 bytes should be used and identify the frame start.

The data path model itself is assembled from several procedures each of which performs the desired operation and updates the current data path state. The data is input into the data path as triples, because this is the most convenient form for processing. The hardware/software par¬ titioning is the same as in the RTL model. This enables the implemen¬ tation of customer-specific features without changes to the behavioral model.

For the embedded processor a similar approach is used. The be¬ havioral processor model consists of a separate procedure lor every in¬ struction. The procedures execute the desired functionality and update the processor state. This state is also captured in a record, as shown in Code-Sequence 6.2. The modular set-up enables easy maintainabil¬ ity and expandability by adding new procedures that implement new instructions.

The record holds the complete contents of all four stacks: the data stack (dSiack), the address stack (aStack), the condition code stack (cStack), and the return address stack (rStack). The respective array holds the complete stack contents independent of the implementation as RAM or registers. A stack access operation can be to add data to the stack (push), to drop data from the stack (pop), or to leave the stack unchanged (idle). Additionally, a counter that determines the number of items of the respective stacks is used to issue over- or underflow exceptions. The rest of the record holds the data memory which is treated very much the same way as the stacks, the program counter, and trap level and trigger.

The processor model works on an instruction-bv-instruction basis. For each new instruction one big case statement selects the correspond- 82 Functional verification of PIP modules

type processorStateType is record

dStack array of natural; dStackOperahon tvpe D {push, pop, ulle}; dStackCounter natural;

a S tack array of natural; aStaekOperatio)} type D [push, pop. idle}; aStackCounter natural;

cSta ck array of boolean; cStackOperatwn type D {push, pop, /die}; cStackCountcr natural;

rStack array of natural; rStackOperation tvpe D {push, pop, idle}; rStackCountcr natural;

dMem or y Dai a natural; dMemoryAddi^ess natural; dMe m oryOperation type D {read, write, idle}; pC natural; trapTrigger boolean; trapheuel natural; end record:

Code-Sequence 6.2: Record used to capture the current state of the embedded processor

ing procedure, then the processor state is updated and finally all neces¬ sary interface accesses are performed.

6.2 Expected responses generation

The generation of the expected responses uses assembler source code and STM-1/STS3 frames as stimuli vectors (left side of Fig. 6.1). A behavioral model of the module is used to generate the expected re¬ sponses for the verification of the RTL model. The model is split into a data path part and the embedded processor. Both parts are modeled separately. This division is necessary because part of the functional¬ ity is implemented in software and may be changed before verification. 6.2 Expected responses generation 83

Therefore, a single unsplit model can not capture the complete function¬ ality in advance. This approach enables the verification of the complete programmable module including the software content using just one reusable test bench.

Frames Assembler

i esi- bench Data Processor service path routine

Responses Responses

Figure 6.2: Token passing based operation of the test bench used for the functional verification of PIP modules

The data path model works on packets of three bytes as these arc the basic building blocks of the telecommunication frames. The overhead bytes are analyzed if necessary, e.g. for frame alignment, and then forwarded to the embedded processor. The payload is forwarded to the data path output after calculation of the standardized check sums.

The behavioral processor model works in an instruction-by-instruction mode. Stimuli data are taken from the assembler data base, converted into a generic opcode format, and stored into a VHDL record array. Every record element represents the properties of one instruction in a 84 Functional verification of PIP modules

generic, uncoded way (opcode, register address, immediate- and offset values).

Figure 6.2 shows the sequence of actions triggered by each stimuli data item. The test bench activates the data paths (Init) and the man¬ agement interfaces, one after the other. The currently active daia path reads a new stimuli vector (1) and executes the necessary data path functions. If a new overhead was received, the data path activates the processor (2) bv launching an interrupt. The processor executes the according interrupt routine (3) and passes the token back (5) to the data path. During program execution all data and instruction memory accesses are stored (4) in separate data bases. The data path outputs the data (6) and eventually passes the token back (7) to the test bench.

The generation of expected responses is handled independently for every data path and management interface. An interface contains con¬ trol and data signals to allow data exchange with another IP module. Interface output data are stored as expected responses only after acti¬ vation of the interface. In our model the data and program memory interfaces of the embedded processor are directly output to offer bet¬ ter fault coverage and easier debugging. To be latency-independent a different data base for each interrupt level is used. This way it can be checked whether the interrupt routines produce the correct output independent of the interrupt scheduling.

The valid expected responses are saved in a separate file per interface in order to decouple the different interface activities. Thereby, signifi¬ cant data reduction and simulation speed-up are obtained without loss of information concerning functionality. The format of the expected responses is integer to provide an uniform data base for the various customizable settings and configurations of the IP module for the RTL simulation.

The outputs to the processors data and program memory are stored in different databases for every interrupt level. This is necessary because of the token passing based operation of the behavioral model. So every interrupt routine is executed completely without any possibility for an interruption. In the RTL model the sequence of RAM access operations may be different from a global perspective. The accesses done by every individual interrupt routine, however, must be equal to the behavioral model. This way, storing the expected responses in separate data bases 6.3 RTL model verification 85

makes them timing-independent and easy to compare to the RTL model outputs.

6.3 Functional verification of the customized RTL model

The second simulation run uses the same STM-1/STS3 frames and assembler code as the first run did. This time the STM-1/STS3 frames are parallel-to-serial converted and fed into the RTL-moctel. Again,

the embedded processor does the interrupt-driven overhead processing based on the assembler program already used in step 1.

The assembler data base is transfcred into binary code by an assem¬ bler observing the mapping directives based on the selected instruction set. During download this binary code is translated into a virtual VHDL memory block. With this approach the assembler is part of the verifi¬ cation flow and therefore validated itself.

As opposed to the instruction/data-driven approach in the behav¬ ioral model, the synthesizable RTL description of the IP module is clock driven (right side of Fig. 6.1). Output vectors are generated once pet- clock cycle during the execution phase of the simulation. These vec¬ tors are compared against the expected responses from the '"behavioral

simulation ruiiv independently for every interface and interrupt level.

6.4 Test bench and real-time constraints

The test bench used to execute the simulation of the two models is

parametrizable the same way as the RTL model. So the test bench scales according to the RTL model and works for all possible parameter sets. The test bench provides the infrastructure to run the simulation, like the clock signals, file handling, token passing between test bench and behavioral model, and generation of test reports. The actual test vectors are kept in the various data bases.

Additionally, the test bench can be used to check real time con¬ straints. As an example the alarm indication signal (AIS) shall be 86 Functional verification of PIP modules

explained here. If a severe failure occurs during data processing, sub¬ sequent blocks must be warned to get incorrect data. For this purpose the AIS signal is inserted into the data stream. The real-time constraint associated with AIS is, that the warning signal must be inserted no later than a given amount of time.

In our case the behavioral model will insert A IS without delay be¬ cause it has no notion of time and does all actions "immediately'". When the test bench detects the AIS signal, it starts a timer to check whether the RTL model reacts on time. During this time-out unequal response vectors will be accepted as correct because the exact timing behavior is not specified.

Although this scheme works very robustly and precisely, it is not lots of very convenient to use. The solution is not very elegant, because timers in the test bench are needed to handle all real-time constraints. Additionally, the IP customer must edit the test bench to implement checks for specific constraints. Chapter 7

PIP module design example

intellectual As an example for the design of programmable property and STS3 modules, a STM-1/STS3 input block was designed. STM-1 are the 155 MIIz/s hierarchies in the synchronous digital hierarchy (SDH) and synchronous optical network (SONET) standard, respec¬ tively. The goal was to design a highly parametrizable soft module that offers maximum flexibility for system integration and customer-specific features. Additionally, the module should be usable in both, SDH and

SONET networks, without any changes to the hardware if implemented

as a hard IP. By downloading appropriate software it's application in

SDH or SONET networks can be selected after system integration.

7.1 Architecture

Table 7.1 shows the frame structure used in SDH and SONET net¬

works. The frames are approximately the same in both regimes, only

the usage of several overhead bytes differs. One frame contains nine

lines and 260 rows and is transmitted serially in 125 ^sec. The major difference to SONET is the organization of SDH payloads as packets of three bytes. These triples are the smallest payload unit for SDH

87 88 A PIP module design example

Overhead

row 1-9 10 - 260

Al Al Al A2 A2 A2 JO X X Bl X X El X X Fl X X ~~mT~l X X D2 X X ^D^ X X HI HI HI H2 H2 H2 H3 H3 H3 B2 B2 B2 Kl X X K2 X X Payload D4 X X D5 X X D6 X X D7 X X D8 X îT"nb9 X X

D10 X X Dil .A. xH D12 X X

SI Zl Z/l Z2 Z2 Ml E2 X X

Table 7.1: 155Mbit/s frame structure: The overhead byte usage is given for SDH networks. "X" means the usage of the overhead byte is not standardized at all or varies depending on transmission media

networks. In SONET the three bytes belong to three different time- interleaved payload frames. This is because of historical differences of the plesiochronous telecommunication hierarchies in Europe and the USA.

The start of the payload is marked by a pointer located in the "H-bytes" in line four of the frame. In SDII only the first bytes 'TU'* and 'TK1' are used to find the beginning of the payload. These two bytes contain the actual lObit pointer and additional bits to operate a state machine that controls pointer actions.

To adapt to frequency variations without loosing the location of the payload start, two mechanisms work in parallel. The "H3" bytes can carry payload to increase the bandwidth, or the three payload-bytes following the "HS" bytes can be left empty. In the first case the pointer is decremented, which means the frame starts one byte earlier, in the second case the pointer incremented.

For SONET networks the pointer algorithm is exactly the same, except it works on individual bytes and not on triples. In this case three pointers ("HI". "112" bytes) must be handled and three independent stuffing opportunities exist.

The framing word is common to both hierarchies and is located in 7.1 Architecture 89

the "Al"' and "A2" bytes of the frame. These six bytes carry a fixed bit pattern that is used to synchronize to the frame structure. The framing algorithm is identical in SDH and SONET networks and was therefore implemented in hardware. All further bytes do not influence the hardware implementation described here and are not discussed in detail.

An overview of the system architecture is given in Figure 7.1. One central idea of synchronous telecommunication networks is, that the clock frequency is spread over the whole network by the data stream (see Figure 2.1). That means that the frequency of the incoming data stream must be extracted by each network element. For this work it is assumed that the regeneration of the clock signal is done outside the IP and so data and clock signals are available at the IP's inputs.

The input block of the IP uses the clock frequency extracted from the incoming data to synchronize to the telecommunication frame structure by searching the "A1/A2" pattern in the data stream. Once the frame structure is locked the bit interleaved parity (BIP) check sum BIP8 is calculated, the data stream is descrambled, parallel-to-serial converted, and the bytes are fed into a FIFO. The FIFO serves to transfer the data from the input clock domain to the system clock domain. Frequency adaptions are executed only once during a frame by generating a fourth UAF* byte or forwarding just two instead of three. The framing word is not used in the rest of the data path anymore and so the adjustments have minimal impact on the system.

The BIPS and BIP24 check sums are calculated in hardware. The BIPS is computed using the complete scrambled frame, while the BIP24 is calculated after descrambling using the complete frame except the first three overhead lines. The values are compared to the "Bl" and "B2" bytes of the overhead.

Three principal clock domain exist inside the PIP module. The input domain is split into subsections for each input stream, as every input potentially has a different data rate. All these different data streams are transfered to the system clock domain using a two-clock FIFO buffer of configurable depth. The system clock frequency is one eighth of the bit frequency because the data processing is byte based.

The embedded processor runs at four times the speed of the data path to ensure enough computing power and seamless data transfer between 90 A PIP module design example

D C a L t K a

> framing, ser-* par P .Control R > 2-CLK O FIFO Interrupt C controller E S separate overhead from S payload O ! R data path Overhead

configurable Management nterfaces

Figure 7.1: Block diagram of the implemented programmable miellée* tuai property

data path and processor.

The next block extracts the overhead from the data stream and launches an interrupt for the embedded processor to initiate overhead processing in software. The rest of the data path processes the data according to the telecommunication standards. Signals generated by the processor control the data path.

Data exchange between data path and processor is done via memory- mapped registers. The address space of the processor is one bit wider than the data RAM. The additional address space is used to directly ac¬ cess data path registers. Dedicating the same storage size to data RAM and data path registers leaves enough room for overhead transfer and control signal feedback. To guarantee seamless data transfer between 7.2 Adaptable Interfaces 91

write di¬ the memory-mapped registers and the processor in read and rection registers must be clocked with the processor clock frequency.

The pointer machine is split between hardware and software to en¬ The able the use of the PIP module in STM-1 and STS3 environments. finite pointer evaluation algorithm can be described as a hierarchical state machine. On the top level state transitions are made based on the pointer values of several frames. On this level, checks concerning con¬ sistency and validity of the pointer are made. If the top machine is in machine a state where pointer adjustments are possible, the lower-level determines whether the pointer actually gets moved. The top-level FSM level machine does state transitions only once per frame, while the lower must react in a matter of bytes after the pointer was evaluated.

The slower top-level FSM is implemented in software, while the ac¬ tual pointer movement and the lower-level FSM are realized in hard¬ must be im¬ ware. For STS3 data three independent pointer machines plemented, while STM-1 frames use just one pointer. These machines generate the valid and frame start signals that accompany the outgoing data stream.

As shown in Figure 7.1 the interrupts from the data paths and the This management interfaces are scheduled by the interrupt controller. controller supports two different priorities, the higher one for the data The controller forwards paths and a lower priority for the interfaces. higher-priority interrupts or waits for the return from interrupt (RTI) that a signal from the processor to launch the next interrupt. To ensure long routine does not prevent overhead data of the other data path from being read by the processor, an additional mechanism was implemented. The data path routines can signal that all data for the current routine was read from the data path and the interrupt controller can then launch

an interrupt from the other data path.

7.2 Adaptable Interfaces

A configurable number of interfaces connect the processor to the net¬ work management. They can bo used to download now data control parameters or upload control and supervision results. These interfaces are directly connected to the processor to enable adaptable interface 92 A PIP module design example

custom data processing push #$15 load idle pattern on stack data from bus interface snoop : load D2, A2 get cmp Dl, D2 bus idle? bneq snoop if not loop back push #$8 load loop counter on stack transfer data to bus interface transfer : store L>3. ,43 data from stack pop D'S remove dec Dl decrease loop counter beq Dl, transfer loop back until all data transfered

Code-Sequence 7.1: Interface protocol that snoops the bus and sends data if the bus is idle

protocols. For the implemented application point-to-point transfer of small amounts of data is needed. Therefore very simple transfer pro¬

after an tocols are sufficient. Incoming data is read by the processor interrupt launch from the interface. After new data was written to the output registers a ''ready*' signal is set. To explore the possibilities of the adaptable interfaces more sophisticated transfer protocols were implemented and simulated.

The interface protocol given in Code-Sequence 7.2 assumes a bus arbiter that gives the devices attached to the bus exclusive write access wants to on demand. The principle of operation is that the device that transmit data must request the bus and eventually gets write permission by the bus master. As it would consume unnecessary computing power to wait for the bus to become available, the protocol is implemented in two different interrupt routines.

The assembler routines given here are not complete, but give only the key code sequences. Especially removing all data from the stack before the end of an interrupt routine is not included. Also, there are

and that are many versions of the used instructions (see Tables A.4 A.4) better suited for the application at hand, but not used here for clarity.

The first routine (IRQl) does the data processing and generates the data item that shall be transmitted. The data processing is symbolized by the dots in Code-Sequence 7.2. The data can not be transmitted right away, but the bus must be requested first. To make the data available 7.2 Adaptable Interfaces 93

for the transmit routine they must be stored in the data RAM. This is data to the stack necessary because other interrupts could move new and the position of the transmit data would then be unknown to the transmit routine. After all (in this case eight) data items are stored the routine terminates.

To actually transmit the data a second interrupt routine (IRQ2) is used. This routine is activated bv the bus master that grant's write access to the bus. All this routine has to do is to fetch the data from the RAM and send it to the bus interface. From this code-sequence it becomes clear, that the interface registers can be viewed as part of the data RAM from the processors point of view. This is due to the memory mapped implementation described before.

IRQ1 : custom data processing push #38 load loop counter on stack in loop I : store D2, Aï store first data item RAM

inc Ai move to next address dec Dl decrease loop counter beq Dl. loopl loop back until D1=0 store #$lo. A2 request bus by sending request indication to interface rti return from interrupt

IRQ2: push #$8 load loop counter on stack loop2 : push £>2, AI load first data item from RAM store D2. A2 write data to interface

inc Ai move to next address dec Di decrease loop counter beq Dl, hop2 loop back until D2=0

Code-Sequence 7.2: Interface protocol that requests the bus and transfers the data once it got bus access

The second implemented interface protocol is shown in Code-Se¬

quence 7.1. It is based on the idea to snoop the bus until an idle pattern is found and then start transmitting.

This is realized using two loops, the first loop reads the bus input

and compares it to the idle pattern. This loop keeps running until the 94 A PIP module design example

idle pattern is found and then terminates. The second loop transmits the data that had previously been placed on the data stack. Because data generation and data transmission are in the same interrupt routine, there is no need to save the data in the data RAM before transmission.

DataA DataA DataB DataB Datapath Processor In clock In clock clock clock

<-

Input A Input B

Datapath, clock <^-" Processor- clock Data Out Scan In Scan Stimut Datapath

Data In Scan Out Scan Acqu Datapath

Datapaths embedded Processor

Chip

f t DataA DataB Scan En Scan En Scan Out Scan In Out Out Datapath Processor Processor Processor

Figure 7.2: Scan and clock signal distribution to enable seamless data transfer behueen processor and datapath and the software-driven BIST

The peculiarities of the stack architecture can very elegantly be used here. As the bottom stack gets moved up when an item is removed from the stack (pop command) the same stack location (D3) can be read in every pass through the loop. 7.3 Built-in self-test 95

7.3 Built-in self-test

To enable a software driven BIST of the data path part, we inserted be fed separate scan chains into processor and data path. Stimuli can into the data path chain, and responses can be collected, from outside the TP or from the embedded processor. The processor instruction RAM is completely accessible from the chip peripherals and test patterns will also be applied and collected from there.

nop set scan enable high to feed scan chains and read responses store #$T1 ,41 write first vector to scan chain push A2 m get scan output store #$T2 .41 write next vector to scan chain push A2D1 get scan output store #$T2 .41 write next vector to scan chain push A2 1j\ get scan output nop set scan enable low to generate new scan responses

Code-Sequence 7.3: Simple code sequence to implement a software driven BIST for the PIP module

The data RAM, on the other hand, is only accessible from the em¬ bedded processor itself. To test the data RAM a checker board pattern is written into the RAM and then read out again. This is done using a software routine that also checks the correctness of the response and outputs the result of the test.

For the implementation of the software-driven data path BIST all clock signals must be controllable to enable the inclusion of flip-flops from different clock domains into a single scan chain. One solution to achieve this goal is shown in Figure 7.2. The input data clocks arc controllable directly from outside the chip. The processor clock must be fed into the data path to transfer data from processor to data path in the functional operation case. In case of BIST operation the processor clock input must be switched to the data path clock. This is done only if the data path is in scan mode and the processor is not. In this case 96 A PIP module design example

the "scanin" and "scan_out" ports of the data path are connected to the processor and all data path flip-flops are clocked by the data path clock.

Code-Sequence 7.3 shows the principal assembler code sequence that executes a BIST. The patterns are fed into the scan chain and the output from the chain is collected back by the processor. After the test the vectors must be compared to the expected responses that where previously determined by simulation. The implemented PIP module uses this simple sequence to prove the feasibility of the software driven

BIST. If the software driven BIST is used in combination with a LFSR- driven BIST Code-Sequence 7.4 offers the possibility to achieve much higher coverage.

push #$255 : load loop counter on stack loopl : push Dl. A\ ; load test vector from RAM

store Dl. A2 ; write vector to scan chain

inc Aï : move to next address

dec D2 ; decrease loop counter beq D2. loopl ; loop back until D2=0 compare : push #$128. Dl ; get MISR output

emp #$165. m ; compare MISR to expected value

Code-Sequence 7.4: Code sequence to implement a software driven BIST using patterns that had previously been saved in the data RAM and the MISR to compress the result

Code-Sequence 7.4 assumes that the test patterns had been saved in the data RAM before the actual application. Additionally, the MISR can be used to compact the responses so that only one read access to the data path is necessary for the complete test. The MISR output is then compared to the value determined before the test by simulation ($165 in this example). Additionally, the availability of a processor- offers the possibility to save the test vectors in a compressed form and decompress them before application. An example for this approach is given in [PR99]. 7.4 Parametrization of the PIP module 97

7.4 Parametrization of the PIP module

The customizatioiis of the PIP module given here subdivided into the three basic categories: Qualitative customizaiion. quantitative custom¬ ization, and system interfaces.

Parameter Description Data path Number of data paths Number of 155MHz/s in- and outputs Kind of data path Defines whether a data path is used for

SDH or SONET data

Data path interface Voice or packet interface can be used Payload columns Must be 261 for synthesis but can be re¬ duced to cut simulation times

Embedded processor

Instruction set A subbet of 66 instruction can be selected

as described in Code-Sequence 5.1 I-Memory size Must be adapted to hold the complete as¬ sembler code D-Memory size Must be big enough for all constants and variables needed Stack depths All stacks must be large enough to prevent overflows

h ana gement Int erfa ces Number of interfaces The number of independent in- and out¬ puts can be configured Interfaces width The width of all interfaces can be adjusted independently Interface protocol The interface protocol is defined by the in¬ terface software running on the embedded

processor

Table 7.2: Major parameters of the PIP module Implementation. The complete parameter set is given in Appendir A,

1. While the telecommunication standards leave virtually no room for qualitative customizatioiis of the data path, the use of the over¬ head bytes is only partially fixed. The implemented data path 98 A PIP module design example

separates the overhead from the payload and forwards all over¬ head bytes to the embedded processor. In addition, the data path implements the functions defined in the ITU standards [lnt96]. The choice between STM-1 and STS-3 is the only qualitative cus¬

tomization that can be realized for the data path.

On the other hand, there are principally unlimited qualitative customizations for the overhead processing by means of different software running on the IP-embedded processor. Several basic routines were developed, but the IP user is free to change them or

to add as many new functionality as needed in the system under design.

2. The most significant numerical adaption offered by the IP is the number of L55 MHz inputs. The user is free to use as many inputs

as he likes. The only limit to the number of STM-1/STS3 inputs is the performance of the embedded processor that has to do the overhead processing.

Number and port width of the management interfaces are also customizable. The interface traffic is handled by the embedded

processor to enable different data transfer protocols.

3. While the 155 Mllz/s inputs are fixed by the telecommunication standards, the data output interface is available in two configura¬ tions.

(a) The telecommunication standards ask for a twelve bytes deep elastic storage of the STM-1/STS3 output signal. This buffer can be implemented at the output of our IP module.

The STM-1/STS3 pavload in byte format and a frame start signal are output, while the overhead bytes are removed from the data stream. This solution strictly adopts the telecom¬ munication standards. A different solution, that was also used in the add-drop multiplexer svstem described in sec¬ tion 2, is enabled by interface type (b).

STM-ic/STS3e standardize the use of continuous payload in STM-1 and STS3 containers, respectively. These standards arc targeted at Asynchronous Transfer Mode (ATM) or In¬ ternet Protocol (IP) cells to form the pavload. This interface is especially useful for such systems. 7.4 Parametrization of the PIP module 99

(b) The standards do not force the system designer to implement the 12 byte buffer directly at the output of the STM-1/STS3 module. As explained in section 2.3.1 grouping all frequency adaptions in one large buffer is superior to having many small buffers on the chip. Therefore, we also offer the possibility to output the complete data stream while marking the payload bytes as valid and overhead bytes as invalid.

The transfer protocol for the data path data is fixed by the telecom¬ munication standards, the management interfaces can vary for different system environments. To achieve the highest flexibility, every interface of configurable width is equipped with additional in- and output signals. These signals can be used by the embed¬ ded processor to implement different custom-designed interface protocols as described in section 7.2.

A STM-1 data path can exclusively be used for SDH data streams because it only has one pointer machine implemented. If STS3 is config¬ ured, three pointer machines are available. Two of them can be disabled to use the hardware also for SDH environments.

The parametrization of the processor instruction set is described by Code-Sequence 5.1. Each instruction can be turned on or off by entering T or "O" in the appropriate vector position. In the VHDL code the signal assignments associated with the individual instruction are made depending on this entry. If the entry is '0' the assignment is never made and the associated hardware will not be synthesized.

The modular approach of the PIP design is mirrored in the VHDL- code (see Figure 7.3). Functional core and interfaces are coded sepa¬ rately to offer maximum independence of the two. The configuration package (ModuleCfgPkg.hdl) is the user interface of the PIP module. In this package all parameters of the IP are defined and can be changed. Many more parameters are derived from these settings. They are stored in a separate package to hide them from the IP user. The "'Mod- ulePkg.hdF* is used bv almost all other files to spread the parameters over the complete implementation. 100 A PIP module design example

IRQCtrl hdl IQRegs hdl

InstructionDecoder hdl Sequencer hdl DStacklnterface hd ALUData hdl AStacklnterface hdl Framing hdl VoiceOut.hdl ResetSleepCtrl hdl RStacklnterface hdl SerToPar hdl hdl PacketOut hdl DStack hdl CStacklnterface hdl Pointer AStack hdl OpCodelnlerface hdl RStack hdl DMemlnterface hdl DMem hdl

Figure 7.3: Code organization showing the modular approach to ensure maximum flexibility and expandability

7.5 Implementation Results

A two-input STM-J/STS3 programmable IP module (see Figure 7.4) data the was implemented. One input is configured to handle STM-1 in other one is targeted at STS3 frames. The two data paths differ the number of pointer machines implemented. The STM-1 version has only one pointer machine and can therefore exclusively be used for SDH frames. The STS3 data path has three pointer machine implemented in hardware. That means that this data path can be used as a hard IP for both, STM-1 or STS3. with no modification to the hardware.

The management interfaces are implemented as two eight, bit wide inputs and outputs. Both inputs have a separate interrupt signal, both outputs are equipped with a ready signal. This offers the necessary flexibility for the implementation of the described interface protocols.

The embedded processor is implemented using the full instruction set and on-chip RAM for data and instruction memory. The data and address stacks are four and two entries deep, respectively with flip-flop 7.5 Implementation Results 101

Parameter Setting Datapaths 1 STM-1 + 1 STS3 Interfaces 2 8bit inputs + 2 8bit outputs Instruction set full

Instruction RAM 1024 x 16 bit

Data RAM 256 x 8 bit

Data stack 68 x 8 bit

Address stack 10 x 16 bit

Table 7.3: Chosen parameter- se/ foi tin experimental implementation of the PIP module

based Synopsys Design-Ware "RAMs'1 as extensions.

The two data inputs run at the individual clock rates extracted from their respective data streams (155.52 MHz). After frame detection and serial-to-parallel conversion, the resulting data bytes are synchronized to the system clock speed (19.44 MHz) via a FIFO. The embedded processor clock frequency is four times the byte frequency (77.76 MHz) to offer enough performance for extensive computations. A detailed list of physical parameters of the ASIC implementation is given in Table 7.4.

Independent scan chains have been implemented for the data path and the processor part of the chip. These chains can be operated from the primary in- and outputs which allows a high fault coverage. Ad¬ ditionally, the production test of the data path part can be done by

a software driven BIST. For this purpose the inserted scan chains can be fed and read from off-chip or by the embedded processor. The only additional hardware neeessarv for the BIST implementation are mul¬ tiplexers to switch the processor clock and the processor's data IOs between functional and test operation.

The area consumption determined in section 2.3.5 for the two STM-1 input block was 52kGE. Compared to this system-level hardware/software co-design solution the PIP implementation suffers a significant area penalty, it is 88kGE in size. The major contribution to this large area overhead is made up by the embedded processor.

In this first PIP implementation, the stack architecture processor, that was used in the ADM-1 project (see section 2.3) as the system 102 A PIP module design example

Process 0.6 ßm CMOS Metal layers 3 Supply voltage 5V Signal pads 98 Supply pads 22 Chip size inch pads 24 mm2 Chip size excl. pads 14.459 mm2 Datapath area (both) 3.730 mm2

Processor area 10.723 mm2 Data input frequency 155,52 MHz Processor frequency 77.76 MHz System frequency 19.44 MHz

Table 7.4: Implementation data of the 2-mput STM-1/STS3 PIP mod¬ ule

processor was used as the PIP embedded processor. More over the clock frequency of both implementations was the same although the PIP implementation used a 0.6/mi technology and the ADM-1 system uses 0,35/mi. It is immediately obvious that this yields quite some area penalty.

The program memory was chosen to be 1024 x 16 bit large. This is too large if only the ITU specifications shall be implemented and leaves room for either additional customer-specific features or area savings. Additionally, the data stack RAM is flip-flop-based. It's size of 64 x 8 bit is also quite large. Although a large stack is very handy for assembler programming, the size could be reduced with little performance penalty. 7.5 Implementation Results 103

Figure 7.4: Chip photograph of the implemented STM-1/STS3 pro¬ grammable intellectual property module \ t *^v

lit , I« Chapter 8

Conclusions

The concept of programmable intellectual property (PIP) modules was presented in this thesis. It extends the evolution of chip design from application-specific blocks over hardware/software co-design to reusable programmable IPs. The key idea, of the new concept is to use an em¬ bedded processor inside IP modules.

The new paradigm helps to solve many problems of SOG design by IP reuse. The major benefits of PIP modules are:

1. PIP modules fit perfectly info the system design flow proposed in Section 3.2.1. They support the divide-and-conquer approach needed for future systems due to their large functional size. PIP modules enable the combination of system design by reuse with hardware/software co-design which is known to be problematic in "classic" system designs.

2. The data transfer protocol of the interfaces that connect PIP mod¬ ules to their neighboring blocks can be implemented in software running on the embedded processor. This approach offers high flexibility in cases where bandwidth is not the key interface re¬

quirement .

3. Built-in self-test is the key method to enhance the testability of SOG designs. PIP modules offer the possibility to enhance the test

105 106 Conclusions

coverage by running a software-driven BIST using deterministic patters in addition to a random test.

4. The functionality of PIP modules can be altered by a software download. Thanks to this feature even hard-IPs become highly flexible when implemented according to the proposed paradigm. Soft PIP modules offer the additional possibility to customize the hardware prior to integration.

The two-step functional verification flow is applicable for every pos¬ sible parameter set and software version running on the embedded pro¬ cessor. It utilizes a behavioral model to generate test vectors for the verification of the RTL model by combining application-specific data and assembler source code as a starting point.

The parametrizable test bench checks whether the sequence of ex¬ pected responses and actual responses of the RTL model are identical. This method verifies the functionality but not the timing constraints. To test whether the real-time constraints are met, a timer for every individual constraint must be implemented in the test bench.

Two processor architectures, a register-based and a stack architec¬ ture, have been implemented to evaluate their suitability for the pro¬ cessor embedded within PIP modules. Both architectures proved to be well suited for this purpose. The only major difference is the better programming support for the more common register-based architecture.

A STM-1/STS3 interface block has been designed in order to prove the usefulness of the PIP paradigm. The implementation also uncovered two drawbacks of the concept. PIP modules consume somewhat more area, which however will be less important in times of ever rising usable chip area. Additionally, the program and data RAJMs of the PIP module make the physical design steps more complex.

These disadvantages are bv far overeompensated by the system de¬ sign advantages and the high reusability of IPs designed according to the new paradigm. Appendix A

Parameter set of the PIP module

The parameters given here are from the "Configuration Package"'. Many more parameters are derived from these. The derived parameters are the once that are actually used in the VHDL code. The "Configuration Package" can be viewed as the user interface of the IP module.

107 108 Parameter set of the PIP module

Parameter Type Value Description

Data Path Parameters Input FIFO settings DEPTH positive 8 depth of the input FIFO in byte Depth-4 ALMOST-FULL positive number of empty locations;

maximum value Depth-S ALMOST-EMPTY positive number of filled locations;

maximum value

ERROR-MODE positive 1 dynamic error flags RST-MODE natural 0 asyncronous reset affecting ctrl and RAM

o SYNC-MODE positive use two FFs to synchronize coutcrs to other clock domain Global configurations ÏN-SECS positive 2 number of input sections

IN-SEC-WIDTH array 0=»1, both sections contain one reg¬ 1=»1 ister OUT_SECS positive 2 number of output sections

OUT_SECWIDTH array 0=>1, both sections contain one reg¬ 1=*1 ister

IRQ-ADR arrav 0=^236. start addresses of the differnt 1=^512. interrupt routines 2=^-768 IRQ-LEVEL nat ural 3 nuber of different interrupt lev¬ els DATA-PATHS positive 2 number of incomming STM-1 or STS-3 signals PATH-CONF array 0=^>1, '0' if STM-1 frames are to be i=M) used; 'F if STS-3 is used PAYLOAD-COLUMNS positive 261 number of payload columns

Table A.l: Complete parameter set of the PIP module including the stack architecture proce ssor (Parti: Datapath)

xUsed if a RAM is used as the IP external memorv

2Used if registers are used as the IP external memory 3Possible exceptions are: DStackOverflow, DStackUndcrflow, AStackOverflow, AStackUndcrflow, RStaekOverflow, RStackUnderflow, CStackOverflow, CStaekUn- derflow

tThis instruction is not used in the reduced instruction set 109

Parameter Type Value Description

Embedded Processor Parameters Stack configurations DSTACK_WIDTH natural 8 data stack register width DSTACK_REG_NUMBER positive 4 number of total data stack reg-

ist crs DSTACK-ACCESS-NUMBER positive 2 2** => daia stack direct access range DSTACK_MEM_ADR_RNGX natural 6 2+* ^ data stack memory ad¬

dress range DSTACK_REG^ADR_RNG2 natural 4 number of dat a stack memory registers ASTACK.WIDTH natural 16 address stack register width ASTAGK_REG_NUMBER positive 2 number of total address stack registers ASTACK_ACCESSJSfUMBER positive 1 2** => address stack direct ac¬ cess range ASTACK_MEM_ADR_RNGX natural 3 2** => address stack memory

address range ASTACK_REG^ADR_RNG2 natural 4 number of address stack mem¬

ory registers RSTACK_WIDTH natural 16 return address stack register width RSTACK-REG.NUMBER positive 1 number of return address stack registers

RSTACK-MEM-ADR-RNG1 natural 3 2** => return address stack

memory address range RSTACK_REG_ADR_RNG2 natural 8 number of return address stack

memory registers CSTACK.WIDTH natural 2 condition code stack register width CSTACK_REG_NUMBER positive 1 number of condition code stack registers CSTACK-MEM-ADRJRNG1 natural 7 2** => condition code stack

memory address range CSTACK_REG_ADR_RNG2 natural 8 number of condition code stack

memory registeis

Table A.2: Complete parameter set of the PIP module including the stack architecture processor (Partll: The Embedded Processor) 110 Parameter set of the PIP module

Parameter Type Value Description

Embedded Processor Parameters OPCODE-WIDTH natural 16 instruction width

OPCODEJdEM_ADRJRNG natural 10 2** => instruction memory ad¬

dress range

OPCODE boolean 66 from the 66 instructions a sub¬

set can be selcted

DMEM-WIDTH natural 8 data memory width DMEM_ADR_RNG natural 9 2** => data memory address range; 8bit needed for 256x8 on-Chip RAM one extra bit for

memory mapped locations Trap and interrupt configurations TRAPXEVEL natural 2 2** =4- number of tiap levels

EXC_NUMBER natural 3 2** => number of different ex¬ ceptions3 INTERRUPTEUR _RNG natural 10 2** -v. interrupt address range

Table A.3: Complete parameter set of the PIP modvle including the stack architecture processor- (Partll: The Embedded Processor cont.) Instruction Opeiation

Data ALU Instiuctions

4- mcB #imm ,d> ,dy 'i (^y[#"7""] = -0 ^lcn dT 4— d^ -| 1 e/se c?; 4— d7

cmp fâimm,dï cmp d-i ,dt/ di - dv

cmpp dv ,dt/ dt - dy add #imm ,d7 d7 -f- dt + #cmm add test ,dT ,dtJ d;, + #m?m add dt ,dy d, <- d, + dt/ addp di ,di) d, 4~ d, + d,;

sub jf^imm ,d% d, 4— dt — #imm

sub test,dT)dy di - dy

sub d 4- - h, dy d, dt dy

4- - subp d% , dv d, di dy

mem 4— modidn fyinim , dt dj (d, -f- 1) #imm

mem test -i- modulo , dL , dy (dx 1) -ftimm mem dx ,dy d-, 4— (dt + 1) modulo dy and #imm ,dt d, 4- d, ond #imm and test and , d; , dv dj dy and d1 ,dy d, 4- d, and dy andp rf,,rfv d, -t- dl and dv or #%mm yd-t. d, 4— d-, or #?mm or test,dr, d'y dy or dy or dt ,dy d, 4— d, or dv orp da ,dy d, 4- d, or dy

not d, d, t— no/- dr xor d7 ,dt/ dx 4- di t oi dy

4— îoi xorp d, , dv d, dt dy shr di d, 4-d7 > 1. djMSTJ] 4- 0 shl d7 d, 4-rft < 1, djLSB] 4~~- 0 ror da d7 <-rf, » 1, rf^A/Sß] 4~~C rol d-, dt 4-d, < 1. rfjLSB] <-C Data IVlemoi-y Instiuctions

pushD (fli)+t do 4— mt/n[flz], a, 4~ «x + 1

pushD (a,)-' do 4— me/7?[«i], aT 4- a, — 1

4— me pushD offs(t[a i V do 7??[o} -f-offset] ^ storeD (a, ) + inem[aT} 4— do a-, 4— a,, + 1

— t storeD (ot ) mtm^] f- do, at 4— ax — 1 storeD address nn m [address] 4— do storeD offs(t((ii)^ mem[az + o/fsct] 4—do writeD ! 4— , 4— 1 d7 , (oy) + mfm[av] d, av oy + —T writeD me m 4— 4- — 1 d, , (ay) [a,J d, at/ av writeD d, ,of f^ftiaif)^ ?7re/rr[oy + of/-se/] 4- a7,

Table A.4: Insti action s< f o/ t/ie embedded stock processor 112 Parameter set of the PIP module

Instruction Operation

Addiess ALU Instructions

addA 4— a f^imm,a] o, r + ffimm

addAhi dr ,a], a,/ 4— «y + ((/r

bra offset PC 4- PC + offset jmpAp PC <~-a0 jsr label PC<-lab(l

PC <- a jsrA or i bne offset if (Z) then PC 4- PC + offset

t he PC <- PC + 1 beq offset if (Z) then PC <- PC + offset

i Ise PC 4- PC + L

t>ge offset i f (C) then PC +-PC + offset else PC 4- PC 4 1 bcc offset ?f (C) f/je?» PC 4- PC + offset else PC <- PC + 1

decjmp dx, of fset ,f (dr = 0) */?<•>? RV PC + 1 dse PC 4- PC + offset

d, 4— d7 - 1

jmB ffnirm,dv ,offset ?/ (rfa [#?m/n] = i) fhen PC 4- PC + offset else PC

nop reset

rti P(f ^_ rQ( condition eode 4— co

rts PC 4— ?"o sleep P(~* ^ p(f »„/// niteri upt

Table A.5: Instruction set of the embedded stack processor (cant.) Appendix B

Parameter set of the register-based processor IP module

113 114 Processor parameter set

Parameter Type Value Description

General Parameters DATA_WIDTH natural 16 data register width ADR_WIDTH nai ural 10 address register width 2** =$ data memory address range Register/Stack Parameters DATA_STACK_NUM natural 12 number of data regis¬ ters/stacks ADR_STACK_NUM natiual 4 number of address regis¬ ters/stacks STACK_DEPTH natural 4 number of stack levels, defining data, address, condition code stack depth RSTACK_DEPTH natural 20 return address stack depth Memory Parameters

INSTR_MEM_ADR_RNG natuial 11 = PC-WIDTH 2** => instruc¬

tion memory address range FIFO-DEPTH natural 5 buffei size of data memory in- teiface WR_WAIT_CYCLES integer 2 clock cycles allowed for write operations, must be > 2 RD_WATT_CYCLES integer 2 clock cycles allowed for read operations, must be > 2 M iscellaneous Parameters

TRAP_LEVEL natuial 3 2** =>• number of trap levels o EXC-NUMBER1 nat ural 2** => number of different ex¬ ceptions

OPCODE boolean 40 from the 40 instructions a sub¬

set can be seiet eel

Table B.l: Complete parameter set of the register-based processor IP modvle

1 Possible exceptions aie: StackOverflow, StackUndorflow, RStackOverfiow, RStackUnderflow. The first two applv to the data, address, and condition code stacks.

tThis instruction is not used in the reduced instruction set Inst met ion Opeiation R-Typc Instructions nop no operation add /?, ,J?v,2?t P\ <- P^ + i^ addu B, ,BinBz iîï <- i?y + i?„. C «- (orrjy adda -f- C f- ça? Aa , Ay, P} Ar At/ + i?j- ry

SUb BrMyJit 1?, ^e- J^ïf — I?^y subu Rx,By, Bz P, f- i?!y - l?y C <- torn/ and B,,By,Bz pt <- Py fc- Py or i?t>ffv,2?- P, f- Py j) Py xor f- ^ /?) , By, Bz i?t Py Ry sit /?L, J?|, if {Rt < Ry) fhtn C <- 1 ehe C <- 0 sltU Bx,Ry ?/ (J?, <^ i?v) f/?( 77 C <- 1 e/sc C «- 0 cmp Br,R)j ?/ (i?, = Pv) //?c?? ff-1 else C<-0 ml Aj,.!^ mh AÎ,JRÏ Aysß^-i?t lwr i?i,A,,i?t J?7 f-^em[A, + Pj] î swr f? A P + <~~ j , j, y A/e???[Al Pj] Hi I-Tvpe Instructions T addl Rh. By, #7777777 P, ^-i?^+ #7 777 77?^ addiu

*" addai 777 77? -f- 777 777 A) , Ay. #7 A, Ay + #7 andl Pj,, P\;, #??????? Pi f- Py AT #??7??7?

Ori 777 777 4— Z?i , i?y, #7 P, i?y || #7??7??7

XOri Z?i . i?,y, #7 777 777 Pt «- Py e #??'???!

Sil 771 < <_ 0 P, , #7777 Pt #7777777. i?,"SK Sri ^0 P, , #777777? P, >#7??7??7, J??/9B Sltl i?

SltlU 777 t7?o? C f- 1 eLseC f- Ü Pt , #777! ?f (JF?t < #??7???7)

= //je H ff- 1 eheC ^ 0 Cmpi i?i , #7777??? if (Et #???? ???) mli Ar. #??????? Af5B 4— #/?????? mhi 77? 4- A, , #7777 AySS #777)??! li 4— i?i , #??????? P, #????/?? ^ Iwi . 7 777 777 f- i?j At , # Z?t ,\/c???[A, -f #?m/7i] swi <— Pt , A,,, #??7????T Ai>???[ ii + #7777 777] Pt btCS 777 =- 1 <~~ P, , #777? ?f (i?, [#?7?????] flu nff-1 cLse C 0 J-Type Inductions

J all #7 77? 77? RA«- PC + 2 PC ^#7 77? 77? jmpi #7 777 77? PC 4- #????77? h be #7777 777 7f (CC = 1) thenPC

Table B.2: Instruction set of the / ec/hst er-based processor 116 Processor parameter set

»PC

i 3 Bibliography

[All] Virtual Socket Interface Alliance. Vc interface standard. lit tp://www. vsi.org/.

[B+971 Felice Balarin et al. Hard waie-Soft ware Co-Design of Em¬ bedded Systems: The POLIS Approach. Kluwer Academic Publishers, 1997.

[BdML+97 Ivo Boisons. Hugo J. de Man, Bill Lin, Karl van Rompaey, Steven Vercauteren, and Diederik Verkest. Hardware/software co-design of digital télécommunica¬ tion systems. Proceedings of the IEEE. 85(3):391-418, March 1997.

[BG99] S. Bakshi and D. D. Gajski. Partitioning and pipelining for performance-constrained hardware/software systems. IEEE Transactions on VLSI Systems, 7(4):4L9-432, De¬ cember 1999.

[BSV951 K. Buclienrieder, A. Sedlmeier, and C. Veith. Codes a framework for modeling heterogeneous systems. In Jerzy Rozenblit and Klaus Buclienrieder, editors, Codesign Computer-Aided Software/Hardware Engineering, chap¬ ter 18. pages 378-393. IEEE Press. 1995.

[GdM96 Rajcsh K. Gupta and Giovanni de Michcli. A co-synthesis approach to embedded system design automation. De¬ sign Automation for Embedded Systems. 1 (l-2):69--120, January 1996.

117 118 Bibliography

[GM92] R. K. Gupta and G. De Micheli. System-level synthesis using re-programmable components. In Proc. European Conference on Design Automation (EDAC), pages 2-7. March 1992.

[GQXHG99] Gang-Quan. Xiaobo-Hu. and G. Greenwood. Preference- driven hierarchical hardware/software partitioning. In Proceedings IEEE International Conference on Computer Design: VLSI in Computers and Processors, pages 652 657. IEEE Gomput. Soc. 1999.

[GS98] Lee Garber and David Sims. In pursuit of hardware- software eodesign. IEEE Computer, pages 12 14, June 1998.

[GVNG941 Daniel D. Gajski. Frank Valiid, Sanjiv Narajan, and Jie Gong. Specification and Design of Embedded Systems. Prentice-Hall, 199 1.

[GvPL4 96] G. Goossens, J. van Praet. D. Lanneer. W. Geurts, and F. Thoen. Programmable chips in customer electronics and telecommunications. In G. de Micheli and M. Sami, editors, Hardware/Software Co-Design, pages 135-164. Kluwer Academic Publishers. 1996.

[GvWH+961 Guilio Goiia. Gecrt van Wauwe, Glenn D. House, Hugo de Man, José Lareo, and Carlos Barrio. Telecommunica¬ tion systems design. IEEE Design & Test of Computers, 13(4):74-81. winter 1996.

[GZ971 Rajesh K. Gupta and Yervant Zorian. Introducing core- based system design. IEEE Design & Test of Computers, 14(4):15 25, October-December 1997.

[HEHB94] Jörg Henkel, Rolf Ernst, Ullrich Holt mann, and Thomas Benner. Adaptation of partitioning and high-level syn¬ thesis in hardware/software co-svnthesis. In Proc, Inter¬ national Conf. Computer-Aided Design (ICCAD), pages 96-100. IEEE Computer Society Press. 1994.

[HP961 John L. Hennessv and David A. Pat terson. Computer Ar¬ chitecture a Quantitative Approadi. Morgan Kaufmann Publishers, second edition, 1996. Bibliography 119

[Int96l International Telecommunication Union. ITU-T Rec. G.783 Characteristics of Synchronous Digital Hierarchy (SDH) Equipment Functional Blocks, October 1996.

[Jr.89] Philip J. Koopman Jr. Stack Computers: The new wave. Computers and Their Application. Ellis Horwood Ltd., 1989.

MBR99 P. Maciel, E. Barros. and W. Rosensticl. A petri net model for hardware/software codesign. Design Automa¬ tion for Embedded Systems, 4(4):243-310, October 1999.

MVFOO Jens Muttersbach, Thomas Villiger, and Wolfgang Ficht- ner. Practical design of globally-asynchronous locally- synchronous Systems. In Proc. International Symposium

on Advanced Research in Asynchronous Circuits and Sys¬

tems, pages 52-59, April 2000.

[MVK+99] Jens Muttersbach, Thomas Villiger, Hubert Kaeslin, Norbert Felber. and Wolfgang Ficht ner. Globally- asynchronous locally-synchronous architectures to sim¬ plify the design of on-chip systems. In Proc. Twelfth In¬ ternational IEEE ASIC/SOC Conference, pages 317 321. IEEE, September 1999.

[Nor96] Anders Nordstrom. Formal verification - a viable alterna¬ tive to simulation? In Proc. International Verilog HDL

Conference, pages 90-95. IEEE Press, February 1996.

[PDH991 Matthew Pressly. Debaleena Das, and Craig Hunter. Lbist for embedded core : Feasi¬

ble or not? In 2nd IEEE International Workshop on Mi¬

croprocessor Test and Verification (AITV99). IEEE Com¬ puter Society Press. September 1999.

[PMN991 C.A. Papachristou, F. Martin, and M. Nourani. Micro¬

processor based testing for core-based system on chip. In Proc. ACM/IEEE Design Automation Conference, pages 586-591. ACM. June 1999.

[PR991 Irith Pomeranz and Sudhakar M. Reddy. Built-in test

sequence generation for synchronous sequential circuits 120 Bibliography

based on loading and expansion of test sequences. In Proc. ACM/IEEE Design Automation Conference, pages 754-759. ACM. June 1999.

[PVvM97 Stefan Pees. Martin Vaupel, Vojin Zivojnovic, and Hein¬ for rich Meyr. On core and more: A design perspective systcms-on-a-ehip. In Proc, International Conference on Application-Specific Systems. Architectures and Proces¬

sors, pages 448-457. IEEE Press, July 1997.

[RT99] Janus/ Rajski and Jerzy Tyszer, editors. Built-in Self-Test for System-on-a-Chip. IEEE Computer Society Press, September 1999.

[SIA99] Semiconductor Industry Association SIA. International Technology Roaämap for Semiconductors 1999 Edition. SIA. 1999".

[SKK+OOl A. Shrivastava, H. Kumar, S. Kapoor, S. Kumar, and M. Balakrislman. Optimal hardware/software partition¬ ing for concurrent specification using dynamic program¬ ming. In Proc. International Conference on VLSI Design, pages J10-113. IEEE Comput. Soc, 2000.

[SRMB981 Stephan Schulz. Jerzy W. Rozenbilt, Michael Mrva, and Klaus Buchenrieder. Model-based codesign. IEEE Com¬ puter. 31(8):60-67. August L998.

[SRS94] Lui Sha, Ragunathan Rajkumar, and Shirish S. Sathaye. Generalized rate-monoionic scheduling theory: A frame¬ work for developing real-time systems. Proceedings of the IEEE. 82(L):68 82, January 199 4.

[SRT+99] Manfred Stadler, Thomas Röwer. Markus Thalmann, Norbert Felber. Hubort Kaeslin. and Wolfgang Ficht- ner. Functional verification of intellectual properties (ip): a simulation-based solution for an application-specific

instruction-set processor. In Proc. International Test Conference, pages 415 420. IEEE, September 1999.

[STZ+991 K. Strehl, L. Thiele. D. Ziegenbein, R. Ernst, and J. Te¬ ich. Scheduling hardware/software systems using sym¬ bolic techniques. In Proceedings of the Seventh In- Bibliography 121

ternational Workshop on Hardware/Software Codesign (CODES'99), pages 173 177, 1999.

[ThaOO] Markus Thalmann, An SDH Add-Drop Multiplexer as a System-on-Chip. PhD thesis, Swiss Federal Institute of Technology. June 2000.

[TSFF99] Markus Thalmann. Manfred Stadler, Thomas Röwer Norbert Felber, and Wolfgang Fichtner. A single-chip solution for an adm-l/tmx-1 sdli telecommu¬ nication node element. In Proc. Tiuelfth International IEEE ASIC/SOC Conference, September 1999.

[VvRBdM96] D. Verkest, K. van Rompaey. I. Bolsens, and H. de Man. Coware-a design environment for heterogeneous hard¬ ware/software systems. Design Automation for Embedded Systems, l(3):357-386. Oktober 1996.

[Wol98] Wayne Wolf. Modern VLSI design: systems on silicon. Prentice-Hall, 2nd edition. 1998.

[YW961 Ti-Yen Yen and Wayne Wolf. Hardware-Software Co- Synthesis of Distributed Embedded Systems. Kluwer Aca¬ demic Publishers. 1996.

[Zor98] Yervant Zorian. System-drip test strategies. In Proc. ACM/IEEE Design Automation Conference, pages 752- 757. ACM, June 1998. V' t List of Tables

2.1 Area consumption of the overhead processing hardware of the needed in case of a pure hardware implementation add-drop multiplexer system 16

2.2 Gate equivalents needed for the ASIP implementation 17 and for the program and data HAMs

2.3 Physical data of the ADM-1 ASIC implementation ... 19

3.1 Technology forecast taken from the International Tech¬

nology Roadmap for Semiconductors 1999 Edition . . . 26

5.1 Parameter settings for the test integration 61

5.2 Key values of the test integration 62

5.3 The three basic instruction classes of the register-based 64 processor

5.4 Parameter settings for the test integration of the register- 69 based processor

5.5 Key values of the t est integration 72

5.6 General parameter set of the synthesis runs for (he stack IP module 72 processor and the register-based respectively

123 124 List of Tables

5.7 Numerical parameter set of the synthesis runs for the stack processor and the register-based IP module respec¬ tively 73

5.8 Area consumption (in mm2) of the two processor archi¬

tectures - stack processor (left) and register-based (right) 73

5.9 Comparison between the two processor concepts. Three

different rating are used (+. 0. — ). The area is compared

for minimal (left) and extensive (right) configurations. . 74

7.1 155Mbit/s frame structure: The overhead byte usage is given for SDH networks. "X" means the usage of the overhead byte is not standardized at all or varies depend¬ ing on transmission media 88

7.2 Major parameters of the PIP module implementation.

The complete parameter set is given in Appendix A. . . 97

7.3 Chosen parameter set for the experimental implementa¬ tion of the PIP module 101

7.4 Implementation data of the 2-input STM-1/STS3 PIP module 102

A.l Complete parameter set of the PIP module including the stack architecture processor (Parti: Datapath) 108

A.2 Complete parameter set of the PIP module including the stack architecture processor (Partll: The Embedded Pro¬ cessor) 109

A.3 Complete parameter set of the PIP module including the stack architecture processor (Partll: The Embedded Pro¬ cessor cont.) 110

A.4 Instruction set of the embedded stack processor Ill

A.5 Instruction set of the embedded stack processor (cont.) . 112 List of Tables 125

B.l Complete parameter set of the register-based processor IP module 114

B.2 Instruction set of the register-based processor 115 SAM

' if I I -» ^ List ot Figures

1.1 Increasing complexitv of integrated circuits according to Moore's Law 2

2.1 SDH network incorporating add-drop multiplexers that form the interface between SDH and PDH 8

2.2 Block diagram of the add-drop multiplexer system show¬ ing the data path hardware blocks and the embedded processor 9

2.3 Interrupt handler of the ADM-1 ASIC 13

2.4 Archil ecture of the application-specific instruction set pro¬

cessor 15

2.5 Microphotograph of the ADAf-1 ASIC implementation . 18

3.1 Four major improvements in IC design methodology . . 25

3.2 System design flow that incorporates predesigned IP mod¬ ules 28

3.3 Principal design steps to build a system on a chip from pre-designed bloekb 32

3.4 Logic BIST architecture 34

127 128 List of Figures

4.1 Using hardware/software co-design in combination with reuse of intellectual property modules in system-on-a- chip designs 39

4.2 Evolution of ASIC design from custom specific blocks to highly reusable programmable intellectual property mod¬ 41 ules ,

4.3 Functional customization of hard-IP modules 44

4.4 System interfaces of IP modules 45

4.5 Software driven BIST of IP modules 46

4.6 Logic BIST architecture from Figure 3.4 with the addi¬ tional possibility to apply deterministic patterns from the embedded processor 47

4.7 Trade-off between performance and flexibility in the im¬ plementation of a given algorithm 49

5.1 Data ALU and data stack of the stack processor, offering direct read and write access to four registers on top of stack 54

5.2 Enhanced stack architecture processor IP-Module with customizable parameters shown as Greek letters 55

5.3 Changeable interfaces that separate the processor core from the system environment 58

5.4 Chip layout of the implemented stack processor intellec¬ tual property mod tue 59

5.5 Architecture of the register-based processor, with stack register hidden from the programmer behind the general

purpose registers 63

5.6 Register-based IP module with customizable parameters

shown as Greek letters 65

5.7 Micro photograph of the implemented register-based pro¬ cessor chip 68 List of Figures 129

5.8 Various synthesis runs with different parameter sets - stack architecture (left) and register-based architecture (right) 75

6.1 Two step functional verification by simulation 79

6.2 Token passing based operation of the test bench used for the functional verification of PTP modules 83

7.1 Block diagram of the implemenled programmable intel¬ lectual property 90

7.2 Scan and clock signal distribution to enable seamless data transfer between processor and datapath and the software-driven BIST 94

7.3 Code organization showing the modular approach to en¬ sure maximum flexibility and expandability 100

7.4 Chip photograph of the implemented STM-1/STS3 pro¬ grammable intellectual property module 103 \ * .

} »' n t «£> v ;„ I Jit i* K 4, Curriculum Vitae

I was born in Gelsenkirchen. Germany, in 1969. I received my M.S. degree in electrical engineering from the Aachen Technical University (RWTH), Aachen, Germany, in 1995. In my diploma thesis I developed a new concept for an ultra small N-MOSFET. In 1996 I joined the Inte¬ grated Systems Laboratory (IIS) as a research and teaching assistant.

The first project I worked on at the IIS was the design of an add-drop multiplexer for the STM-1 hierarchy. In this project we worked closely together with Siemens Switzerland AG. After finishing this project I mainly worked on the topics covered in this thesis.

After finishing my Ph.D. at the Swiss Federal Institute of Technology (ETH) I will join IBM and work at Thomas J. Watson Research Center in Yorktown Heights, New York.

131 132 Curriculum Vitae

List of Publications

[i] John D. Holder and Thomas Rower, '"Spice heat tiansfer model for silicon crystal growth", in Proceedings of Fifth Conference on Crystal Growth AACG/East. October 1994. Atlantic City, NJ, USA.

[2] J. Gondermann. T. Röwer. B. Hadam. T. Köster, J. Stein, B. Spangen- berg. H. Roskos and H. Kurz. "Triangle-shaped nanoscale metal-oxide- semiconductor devices", in Journal of Vacuum Science and Technology B, vol.14, no.6, Nov./Dec. 1996, pp.4042-4045.

[3] J. Gondermann, J. Sehiepanski, B. Hadam, T. Köster, T. Rower, J. Stein, B, Spangenberg, H. Roskos and H. Kurz, "New concept for ultra small N-MOSFETs", in Microelectronic Engineering Journal vol.35, no. 1-4, Feb. 1997, pp.305-308.

[4] J. Gondermann, T. Rower, B. Hadam. T. Köster, J. Stein, B. Span¬ genberg, U.R. Roskos and H. Kurz. "Xew Concept for Nanoscaled

MOS-Devices'1, in International Conference on Electron, Ion and Pho¬ ton Beam Technology and Nanofabneafion. May 1996, Atlanta. GA, USA.

[5] M. Stadler, T. Röwer. M. Tlialmann, N. Felber and W. Fichtner, "Au Embedded Stack Microprocessor for SDH Telecommunication Applica¬ tions", in Proceedings of the IEEE Custom Integrated Circuits Confer¬ ence 1998, May 1998. St.Clara. CA, USA. pp.17-20.

[6] T. Röwer, M. Stadler. Al. Tlialmann, N, Felber and W. Fichtner, "An Efficient Hardware/Software Co-Design for Broadband Telecommunica¬ tion Applications'', in Proceedings of Globccom 1998. Nov.1998, Sydney, Australia, pp. 1676-1681.

[7] M. Tlialmann, M. Stadler. T. Röwer. N. Felber and W. Fichtner, "A Single-Chip Solution for an ADAI-t/TMX-t SDH Telecommunication Node Element", in Proceedings of the IEEE Inter-national ASIC/SOC

Conference 1999, Sept. 1999, Washington DC, USA, pp. 147-151.

[8] M. Stadler, T. Röwer, M. Tlialmann, N. Felber, H. Kaeslin and W. Fichtner. "Functional Verification of Intellectual Properties (IP): a Simulation-Based Solution for an Application Specific Instruction- Set Processor-', in Proceedings of the International Test Conference, 114-420. [9] T. Röwer, M. Stadler, Al. Tlialmann, N. Felber, H. Kaeslin and W. Fichtner, "Intellectual Property Module of a Highly Parametrizable Embedded Stack Processoi", in Proceedings the IEEE International ASIC/SOC Conferen<< 1999. Sept. Washington DC, USA. pp. Sept.1999, Atlantic City. NJ. USA, pp.399-403. Curriculum Vitae 133

[10] T. Röwer, M. Stadler, M. Thalmaim, H. Kaeslin, N. Felber and W. Fichtner, "A New Paradigm for Very Flexible SONET/SDH IP- Modules", in Proceedings of the IEEE Custom, Integrated Circuits Con¬ ference 2000, May 2000, Orlando, Florida, USA, pp. 533-336. [11] P. Liithi, T. Röwer, M. Stadler. D. Forrer, S. Moscibroda, H. Kaeslin, N. Felber, W. Fichtner, "A Parametrizable Hybrid Stack-Register Pro¬

cessor as Soft Intellectual Property Module'', in Proceedings of IEEE International ASIC/SOC Conference 2000, September 2000, Washing¬

ton D.C., USA, pp. 87-91.