A Low-Energy Heterogeneous Reconfigurable DSP IC

architecture template and the model of reconfiguration. In section 3 and 4, the methodology used to map 1 INTRODUCTION algorithms to an architecture is given and the implementation of the architecture is discussed. Section 5 The advent of the third generation of wireless reports the testing strategy for the design and results of applications creates a need for digital signal processing the final chip. platforms that simultaneously display high computational performance, ultra low-energy consumption and a high degree of flexibility and adaptability. The flexibility and 2 HETEROGENEOUS adaptability is a necessity in the presence of multiple and RECONFIGURABLE DSP evolving standards, and helps to increase quality-of- service in the presence of dynamically evolving channel Reconfigurable architectures [5][6][7] have received conditions. (Re)configurable processors offer the significant attention in recent years in both the general advantage of combining flexibility and low-energy [1][2] purpose computing as well as embedded processing. by providing a direct spatial mapping from algorithm to Mixing processor with fine-grain reconfigurable elements architecture, hence reducing the control overhead has been the main approach attempted by the above typically associated with instruction-set processors. systems. The Pleaides reconfigurable architecture achieves low energy consumption by providing a A low power reconfigurable DSP architecture template computational platform with mixed programming (Pleiades) which encapsulates heterogeneous computing granularity (i.e. microprocessor, reconfigurable dataflow, elements has been proposed [2][3] to solve the problem of FPGA) [8]. In this section, we explain our architecture in meeting the requirement of flexibility, speed and energy concept, and provide a description of the reconfiguration efficiency at the same time (Figure. 1). The Pleiades and computation models used in our design methodology. architecture style echoes the current trend in system-on- a-chip design which includes a wide variety of 2.1 Architecture Template macromodules including core processors, DSPs, programmable logic, embedded memory, and custom The Pleiades architecture (Figure. 2) is composed of a modules [4]. The heterogeneous architecture style of programmable microprocessor and heterogeneous Pleiades allows better algorithm-architecture matching computing elements (referred to as satellites in the rest of ,giving better power/performance than many the paper). The architecture template fixes the heterogeneous reconfigurable processors which communication primitives between the microprocessor incorporate only a microprocessor and fine-grained

FPGAs. data ASIC or FPGA data module

control clocks signals In this paper, we describe the design process and flags flags control implementation results of an instance of the Pleiades handshake handshake configr egister architecture , Maia, targetting the speech coding domain. A Satellite In section 2, we give a description of the Pleiades SAT SAT Configuration

Microprocessor ASIC Pleiades DSP mP Reconfigurable Interconnect Flexibility SAT1 SAT2 SAT3

Energy-Efficiency Reconfiguration Bus

Figure 1. Energy and Flexibility Spectrum for Different Figure 2. Heterogeneous Architecture Template Architectures Architecture Instance: 3 Address Generators, 3 Memories, 1 MAC/MUL and 1 ALU and satellites and between each satellite. For each AG AG AG AG AG algorithm domain (communication, speech coding, video C1 C2 C’1 MEM MEM MEM MEM MEM coding), an architecture instance can be created (with C3 C4 known satellite types and numbers) MAC/MUL MAC/MUL C’8 C5 ALU To reduce overhead in terms of instruction fetch and glo- reconfiguration bal control, the architecture utilizes distributed control

t C’8 and configuration. To achieve distributed control, each c

e C5 n

n : : n satellite is equipped with an interface that enables it to o r c r e t

e C2

t C’4 t a exchange data streams with other satellites efficiently, n i p C1 C’1 without the help of a global controller. The t0 t1 t2 Time communication mechanism between each satellite is dataflow driven [9]. Figure 4. Model of Reconfiguration

The control means available to the programmer are basic thread and the satellites and connections have to be satellite configurations to specify the kind of operation to reconfigured for each split point. be performed by the satellite, and configurations for the reconfigurable interconnect to build a cluster of satellites. The main idea behind reconfigurable computing that is advocated by the Pleiades system is to build a computa- 2.2 Model of Computation and tional engine through spatially-programmed connections Reconfiguration of processing elements (satellites). The interconnect model that needs to support such a system is depicted in Figure. 4. On the time axis, t0, t1 and t2 indicate the time While multiple threads of application can be run on an of reconfiguration. The bars (C1, C2 etc.) in-between two instance of the Pleiades architecture template, the reconfiguration times represent a set of inter-satellite compilation of a single thread down to the reconfigurable connections that has to be realized simultaneously by the components is the main core of the higher level reconfigurable interconnect. scheduling tools that can utilize multi-threads. Therefore, the design methodology described later in the paper aims to support a smooth transition from a single thread 3 OVERVIEW OF THE algorithm to an optimized implementation on Pleiades. ARCHITECTURE DESIGN Figure. 3 illustrates the flow of computation supported by METHODOLOGY this software methodology. As shown in the figure, a sequential thread is first initialized on the There are two key issues to be resolved in order to make microprocessor. After configuration codes are executed the methodology practical to the designers. Firstly, the on the processor, the control is transferred to the Pleiades architecture combines two very distinct models of reconfigurable satellites (the “split” point in Figure.3) computation, control-driven computation on the general- and the computation is returned back to the processor purpose microprocessor and data-driven computing on after all satellite operations are finished (the “join” the clusters of satellites. Therefore, the goal of the point). Multiple split points exist within a seqeuntial architectural exploration process is to partition the Application Thread1 application over these two computing paradigms so that split performance and energy dissipation constraints are met (during the compilation process). Secondly, optimizations related to reconfigurability have to be Thread3 Thread2 supported at both the architecture design as well as compilation stage. Both of these issues requires careful on satellites join modeling of the algorithm and the underlying on programmable processor heterogeneous architectures.

The basic flow of the design exploration methodology Figure 3. Flow of Computation on Pleiades [10] is presented in Figure. 5. After the introduction of terminology, a short overview of the overall flow is given in section 3.1.1. A more detailed description of this evaluated in order of importance and mapped to satellites methodology and tools developed can be found in [11]. for better power and performance [12]. If a hardware implementation is deemed worthwhile, a repartitioning of Definition of Kernel - A computational intensive part of the design is established. the algorithm that often resides in nested loops. After all costly kernels are mapped to accelerators, a final 3.1.1 Basic Methodology Flow partition of the algorithm across different architectures is The methodology flow takes DSP or communication obtained (stage 5) and memory assignment and allocation algorithms specified in a high-level language (e.g. C) as is performed to minimized memory trasfers. While the input. The initiation of the design process requires the rest of the algorithm remains as high- level language, the establishment of a first-order baseline model of the portions of the algorithm to be implemented by satellites algorithm complexity and bottlenecks. Such a model are specified in an intermediate form that is capable of allows for the selection and execution of architecture- modeling the structure of the reconfigurable satellite independent optimizations (stage 1). As architectural operations (i.e. as a netlist). Based on this conceptual choices have yet to be made, this model assumes the netlist, implementation optimizations (stage 6) [13] are presence of a “virtual architecture” with some generic invoked to choose a good reconfigurable interconnect operator costs attached to it. Optimizations at this stage architecture (during architecture design path) and to only address either win-only situations or order-of- generate efficient configuration and interface code magnitude improvements, so that absolute accuracy is not (during compilation and test vector generation path) [14]. that important.

After a satisfactory algorithm formulation is obtained, the Applications architectural mapping and partitioning process can be Satellites Microprocessors(s) Specification entered. To be meaningful, the partitioning process stage 1. timing, power constraints should be based on realistic bottom-up information Algorithm Architecture regarding the cost of implementing functions and Refinement Characterization operations on the different architectural choices. Our stage 2. design-exploration methodology relies extensively on the Mapping to Core stage 3. availability of power-delay models for all components in Kernel Ranking PDA its architectural library (stage 2). The estimation methods models employed in each of these models vary depending upon Mapping to accelerators the type of the module and the desired accuracy. While Exploration the absolute accuracy of these characterizations is not stage 4. Kernel crucial, it is important that bounds on the prediction PDA macro-model accuracy are known. Only “Improvements” that fall within the noise level of the estimations are accepted. Performance Evaluation

stage 5. stage 6. The architecture partitioning and mapping process is Partitioning started by establishing an initial solution. Given the Interconnect implementation simplicity of a pure software Optimization Compilation/Code Generation implementation, we have adopted a “software-centric” approach that assumes that the whole algorithm is Reconfig. Hardware initially mapped onto the microprocessor (stage 3). This Implementation Optimization establishes how close such a solution adheres to the design specifications and helps to establish the design Figure 5. The Software Methodology bottlenecks. A rank ordering of the dominant compute Flow kernels is established. In stage 4, dominant kernels are At different phases of the design phase, it is important to show the impact of particular design choices on the overall performance and energy of the application in Mem1K Mem1K order to give meaningful design guidance. A spreadsheet- like environment does precisely that and it is utilized in AG AG FPGA AG AG our methodology.

Mem1K Mem1K 4 CHIP IMPLEMENTATION

Using the methodology described above, a prototype ALU ALU MAC Mem MAC Mem architecture, Maia, has been designed targeting the i i domain of voice processing (CELP based speech coders o o i.e. 16kb/sec VSELP, LDC-CELP etc.) for wireless AG AG AG AG Mem512 Mem512 devices.

Interface The most dominant kernels in this domain are vector ARM computations (dot-products, FIR, IIR filters etc). After going through stage 1 to stage 5 of the methodology, the 1 1 2 2 following specifications of the architecture are obtained. 1 2 The Maia processor (Figure. 6) combines a 1 2 2 1 microprocessor core (embedded ARM8) with 21 satellite 2 1 processors: two MACs, two ALUs, eight address 2 2 1 1 generators, eight embedded memories (4 512´16bit, 4 Universal Switchbox Hierarchical Switchbox 1K´16bit), and an embedded low-energy FPGA [15]. The (only cross-mesh connections are shown) FPGA is used for infrequent functions (theta function generation etc.) that does not justify custom ASIC Figure 6. The Maia floorplan implementations. 4.1.1 Embedded microprocessor Connections between satellites are accomplished through a 2-level hierarchical mesh-structured reconfigurable The embedded ARM8 core is optimized for low-energy interconnect network. Through an interface control unit, operation, and can operate under variable supply voltages the ARM8 configures the satellites and communicates [16]. The core is synthesized from VHDL with hand data with satellites using IO interface ports and direct optimizations on the critical path to enhance performance memory reads/writes. and power. 4.1.2 Programmable ASIC elements In the rest of this section, the architecture and circuit designs of each component are described. We first Since filter operations in the applications make frequent present the computational elements (microprocessor, use of MAC and memory units, the cores of both MAC ASIC and FPGA designs) in section 4.1. The processor- and memory are custom designed ICs. The ALU and satellite and inter-satellite interface design are discussed address generator are synthesized from VHDL. in section 4.2. Reconfigurable interconnect architecture is described in Section 4.3. Both the dual-stage pipelined MAC (including shift/round/saturate functions) and the ALU can be 4.1 Component Description configured to handle a range of operations. The address generators and embedded memories are distributed to supply multiple parallel data streams to the computational elements. The address generator features a small local instruction memory, and can be programmed Out I Processor n In to support various types of addressing patterns and nested Module Req loops with loop counters and stride counters. It behaves in Clk Done Clk

Network Req Req as the local controller of data-flow kernels by initiating in clk out Enable Done

Reconfigurable delay the data-flow threads, and by signaling the end of the data-flow threads to the ARM8. (a) Globally asynchronous - locally synchronous signaling

1 1 MPY 1 MPY n 4.1.3 Embedded FPGA 1 n

Commercial FPGAs are often notorious for their energy n Data associated with an end-of-vector token consumption and most of them can not be embedded in a MAC 1 n Regular data system-on-a-chip. Therefore, we make use of an in-house low-energy embedded FPGA [15]. (b) Control tokens differentiate and delineate data streams and data structures (scalar, vector, matrix)

The embedded FPGA contains a 4´8 array of 5-input 3- Figure 7. Data-flow driven globally synchronous locally output CLBs, optimized for arithmetic operations and asynchronous communication protocal data-flow control functions. Its energy-efficiency has been measured to be 70 times higher than equivalent commercial solutions. This energy efficient FPGA design 4.2 Communication Interface Description is realized by combining both architectural and circuit level modifications which are outlined below. Logic block ¾ The logic block is designed to improve 4.2.1 Inter- satellites Communication Interface the interconnect utilization, and hence the interconnect The data-flow driven synchronization between the energy. It is made up of a cluster of 3 input look-up- processing elements employs a 2-phase self-timed tables. It can be used to implement 5 input random logic, handshaking scheme with REQUEST and or 2 bit arithmetic operations. ACKNOWLEDGE signals (Figure 7a), realized in a Low-swing circuit ¾ Low-swing interconnect circuit globally asynchronous locally synchronous improves the energy by a factor of 2 as compared to a full implementation fashion. This approach not only reduces swing circuit. The logic blocks operate on 1.5V while the power consumption by ensuring that a module is only low-swing signal lines have a 0.8V swing. activated when data is ready, but also allows various Interconnect Architecture ¾ The interconnect is made modules to operate at different and dynamically varying up of 3 levels of connectivity. Each level is targeted at rates. Data links combine 16-bit fixed-width data words providing low energy connections for specific path with 2-bit control tokens that serve as tags for different lengths. The Level0 structure is targeted at connections data structures (scalar, vector, or matrix) that are between nearest neighbors. Each logic block can connect supported by the network (Figure 7b). Each module to 8 of it’s immediate neighbors. The Level1 structure is includes a network interface controller to coordinate the traditional symmetric mesh architecture, and is good communication and synchronization. for intermediate length wires. The Level3 structure is used for implementing connections that span a significant fraction of the chip. The connectivity of each 4.2.2 Communication Interface between the of these structures has been optimized using architecture Microprocessor and Satellites evaluation tools to obtain energy efficiency. Clock Distribution ¾ More than 80% of the clock This interface control unit coordinates synchronization energy is dissipated in the clock distribution network. and communication between the synchronous ARM8 core Double-edge-triggered Flip-Flops are used to reduce the and the asynchronous reconfigurable data-paths, most clock activity by factor of 2, and hence a proportional importantly helping the core perform the reconfiguration reduction in energy. The clock distribution network also of satellites by mapping all the configuration memories to uses the low-swing technique for energy reduction. the ARM8 memory space. VDD The interface logic controls the strobe generation for clk P5 REF clk configuration reads/writes, handshakes, network reset, in P3 P2 P4 start requests for the address generators and IO ports. d P1 out REF GND in The acknowledge signals for the address generators and n1 n2 1V GND d 0.4V IO ports are used to detect the end of kernel and the P6 P7 B B ARM8 core is interrupted. Interrupt mask registers and A A clk clk control registers are used to synchronize ARM8 with the N3 N1 N2 N4 out asynchronous satellite array.

The system supports two modes of operation: TEST and Figure 8: Pseudo-differential low-swing interconnect SYSTEM modes. As part of the test strategy, the TEST circuitry mode allows us to bypass the ARM8 processor and The implemented hierarchical interconnect mesh execute individual kernels through the interface. In the network can provide the optimum energy-efficiency with SYSTEM mode, instead of an on-chip cache for the right degree of flexibility within the application domain embedded ARM8, an external SRAM (with zero bus of interest. Several clusters of tightly connected modules turnaround) serves as the memory for the processor. In are formed based on the communication locality. Each order to meet the 40MHz performance for the cluster has a local mesh with 2 buses per channel, and a application, the off-chip memory is clocked twice as fast universal switchbox at every intersection point (Figure as the core. The interface is designed to meet this 6). Global interconnections are supported by a 2nd level bandwidth. larger-granularity mesh (implemented on the higher metal layers) with 2 buses per channel and hierarchical 4.3 Reconfigurable Interconnect switchboxes, located at the key connection points. The hierarchical switchbox (Figure 6) contains a universal Architecture switchbox for each mesh-level, as well as a number of cross-level interconnect switches. This hierarchical Keeping the energy of the reconfigurable interconnect network architecture requires only a limited number of network as low as possible while still meeting the buses to achieve sufficient connection flexibility for our flexibility requirement is crucial to the success of out target applications, and cuts the interconnect energy cost approach of heterogeneous reconfigurable architecture. by a factor of 7 compared to a straightforward crossbar This is realized by a combination of architecture and network implementation. circuit optimizations. 4.3.2 Low-swing Interconnect Interface Circuits 4.3.1 Hierarchical Interconnect Network Communication energy is further reduced by employing a Architecture low-swing (0.4V) pseudo-differential signaling scheme Energy-efficient architecture must take advantages of the (Figure 8). The wire capacitance loads are also reduced locality and regularity of computation. Exploiting locality by simplifying the switch network with NMOS-only by identifying natural isolated clusters of operations, can switches. The circuit employs an NMOS-only push-pull be used to guide hardware partitioning resulting in the driver with a very low voltage supply. The receiver is a minimization of global busses, thus reducing the clocked sense amplifier with low input-offset and good interconnect power. Although the underlying system is sensitivity followed by a static flip-flop. It contains heterogeneous, the DSP algorithms usually have double pairs of input transistor, with the gates of P1 and inherently repetitive computation patterns. Partitioning P3 connected to d, while the gates of P4 and P2 biased at the hardware by preserving such regularity will lead to GND and REF respectively. Figure. 8 shows the simpler interconnect structure with reduced fan-ins and signaling waveforms. Based on our asynchronous fan-outs. Especially for reconfigurable architectures, clocking protocol, the clock signal is generated from the more regular interconnect architecture achieve better handshaking signals. The low-swing signaling reduces routability and less reconfiguration overhead. There is the interconnect energy by a factor of 3.4 compared to a trade-off between flexibility and energy-efficiency. For full-swing CMOS implementation [17]. instance, the crossbar network has the most flexibility, but also the least energy efficiency. In stage 6 of the design methodology, cross-bar, mesh and hierarchical 5 RESULTS AND STATISTICS mesh structures are evaluated, and a 2-level hierarchical mesh is decided for this implementation. Maia is a 210-pin chip that contains 1.2 million Table 2 shows the performances of different chip transistors and measures 5.2´6.7mm2 in a 0.25 mm 6- components (based on a per-block analysis) from metal CMOS technology. Figure 9 shows the die photo PowerMill simulation. of the Maia chip and Table 1. summarizes all the implementation statistics of the chip. Figure 10 (see the end of the paper, after references) illustrates the signals that are available at the I/O pins. During the TEST mode, all satellites and the Technology 0.25 mm 6-level metal reconfigurable interconnect can be configured by writing CMOS to Taddr and Tdata pins (to the ConfigAdd and Main Supply Voltage 1 V ConfigData buses) and the result of the computation can Additional Voltages 0.4 V, 1.5 V be read on the Tdata and FIQ pins (from ReadData and Die Size 5.2 mm x 6.7 mm ACK buses). In addition, simple programs can also be Transistor Count 1.2 Million transistors fed to ARM8 via Tdata pins to test satellite configuration Average Cycle Speed 40 MHz reading and writing. The current test set-up supports the Average Power Dissipation 1.5 - 2 mW test mode described above and a board to verify the SYSTEM mode is being designed. The HP 16702A logic analysis system was used for generating the test vectors Table 1: Chip Characteristics (derived from Timemill simulations) for the TEST mode. Pattern acquisition was used for verifying the results of Hardware Pipeline Energy Area the computations after detecting end of kernel using an modules speed consumptio (mm2) external interrupt signal. (ns) n per operation Energy and performance of all kernels are tested in the (PJ) TEST mode. Based on this information, the estimated MAC 24 21 0.25 energy dissipation of the processor when programmed for ALU 20 8 0.09 a VCELP voice coder (with 1.8mW total power Memory (1K x 14 8 0.32 consumption) is presented in Table 3, including a 16) breakdown of the energy over the major functions. Memory (512 x 11 7 0.16 Dominant kernels are directly mapped onto hardware 16) satellites, and their run-time reconfiguration is performed Address generator 20 6 0.12 by the ARM core. Therefore, the kernel energy presented Interconnect 10 1* NA in the table incorporate contributions from both satellite network and ARM8 configuration. The program control part of FPGA 25 18** 2.76 the algorithm is completely mapped to the software. The total energy efficiency is a factor of 8 better than the best Table 2: Performances of hardware modules reported in literature [18]. *This number is the average energy consumption per connection **This number is the average energy consumption across various arithmetic functions MEM MEM

AGU AGU FPGA AGU AGU MEM MEM

ALU ALU MAC MAC MEM MEM Interconnect Network

AGU MEM AGU AGU MEM AGU

Interface ARM8 Core

Figure 9. Maia die photo

Functionality Energy consumption (mJ) for 1 sec of VCELP speech processing Dot product 0.738 FIR filter 0.131 IIIR filter 0.021 Vector sum with scalar multiply 0.042 Kernels Compute code 0.011 Covariance matrix compute 0.006 Program control 0.838 Total 1.787 Table 3. VSELP energy breakdown current trend in system-on-a-chip design which contains 6 CONCLUSION embedded components of various flexibility and In this paper, Pleiades, a heterogeneous reconfigurable reconfigurability (microprocessor, ASICs, FPGA). The architecture template is introduced and a design heterogeneity and reconfigurability of the architecture methodology to map algorithms to architectures is proves to be very energy efficient when compared to summarized. The details of the design and state-of-the-art programmable processors. implementation of an instance of the Pleiades architecture is presented. The implementation echoes the 7 ACKNOWLEDGEMENTS Pleiades Family of Processors”, Master’s Thesis, UC Berkeley, 1999. We would like to acknowledge DARPA’s support for the [10] M. Wan, D. Lidsky, Y. Ichikawa and J. Rabaey. “An Pleiades project (DABT-63-96-C-0026). The authors Energy-Conscious Methodology for Early Exploration would like to thank Seno Katsunori and Yuji Ichikawa for their early work on the Pleiades prototype and of Heterogeneous DSPs”, Proceedings of CICC 1998. evaluation. We would like to acknowledge other [11] M. Wan, H. Zhang, V. George, M. Benes, A. members on the Maia design team. Abnous and J. Rabaey, "Design Methodology of a Low-Energy Reconfigurable Single-Chip DSP 8 REFERENCES System", Journal of VLSI Signal Processing 2000. [12] M. Wan, H. Zhang, M. Benes and J. Rabaey, “A [1] M. Goel and N. R. Shanbhag, “Low-power Low-Power Reconfigurable Data-Driven DSP equalizers for 51.84 Mb/s very high-speed digital System”, Proceedings of the SiPS99 subscriber loop [VDSL] modems”, Proceedings of [13] H. Zhang, M. Wan, V. George, J. Rabaey, “Intercon- IEEE Workshop on Signal Processing Systems, Oct. nect Architecture Exploration for Low Energy Recon- 1998, Boston. figurable Single-Chip DSPs”, Proceedings of the [2] A. Abnous and J. Rabaey, “Ultra-Low-Power WVLSI, Orlando, FL, USA, April 1999 Domain- Specific Multimedia Processors”, [14] S. Li, M. Wan and J. Rabaey, “Configuration Code Proceedings of the IEEE VLSI Signal Processing Generation and Optimizations for Heterogeneous Workshop, San Francisco, California, USA, October Reconfigurable DSPs”, Proceddings of SiPS, 1999. 1996. [15] V. George, H. Zhang, J. Rabaey, “Low Energy [3] A. Abnous et al., “Evaluation of a Low-Power FPGA Design”, Proceedings of ISLPED 1999. Reconfigurable DSP Architecture”, Proceedings of [16] T. Burd, T. Pering, A. Stratakos, R. Brodersen,”A the Reconfigurable Architectures Workshop, Orlando, Dynamic Voltage-Scaled Microprocessor System”, Florida, USA, March 1998. Proceedings of ISSCC 2000. [4] J. Borel, “Technologies for multimedia systems on a [17] Hui Zhang et al, “Low-Swing Interconnect Interface chip”, 1997 IEEE International Solid-State Circuits Circuits”, Proceedings of ISLPED 1997. Conference. pages. 18-21. [18] Wai Lee et al, “A 1V DSP for Wireless [5] G. R. Goslin, “A Guide to Using Field Communication”, Digest of Technical Papers of Programmable Gate Arrays for Application-Specific ISSCC 97 Digital Signal Processing Performance”, Proceedings of SPIE, vol. 2914, p321-331. [6] J. Hauser and J. Wawrzynek. GARP: A MIPS processor with a reconfigurable coprocessor. In J. Arnold and K. L. Pocek, editors, Proceedings of IEEE Worship on FPGA for Custom Computing Machines, Napa, CA, April 1997. [7] T. Garverick et al, NAPA1000, http:// www.national.com/appinfo/milaero/napa1000 [8] J. M. Rabaey, “Reconfigurable Computing: the Solution to Low Power Programmable DSP”, Proc. to 1997 ICASSP Conference, Munich, April 1997. [9] M. Benes, “Deisng and Implementation of Communication and Switching Techniques for the SYSTEM MODE TEST MODE

Off-chip Logic Analyzer SRAM Taddr<15:0> Addr<31:0> Tdata<31:0> Dq<31:0> Test,TRwn,TClk,FIQ etc. Other controls IO Pins

Wdata 32 Rdata 32 ConfigAdd 16 ARM8 VAddress 32 ConfigData 32 Core Requests Interface ReadData 32 Responses Strobe 22 Satellites Interrupt Start ACKs 10

Figure 10. Maia chip testing strategy