A Reconfigurable Signal Processing IC with Embedded FPGA and Multi-Port Flash Memory

40.3 A Reconfigurable Signal Processing IC with embedded FPGA and Multi-Port Flash Memory M. Borgatti, L. Calì, G. De Sandre, B. Forêt, D. Iezzi, F. Lertora, G. Muzzi, M. Pasotti, M. Poles and P.L. Rolandi STMicroelectronics, Central R&D Agrate Brianza, Italy [email protected] 1. INTRODUCTION ABSTRACT Increasing complexity of system design and shorter time-to- A 1GOPS dynamically reconfigurable processing unit with market requirements are leading research towards the embedded Flash memory and SRAM-based FPGA targets image- investigation of hybrid systems including processors enhanced by voice processing and recognition applications. Code, data and programmable logic [1][2]. Embedded programmable logic allows FPGA bitstreams are stored in the embedded Flash memory and ASIC and ASSP vendors to broaden the appeal of their products. are independently accessible through 3 content-specific, 64-bit NRE reduction and shorter time-to-market are key to OEMs I/O ports with a peak read rate of 1.2GB/s. The system is looking for faster turnaround and lower risk design solutions and implemented in a 0.18um, 2PL-6ML CMOS Flash technology, technology. System integrators can also exploit hardware chip area is 70mm2. programmability for in-house product customization. In this paper we present a pragmatic approach to introduce Categories and Subject Descriptors flexibility in system-chip design and exploit embedded B.7.1 [INTEGRATED CIRCUITS]: Types and Design Styles – programmable silicon fabrics to enhance system performances. In Advanced technologies, Algorithms implemented in hardware, particular, enabling application-specific configurations to adapt Gate arrays, Input/output circuits, Memory technologies, the underlying hardware architecture to time-varying application Microprocessors and microcomputers, VLSI. demands can improve execution speed and reduce power C.1.3 [PROCESSOR ARCHITECTURES]: Other Architecture consumption compared to a general-purpose programmable Styles – Adaptable architectures, Heterogeneous (hybrid) solution. systems. This paper describes a dynamically reconfigurable processing B.3.1 [MEMORY STRUCTURES]: Semiconductor Memories. unit tightly connected to a Flash EEPROM memory subsystem. The reconfigurable processing unit targets image-voice processing C.3 [SPECIAL-PURPOSE AND APPLICATION-BASED and recognition application domains and is implemented by SYSTEMS]: Signal processing systems. joining a configurable and extensible processor core and an SRAM-based embedded FPGA. Application-specific HW units General Terms are added and dynamically modified by embedded FPGA Design, Performance, Algorithms. reconfiguration. By implementing application-specific vector processing instructions, the unit shows a peak computing power of 1GOPS. Efficient read-write-erase access to code, data and Keywords FPGA bitstreams is provided by a specific memory subsystem Application-specific integrated circuits (ASICs), digital signal based on a modular 8Mb, 4-bank Flash memory. It features 3 processors, field-programmable gate arrays (FPGAs), integrated content-specific I/O ports and delivers an aggregate peak read circuit design, multimedia computing, reconfigurable throughput of 1.2GB/s. architectures. The proposed system has been built using a set of state-of-the-art IP cores and system design methodology. Design flows for system exploration and implementation are also described. Permission to make digital or hard copies of all or part of this work for 2. SYSTEM ARCHITECTURE personal or classroom use is granted without fee provided that copies are The system architecture is illustrated in Figure 1. The not made or distributed for profit or commercial advantage and that functional purposes of the embedded FPGA are: i) extension of copies bear this notice and the full citation on the first page. To copy the processor datapath supporting a set of additional special- otherwise, or republish, to post on servers or to redistribute to lists, purpose C-callable microprocessor instructions; ii) bus-mapped requires prior specific permission and/or a fee. coprocessors (connected to the system bus through a master/slave DAC 2003, June 2-6, 2003, Anaheim, California, USA. Copyright 2003 ACM 1-58113-688-9/03/0006…$5.00. 691 32 bit 8KB D$ 8KB I$ External 48 KB External RAM Port PIF/AHB SRAM Memory 64 bit Interface 32 bit PIF BUS BRIDGE 32 bit Extensible External Processor ROM/FLASH Microprocessor Port INTERFACE (PIF) INTERFACE 64 bit AHB BUS Interrupt Manager DMA FP CP DP AHB/APB Bridge Master/Slave AHB Interface Flash Memory FPGA Interrupt Programming Instruction Interface Interface Extension Embedded BUS FPGA Instruction Dual Port 1KB Dual Extension Buffer Port Interface Interface Buffer General Purpose I/O Interface 64 bit APB BUS PC Parallel I2C BUS General Programmable General Purpose PC Parallel Port Port I2C Purpose General Purpose Registers Interface Master I/O Lines I/O Figure 1. System Architecture. interface); iii) flexible I/O (to connect external units or sensors (N+2)x4 128-bit crossbar connects the modular memory with the featuring application-specific communication protocols). four initiators (CP, DP, FP and uP) providing that three banks can be read in parallel at full speed. The memory space of the four Even though such different circuit purposes would require modules is arranged in three programmable user-defined different kinds of programmable logic for best implementation of partitions, each one devoted to a port. either arithmetic-dominated or control-dominated logic, we implemented a single programmable logic fabric to be shared Each 2Mb flash memory module has a 128-bit IO data bus among different purposes both in space (same configuration) and with 40ns access time, resulting in 400Mbyte/s, and a time (subsequent configurations). A single, high I/O count, fine- program/erase control unit. Simultaneous memory operations use grain e-FPGA operates as a datapath for the microprocessor the power management arbiter (PMA) for optimal scheduling. pipeline and as dedicated control logic for bus coprocessor and Available power and user-defined priorities are considered to I/O control interface. schedule conflicting resource requests in a single clock cycle. The memory system allows up to four simultaneous operations (with a FPGA reconfiguration is concurrent to software execution. A limit of one both for write and erase). local bus connects a dedicated 32-bit Flash memory port (FP) to the FPGA programming interface. A DMA channel handles the bitstream transfer while microprocessor fetches instructions and data from different Flash memory ports: 64-bit wide code port ADC DFT 2MbFlash 2MbFlash 2MbFlash 2MbFlash (CP) and data port (DP). To support streaming applications a 1kB module 0 module 1 module 2 module 3 PMA dual-port buffer is used to interface fast decoding hardware and Power slower software running on the processor. Block The memory sub-system architecture is shown in Figure 2. The modular memory (dotted line) includes charge pumps (Power 128 bit Memory Sub-System Crossbar Block), testability circuits (DFT), a power management arbiter 128 128 128 128 (PMA) and a customizable array of N independent 2Mb flash µP interface DP CP FP memory modules, depending on the storage requirements (N=4 in the current implementation). The modular memory features (N+2) 64 64 32 8 bit µP 128-bit target ports and implements a N-bank uniform memory. Data Port Code Port FPGA Port An 8-bit microprocessor (uP) is devoted to handle complex file-system functions (defrag, compression, virtual erase, etc.) not natively supported by DP, and assists for built-in self test. A Figure 2. Flash Memory Architecture. 692 Figure 3 depicts the memory hierarchy and parallelism across The overall performance improvements for the face the system. CP and DP are interfaced to the 64-bit, 800MB/s recognition tasks are shown in Table 1. Execution time is AHB system bus. At a system clock rate of 100MHz each I/O port compared for 32-bit RISC with basic DSP extensions (MAC, can independently operate at maximum speed. So, an aggregate zero-overhead loops, etc) and the same processor enhanced with peak read rate of 1.2GB/s can be sustained as it is limited by application-specific instructions. Measured speed-ups range from memory access time. In the current implementation the e-FPGA 1.8x to 10.6x (on the most-demanding task), with an overall reconfiguration takes 500us @ 100 MHz. 50MB/s average improvement of 8.5x. Notice that switching between algorithm throughput out of the available 400MB/s are currently sustained stages requires only one reconfiguration of the e-FPGA. by the e-FPGA configuration interface. Reconfiguration time is negligible. The speed-up factors take into account the possible multi-cycle clock penalty due to processor- 32 bit Microprocessor FPGA synchronization in case of instruction extensions slower Register File External Processor than the processor clock. Interface & AHB Bridge 64 bit 32 bit Energy efficiency figures are also depicted in Table 1. As the AHB Bus FPGA PI average power consumption of the system extended with the e- FPGA is slightly higher, the energy reduction for executing each 64 bit AHB Port 32 bit Port of the tasks on its specific HW configuration (power-delay 64 bit AHB CP 64 bit AHB DP AHB product improvement) results in an overall reduction of 6.7x. Interface Interface DMA Only one task showed slightly worse total execution energy, 64 bit Port 64 bit Port 32 bit Port DP CP FP though showing benefits on execution speed. Last column of 512 Bytes Page Buffer 2x64 bit + 1x32 bit Figure 5 reports the energy-delay improvement of each specific Flash Memory Port Interfaces HW configuration compared to the general-purpose counterpart. 6x4 128 bit Flash Memory Crossbar Energy required for e-FPGA reconfiguration is always negligible. Measurements show the best energy efficiency in the range of Flash Memory Controller Logic several MOPS/mW at 1.8V supply. It lies between conventional 4x16384x128 bit Memory Module ASIP/DSP and dedicated configurable hardware implementations [2]. Figure 3. Memory Hierarchy. Table 1.

A Reconfigurable Signal Processing IC with Embedded FPGA and Multi-Port Flash Memory

Hardware Architecture

Efficient Processing of Deep Neural Networks

Architectural Adaptation for Application-Specific Locality

Development of a Predictable Hardware Architecture Template and Integration Into an Automated System Design Flow

Computer Architectures an Overview

Integrated Circuit Technology for Wireless Communications: an Overview Babak Daneshrad Integrated Circuits and Systems Laboratory UCLA Electrical Engineering Dept

Hybrid Pipeline Hardware Architecture Based on Error Detection and Correction for AES

Extensible Hardware Architecture for Mobile Robots

Ultra-Low-Power Design and Implementation of Application-Speciﬁc Instruction-Set Processors for Ubiquitous Sensing and Computing

Foundations for the Study of Software Architecture

1. SYSTEM ARCHITECTURE 1.1 System Hardware Architecture

Method of Top-Level Design for Automated Test Systems