Low Power Image Processing: Analog Versus Digital Comparison Jacques-Olivier Klein, Lionel Lacassagne, Herve Mathias, Sebastien Moutault, Antoine Dupret

Low power image processing: analog versus digital comparison Jacques-Olivier Klein, Lionel Lacassagne, Herve Mathias, Sebastien Moutault, Antoine Dupret To cite this version: Jacques-Olivier Klein, Lionel Lacassagne, Herve Mathias, Sebastien Moutault, Antoine Dupret. Low power image processing: analog versus digital comparison. 2005, pp.111-115, 10.1109/CAMP.2005.33. hal-00015955 HAL Id: hal-00015955 https://hal.archives-ouvertes.fr/hal-00015955 Submitted on 15 Dec 2005 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Low power Image Processing: Analog versus Digital comparison Jacques-Olivier Klein, Lionel Lacassagne, Herve´ Mathias, Sebastien´ Moutault and Antoine Dupret Institut d’Electronique Fondamentale Bat.ˆ 220, Univ. Paris-Sud, France Email: [email protected] Abstract— In this paper, a programmable analog retina is SENSORS ARAM MUX3 A/D-PROC I/O presented and compared with state of the art MPU for embedded Y - D Q imaging applications. The comparison is based on the energy D requirement to implement the same image processing task. E C D Q Results showed that analog processing requires lower power O consumption than digital processing. In addition, the execution D D Q E time is shorter since the size of the retina is reasonably large. R D Q I. INTRODUCTION Smart sensors, vision chips [3]–[6] have potential to take X-DECODER an increasing part in navigation or surveillance systems: toys or industrial robots, car driving assistance. For this class Fig. 1. Array and decoder architecture. of applications, one has to provide vision systems which feature high processing capabilities, low cost, compactness and reduced power consumption. In a previous paper [10] we A. Architecture of rows introduced the architecture of the X-Cell, a universal analog Each row of the retina is organized around two mixed computation cell. Compared to its digital counterpart, lower analog-digital buses used to connect various functional units power consomption and reduced silicium area are expected. (see Figure 2). The functional units that can compose the row Such statement has to be proven with fairly quantitative study. of a vision chip are: the rows of photosensors, the row of Consequently, we propose a comparison between a vector of analog memory map, the set of analog registers, the Analog X-Cell dedicated to image processing called PARIS and a Processing Unit (X-Cell), the Boolean Processing Unit and similar digital architecture comprising SIMD units: PowerPC few special registers. These last are notably required for I/O G4 Altivec. This comparison is performed using a well- and global operators. In each processor, linear processings known algorithm, representative of image processing task: are handle by the Analog Processing Unit. Boolean units edge detector. We present a detailed implementation on both associated to the condition register allows to achieve different architectures and focus on the hot spots for an optimized operations according to locally stored values. Binary data implementation. Two benchmarks are provided, the first one stemming from a comparator are combined by the Boolean is about the execution time only to estimated the efficiency Processing Unit and can be written in a condition register. of general purpose processor as a challenger to dedicated Mixed registers will then be modified wherever this condition architectures, the second deals with the most embedded con- is true. Such architecture paves the way to numerous linear, straining criterion: power consumption. isotropic or not algorithms [8]. II. PARIS ARCHITECTURE In most vision chips, photodetectors form an array. With our programmable approach, photodetectors are as- BOOLEAN UNIT sociated to memory elements, them also organized in ar- SENSORS & MEMORY BUS ANALOG & DIGITAL ray. These arrays are bordered on one of their side by a MUX I / O PROCESSING BUS column of analog/digital processors (see Fig. 1). Operations are performed sequentially on columns while snapshot mode PHOTOSENSORS MEMORY-MAP Register Bank Analog Processing image acquisition is concurrently achieved. A decoder selects then the column reached by processors. Furthermore, each processor access to a set of rows by the way of a mux (MUX3). Finally, fully random addressing can be convenient for reading Fig. 2. Architecture of rows and writing images. B. Generic functional units 4) The resulting voltage on capacitors C3 and C4 is copied onto capacitors C5 and C6 by way of the V -BUS. It performs a copy in voltage mode. The configuration of switches allows to do or not a change of sign by reversal Global V-BUS of the target capacitor during the copy. VREF read D. Analog processing unit write The analog processor is constituted by a set of capaci- Local Q-BUS tors associated to switches allowing various configurations. It includes a set T of processing capacitors associated to registers-capacitors (cf. Fig. 5). To improve accuracy, each Fig. 3. Generic Functional Unit capacitor is an instance of a unitary capacitor Cu. Let define the weight of a set S of capacitors, the dimensionless quantity: Derived from [10], each functional unit is organized around 1 P × Ci, where Ci is the capacitance of the ith capacitor Cu i∈S one OTA, a set of capacitors associated to switches and of of S. Typicaly, four capacitors of weights 1, 1, 2 and 4 two buses: a global one, and a local one (see Figure 3). The compose the T set, allowing 3-bit processing. global bus, which is dedicated to inputs/outputs, is named V - More general operation of the analog processor, BUS. It is intended to distribute a value represented by a multiplication-accumulation can be decomposed into three voltage, therefore allowing to realize non-destructive copies. steps: Load, Distribute, Accumulate. For each of these 3 The voltage is forced by the output of one OTA or by the steps, a set of the implied capacitor (respectively L, B, A) is output of a digital cell. A voltage mode operating drastically considered. reduce its sensitivity to parasitic capacitors. The local Q- • During the first step (Load), the set L ⊂ T (of weight l) BUS, is intended to realize charge transfers and balancing. is charged by a positive or negative voltage-copied. The The charge transfer is used to perform accumulations while input voltage V n is copied (positively or negatively) in a division is based on charge balancing. The voltage of the i subset L ⊂ T of capacitors (of weight l). After loading, Q − BUS is set to V by the output of one OTA thanks REF the charge Q , stored in set L, is Q = ±l × V n × C . to a feedback loop. So, its parasitic capacitor keeps its charge L L i u • During the second step (Balancing), the charge Q is and thus has little impact during the transfer of charges [10]. L distributed on the set B ⊃ L (of weight b), so that each 1 C. Operating with switched capacitors capacitor belonging to B has a voltage VB = b × ±l × All the functional units are based on switched capacitors Vin. structures. Four different operations are used. They are illus- • Finally, the last step (Accumulation) consists in adding trated by an example on the scheme given figure 4. At the charges stored in a set of capacitors A ⊂ B, of weight a, on a register-capacitor CR of capacitance Cu. So: a×l VC (t + 1) = VC (t) ± × Vin. V-BUS Vout +/ - VOLTAGE COPY R R b a) : Hence, the realized operation is a fractional multiplication/accumulation (FMAC) with a coefficient a × l. VREF b Obviously, if B = L, step 2 can be omitted and the realized C5 operation is a integer multiplication/accumulation (IMAC) C C3 C4 V5 C6 0 V6 C1 C2 with a coefficient a. As a consequence, the MAC instruction VREF duration is 2 (IMAC) or 3 cycles (FMAC). Table I describes a Q1 Q2 Q3 Q4 +/ - VOLTAGE COPY Q-BUS (V-) Q3' Q4' subset of the X-Cell instructions. MR represents any mixt RESET Q -TRANSFERT Q -BALANCE register and BR the boolean register. LC stands for Local Condition. Fig. 4. Operation of the switched capacitors Instruction Description Cyc. IMAC MR,Vin ± a MR ← MR ± a × Vin 2 FMAC MR, al/b a×l Vin ± MR ← MR ± b × Vin 3 instant all the switches close, the charge of all the capacitors ASTR MR,Vout Vout ← MR 1 are modified: CLR MR MR ← 0 1 CMP MR Vout ← (MR > 0) 1 1) The capacitor C0, is shorten, thus reset. WHERE LC ← BR 1 2) The capacitors C1 and C2 are also emptied of their WHERN LC ← notBR 1 ENDWH LC ← T RUE charges, Q1 and Q2, which flow by way of the Q-BUS 1 <OP> MR BR ← BR op MR 1 to capactors C3 and C4. It is a cumulative transfer of BSTR MR MR ← BR 1 charges. TABLE I 3) The total charge Q1 +Q2 +Q3 +Q4 divides up between X-CELL INSTRUCTION SUBSET two parallel capacitors, C3 and C4, in proportion to their respective capacitance. It achieves charge balancing. algorithm like edge detection. The closest ”software” archi- R MUX BOP D Q 3:0 4-1 tecture are the General Purpose Processor with multimedia BOOLEAN SIMD extension (also called SWAR for SIMD Within A REGISTER (BR) VTHD Register). The most embedded GPP are the PowerPC Altivec and Intel Centrino.

Low Power Image Processing: Analog Versus Digital Comparison Jacques-Olivier Klein, Lionel Lacassagne, Herve Mathias, Sebastien Moutault, Antoine Dupret

Elementary Functions: Towards Automatically Generated, Efficient

Run-Time Adaptation to Heterogeneous Processing Units for Real-Time Stereo Vision

A Compiler Target Model for Line Associative Registers

High Performance Motion Detection: Some Trends

Low Overhead Memory Subsystem Design for a Multicore Parallel DSP Processor

FULLTEXT01.Pdf

Scalar Arithmetic Multiple Data: Customizable Precision for Deep Neural Networks

The Design of Scalar AES Instruction Set Extensions for RISC-V

An Instruction Set Extension to Support Software-Based Masking

ESA IP Core Extensions for LEON2FT

Software Component Template V4.2.0 R4.0 Rev 3

Perception-Motivated Parallel Algorithms for Haptics