Computer Architecture for Machine Percep
Total Page:16
File Type:pdf, Size:1020Kb
A Dedicated Image Processor Exploiting Both Spatial and Instruction-Level Parallelism* A.Broggi, M.Bertozzi, G.Conte F.Gregoretti, R.Passerone, C.Sansoit , L.M .Reyneri Dipartimento di Ingegneria dell’Informazione Dipartimento di Elettronica UniversitA di Parma Politecnico di Torino 1-43100 Parma, Italy 1-10129 Torino, Italy Abstract ing called HACRE (HAndwritten Character REcogni- This paper presents the PAPRICA-3 massively par- tion). The main task of the system was to perform allel SIMD system, designed as n hardware accelerator automatic recognition of the amount handwritten on for real-tim.e image processing tasks. It is composed of the legal and courtesy fields of a bank cheque, at a 0,lin,enr array of single- hit processsing elements, includ- rate of about ten cheques per second. A solution to ing a fairly complex pipelined controller, thus allowing the problem is to design a dedicated parallel processor the system to take advantage also of the intrinsic par- with an instruction set tailored to low-level image pro- allelism in a program. cessing tasks. To meet the cost and size constraints A program,ming en,ciironment has been developed to typical of embedded systems, the hardware accelerator ease the prototyping of applications: a code generator has to be composed of the smallest possible amount converts C++ programs into assembly, and code op- of components, with a single-chip unit as the most timization is performed directly at the assembly level, desirable solution. following a genetic approach. The heart of PAPRICA-3 consists of a linear ar- The effectiveness of the processor, as well as of the ray of processing elements (PES), each one devoted code optimizer, are discussed with the aid of an ap- to a single line pixel. At any time, all PES execute plication example for handwritten character recogni- the same instruction. The system operates on black tion. & white, grey-scale or color bitmap images, so that a pixel is defined as a multi-bit value carrying informa- 1 Introduction tion stored on different image bit-planes. A complete This paper describes PAPRICA-3, a real-time hard- image bit-plane can be seen in different ways, for ex- ware accelerator for low-level image processing appli- ample as an independent black & white image, as part cations. The system is designed to be interfaced to of one or more grey-scale or color images, or as an in- a conventional microprocessor to relieve it from the termediate result of operations performed on any of high computational load required by the first steps of the above objects. image processing algorithms (filtering, noise removal, Each PE can perform logical or morphological oper- contrast enhancement, features extraction, etc.). The ations on an input pixel: logical operations compute a system is able to capture a serial input data stream one-bit result depending on values stored on different coming from an external camera (or pair of stereo cam- image planes of the same pixel, while morphological eras), to process it in parallel on a line-by-line basis, operations evaluate a one-bit result based on the val- and to store the resulting images either on a dedicated ues stored in the neighboring pixels on a single image dual-port memory, accessible also from the conven- plane, using the mathematical morphology approach tional processor, or on a monitor. proposed in [16]. One of the design goals was to build a low-cost, The execution unit, which is common to the whole high-performance system, dedicated to embedded ap- processor array, controls program flow and can per- plications. PAPRICA-3 was initially formulated as the form tests, branches, conditional execution of program core of a high speed system for bank cheques process- segments, loops, and synchronization statements. To ‘This work was partially supported by the CNR and the speed up program execution, a fairly complicated Italian Ministry for University and Scientific Research 40%. pipeline scheme has been adopted to exploit as much 106 0-8186-7987-5197$10.00 0 1997 IEEE Authorized licensed use limited to: UNIVERSITA TRENTO. Downloaded on July 28, 2009 at 11:05 from IEEE Xplore. Restrictions apply. instruction level parallelism as possible. The program This paper presents the whole project which started is stored completely in an internal program memory about 3 years ago and involved two reserach units: (512 instructions wide) writable from the external pro- Parma University and Torino Politechnic. Next sec- cessor, so there is no need for internal cache memory. tion describes the hardware system and architecture; A Status Register File can be used to compute and section 3 presents the programming environment and store global parameters of the processed images, such details the code optimization process, describing also as statistics, degrees of matching between an input and its implementation on a parallel cluster of worksta- reference images, etc.. It is accessible to the external tions; section 4 describes the main application that processor and is aimed to increase execution speed of motivated the development of PAPRICA-3; and fi- a broad class of neural algorithms. nally section 5 concludes the paper with some remarks. Two versions of the PAPRICA-3 system are un- 2 Hardware System Description der development, having the same architecture: the first one is dedicated to the above mentioned hand- As shown in figure 1, the kernel of PAPRICA-3 is writing recognition application and in the following a linear array of Q identical 1-bit PES connected to will be also referenced with the name HACRE, while an image memory via a bidirectional Q-bit wide bus. the second one is of a general purpose aimed to build The image memory is organized in addressable words an image processing system based on a single PCI whose length matches that of the processor array; each bus PC expansion board and will be called generically word contains data relative to one binary pixel plane PAPRICA-3. Two different ASIC chips will therefore (also called layer) of one line of an image (Q bits wide), be built: and a single cycle is needed to load an entire line of data into the PE’s internal registers. When executing HACRE is a single-chip system. The chip is com- a program, the correspondence between the line num- posed of 64 PES, an input video interface ded- ber and the pixel plane of a given image and the abso- icated to a linear image scanner, a lkx64 bits lute line address is computed in hardware by means of internal image memory and an internal 256 ele- data structures, named Image Descriptors, stored in ments Status Register File. It interfaces to a 32 the Control Unit. An Image Descriptor is a memory bit microcontroller with a minimum amount of pointer built up of three parts: a base address, a line external logic. counter and a page counter. Line and page counters can be reset or incremented by a specified amount un- PAPRICA-3 is a modular system. Each chip is der program control. Two counters, called M and N, composed of 32 PES,a generic input/output video are not directly related to image memory addressing, interface and does not include any internal image but can be used to modify program flow: instructions memory and the Status Register File. Instead, are provided to preset, increment and test them. the chip has a fast interface to an external image Data are transferred into internal registers of each memory and a Status Register File, to achieve PE, processed and explicitly stored back into memory two goals: connecting several chips together, it is according to a RISC-like processing paradigm. possible to create a system with a varying num- Each PE processes one pixel of each line and is com- ber of PES; an external but fast image memory posed of a Register File and a 1-bit Execution Unit. will not decrease too much the performances com- The core of the instruction set is based on morpho- pared with an internal memory, but the amount of logical operators [16]: the result of an operation de- memory can be much higher, to cope with mem- pends, for each processor, on the value of pixels in a ory demanding applications. given neighborhood (reduced 5 x 5, as sketched by the grey squares in figure 1). Data from E, EE, W and A programming environment has been developed to WW directions (where EE and WW denote pixels two ease prototyping of applications on PAPRICA-3. Ap- bits apart in E and W directions) may be obtained plication programs are written in C++; a code genera- by direct connection with neighboring PES while all tor first converts C++ statements into assembly code; other directions correspond to data of previous lines subsequently, optimization is performed on the gener- (N, NE, NW, NN) or of subsequent lines (S, SE, SW, ated assembly code. The optimizer performs both de- SS). Operations on larger neighborhoods can be de- terministic optimizations following the standard well composed 121 in chains of basic operations. known techniques and stochastic techniques to im- To obtain the outlined neighborhood, a number of prove performance of the pipeline structure and take internal registers (16, at present) , called Morphological advantage of the intrinsic instructions parallelism. Registers (MOR), have a structure which is more com- 107 Authorized licensed use limited to: UNIVERSITA TRENTO. Downloaded on July 28, 2009 at 11:05 from IEEE Xplore. Restrictions apply. E fA N f % n-bitinputfrom serial camera / k lines of binary image plane i n-bit serial f/ output to monitor Figure 1: General architecture of the Processor Array plex than that of a simple memory cell, and are actu- Image processing algorithms consist of sequences ally composed of five 1-bit cells with a S+N shift reg- of low level steps, such as filters, convolutions, etc., ister connection.