A Dedicated Image Exploiting Both Spatial and Instruction-Level Parallelism*

A.Broggi, M.Bertozzi, G.Conte F.Gregoretti, R.Passerone, C.Sansoit , L.M .Reyneri

Dipartimento di Ingegneria dell’Informazione Dipartimento di Elettronica UniversitA di Parma Politecnico di Torino 1-43100 Parma, Italy 1-10129 Torino, Italy

Abstract ing called HACRE (HAndwritten Character REcogni- This paper presents the PAPRICA-3 massively par- tion). The main task of the system was to perform allel SIMD system, designed as n hardware accelerator automatic recognition of the amount handwritten on for real-tim.e image processing tasks. It is composed of the legal and courtesy fields of a bank cheque, at a 0,lin,enr array of single- hit processsing elements, includ- rate of about ten cheques per second. A solution to ing a fairly complex pipelined controller, thus allowing the problem is to design a dedicated parallel processor the system to take advantage also of the intrinsic par- with an instruction set tailored to low-level image pro- allelism in a program. cessing tasks. To meet the cost and size constraints A program,ming en,ciironment has been developed to typical of embedded systems, the hardware accelerator ease the prototyping of applications: a code generator has to be composed of the smallest possible amount converts C++ programs into assembly, and code op- of components, with a single-chip unit as the most timization is performed directly at the assembly level, desirable solution. following a genetic approach. The heart of PAPRICA-3 consists of a linear ar- The effectiveness of the processor, as well as of the ray of processing elements (PES), each one devoted code optimizer, are discussed with the aid of an ap- to a single line . At any time, all PES execute plication example for handwritten character recogni- the same instruction. The system operates on black tion. & white, grey-scale or color bitmap images, so that a pixel is defined as a multi-bit value carrying informa- 1 Introduction tion stored on different image bit-planes. A complete This paper describes PAPRICA-3, a real-time hard- image bit-plane can be seen in different ways, for ex- ware accelerator for low-level image processing appli- ample as an independent black & white image, as part cations. The system is designed to be interfaced to of one or more grey-scale or color images, or as an in- a conventional to relieve it from the termediate result of operations performed on any of high computational load required by the first steps of the above objects. image processing algorithms (filtering, noise removal, Each PE can perform logical or morphological oper- contrast enhancement, features extraction, etc.). The ations on an input pixel: logical operations compute a system is able to capture a serial input data stream one-bit result depending on values stored on different coming from an external camera (or pair of stereo cam- image planes of the same pixel, while morphological eras), to process it in parallel on a line-by-line basis, operations evaluate a one-bit result based on the val- and to store the resulting images either on a dedicated ues stored in the neighboring on a single image dual-port memory, accessible also from the conven- plane, using the mathematical morphology approach tional processor, or on a monitor. proposed in [16]. One of the design goals was to build a low-cost, The , which is common to the whole high-performance system, dedicated to embedded ap- processor array, controls program flow and can per- plications. PAPRICA-3 was initially formulated as the form tests, branches, conditional execution of program core of a high speed system for bank cheques process- segments, loops, and synchronization statements. To ‘This work was partially supported by the CNR and the speed up program execution, a fairly complicated Italian Ministry for University and Scientific Research 40%. pipeline scheme has been adopted to exploit as much

106 0-8186-7987-5197$10.00 0 1997 IEEE

Authorized licensed use limited to: UNIVERSITA TRENTO. Downloaded on July 28, 2009 at 11:05 from IEEE Xplore. Restrictions apply. instruction level parallelism as possible. The program This paper presents the whole project which started is stored completely in an internal program memory about 3 years ago and involved two reserach units: (512 instructions wide) writable from the external pro- Parma University and Torino Politechnic. Next sec- cessor, so there is no need for internal cache memory. tion describes the hardware system and architecture; A Status can be used to compute and section 3 presents the programming environment and store global parameters of the processed images, such details the code optimization process, describing also as statistics, degrees of matching between an input and its implementation on a parallel cluster of worksta- reference images, etc.. It is accessible to the external tions; section 4 describes the main application that processor and is aimed to increase execution speed of motivated the development of PAPRICA-3; and fi- a broad class of neural algorithms. nally section 5 concludes the paper with some remarks. Two versions of the PAPRICA-3 system are un- 2 Hardware System Description der development, having the same architecture: the first one is dedicated to the above mentioned hand- As shown in figure 1, the kernel of PAPRICA-3 is writing recognition application and in the following a linear array of Q identical 1-bit PES connected to will be also referenced with the name HACRE, while an image memory via a bidirectional Q-bit wide bus. the second one is of a general purpose aimed to build The image memory is organized in addressable words an image processing system based on a single PCI whose length matches that of the processor array; each bus PC expansion board and will be called generically word contains data relative to one binary pixel plane PAPRICA-3. Two different ASIC chips will therefore (also called layer) of one line of an image (Q bits wide), be built: and a single cycle is needed to load an entire line of data into the PE’s internal registers. When executing HACRE is a single-chip system. The chip is com- a program, the correspondence between the line num- posed of 64 PES, an input video interface ded- ber and the pixel plane of a given image and the abso- icated to a linear image scanner, a lkx64 bits lute line address is computed in hardware by means of internal image memory and an internal 256 ele- data structures, named Image Descriptors, stored in ments Status Register File. It interfaces to a 32 the . An Image Descriptor is a memory bit with a minimum amount of pointer built up of three parts: a base address, a line external logic. counter and a page counter. Line and page counters can be reset or incremented by a specified amount un- PAPRICA-3 is a modular system. Each chip is der program control. Two counters, called M and N, composed of 32 PES,a generic input/output video are not directly related to image memory addressing, interface and does not include any internal image but can be used to modify program flow: instructions memory and the Status Register File. Instead, are provided to preset, increment and test them. the chip has a fast interface to an external image Data are transferred into internal registers of each memory and a Status Register File, to achieve PE, processed and explicitly stored back into memory two goals: connecting several chips together, it is according to a RISC-like processing paradigm. possible to create a system with a varying num- Each PE processes one pixel of each line and is com- ber of PES; an external but fast image memory posed of a Register File and a 1-bit Execution Unit. will not decrease too much the performances com- The core of the instruction set is based on morpho- pared with an internal memory, but the amount of logical operators [16]: the result of an operation de- memory can be much higher, to cope with mem- pends, for each processor, on the value of pixels in a ory demanding applications. given neighborhood (reduced 5 x 5, as sketched by the grey squares in figure 1). Data from E, EE, W and A programming environment has been developed to WW directions (where EE and WW denote pixels two ease prototyping of applications on PAPRICA-3. Ap- bits apart in E and W directions) may be obtained plication programs are written in C++; a code genera- by direct connection with neighboring PES while all tor first converts C++ statements into assembly code; other directions correspond to data of previous lines subsequently, optimization is performed on the gener- (N, NE, NW, NN) or of subsequent lines (S, SE, SW, ated assembly code. The optimizer performs both de- SS). Operations on larger neighborhoods can be de- terministic optimizations following the standard well composed 121 in chains of basic operations. known techniques and stochastic techniques to im- To obtain the outlined neighborhood, a number of prove performance of the pipeline structure and take internal registers (16, at present) , called Morphological advantage of the intrinsic instructions parallelism. Registers (MOR), have a structure which is more com-

107

Authorized licensed use limited to: UNIVERSITA TRENTO. Downloaded on July 28, 2009 at 11:05 from IEEE Xplore. Restrictions apply. E fA N f % n-bitinputfrom serial camera / k lines of binary

image plane i

n-bit serial f/ output to monitor

Figure 1: General architecture of the Processor Array plex than that of a simple memory cell, and are actu- Image processing algorithms consist of sequences ally composed of five 1-bit cells with a S+N shift reg- of low level steps, such as filters, convolutions, etc., ister connection. When a load operation from memory to be performed line by line over the whole image. is performed, all data are shifted northwards by one This means that the same block of instructions has position and the south-most position is taken by the to be repeated many times, and instruction fetching new line from memory. In this way, data from a 5 x 5 from an external memory is an important source of neighborhood are available inside the array for each overhead. Hence we chose to pre-load each block of PE, at the expense of a two lines latency. A second instructions into an internal memory, named Writable set of registers (48, at present), called Logical Regis- (WCS, 512 words x 32 bits), and to ters (LOR), is only 1-bit wide. On LORs and on the fetch instructions from there. In this way it is pos- central bit of MORS it is possible to perform a dif- sible to obtain the performance of a fast cache with ferent portion of the instruction set which comprises a hit ratio close to 1, at a fraction of the cost and logical and algebraic operations. complexity.

An important characteristic of the system is the Two inter-processor communication mechanisms integration of a serial-to-parallel 1/0 device, called are also available to exchange information among PES Video Interface (VIF), which can be connected to a which are not directly connected. The first one is the conventional input imaging device (camera, linear ar- Status Evaluation Network (SEN), shown in figure 2, ray) and/or to a monitor. The interface is composed to which each processor sends the contents of one of its of two 8 bit, Q stages shift register which serially registers (selectable under program control) and which and asynchronously loads/stores a new line of the in- provides a Status Word divided into two subfields. The put/output image during the processing of the pre- first one is composed of two global flags, named SET vious/next line. Two instructions enable the bidirec- and RESET, which are true when the contents of the tional transfer between the PE’s internal registers and specified register are all Is or all Os, respectively. The the VIF, ensuring also proper synchronization with second one is the COUNT field which is set equal to the camera and the monitor. the number of processing elements in which the con-

108

Authorized licensed use limited to: UNIVERSITA TRENTO. Downloaded on July 28, 2009 at 11:05 from IEEE Xplore. Restrictions apply. tent of the specified register is 1. This global informa- by the external processor. This has been used in the tion may be accumulated and stored into the Status HACRE system to implement a neural network, by Register File and used to conditionally modify the pro- computing the degree of matching between an image gram flow. As an alternative, it may be reloaded into and a set of templates (weight matrices). the PE’s Register File for further processing. As mentioned earlier, PAPRICA-3 was conceived as a modular, high-performance, low-cost and small size image processing system. The first PAPRICA-3 board Count Ones is currently under development. It will be a single PCI I I -_ bus PC expansion board. It will incorporate: an array of 128 PES (4 chips); a PCI 32 bit bus master con- troller implemented in FPGA; a 64Kx 128 bits, 20ns, Image Memory; a Status Register File (a FPGA and a memory chip); a simple PAL video camera interface (a single standard chip including an A/D converter, sync separators and pixel and line clock generators) connected to the input VIF.

3 The Programming Environment 3.1 High Level Programming Language PE 0 PE 1 PE 2 PE 3 The intrinsic complexity of Assembly language, es- Figure 2: The Status Evaluation Network pecially when combined with the SIMD computational model, produces a low programming efficiency; a high level programming language, together with a software tool for the generation of its corresponding Assembly code, were needed to ease the development of software applications. Different solutions have been considered, ranging from the development of a completely new ad-hoc lan- RI R2 R3 guage to the use of a code generator. Some considera- tions [8] led to the choice of a code generator based on FigurefiP 3: The Interprocessor Communication Network the C++ object oriented language: its only drawback is the low level of optimization achievable in a single The second communication mechanism is the Inter- pass (generally compilers use two passes). processor Communication Network (ICN), shown in The main disadvantage of a compiler-based solution figure 3, which allows global and multiple communi- lays in the need for a complete definition of a new lan- cation among clusters of PES. The topology of the guage, which must be able to handle not only parallel communication network may be varied during pro- statements, but also some other sequential functions gram execution. In fact, each PE drives a switch that (such as serial operations, data I/O, input of data from enables or disables the connection between pairs of ad- the user, memory management, etc.) which are of ba- jacent processors. The PES may thus be dynamically sic importance although not dealing with the specific grouped into clusters, each PE can broadcast a register SIMD extension. Conversely, a code generator may value to the whole cluster within a single instruction. inherit these capabilities from the chosen high-level This feature can be very useful in algorithms involving language and thus its implementation is considerably seed-propagation techniques, and in the emulation of less complex than the former. Moreover, in this second pyramidal (hierarchical) processing. case the efficiency in the development of applications is The Image Memory and the Status Register File significantly increased, since the programmer does not are external to the in PAPRICA-3, have to learn a new language but only its parallel ex- in order to provide an higher degree of modularity and tensions. Moreover, the use of an object oriented lan- flexibility at system level, while HACRE provides an guage eases the programmer task since the operands internal 1Kx64 bits Image Memory and 256 Status acting on the new objects can be used intuitively by Registers. Besides being used to conditionally modify the application programmer thanks to the possibility program flow, the Status Register File can be read of overloading known operators (such as ’+’, ’-’, ’*’,...).

109

Authorized licensed use limited to: UNIVERSITA TRENTO. Downloaded on July 28, 2009 at 11:05 from IEEE Xplore. Restrictions apply. As shown in figure 4, the complete sequence of oper- ure 5. Finally the optimized blocks are re-assembled ations required to build a PAPRICA-3 Assembly pro- and the final program is generated. gram is the following: first the extended C++ code Each block is optimized following an iterative is written and compiled; then the executable obtained stochastic approach (as opposed to traditional [9] is run and the PAPRICA-3 Assembly code is gener- polynomial solutions [3, 61): at the i-th iteration, N(i) ated. The generation of Assembly code was preferred new programs are generated by the consecutive appli- with respect to the generation of binary machine code, cation of R(i)rules (drawn randomly from a given set) thus requiring a further processing step (PAPRICA-3 to each program of the (i - 1)-th generation. The ones Assembler) to get the final binary code. This choice al- performing better than a given threshold T(i)are kept lows a faster and more efficient debugging of the code- and will form the i-th generation, while the other are generation functions, as well as a simpler implemen- discarded. tation of the following global optimization step, since Since the population at the first iteration is com- in the Assembly code also comments are automati- posed by only the original block, at the beginning of cally included to ease program comprehension. More- the process (when i is small) the number of rules ap- over, since PAPRICA-3 has been conceived as a real- plied, R(i),is high, thus allowing a large spreading of tame image processor, a code optimizer can be used the initial population. As soon as i becomes large, the to further improve the generated code. Finally the number of rules applied to the same block decreases PAPRICA-3 Assembler is used to produce the binary to 1, for a fine tuning of the solution (see figure 6.a). machine code. A linear slope of R(i)has been chosen to simplify the 3.2 Assembly Code Optimization development of the tool, but more complex functions are under evaluation. The optimization process is divided into two main steps: a deterministic one, performed on the initial assembly code, and a stochastic one, iterated on the result of the previous step. The first step is used to reduce both execution time and code size and exploits well known traditional optimization techniques [l];on the other hand, the stochastic step tends to improve the efficiency of the Assembly code exploiting the char- acteristics of the pipeline structure, the specific in- Figure 6: Qualitative behavior of (a) the number of struction set, or, more generally, the intrinsic paral- rules applied consecutively to a single block, (b) the lelism of each instruction [14]. This process is driven threshold T(i)as a function of the iteration number i by the user choice about the cost function: execution time, code size, or a given combination of both. Figure 6.b shows a qualitative behavior of threshold 3.2.1 Deterministic Optimization T(i),which, starting from a value twice as large as the cost featured by the initial program, decreases linearly This first step is aimed to the application of rules that with i (the slope of T(i)determines the convergence will certainly lead to shorter and faster versions of the speed). This implies that at the beginning of the it- program. The techniques used are: Invariant Code erative process (i small), solutions with a cost higher Motion, and Removal of Jumps. than the initial can be kept (this is of basic importance to escape from local minima), while in the following 3.2.2 Stochastic Optimization iterations a threshold limits the number of candidates and keeps only the best solutions. At each iteration Since in the optimization phase no data are available, the block used to generate the new population is also data-dependent branches cannot be solved (no branch included in the evaluation process, in order to avoid prediction is here made). For this reason, the code is jittering from the optimal solution that may have been divided into blocks, which are defined as code segments eventually detected. The number of iterations of the without data-dependent conditional control-flow con- stochastic process and the slope of T(i)is a function structs. Each block is optimized independently of the of the length of the block to be optimized: in the first others: a first module of the optimizer tool splits the version its dependence is inversely proportional to the code into blocks, and sends them to independent pro- iteration number. cesses in charged of the optimization, as shown in fig- The performance of each new program is deter-

110

Authorized licensed use limited to: UNIVERSITA TRENTO. Downloaded on July 28, 2009 at 11:05 from IEEE Xplore. Restrictions apply. Optimized PAPRICA-3 optimizer Execution PAPRICA-3 Assembly code PAPRICA-3 PAPRICA-3 Assembly code dz' 1-kAssembler binarycode

Code Generation Optional Code Optimization Machine Code Generation

Figure 4: The complete sequence of operations required to build a PAPRICA-3 executable

Set of Rules

Final Assembly Code Stochastic Optimizer Assembler -4

Performance Evaluation and Threshold

Figure 5: Block diagram of the stochastic code optimization mined using PiPE, a tool that allows to determine (in terms of both memory and CPU time), a paral- the performance of a code segment (i.e. the number lel version of the optimizer has been developed. The of clock cycles required for a complete run). Upon development of the parallel version started with the initialization, PiPE reads a description of the internal analysis of the system requirements and performance: architecture of the processor, such as pipeline struc- ture, structural, data, and control hazards [12], and 0 the CPU time spent in the evaluation of the per- the instruction set encoding. Then the list of instruc- formance of a given segment of assembly code is tions of a given block is sent to the PiPE tool, which up to 90% of the overall CPU time; computes and returns the total number of clock cycles 0 the larger the number of iterations, the better required to run that segment of code. the quality of the result, but also the larger the Each rule can move, modify, add, or remove in- amount of memory required. structions. Every action is fully reversible, thus allow- Different approaches for the development of the ing to return to the originating state. The elemen- parallel version of the optimizer were considered (code tary rules used are: Instruction Scheduling, Instruc- splitting and assembling is anyway a serial step which tion Alignment, Loop Fusion and Breaking, Loop Un- is performed by a single master processor). The ap- rolling and Rolling, Instruction Fusion and Breaking, proach followed in the development of the current par- Multiple Assignment Removal, Common Subexpres- allel version (using standard MPI libraries) is based sion Elimination, Registers Reallocation. on the distribution of the performance evaluation step only: a master processor splits the code into blocks 3.2.3 Parallel Implementation of the Code and generates the new blocks to be evaluated. The Optimizer new blocks are queued and sent to different processors for the evaluation of their performance. When a pro- Since the major drawback of a stochastic approach cessor completes the evaluation phase, it returns the to code optimization is the high computational power number of clock cycles to the master node and signals

111

Authorized licensed use limited to: UNIVERSITA TRENTO. Downloaded on July 28, 2009 at 11:05 from IEEE Xplore. Restrictions apply. its availability to evaluate a new code segment. Due to the different computational power required by (2) the evaluation phase (90%) and (ii) the rest of the pro- cessing, this approach balances the load even in the case of a non-homogeneous cluster of nodes. The rules that are currently operative in the first version of the tool and that have been used in the Figure 7: Block diagram of the image preprocessor examples described in this work are: Invariant Code and neural recognizer. Motion, Removal of Jumps, Instruction Scheduling, Instruction Alignment, Loop Fusion and Breaking, cheque; an image preprocessor for preliminary filter- Loop Unrolling and Rolling, while Instruction Fusion ing, scaling, and thresholding of the image; a neural and Breaking, Multiple Assignment Removal, Com- subsystem, based on an ensemble of neural networks, mon Subexpression Substitution, Registers Realloca- which detect character centers and provide hypotheses tion, are currently under test. The future integration of recognition, for each detected character. of these rules will lead to a higher degree of optimiza- The second and third blocks are both executed on a tion. number of PAPRICA-3 chips (two, in the prototype) 4 Applications and Performance to cope with the tight real-time requirements of the This section describes a practical application of application (target speed is more than 10 cheques per PAPRICA-3 and compares its performance with and second). As shown in fig. 7, they are made of the without pipeline and with commercial processors. The cascade of the blocks listed below. application proposed is part of an automatic handwrit- 1 the acquires the input image ing recognizer for bank cheques called HACRE [lo], WINDOW EXTRACTOR which is made of two major blocks strictly interact- from the SCANNER, at a resolution of 200dpi, gray ing: an image preprocessor and neural recognizer and scale, 16 levels. The scanner is an 876-pixel linear CCD scanned mechanically over the image, at a J context analysis subsystem. The former is imple- mented on PAPRICA-3, while the latter, which uses speed of 2m/s (equivalent to about 1,200 charac- the redundancy contained in the amount written twice ters/s). Character recognition takes place in two to reduce error probability, runs on a commercial PC smaller areas of known coordinates, containing but is not described here. Refer to [lo] for a complete the legal and the courtesy amounts, respectively. and more detailed description. These areas, 128 x 500 and 128 x 1500 pixels in The proposed system and its component blocks are size, respectively, are extracted from the scanned representative of the algorithms used in a wide range image by means of an ad-hoc CCD controller. The system has to process about 2.8 Mpixel/s of of applications of image processing, such as: hand- writing recognition [7, 131, mailing address interpreta- data. tion, document analysis, signature verification, bank- 2. the FILTER block computes a simple low-pass fil- ing documents processing, quality control, etc. ter with a 3 x 3 pixel kernel. The problem of handwriting recognition is a strate- 3. the BRIGHTNESS block compensates for the non- gic, although quite challenging problem. Besides the uniform detector sensitivity and paper color. A classical problems encountered in reading machine- pixel-wise adaptive algorithm shifts the level of printed text, such as word and character segmenta- white to a pre-defined value and computes the tion and recognition, the domain of handwritten text contrast, while the image is scanned. White level recognition has to deal with other problems such as, and contrast are periodically damped, to reduce for instance, the evident similarity of some characters the influence of very bright and very black small with each other, the unlimited variety of writing styles spots. and habits of different writers, and also the high vari- 4. the THRESHOLD block converts the gray-scale im- ability of character shapes issued from the same writer age, after compensation, into a B/W image; the over time. This are few of the reasons why handwrit- image is compared against an adaptive threshold ing recognition requires very powerful computing sys- which is function of both the white level and the tems, mainly to deal with real-time requirements. contrast. The image preprocessor and neural recognizer are 5. the BASELINE block detects the baseline of the made of three cascaded subsystems: a mechanic- handwritten text, which is an horizontal stripe in- optical scanner, to acquire a bit-map image of the tersecting the text in a known position, as shown

112

Authorized licensed use limited to: UNIVERSITA TRENTO. Downloaded on July 28, 2009 at 11:05 from IEEE Xplore. Restrictions apply. ber of features, by removing both redundant and useless features. Examples of redundant features are evident in the vertical stroke of digit “9” shown in fig. 8.h. 9. the ZOOM block evaluates the vertical size of the manuscript and scales it in size to approximately 25-30 pixels. Scaling the size eases the task of the neural recognizer, as the number of weights required for proper recognition is reduced. 10. the COMPRESS block reduces image size further by a factor 2, approximately, by means of an ad-hoc topological transformation which does not pre- serve image shape, although preserves its connec- tivity. At this point, after all preprocessing steps, the B/W image is ready for the following neural recognition steps. Each character is compressed into 252 bits, which are then reorganized into four adjacent processor memory words (4 x 64 bits), to optimize the performance of the neural detector. li. the CENTERING DETECTOR takes care of detecting the location of character centers. Most handwrit- ing recognizers [7, 13, 151 require an indepen- dent character segmenter which segments each Figure 8: Preprocessing steps of handwritten images: word into individual characters. Unfortunately a) original image, 200 dpi, 16 gray levels; b) low-pass segmentation is a quite difficult task [7, 131, es- filtered image; c) brightness compensation; d) thresh- pecially for handwritten texts, as there is no olded image; e) spot noise removal; f) thinning, after clear separation between consecutive characters. 6 steps; g) finding baseline (at the left side of the im- For this reason, we have decided to use an in- age). h) features detection (features are tagged by tegrated segmentation and recognition (ISR) ap- small crosses). i) centering detection: neural (solid proach [lo], in which character segmentation is lines) and feature-based (dashed lines). tightly integrated with character recognition, and no preliminary segmentation is required. This ap- proach has one main advantage: localization of in fig. 8.g (left side). Unlike the other steps of character centers can be partially implemented by preprocessing, baseline cannot be detected only means of a neural network, therefore the system by means of local algorithms (namely, algorithms can self-learn to detect centers from a set of ap- with limited neighborhood), as it is a global pa- propriate training examples (training set). This rameter of the entire image, therefore it is avail- further reduces development time. able only at the end of the image, as shown in In details, the CENTERING DETECTOR is made of fig. 8.13. two independent blocks: a neural detector: a 6. the THINNING block reduces the width of strokes two-layer (252 x 30 x 3 x 1) WRBF+MLP net- to 1 pixel. Thinning is a morphological opera- work [15, 111, trained to detect character centers, tor [16] which reduces line width, while preserving and a feature-based detector, which attempts to stroke connectivity. Thinning is required to prop- locate character centers according to the distri- erly identify character features (see next block). bution and occurrence frequency of character fea- 7. the FEATURES block detects and extracts from the tures. Fig. 8.i shows the results of the two center image a set of 12 useful stroke features, which are detectors. It is clear that each character center is helpful for further character recognition. Features localized at more than one position, but the con- are not used alone, but together with the neural text analysis subsystem [lo] easily filters them. recognizer to improve recognition reliability, as 12. the CHARACTER RECOGNIZER recognizes each indi- described in [lo]. vidual character, using a neural network trained 8. the FEATURE REDUCTION block reduces the num- from an appropriate training set. At this stage

113

Authorized licensed use limited to: UNIVERSITA TRENTO. Downloaded on July 28, 2009 at 11:05 from IEEE Xplore. Restrictions apply. PAPRICA-3 Pentium 90 MHz Sparc 10 worst case morphol. I ad-hoc morphol. Image preprocessor psfline ms/check ps/line WINDOW EXTRACTOR + FILTER 1.38 2.76 1,370 BRIGHTNESS 2.95 5.90 1,660 THRESHOLD 0.91 1.82 570 THINNING 8.34 16.7 7,390 7,850 BASELINE 4.24 8.28 4,320 3,890 FEATURES 3.05 6.10 9,490 10,430 ZOOM 2.24 4.48 820 160 760 COMPRESSt 30.8 61.6 21,350 24,950 OTHER (VARIOUS) 3.25 6.50 3,330 5,020 TOTAL PREPROCESSING 57.2 114 51,200 56,500

of processing, context is not yet considered, as as a comparison between systems. Note that running it is used by the context based subsystem de- the small programs individually exposes them to the scribed in [lo]. The neural network is a two-layer pipeline fill up overhead (the time it takes to reach a (252 x 75 x 1) WRBF network [15, 111, which is steady state) , which somewhat penalizes the pipeline “triggered” for each character center detected by solution (up to 5-10% overhead). The real speedups of the CENTERING DETECTOR. those pieces of code inserted in a larger program are As shown in table 10, the CHARACTER RECOGNIZER higher than what shown in the table. is the slowest piece of code, due to the large num- 5 Conclusions ber of synaptic weights involved. Fortunately it Thanks to the specific solutions adopted, the is computed at a relatively slow rate, namely ev- PAPRICA-3 system fully exploits the spatial paral- ery 15 lines, in the average, therefore its effects lelism of the processor array and the intrinsic paral- on computing time are limited. lelism found in the sequence of assembly instructions. Table 10 lists execution times of the various blocks The code generator allows the user to develop applica- and compares the performance of HACRE with and tions without worrying about the spatial parallelism of without pipelining, with and without optimization, the machine, while the assembly code optimizer is used and with a Pentium 90 and a Sparcstation 10 run- to improve the final code exploiting instruction-level ning the same algorithms. Figures for PAPRICA-3 parallelism. Due to the high computational power re- are given for a system with 64 PES, with and without quired for code optimization, the optimizer is useful using the pipeline, considering a 15ns cycle time plus only in the final optimization of programs whose exe- 30ns memory access time, and a 50ns cycle time plus cution time is a key parameter. 50ns memory access time, respectively. Recently not only the application to bank checks All programs have also been tested on both a has been developed but the use of PAPRICA-3 to Pentium and a Sparc, using the same algorithms speed-up the real-time processing of images for the based on mathematical morphology, which are well automatic guidance of a passenger car [5] has been suited to the specific problem of bitmap processing investigated as well. Images are acquired by cam- and character recognition. Some of them (FILTER, eras installed on board of ARGO, a Lancia Thema BRIGHTNESS, THRESHOLD, NEURAL RECOGNIZER) could passenger car, and are processed in order to detect be implemented more efficiently on sequential com- the lane position and to localize possible obstacles [4]. puter using more traditional methods; these have been Within this framework, the PAPRICA-3 system is implemented on the Pentium and their performance now being compared to general-purpose systems, and listed in table 10 for comparison. in particular with a Pentium 200 MHz with MMX The performance of PAPRICA-3 are two to three technology. From the preliminary results obtained by orders of magnitude faster than that of Pentium and the first experiments on low-level image processing, Sparc, for almost all the programs considered; this dif- the PAPRICA-3 system is from 2 to 3 times faster ference reduces slightly when the system has to crunch than the Pentium processor. The experiments deal numbers with many bits (typically, 32 bits fixed point, mainly with low-level processing of images, where both or 64 bits floating points). systems can take advantage of their internal struc- As the first eight programs are relatively short ones, ture; more precisely, the MMX pentium is being pro- they do not affect the performance of the whole appli- grammed directly at assembly level since up to now cation very much, but their performance is presented no public domain C compiler (GCC ver. 2.7.2) does

114

Authorized licensed use limited to: UNIVERSITA TRENTO. Downloaded on July 28, 2009 at 11:05 from IEEE Xplore. Restrictions apply. not support MMX-based instructions. [ll]S. Haykin. Neural Networks: A Comprehensive Foundation. Mc Millan College Publishing Com- References pany, 1994. [l] A. Aho, R. Sethi, and J. Ullman. Compil- ers: Principles, Techniques, and Tools. Addison- [12] J. L. Hennessy and D. A. Patterson. Computer Wesley, Reading, MA, 1986. Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, Inc., 1990. [2] G. Anelli, A. Broggi, and G. Destri. Decomposi- tion of Arbitrarily Shaped Binary Morphological [13] F. Kunihito and W. Nobuaki. Handwritten al- Structuring Elements using Genetic Algorithms. phanumeric character recognition by the neu- IEEE Transactions on PAMI, 1997. In press. rocognitron. IEEE Trans. on Neural Networks, 2(3) ~355-365. [3] S. J. Beaty, S. Colcord, and P. H. Sweany. Us- ing Genetic Algorithms to Fine-Tune Instruction- (141 B. Ramakrishna Rau and J. A. Fisher. Instruc- Scheduling Heuristics. In Proceedings MPCS-96 tion Level Parallel Processing: History, Overview, - IEEE International Conference on Massively and Perspective. The Journal of Supercomputing, Systems. IEEE Computer So- pages 9-50, 1993. ciety - EuroMicro, May 6-9 1996. [15] L. Reyneri. Weighted radial basis functions for improved pattern recognition and signal process- [4] M. Bertozzi and A. Broggi. GOLD: a Parallel ing. Neural Processing Letters, May 1995. Real-Time Stereo Vision System for Generic Ob- stacle and Lane Detection. IEEE Transactions [16] J. Serra. Image Analysis and Mathematical Mor- on Image Processing, 1997. In press. phology. Academic Press, London, 1982. [5] M. Bertozzi and A. Broggi. Vision-based Vehi- cle Guidance. IEEE Computer, 30(7):49-55, July 1997.

M. J. Bourke 111, P. H. Sweany, and S. J. Beaty. Extending List Scheduling to Consider Extension Frequency. In Proceedings of the 29th Hawaii In- ternational Conference on System Sciences, pages 193-202. IEEE Computer Society, 1996.

R. Bozinovic and S. Srihari. Off-line cursive script word recognition. IEEE Trans. on PAMI, 11:68- 83, Jan. 1989.

A. Broggi. The PAPRICA-3 Programming and Simulation Environment. In Proceedings MPCS- 96 - IEEE International Conference on Massively Parallel Computing Systems, Ischia, Italy, May 6- 9 1996. IEEE Computer Society - EuroMicro.

D. J. DeWitt. A Machine Independent Ap- proach to the Production of Optimal Horizontal . PhD thesis, Dept of Computer and Communication Sciences, University of Michigan, Ann Arbor, 1976.

F. Gregoretti, B. Lazzerini, A. Mariani, and L. Reyneri. Mixing neural networks and context based techniques in a high-speed handwriting rec- ognizer. submitted to IEEE Trans. on Pattern Analysis and Machine Intelligence.

115

Authorized licensed use limited to: UNIVERSITA TRENTO. Downloaded on July 28, 2009 at 11:05 from IEEE Xplore. Restrictions apply.