and Microsystems 23 (1999) 35–41

A programmable image for real-time image processing applications

M.Y. Siyala,1,*, M. Fathyb,2

aSchool of EEE, Nanyang Technological University, Nanyang Avenue, Singapore, Singapore 639798 bDepartment of Computer Engineering, Iran University of Science and Technology, Narmark, Tehran, Iran Received 22 September 1998; received in revised form 25 October 1998; accepted 12 November 1998

Abstract In this article we present an architectural design and analysis of a programmable image processor, named Snake. The processor was designed with a high degree of parallelism to speed up a range of image processing operations. Data parallelism found in array processors has been included into the architecture of the proposed processor. The implementation of commonly used image processing algorithms and their performance evaluation are also discussed. The performance of Snake is also compared with other types of processor architectures. ᭧ 1999 Elsevier Science B.V. All rights reserved.

Keywords: SIMD; RISC; CISC; Array processor; Parallel processing; Image processing Real-time

1. Introduction area operations with 3 × 3 mask. This is because area processing with 3 × 3 mask is commonly used in many Low-level image processing operations usually involve image processing applications [5,9]. This is particularly simple and repetitive operations over the entire input images true for the image processing algorithms developed for [1–5]. In real time applications, these input images are road traffic analysis [8,9]. However, other area processing required to be processed at a rate of 25 frames per second with larger masks can be easily realised through several 3 × (in most countries). Hence, with every frame consisting of 3 basic mask operations. 256 K (512 × 512 resolution), tens of megabytes of data must be processed at every second. To provide such a high throughput rate, parallel implementation has to be 2. The architecture of Snake investigated [8,9]. Area processing [2] including convolution and morpho- The whole design reflects the idea of combining paralle- logical operations present nearly all of the feature extraction lism found in mesh-connected array processors into an effi- algorithms, therefore, it plays a vital role in image proces- cient pipeline processor, so that the high concurrency could sing applications. However, because of their dependency on be achieved. neighbouring pixels, the operations become computation- 2.1. The overall system ally expensive [1,7,8], therefore, the efforts on developing suitable architectures to implement them efficiently is an Fig. 1 shows the block diagram of Snake. The processor important task for real-time image processing based appli- contains four functional units that are connected to each cations. Theoretically, the highest throughput rate will be other, forming an efficient four-stage pipeline. The four achieved when a perfect match is met between the processor hardware units are listed below: architecture and the algorithms [2,5]. Keeping this in mind, the design of Snake was undertaken. 1. (AGU): This module consists The processor was designed to implement various image of two sub-units and is responsible for movement between the processor and the image memory. 2. PE-array Computation Unit: It consists of nine inter- * Corresponding author. Tel.: 0065 790 4464; fax: 0065 790 0415. connected processing elements, forming a full mesh- E-mail addresses: [email protected] (M.Y. Siyal), × 1 E-mail: [email protected] (M.Y. Siyal), connected 3 3 PE-array. 2 E-mail: [email protected]. ac.ir (M. Fathy) 3. Multi-input Arithmetic Logical Unit (MALU): Besides

0141-9331/99/$ - see front matter ᭧ 1999 Elsevier Science B.V. All rights reserved. PII: S0141-9331(98)00112-4 36 M.Y. Siyal, M. Fathy / Microprocessors and Microsystems 23 (1999) 35–41

Fig. 3. The links in PE-array.

Fig. 1. Block diagram of the processor. Each sub-unit is connected to PCU that handles program execution. The address calculation sub-unit can figure out execution of normal arithmetic logical functions, several the addresses for the itself and its neighbouring pixels, special-designed circuits are integrated. Therefore, this while the interfacing sub-unit generates interface signals to unit can handle up to ten inputs at a time to perform some control the data movement between the processor and the ‘group arithmetic and logical’ operations within one memory. The most distinctive operation is ‘neighbouring cycle. read’, in which case nine neighbouring pixels in image 4. Processor (PCU): It consists of several sub- memory are fetched by the AGU into nine correspondent units and is responsible for fetching and decoding inputs of PE-array. instructions, and generating control signals for corre- During each read cycle, a 18-bit base-address, for three spondent components. The efforts of parallelism exploi- row neighbouring pixels, is firstly put on the higher bits of a tation at machine instruction level are mainly made at 24-bit address and data bus, while the three offset addresses this unit. (2 bits for each pixel) are put on the corresponding lower bits of the bus. Then, after the control signals have been Data flow in Snake is based on a four-stage pipeline: firstly, sent, the contents of the three pixels can be obtained by a set of neighbouring pixels is read into the processor by the the AGU. Finally, each obtained data is output to a suitable AGU. Then, the PE-array takes the data, computes it in input of PE-array. parallel, and sends the processed results to MALU. The MALU computes these partial results and gets the final result. Finally, the result is output by AGU. 2.3. PE-array computation unit An instruction set, supporting both Snake architecture The mesh of 3 × 3 processing elements forms this compu- and characters of area image processing, was also devel- tation module. As shown in Fig. 3, the nine PEs are fully oped. The instruction set it simple and designed with effi- inter-connected, therefore, they work synchronously, the cient pipelining and decoding in mind. Moreover, same way as mesh-connected array processor works. Each parallelism at instruction level was also exploited. That PE in the array accesses its corresponding input from AGU, means, some instructions take advantage of spatial paralle- along with inputs from its neighbouring PEs. lism by using multiple functional units to execute several As shown in Fig. 4, every processing element (PE) operations concurrently. consists of a Processing Arithmetic & Logical Unit (PALU) providing Boolean as well as arithmetic functions, 2.2. Address generation unit

As shown in Fig. 2, the AGU consists of two sub-units: one for address calculation and the other for interfacing.

Fig. 2. Block diagram of AGU. Fig. 4. Structure of a processing element. M.Y. Siyal, M. Fathy / Microprocessors and Microsystems 23 (1999) 35–41 37 instruction execution (EX). The possible overlapping execution among the three stages has been investigated during the design, so that the performance of the processor could be improved. The processor carries out the two stages (EX and IF) concurrently when the two consecutive and independent instructions are met. For the EX stage, two sections of instructions are included. One is to initialise local memories in both PE- array and MALU. This section of instructions is responsible for mapping algorithms onto the functional units. The other section of the instructions is to synchronise the processing step among all functional units in the processor. Fig. 5. Structure of MALU. instruction and constant registers, and a Logic Control Unit 3. Instruction set (LCU). Each processing element has a local memory, which is used to store instructions and the data. Initially, both Obviously, along with the architecture design of Snake, instruction and constant registers in each PE are ready to an effective instruction set should be developed as well, so execute their instructions. During the process, every PE in that various image-processing algorithms could be imple- the array executes the operations defined in its own instruc- mented. The simplicity and suitability for image processing tion register simultaneously. Note that the whole process in are two major concern during the development of the the array is well connected and is controlled by PCU. instruction set. Besides, the possible parallelism at instruc- tion level and particular architecture of the processor are 2.4. Multi-input arithmetic logical unit also important factors and was taken into consideration at the design stage of the processor. As shown in Fig. 5, MALU has ten input data: nine from A total of 64 instructions with 8-bit format have been PE-array with pre-processed data, and one from PCU. The developed for basic use. Apart from instructions for normal instructions, which will be running, are stored in the local functions, there are quiet a few distinctive instructions memory. The instructions are loaded by PCU in a serial way developed to take advantage of the special architecture of through the unit’s LCU. Therefore, with every clock cycle, Snake. A short list of some of instructions is shown in Table it can take more than one value from its inputs and produce 1. the result. LCU is responsible for the execution of the As an example, instruction ‘CEP’ is the instruction for instructions that are stored in the local memory or sent by both PE-array and MALU. The execution of this instruction PCU. results in, not only all of PEs in the PE-array doing proces- As mentioned earlier, apart from its conventional abilities sing simultaneously, but also MALU processing at the same for arithmetic and logical functions, some specially time. With the help of these instructions, such as group designed circuits are integrated to enhance its capability to instructions and neighbour read/write instructions, the handle these ten inputs and perform ‘group operations’. The architecture of the processor is well exploited. enhanced ability to do ‘group operations’ in MALU makes many complex operations essential for image area proces- 3.1. Programming sing to be executed very efficiently. For example, the enhanced MALU can select the maximum value from its In Snake, each unit has its own short program and data. nine inputs from PE-array within one cycle. Therefore, the Each unit executes its own program on all pixels. Therefore, enhanced ALU can deal with a variety of image area proces- the first step of programming is to map the algorithms on to sing operations, like convolution and morphological opera- these units. In our approach, the mapping is done by loading tions. instructions and data to local memories. After the mapping procedure, the units are ready for operations. 2.5. Processor control unit During processing, instructions are decoded in the PCU, Like traditional RISC processors, the process in PCU is and synchronous signals are then sent to correspondent decomposed into a three-stage instruction pipeline (Fig. 6): units. These signals are used to control the speed of proces- instruction fetching (IF), instruction decoding (ID), and sing in the processor.

4. The analytical model

Fig. 6. Instruction pipeline. As most image area algorithms use a 3 × 3 mask for the 38 M.Y. Siyal, M. Fathy / Microprocessors and Microsystems 23 (1999) 35–41

Table 1 by: A short list of instructions ˆ × ϩ ϩ †: † TA N tIO tMALU tPE 5 Instruction Brief description Opcode mnemonic The Eq. (5) can also be expressed graphically by the pipe- line shown in Fig. 7. Moreover, because of the existence of ARD Neighbour read 00000111 spatial parallelism in the pipeline, the four processing times RPA PE-array execute 00111011 SMED Select medium from PE-array 01000010 (tin, tout, tMALU, tPE) overlap. So, considering this factor, a OUTR Output result in MALU to AGU 01000101 coefficient r 0 Ͻ r Ͻ 1† is introduced to adjust Eq. (5) as AINC AGU address register plus 1 00000100 follows: CEP Execute operations in both PE-array 1001100 ˆ × r × ϩ ϩ †: † and MALU, simultaneously TA N tIO tMALU tPE 6 CRP Both neighbour and PE-array 10000111 execution As shown in Eq. (6), in order to speed up the processing, two ways are available: one is to reduce the coefficient value r, and the other is to reduce the processing time for each stage. basic implementation, we can discuss the analytical model The former means developing a spatial parallelism in pipe- × for Snake, based on a 3 3 neighbourhood. line while the latter means to improve efficiency in every Disregarding both times on program decoding and regis- stage. Therefore, enormous efforts on both the optimisation ter initialising, the total processing time (TA) has three main of architecture at component level and exploitation of paral- components: data I/O time (TIO), computation time in PE- lelism at machine instruction level were made during the array (TPE), and computation time in MALU (TMALU). These . For Snake, we achieved r ˆ 0:4. times are functions of the program length (L) and image size (N). The total time to process a whole image can be expressed by: 5. Evaluation and comparison T ˆ T ϩ T ϩ T : 1† A IO PE MALU This section presents an evaluation of the proposed The I/O time in Eq. (1) can be expressed by: processor by implementing typical image processing algo- ˆ × ϩ †; † rithms. For the purpose of comparison, the algorithms were TIO N tin tout 2 also implemented on a number of CISC, RISC and DSP where N is the number of pixels in an image, tin is the cycle processors. These processors, namely Intel 80I860 (i860) time for performing a ‘read neighbouring’ operation, and tout RISC processor, Pentium-200 CISC processor, is the cycle time for performing a ‘write result’ operation. TMS320C30 DSP, and T800 Transputer are the most For Snake, because of enhance-designed of AGU, commonly used processors for real-time applications [6]. ϩ ˆ †: In order to compare the performance of Snake with these tin tout 12CLK inpractice processors, we implemented various image-processing The computation time in PE-array is expressed by: algorithms on Snake. ˆ × ; † However, for the purpose of comparison and performance TPE N tPE 3 evaluation, only two algorithms are described in this article. where N is the number of pixels in an image, and tPE is the longest processing cycle time among the PEs in the array. 5.1. Dilation algorithm Note that all PEs compute in parallel. The computation time in MALU is expressed by: The first algorithm considered for this investigation is a dilation operation with a 3 × 3 mask over a 512 × 512 T ˆ N × t ; 4† MALU MALU image. Dilation is a basic morphological operation and where N is the number of pixels in an image, and tMALU is constitutes a class of algorithms devised for local feature time to execute an algorithm on a pixel. Note that because of extraction in image processing [9]. a special hardware inside MALU, the number of instructions Two dimensional dilation operation of a gray-scale image for MALU to execute an algorithm is significantly reduced. I(x,y),by using a gray-scale structuring element C(ij), can be Hence, the total time for a whole image is again expressed expressed by the following equation: D x; y†ˆmaximum‰I x Ϫ i; y Ϫ j† ϩ C i; j†Š: 7† In our investigation, matrix C is a 3 × 3 mask as shown below, and test image is a picture with 512 × 512 pixels.

102

C ˆ 211 :

Fig. 7. The analytical model. 101 M.Y. Siyal, M. Fathy / Microprocessors and Microsystems 23 (1999) 35–41 39 matrix is applied as:

0 Ϫ10

C ˆ Ϫ13Ϫ1 :

0 Ϫ10 Like the dilation operation, the matrix C is applied over the test picture (512 × 512pixels). The filtering algorithm is expressed by the following equation: D i; j†ˆS{I i ϩ k; j ϩ 1†*C k; 1†}; 0 Յ k Յ 2; 0 Յ 1 Յ 2; 9† where I(i,j) is the input pixel (i,j), C(k,l) is matrix C coeffi- cient, and D(i,j) is the resultant pixel.

5.3. Implementation

As shown in Fig. 8 two steps are required for the imple- mentation of the above algorithms on Snake. The first proce- Fig. 8. Flow chart. dure is to initialise the values of registers in both AGU and PE-array. The second step focuses on the execution of a set Therfore, test dilation algorithm can be expressed by the of instructions repeatedly. In practice, the first step is carried equation as follows: out at the beginning of the implementation, while the second step is executed until the whole image data has been D i; j†ˆMaximum{I i Ϫ k; j Ϫ l† ϩ C k; l†}; 0 Յ k Յ 2; 0 processed. Յ 1 Յ 2; As a result of the same test pictures for both algorithms, the four AGU registers were initialised during the mapping † 8 process with the same set of data. This reflects the structure of the test image (512 × 512) and the starting position of the where I(i,j) is the value at the input pixel (i,j), C(k,l) is the pixel (0,0). The same procedure is followed for AGU matrix C coefficient, and D(i,j) is the result value at pixel mapping. During the process, instructions and the registers (I,j). The algorithm overall requires 0 32 × 5122† calcula- of every processing element in the PE-array are initialised tions over the 512 × 512 test image. with suitable values. The values should correspond to the algorithm involved. For example, for the dilation algorithm, 5.2. A low-bound filter the constant and instruction registers of PE0 in the PE-array were initialised with values ‘1’ and ‘Add’ respectively. The second algorithm implemented was a convolution While, for the low-bound filter, these two registers were algorithm, named low-bound filter. In this filter, a 3 × 3 initialised with ‘0’ and ‘multiply’ respectively. In the

Fig. 9. Execution times of processors (dilation algorithm). 40 M.Y. Siyal, M. Fathy / Microprocessors and Microsystems 23 (1999) 35–41

Fig. 10. Execution times of processors (convolution algorithm). implementation of the dilation algorithm, the instructions PE-array, concurrency in the processor is significantly related to the ‘group maximum obtaining’ were downloaded improved. Some other parallel strategies, such as paral- in the local memory of MALU. While for the low-bound lelism at instruction level, were also implemented in the filter, the local memory was initialised with instruction design. ‘group adding’. As Snake is a programmable processor, it is easy to After the above mapping process, the second step is the implement various image-processing algorithms and, it is processing of images. At this stage, PCU decodes the flexible to deal with different structures of images. There- processing instructions from the memory module, and fore, the proposed processor could be a good candidate for generates the synchronous signals for the correspondent real-time image processing applications. units. The nine neighbouring pixels were obtained by the neighbouring read instruction. It was followed by the instruction that resulted in both PE-array and MALU work- ing simultaneously. After two management instructions, References pixel result was obtained and output by AGU. With the help of spatial parallelism instructions, only five instructions [1] C.J. Turner, V.C. Bhavsar, Parallel implementations of convolution were involved in this stage for the two algorithms. and moments algorithms on a multi-transputer system, and Microprogramming (1995) 283–290. [2] P.A. Navaux, A.F. Cesar, Performance evalutaion in image processing 6. Results and comparison with GAPP array processor, Microprocessor and Microprogramming (1995) 71–82. [3] D.D. Eggers, E. Ackerman, High speed image rotation in embedded In order to evaluate the architecture and performance of system, Computer Vision and Image Understanding (1995) 270–277. Snake, we have conducted extensive experiments and have [4] A. Abnous, N. Bagherzadeh, Architectural design and analysis of a compared the performance of Snake with processors. Figs. 9 VLIW processor, Computer and Electrical Engineering 21 (1995) and 10 show the comparitive performance of the processors 119–142. in implementing the dilation and convolution algorithms [5] V. Cantoni, C. Guerra, S. Levialdi, Towards an Evaluation Of An × Image Processing System, in: Computer Structures for Image Proces- over the same 512 512 test image, receptively. As can sing, Academic Press, New York, 1983, pp. 43–56. be seen, the proposed architecture is more powerful than [6] M.O. Tokhi M.A. Hossain, CISC, RISC and DSP processors in real- the commonly used processors. time signal processing and control, Microprocessors and Microsys- tems, 1995. [7] M. Fathy, M.Y. Siyal, A combined edge detection and background 7. Conclusion differencing image processing approach for real-time traffic analysis’’, Journal of Road and Transport Research 4 (3) (1995) 114–123. In this article, we have presented the architectural design [8] M.Y. Siyal, M. Fathy, A real-time image processing approach for and the results of a programmable image processor called automatic traffic control and monitoring, IES Journal 35 (1) (1995) 50–55. Snake. An important factor in the design of Snake is its [9] M Fathy, M.Y. Siyal, Measuring traffic movements using image pipeline structure with a small number of processing processing techniques, Pattern Recognition Letters 18/05 (1997) elements. As a result of this small mesh-connected 493–500. M.Y. Siyal, M. Fathy / Microprocessors and Microsystems 23 (1999) 35–41 41

Dr M Y Siyal obtained M.Sc and Dr M Fathy received B.Sc. degree in Ph.D. from the University of Man- Electrical Engineering from Iran chester Institute of Science and Tech- University of Science and Technol- nology, England in 1988 and 1991 ogy, in 1984. He obtained his M.Sc respectively. From February 1991 to in Process Engineering from Bradford January 1993, he was with the Depart- University, England in 1987 and ment of Electrical and Electronic Ph.D. degree from University of Engineering, University of Newcastle Manchester Institute of Science and in Tyne, England. Since, February Technology, England in 1990. 1993, he has been with the school of Currently he is working in the Depart- Electrical and Electronic Engineering, ment of Computer Engineering at Iran Nanyang Technological University, University of Science and technology, Singapore. Dr Siyal is senior member where he teaches postgraduate of IEE, member of IEE and chartered engineer. He has published over students. Dr Fathy has been consultant to various organizations and is 40 referred journal and conference papers. His research interests are actively conducting research in areas of image processing, neural real-time image processing, computer architecture and multi-media and network, fuzzy logic and real-time image processing applications and information technology. has published extensively in these areas in international journals and conferences.