A Programmable Image Processor for Real-Time Image Processing Applications
Total Page:16
File Type:pdf, Size:1020Kb
Microprocessors and Microsystems 23 (1999) 35–41 A programmable image processor for real-time image processing applications M.Y. Siyala,1,*, M. Fathyb,2 aSchool of EEE, Nanyang Technological University, Nanyang Avenue, Singapore, Singapore 639798 bDepartment of Computer Engineering, Iran University of Science and Technology, Narmark, Tehran, Iran Received 22 September 1998; received in revised form 25 October 1998; accepted 12 November 1998 Abstract In this article we present an architectural design and analysis of a programmable image processor, named Snake. The processor was designed with a high degree of parallelism to speed up a range of image processing operations. Data parallelism found in array processors has been included into the architecture of the proposed processor. The implementation of commonly used image processing algorithms and their performance evaluation are also discussed. The performance of Snake is also compared with other types of processor architectures. q 1999 Elsevier Science B.V. All rights reserved. Keywords: SIMD; RISC; CISC; Array processor; Parallel processing; Image processing Real-time 1. Introduction area operations with 3 × 3 mask. This is because area processing with 3 × 3 mask is commonly used in many Low-level image processing operations usually involve image processing applications [5,9]. This is particularly simple and repetitive operations over the entire input images true for the image processing algorithms developed for [1–5]. In real time applications, these input images are road traffic analysis [8,9]. However, other area processing required to be processed at a rate of 25 frames per second with larger masks can be easily realised through several 3 × (in most countries). Hence, with every frame consisting of 3 basic mask operations. 256 K pixels (512 × 512 resolution), tens of megabytes of data must be processed at every second. To provide such a high throughput rate, parallel implementation has to be 2. The architecture of Snake investigated [8,9]. Area processing [2] including convolution and morpho- The whole design reflects the idea of combining paralle- logical operations present nearly all of the feature extraction lism found in mesh-connected array processors into an effi- algorithms, therefore, it plays a vital role in image proces- cient pipeline processor, so that the high concurrency could sing applications. However, because of their dependency on be achieved. neighbouring pixels, the operations become computation- 2.1. The overall system ally expensive [1,7,8], therefore, the efforts on developing suitable architectures to implement them efficiently is an Fig. 1 shows the block diagram of Snake. The processor important task for real-time image processing based appli- contains four functional units that are connected to each cations. Theoretically, the highest throughput rate will be other, forming an efficient four-stage pipeline. The four achieved when a perfect match is met between the processor hardware units are listed below: architecture and the algorithms [2,5]. Keeping this in mind, the design of Snake was undertaken. 1. Address Generation Unit (AGU): This module consists The processor was designed to implement various image of two sub-units and is responsible for movement between the processor and the image memory. 2. PE-array Computation Unit: It consists of nine inter- * Corresponding author. Tel.: 0065 790 4464; fax: 0065 790 0415. connected processing elements, forming a full mesh- E-mail addresses: [email protected] (M.Y. Siyal), × 1 E-mail: [email protected] (M.Y. Siyal), connected 3 3 PE-array. 2 E-mail: [email protected]. ac.ir (M. Fathy) 3. Multi-input Arithmetic Logical Unit (MALU): Besides 0141-9331/99/$ - see front matter q 1999 Elsevier Science B.V. All rights reserved. PII: S0141-9331(98)00112-4 36 M.Y. Siyal, M. Fathy / Microprocessors and Microsystems 23 (1999) 35–41 Fig. 3. The links in PE-array. Fig. 1. Block diagram of the processor. Each sub-unit is connected to PCU that handles program execution. The address calculation sub-unit can figure out execution of normal arithmetic logical functions, several the addresses for the pixel itself and its neighbouring pixels, special-designed circuits are integrated. Therefore, this while the interfacing sub-unit generates interface signals to unit can handle up to ten inputs at a time to perform some control the data movement between the processor and the ‘group arithmetic and logical’ operations within one memory. The most distinctive operation is ‘neighbouring cycle. read’, in which case nine neighbouring pixels in image 4. Processor Control Unit (PCU): It consists of several sub- memory are fetched by the AGU into nine correspondent units and is responsible for fetching and decoding inputs of PE-array. instructions, and generating control signals for corre- During each read cycle, a 18-bit base-address, for three spondent components. The efforts of parallelism exploi- row neighbouring pixels, is firstly put on the higher bits of a tation at machine instruction level are mainly made at 24-bit address and data bus, while the three offset addresses this unit. (2 bits for each pixel) are put on the corresponding lower bits of the bus. Then, after the control signals have been Data flow in Snake is based on a four-stage pipeline: firstly, sent, the contents of the three pixels can be obtained by a set of neighbouring pixels is read into the processor by the the AGU. Finally, each obtained data is output to a suitable AGU. Then, the PE-array takes the data, computes it in input of PE-array. parallel, and sends the processed results to MALU. The MALU computes these partial results and gets the final result. Finally, the result is output by AGU. 2.3. PE-array computation unit An instruction set, supporting both Snake architecture The mesh of 3 × 3 processing elements forms this compu- and characters of area image processing, was also devel- tation module. As shown in Fig. 3, the nine PEs are fully oped. The instruction set it simple and designed with effi- inter-connected, therefore, they work synchronously, the cient pipelining and decoding in mind. Moreover, same way as mesh-connected array processor works. Each parallelism at instruction level was also exploited. That PE in the array accesses its corresponding input from AGU, means, some instructions take advantage of spatial paralle- along with inputs from its neighbouring PEs. lism by using multiple functional units to execute several As shown in Fig. 4, every processing element (PE) operations concurrently. consists of a Processing Arithmetic & Logical Unit (PALU) providing Boolean as well as arithmetic functions, 2.2. Address generation unit As shown in Fig. 2, the AGU consists of two sub-units: one for address calculation and the other for interfacing. Fig. 2. Block diagram of AGU. Fig. 4. Structure of a processing element. M.Y. Siyal, M. Fathy / Microprocessors and Microsystems 23 (1999) 35–41 37 instruction execution (EX). The possible overlapping execution among the three stages has been investigated during the design, so that the performance of the processor could be improved. The processor carries out the two stages (EX and IF) concurrently when the two consecutive and independent instructions are met. For the EX stage, two sections of instructions are included. One is to initialise local memories in both PE- array and MALU. This section of instructions is responsible for mapping algorithms onto the functional units. The other section of the instructions is to synchronise the processing step among all functional units in the processor. Fig. 5. Structure of MALU. instruction and constant registers, and a Logic Control Unit 3. Instruction set (LCU). Each processing element has a local memory, which is used to store instructions and the data. Initially, both Obviously, along with the architecture design of Snake, instruction and constant registers in each PE are ready to an effective instruction set should be developed as well, so execute their instructions. During the process, every PE in that various image-processing algorithms could be imple- the array executes the operations defined in its own instruc- mented. The simplicity and suitability for image processing tion register simultaneously. Note that the whole process in are two major concern during the development of the the array is well connected and is controlled by PCU. instruction set. Besides, the possible parallelism at instruc- tion level and particular architecture of the processor are 2.4. Multi-input arithmetic logical unit also important factors and was taken into consideration at the design stage of the processor. As shown in Fig. 5, MALU has ten input data: nine from A total of 64 instructions with 8-bit format have been PE-array with pre-processed data, and one from PCU. The developed for basic use. Apart from instructions for normal instructions, which will be running, are stored in the local functions, there are quiet a few distinctive instructions memory. The instructions are loaded by PCU in a serial way developed to take advantage of the special architecture of through the unit’s LCU. Therefore, with every clock cycle, Snake. A short list of some of instructions is shown in Table it can take more than one value from its inputs and produce 1. the result. LCU is responsible for the execution of the As an example, instruction ‘CEP’ is the instruction for instructions that are stored in the local memory or sent by both PE-array and MALU. The execution of this instruction PCU. results in, not only all of PEs in the PE-array doing proces- As mentioned earlier, apart from its conventional abilities sing simultaneously, but also MALU processing at the same for arithmetic and logical functions, some specially time. With the help of these instructions, such as group designed circuits are integrated to enhance its capability to instructions and neighbour read/write instructions, the handle these ten inputs and perform ‘group operations’.