GVIP Journal, Volume 6, Issue 3, December, 2006

Implementation of Image Processing Operations Using Simultaneous Multithreading and Buffer Processing A.K.Manjunathachari1 K.SatyaPrasad2 1 G.PullaReddyEngineering College, Kurnool (India) 2 JNTU Collge Of Engineering, Kakinada(India) [email protected].

Abstract interest) in the image. Examples of intermediate-level Typical real time image processing applications require a operations are: region labeling, motion analysis. huge amount of processing power, ability and iii. High-level operations: Information derived from large resources to perform the image processing images is transformed into results or actions. These applications. The limitations appear on image processing operations work on data structures (e.g. a list) and lead to systems due to the volumetric data of image to be decisions in the application. So high-level operations can processed. This challenge is more dominant when be characterized as symbolic processing. An example of coming to the image processing applications a high-level operation is object recognition. parallely. Parallel processing appears to be the only A image processing starts with a plain image, or solution to attain higher speed of operation at real time sequence of images, (coming from a sensor) and, while resource constraints. The nature of processing in typical processing, the type of operations moves from arithmetic image processing ranges from large arithmetic (Floating Point Operations Per Second, FLOPS) to operations to fewer one. Although the existing parallel symbolic (Million Logic Inferences Per Second, MLIPS) computing systems provide to some extent parallelism and the amount of data to process is reduced until in the for image processes but fails to support image processing end some decision is made (image understanding). As operations varying at a large rate. may be obvious, image processing tasks require large As part of my thesis, this paper presents an amounts of (different type of) computations. When real- Implementation of Image Processing Operations Using time requirements are to be met, normal (sequential) Simultaneous Multithreading and Processing Buffer by workstations are not fast enough. So more processing bifurcating the operation of parallelism and the results power is needed and parallel processing seems to be are simulated in a standard LAN environment. seems to be an economical way to satisfy these real time requirements. Besides even when current workstations Keywords: Processing Buffer, Simultaneous get fast enough to do the image processing task of today, multithreading, SIMD, MIMD parallel processing will offer more processing power and open new application areas to explore. 1. Introduction Many architectures have been proposed that try to The type of processing operations in a typical image exploit the available parallelism at different granularities. processing task varies greatly. Generally three levels of For example, pipelined processors [2, 9, 15] and multiple image processing are distinguished to analyze and tackle instruction issuing processors, such as the superscalar [l, the image processing application: low-level operations, 18] and VLIW [4, 7, 12] machines, exploit the fine-grain intermediate-level operations, and high-level operations. parallelism available at the instruction set level. In i. Low-level operations: Images are transformed into contrast, multiprocessors [8, 11, 13] modified images. These operations Work on whole exploit coarse-grain parallelism by distributing entire image structures and an image, a vector, or a single loop iterations to different processors. Each of these value. The computations have a local nature; they work parallel architectures have significant differences in on single pixels in an image. Examples of Low-level overhead, instruction scheduling operations are: smoothing, convolution, histogram constraints, memory latencies, and implementation generation. details, making it difficult to determine which ii. Intermediate-level operations: Images are transformed architecture is best able to exploit the available into other data structures. These operations work on parallelism. The performance potential of multiple images and produce more compact data structures (e.g. a instruction issuing and its interaction with pipelining has list). The computations usually do not work on a whole been investigated by several image but only on objects/segments (so called areas of researchers[l0,14,16,17].Their work has shown that at

47

GVIP Journal, Volume 6, Issue 3, December, 2006 the basic level, pipelining and multiple instruction 2.1 Processing Buffer Concept: issuing are essentially equivalent in exploiting fine-grain The idea behind distributed Processing Buffer parallelism. Studies using the PASM prototype have processing is that it offers performance improvement by indicated that the multiprocessor organization may be reducing the processed data as well as (the option of) outperformed by the SIMD organization [5, 6] unless processing the data in parallel. The idea behind PB is to special care is taken to provide efficient synchronization combine the data reduction and for the MIMD mode [6]. We extend this previous work strategies. It is geared towards iterative image processing by comparing the performance of a pipelined , a algorithms where only a subset of the image data is , and a shared memory processed. The global steps in using PB processing: multiprocessor when executing scientific application 1. scan image to collect data of interest programs. 2. put data to process in a PB In image processing operations the existing approach to 3. while PB not empty parallelism get constrained due to variant size of data and - process data in PB the required resources. Hence a system is required for the - put new(ly generated) data to efficient controlling of image processing operation with process in PB variable data size. The proposed approach realizes a endwhile parallel processing architecture integrating the PB data structure: simultaneous multithreading concept (SMT) and A PB is defined as a data structure with two main access processing buffer (PB) concept for the proper control and functions: a put() function to put data elements in the PB execution of variant image processing application. and a get() function to retrieve an arbitrary data element Section 2 discusses about the multithreading, SMT, from the PB. Furthermore, there is an empty() function processing Buffers. Section3 overviews the approach of for checking whether the PB is empty and a clear() image processing using SMT and PB with discussion on function for removing all data elements from the PB. an template matching using the above technique and finally concludes with the results and Note that both get() and put() are blocking when the PB discussion. is empty or full respectively. This definition of the access functions allows different 2. Theory implementation schemes of the PB data structure. For instance, on a workstation, the PB could be implemented Simultaneous multithreading is a that combines hardware multithreading with superscalar as a linked list of data elements where elements are put processor technology to allow multiple threads to issue in the list and fetched from the list in a LIFO (Last- In instructions each cycle. Unlike other hardware First-Out), or FIFO (First-In First-Out) manner. Yet, the multithreaded architectures (such as the Tera MTA), in user may not assume the PB behaves in a certain way, which only a single hardware context (i.e., ) is e.g. as a FIFO, and use that knowledge in his program. active on any given cycle, SMT permits all thread class PB contexts to simultaneously compete for and share { processor resources. Unlike conventional superscalar public: processors, which suffer from a lack of per-thread PB(unsigned int element_size); instruction-level parallelism, simultaneous ~PB(); boolean empty(); multithreading uses multiple threads to compensate for void put(void *element); low single-thread ILP. The performance consequence is void get(void *element); significantly higher instruction throughput and program void clear(); speedups on a variety of workloads that include }; commercial , web servers and scientific Distributed PB data structure: applications in both multiprogramming and parallel On a parallel system the PB data structure may be environments. distributed over a number of processors. This distributed Enhanced SMT features PB data structure is obtained by segmenting the To improve SMT performance for various Workload PB in so called partial PBs that are allocated on the mixes and provides robust quality of Service, we added processors; one on each processor. The distributed PB two features: Dynamic resource balancing and consists of n partial PBs, each one of them uniquely Adjustable thread priority. mapped to a specific processor. Instead of allowing each • Reducing the thread’s priority is the primary processor to access all data elements in the PB, a Mechanism in situations where a Thread uses more than a predetermined number of GCT entries. processor is restricted to only access his own part of the • Inhibiting the thread’s instruction decoding PB, his partial Neuro Bucket, when fetching data from Until the congestion clears is the primary Mechanism for the distributed PB. When get() is called, the processor throttling a thread that incurs a prescribed number of L2 will only return data elements that are present in his cache misses. • Flushing all the thread’s instructions that partial PB. are waiting for dispatch and holding the thread’s decoding until the congestion clears is the primary mechanism for throttling

48

GVIP Journal, Volume 6, Issue 3, December, 2006 3. Processing Buffer approach using SMT to differentiates this implementation is its ability to Image Processing Operations schedule instructions for execution from all threads In this approach we present a method to the bifurcation concurrently. With SMT, the system dynamically adjusts of image processing application into three fundamental to the environment, allowing instructions to execute from layers which are isolated based on processor each thread if possible, and allowing instructions from requirements and their functionality. Generally the one thread to utilize all the execution units if the other image processing applications thread encounters a long latency event. 3.1 Template Matching using parallel execution perform parallel operations by taking additional resource support from and packages and create buffering Template matching is one of the most fundamental tasks for performing image PA. The transition of control for in many image processing applications. It is a simple creating buffers and controlling the applications takes a method for locating specific objects within an image, considerable amount of transfer time which results in where the template contains the object one is searching slower processing. We present an approach to enhance for. For each possible position in the image the template the parallelism by adding the concept of simultaneous is compared with the actual image data in order to find multithreading over the processor for redundancy the sub images that match the template. To reduce the transition delay in Parallel Computing Image Processing impact of possible noise and distortion in the image a Application. The Parallel for similarity or error measure is used to determine how well parallel image processing shown in figure.1 as shown the template compares with the image data. A match below. occurs when the error measure is below a certain Resource layer: predefined threshold. This layer provides the track of all the hardware resource Algorithm for template matching: requirements such as device drivers, processing buffers Phase1. First each input image from file is scattered for performing multiple IPA. This layer communicates throughout the parallel system depending on the number with systems (CPU).  Phase2. Next all templates are broadcasted to CPUs, also Resource Layer convolution operations to perform correctly, image (Support drivers, buffers) borders are exchanged among neighbors in the CPUs. In all cases the extent of the border in each dimension is Link layer half the size of the template minus one pixel. (Library,DLL, Mex) Phase3. Finally before error image is written out to file it is gathered to a single unit. (apart from these communication operations all processing units can run Application layer independently, in a fully data parallel way. ( programs, I/O) 3.2 Parallel Image Convolution Algorithm Image convolution is a primitive operation in image Fig1. processing and computer vision[4]. It has plenty of the application layer via linking layer to find the applications across image processing and computer requirements of IP applications so as to allocate vision such as image filtering, feature detection, processing buffers to carry simultaneous operations. fa(a[x, y]* g[x, y]) = ∑ f (aa[i, j]g[x − i, y − j]) − (1) Linking layer: i, j This layer provides a link between the resource layer and enhancement, restoration, template matching, and so on. application layer which consists of DLL files and mex Convolution is local operation because the outcome of files for proper transfer of data between resource convolution at each pixel is just the sum of allocation unit and computing unit. This layer hold the multiplications between neighboring pixels of the library defined and the packages required for supporting the transactions. point in the image and pixels in a kernel. It is defined by Application layer: Convolution is a time-consuming job because the This layer reads the input image and dedicated functions. amount of computation is very high especially when the IPA with support of upper layer. This layer evaluates the kernel size is large, which makes parallel processing time of computation and the resource requirement for very attractive way of implementing it with the local IPA. A copy of requirements is transferred to resource property of the convolution. There are two ways in layer for allocation of resources. This layer is the User implementing the convolution. One is a direct I/F where the user can pass the inputs to be processed on convolution following the definition above in a spatial the image and obtain results. The transactions in these domain and the other is using 2D FFT method in a layers are controlled by the simultaneous multi threading Fourier domain. approach where the instructions are latched out into multiple threads and are executed concurrently. Finally, 3.2.1 Parallel Direct Convolution in simultaneous multithreading (SMT), as in other In a conventional parallel direct convolution method [22], multithreaded implementations, the processor fetches image is decomposed by as many rows as processor instructions from more than one thread.4 What buffers count as shown in Figure 2.

49

GVIP Journal, Volume 6, Issue 3, December, 2006 The complexity of the algorithms is evaluated by the two 3.2.2 Parallel 2D FFT Method measures: computation time Tcomp(n,k,p) and In a 2D FFT method, the image convolution is performed communication time T comp(n,k,p) where n is image width, in a Fourier domain [21]. Since it includes 2D FFT and k is kernel width (without loss of generality, we assume 2D envies FFT operation, it has to go through several stages as shown in Figure 3. In this method, each 1-D FFT operation is performed without any communication BP1 on each processor. However, the transpose step becomes a major bottleneck BP2 because it requires all-to-all communication. To avoid Partition No: # of the severe network contention, it should be done by Row Processor carefully. A parallel matrix transpose algorithm [23] is Image s used. Given n *n and p processors, the image is n n partitioned into blocks of * size. Let the ith processor BP N 2 2

X i X i X i have 0 , 1 ,...... p − 1 blocks and 1D Column IFFT i Y denote a transposed block which the ith processor k Fig2 will have in the kth block. Then the parallel matrix transpose algorithm is given as follows. BP 1 BP 1 BP1 BP 2 BP 2 BP2 Trans

pose BP N BP N BPN 1D column FFT 1D Row I FFT 1D Row FFT of Image

BP1 BP 1 X BP 1 BP 2 BP 2 BP 2

BP N BP N BP N 1D Row FFT of 1D column FFT Kernel 1D Column IFFT Fig 3

each of the image and kernel has the same width and height), and p is processor count.Communication Parallel Matrix Transpose Algorithm between processors occurs only in the boundary regions between partitions. Thus, communication time load at for( j = 1: to : top −1 k i ⎡ ⎤ send X p to process or i + j p each processor will be ⎢2*n* ⎥ .The omputation and : (i+ j) % : ( ) : ( )% ⎣ 2⎦ receive Y i from processor i − j p communication time of the parallel direct convolution : (i− j)% p : : ( )% algorithm can be expressed as follows. ⎡ ⎛ 2 2 ⎞⎤ By the above algorithm, each processor sends and ⎜ k n ⎟ ⎢Tcomp (n, k, p) = O ⎥ − −(2) receives exactly one block at each step so that it’s ⎜ p ⎟ ⎣⎢ ⎝ ⎠⎦⎥ memory port does not suffer from traffic jam. Tcomp (n, k, p) = O()τnk − −(3) 4. Results and conclusion where τ represent average network latency per The above is simulated and the results are tested on a communication. Therefore scaling of computation to standard LAN environment of 100Mbps consisting of P- ⎛ kn ⎞ III,1.72GHz with 256MB RAM system, under Linux communication is o⎜ ⎟ which tells us scaling will ⎜ τp ⎟ Environment. ⎝ ⎠ decrease as kernel size or image size decreases.

50

GVIP Journal, Volume 6, Issue 3, December, 2006 Table2 Image Name Entries in processing buffers Flower 1500 TUD 3800 Obscura 200

4.1 Template Matching Trui 9000 Results obtained for the parallel version algorithm Cermet 10000 presented above for non real time applications. For these Table3 Image size SMT(ms) CGM(ms) FGM(ms) 64*64 20 25 28

256*256 200 220 250 results shows that even for a large number of processing 400*400 220 245 260 units, speed is close to linear, and also Table3 Compares Execution time on various image sizes characteristics are identical when the same number of templates is used in the matching process as shown in 4.2Parallel Image Convolution Algorithm table 1a and table 1b: Most of experiments are focused on discovering the difference between the parallel versions of the Table1a: direct convolution method and the method using 2D FFT. No of Parallel execution For the experiment, we used 64 * 64, 128 * 128, 256* Systems in LAN 1 input image 256 lena images. In the first experiment, we investigate 1 template(S) 5 templates(S) 10 templates(S) the scalability change as image size changes in both 2 25.439 126.654 253.165 methods. Figure 5 shows the difference between two methods. As expected in theoretical analysis, in both 3 12.774 63.41 126.694 cases, scalability decreases as image size decreases 4 6.449 31.895 63.707 though it is hardly noticeable in the direct convolution method. We can see that the 2D FFT method’s Table1b: scalability gets much worse than the direct one in the No of Sequential execution figure. It implies that the 2D FFT method’s Systems in 1 input image communication time more dominates the total execution LAN 1 template 10 templates time as image size decreases than the direct method does. 2 25.526 253.627 this happens because, in the 2D FFT method, the portion of time spent in communication(barrier and read) 3 13.466 133.443 increases more as image size decreases, which causes the 4 7.126 69.924 decrease of scalability. we investigate the scalability change as kernel size changes in both methods. Figure 7 We can see that the execution time for SMT using PB shows the difference between two methods. As expected perform better than the Fine grain multithreading (FGM) also in theoretical analysis, the direct convolution’s and coarse grain multithreading (CGM). Fig4 explains, as scalability decreases as kernel size gets smaller while the the number of threads increases, by keeping the image 2D FFT method is never affected by kernel size. we size fixed, it shows that the execution time of SMT based compared the actual execution time between both low pass filtering is better then FGM based low pass algorithms with different kernel and image sizes in order filtering and CGM based low pass filtering. to determine the guide line about which method should be used in which case, in other words, how big kernel size should be to get the benefit from the 2D FFT method. . Also, more importantly, this experiment is to see whether the guide line is changed as processor count increases. Figure 12 shows the experiment results. From the figure, we can conclude that the 2D FFT method works faster than the direct convolution method when kernel size k is larger than 11 * 11 in the setting where experiments were performed and more important result is that the threshold is hardly changed

Conclusion Fig4.Execution times for SMt,FGM and CGM In this paper, we explored the scalabilities of two parallel Table 2 and 3 presents the elements in the processing applications and discovered what factors and how they buffer at the start of each iteration for image processing influence on the thread-level parallelism. In the parallel iterations and execution times for SMT, CGM, FGM. image convolution, two different implementation

51

GVIP Journal, Volume 6, Issue 3, December, 2006 methods were compared. The direct convolution method from the image size, kernel size or the network latency has less communication load than the 2D FFT method. variation are not huge enough to change a guideline Thus it leads to less vulnerability to network latency. It’s severely about which method should be used. Therefore, scalability slightly decreases as kernel size gets smaller, we can still determine the guideline in uniprocessors but hardly affected by image size. On the other hand, the even if both of the convolution algorithms are 2D FFT method’s scalability decreases as image size gets parallelized. We have given an assessment of the smaller, but it is never affected by kernel size. Also, its effectiveness of our architecture in providing scalability is severely affected by network latency significant results presented; much of the efficiency of because it has high communication load due to the matrix parallel execution is still in the hands of the developer. transpose operation. However, the scalability changes

Figure 5: Comparison of scalability with different image size (I). Note that CONV means direct convolution method, 2DFFT 2D-FFT method, k kernel size(k_k), I image size(i_i) Figure 6: Comparison of scalability with different kernel size (I).

Figure 7: Comparison of execution time with different kernel and image size.

52

GVIP Journal, Volume 6, Issue 3, December, 2006 [17] M. D. Smith, et al. “Limits on Multiple Instruction Issue,” ASPLOS, Apr. 1989, pp. 290-302. [18] G. S. Sohi and S. Vajapeyam. “Tradeoffs in . Instruction Format Design for Horizontal References Architectures,’,ASPLOS, Apr. 1989, pp. 15-25. [1] R. D. Acosta, et al. “An Instruction Issuing Approach [19] M. R. Thistle and B. J. Smith. “A Processor to Enhancing Performance in Multiple Functional Unit Architecture for Horizon,” Supercomputing ’88, Processors,” IEEE TOC, Sep. [20] G. S. Tjaden and M. J. Flynn. “Detection and [2]D. W. Anderson, et al. “The BM System/36O Model Parallel Execution of Independent Instructions,” 91: Machine Philosophy and Instruction-Handling,” IBM [21] A. K. Jain, Fundamentals of Digital Image J. of Res. d Deu., Jan. 1967, Processing, Prentice Hall, 1999. [3]E. C. Bronson, et al. “Experimental Application-Driven [22] S. Yu and M. Clement and Q. Snell and B. Morse, Architecture Analysis of an SIMD/MIMD Parallel Parallel Algorithms For Processing System,” Image Convolutin, International Conference on Parallel [4]R. P. Colwell, et al. “A VLIW Architecture for a and Distributed Processing Techniques and Applications, Trace Scheduling ,” IEEE TOC, Aug. 1998 [5] S. A. Fineberg, et al. “Mixed-Mode Computing with [23] M. Hegland Real and Complex Fast Fourier the PASM System Prototype,” Allerton transforms on the Fujitsu Vpp 500, TR-CS-94-07, U of Conf. on Comm., Con., and Comp., 1987, pp. Canberra, Australia 1994. [6] S. A. Fineberg, et al. “Non-Deterministic Instruction Time Experiments on the PASM System Prototype,” ICPP, Aug. 1988, pp. 444 - 451. Author Biography: [7] J. A. Fisher. “Trace Scheduling: A Technique for Global Compaction,” IEEE TOC, K.Manjunathachari, Associate Prof. July 1981, pp. 478-490. Department of Electronics and [8] A. Gottlieb, et al. “The NYU Ultracomputer - Communication, G.Pulla Reddy Designing a MIMD, Shared-Memory Parallel Engineering College, Kurnool, Machine,” ISCA, 1982, pp. 27-42. A.P.India, Pursuing PhD in Parallel [9] N. P. Jouppi. “Architectural and Organizational Image Processing from JawaharLal Tradeoffs in the Design of the MultiTitan CPU,” Technological University, Hyderabad, ISCA, May 1989, pp. 281-289. India. His Research interest include: [10] Yung-Lin Liu, Hau-Yang Cheng, Chung-Ta King, Image Processing and Compression, High performance Computing on networks of Parallel Processing workstations through the exploitation of function parallelism, Journal of Systems Architecture 45 , Dr.K.SatyaPrasad, Presently Principal 1307-1321,1999. and Professor in Electronics and [11]N. P. Jouppi and D. W. Wall. “Available Instruction- Communication, JawaharLal Level Parallelism for Superscalar Technological University Of and Superpipelined Machines,” ASPLOS, Apr. Engineering, Kakinada, A ’P, India.. [12] D. J. Kuck, et al. “Parallel Supercomputing Today and the Cedar Approach,” Science, Feb. [13] M. Lam. “Software Pipelining: An Effective Scheduling Technique for VLIW Machines,”SIGPLAN ’88, June 1988, pp. 318-328. [14] G. F. Pfister, et al. “The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture,” ICPP, 1985, pp. 764-771. [15] A. R. Pleszkun and G. S. Sohi. “The Performance Potential of Multiple Functional Unit Processors,” ISCA, 1988, pp. 37-44. [16] G. Radin. “The 801 Minicomputer,” IBM J. of Res. 8 Dev., May 1983, pp. 237-246.

53