Parallel Image Processing System on a Cluster of Personal Computers
Total Page:16
File Type:pdf, Size:1020Kb
Parallel Image Pro cessing System on a Cluster of Personal Computers Best Student Pap er Award First Prize J Barb osa JTavares and AJ Padilha FEUPINEB Rua Dr Rob erto Frias Porto P email jbarb osafeuppt Abstract The most demanding image pro cessing applications require real time pro cessing often using sp ecial purp ose hardware The work herein presented refers to the application of cluster computing for o line image pro cessing where the end user b enets from the op eration of otherwise idle pro cessors in the lo cal LAN The virtual parallel computer is comp osed by otheshelf p ersonal computers connected bya low cost network such as a M bitss Ethernet The aim is to minimise the pro cessing time of a high level image pro cessing package The system develop ed to manage the parallel execution is describ ed and some results obtained for the parallelisation of high level image pro cessing algorithms are discussed namely for active contour and mo dal analysis metho ds which require the computation of the eigenvectors of a symmetric matrix Intro duction Image pro cessing applications can be very computationally demanding due to the large amount of data to pro cess to the resp onse time required or to the complexity of the image pro cessing algorithms A wide range of general purp ose or custom hardware has b een used for image pro cessing SIMD computers using data parallelism are suitable for lowlevel image analysis where each pro cessor p erforms a uniform set of op erations based on the image data matrix in a xed amount of time in a sp ecial purp ose SIMD computer with pro cessors was presented Systolic Arrays which can exploit the regular and constant time op erations of an algorithm are also an option MIMD computers commonly used in simulation are suitable for high level image pro cessing such as pattern recognition where each pro cessor is assigned an indep endent op eration For real time vision applications sp ecial MIMD computers were develop ed eg ASSET based on PowerPC pro cessors for com putation and on Transputers for communication MIMD computers were haracterised by exploiting a variety of structures however technological fac c tors have b een forcing a convergence towards systems formed by a collection of PhD grant BD PRAXIS XXI PhD grant BD PRAXIS XXI essentially complete computers connected by a communications network The pro cessors in these computers are the same ones used in currentworkstations Therefore the idea of forming a parallel computer from a collection of othe shelf computers comes naturally and fast communication techniques were also develop ed for that purp ose Several cluster computing systems have b een develop ed eg the NOW pro ject Our aim is not to build a sp ecic cluster of p ersonal computers for parallel image pro cessing but rather to p erform parallel pro cessing on already existing group clusters where each no de is a desktop computer running the Windows op erating system These clusters are characterised by havingalow cost inter connection network such as a M bitss Ethernet connecting dierent typ es of pro cessors of variable pro cessing capacity and amount of memorythus forming a heterogeneous parallel virtual computer Due to network restrictions which do not allow simultaneous communication among several no des the ap plication domain is restricted to ab out one or two dozens of pro cessors The motivation for a parallel implementation of image algorithms comes from image and image sequence analysis needs p osed byvarious application domains which are b ecoming increasingly more demanding in terms of the detail and variety of the exp ected analytic results requiring the use of more sophisticated image and ob ject mo dels eg physicallybased deformable mo dels and of more complex algorithms while the timing constraints are kept very stringent A promising approach to deal with the ab ove requirements consists in devel oping parallel software to b e executed in a distributed manner by the machines available in an existing computer network taking advantage of the wellknown fact that many of the computers are often idle for long p erio ds of time It is quite common in many organisations that a standard network connects several general purp ose workstations and p ersonal computers accumulating a very substantial computing p ower that through the use of appropriate manag ing software could b e put at the service of the more computationally demanding applications Existing software suc h as the Windows Parallel Virtual Machine WPVM allows building parallel virtual computers byintegrating in a common pro cessing environment a set of distinct machines no des connected to the network Although the parallel virtual computer no des and the underlying communication network were not designed for optimised parallel op eration very signicant p er formance gains can b e attained if the parallel application software is conceived for that sp ecic environment Image Algorithms and Systems The image algorithms that have been parallelised consist of a set of low level image pro cessing op erations namely edge detection distance transform convolution mask histogramming and thresholding whose suitability to the clus ter architecture was analysed in A set of linear algebra algorithms which are building blo cks for many high level image pro cessing metho ds was also im plemented These algorithms are the matrix pro duct LU factorisation tridiagonal reduction symmetric QR iteration matrix inversion and matrix correlation In this pap er the results presented fo cus on high level image pro cessing algorithms namely activecontours and mo dal analysis Some image pro cessing systems have b een prop osed to run on a cluster of p ersonal computers In two highly demanding vision algorithms were tested giving sup erlinear sp eedup due to memory pagination on a single workstation The machines formed an homogeneous computer In a high level interface parallel image pro cessing library is presented and results are rep orted for low level image op erations on an Ethernet network of HP workstations and on an ATM network of SGI workstations In a machine indep endent metho dology was prop osed for homogeneous computers results were presented separately for two SMP workstations with two and eight pro cessors not requiring communication b etween machines Our implementation diers from the ones mentioned ab ove as it considers a general bus typ e heterogeneous cluster where data is distributed in order to obtain a correct load balancing and also b ecause the numb er of pro cessors that participate in a distributed algorithm vary dynamically in order to minimise the pro cessing time of each op eration System Architecture The computers that b elong to the virtual machine run a pro cess to monitor the p ercentage of pro cessor time sp ent with the lo cal user Conceptually lo cal users have priorityover the distributed application and the computer will not be available if the mean lo cal user time is ab ove a minimum threshold during a sp ecied p erio d of time eg seconds Each algorithm or task is decomp osed until indivisible op erations are ob tained to which parallel co de exists When a parallel algorithm is launched the master pro cess schedules work to the pro cessors of the virtual machine according to their a vailability and cho osing a numb er of pro cessors that minimise the pro cessing time of individual op erations allowing data redistribution if the optimal grid of pro cessors changes from op eration to op eration As an example the algorithm to extract the contour of an ob ject can b e de comp osed into edge enhancement thresholding and contour tracking op erations Hardware Organisation and Computational Mo del The hardware organisation is shown in gure Eachnode of the virtual ma chine is a p ersonal computer under the Windows NT op erating system running WPVM software to communicate The interconnection network is an Ethernet at M bitss Several computational mo dels were prop osed in order to estimate the pro cessing time of a parallel program in a distributed memory machine Master Slave Slave Slave Slave Bus Fig Hardware organisation Although they could b e adapted for the cluster of p ersonal computers a sp ecic and simplied mo del is presented b elow The total pro cessing time is obtained by summing the time sp ent in commu nications T and the parallel pro cessing time T Each no de of the machine is C P characterised by the pro cessor capacity S measured in M f l ops The network i is characterised byallowing only one message to b e broadcast at a given time the latency time T and the bandwidth LB The time to send a message L T comp osed by nb bytes is given by C nb nb T T K K C L LB pack etsiz e The value K multiplies T due to the partition of each message into packets of L length to bytes pack etsiz e existing a latency time for each packet is a typical packet size The parallel comp onent T of the computational mo del represents the op er P ations that can b e divided over a set of p pro cessors obtaining a sp eedup of p ie op erations without any sequential part n P T n p P p S i i The numerator n is the cost function of the algorithm measured in oating p oint op erations flop as a function of the problem size n For example to multiply square matrices of size n the cost is nn Software Organisation Each op eration is represented by an ob ject containing the parallel and serial implementation of the co de since the system can schedule a sequential execu tion