A Scalable Integrated Region-Based Image Retrieval System

A SCALABLE INTEGRATED REGION-BASED IMAGE RETRIEVAL SYSTEM Yanping Du James Z. Wang The Pennsylvania State University, University Park, PA 16801 ABSTRACT image-to-image matching can be performed. The overall similarity approach reduces the adverse effect of inaccurate In this paper, we present a scalable algorithm for index- segmentation, helps to clarify the semantics of a particular ing and retrieving images based on region segmentation. region, and enables a simple querying interface for region- The method uses statistical clustering on region features based image retrieval systems. Experiments have shown and IRM (Integrated Region Matching), a measure devel- that IRM is comparatively more effective and more robust oped to evaluate overall similarity between images incorpo- than many existing retrieval methods. Like other region- rates properties of all the regions in the images by a region- based systems, the SIMPLIcity system is a linear matching matching scheme. The algorithm has been implemented as a system. To perform a query, the system compares the query part of our experimental SIMPLIcity image retrieval system image with all images in the same semantic class. and tested on large-scale image databases of both general- In this paper, we present an enhancement to the SIM- purpose images and pathology slides. Experiments have PLIcity system for handling image libraries with million demonstrated that this technique maintains the accuracy of of images. The targeted applications include Web image the original system while reducing the matching time sig- retrieval and biomedical image retrieval. Region features nificantly. of images in the same semantic class are clustered auto- matically using a statistical clustering method. Features in 1. INTRODUCTION the same cluster are stored in the same file for efficient ac- cess during the matching process. IRM (Integrated Region Content-based image retrieval has been widely studied. In Matching) is used in the query matching process. Tested on the commercial domain, IBM QBIC [9] is one of the earliest large-scale image databases, the system has demonstrated developed systems. Recently, additional systems have been high accuracy and scalability. developed at IBM T.J. Watson [14], VIRAGE [3], NEC C&C Research Labs [8], Interpix (Yahoo), Excalibur, and 2. THE SIMILARITY MEASURE Scour.net. In academia, MIT Photobook [10] is one of the earliest. Berkeley Blobworld [1], Columbia VisualSEEK In this section, we describe the similarity matching process and WebSEEK [13], CMU Informedia [15], UIUC MARS [7], we developed. We briefly describe the segmentation pro- UCSB NeTra [6], UCSD, Stanford WBIIS [16], and Stan- cess and related notations in Section 2.1. The feature space ford SIMPLIcity [17]) are some of the recent systems. analysis process is described in Section 2.2. In Section 2.3, Region-based approach has recently become a popular we give details of the matching scheme. research trend. Li et al. of Stanford University recently developed SIMPLIcity (Semantics-sensitive Integrated Match- ing for Picture LIbraries) [17]. SIMPLIcity uses semantics 2.1. Region segmentation type classification and an integrated region matching (IRM) Semantically-precise image segmentation is extremely diffi- scheme to provide efficient and robust region-based image cult and is still an open problem in computer vision [12, 18]. matching [5]. The IRM measure is a similarity measure of We attempt to develop a robust matching metric that can re- images based on region representations. It incorporates the duce the adverse effect of inaccurate segmentation. The seg- properties of all the segmented regions so that information mentation process in our system is very efficient because it about an image can be fully used. With IRM, region-based is essentially a wavelet-based fast statistical clustering pro- The authors are with School of Information Sciences and Technology cess on blocks of pixels. and Department of Computer Scinece and Engineering. This work was To segment an image, we partitions the image into blocks supported in part by the National Science Foundation’s Digital Library II ¢ Ø with Ø pixels and extracts a feature vector for each block. initiative under Grant No. IIS-9817511. and an endowment from the PNC Foundation. The authors wish to thank the help of Jia Li, Gio Wiederhold, The k-means algorithm is used to cluster the feature vectors and Oscar Firschein. into several classes with every class corresponding to one region in the segmented image. We dynamically determine If the Euclidean distance is used, the k-means algorithm k =¾ k =4 k by starting with and refine if necessary to , results in hyper-planes as cluster boundaries. That is, for the d etc. The details of the segmentation process is described feature space Ê , the cluster boundaries are hyper-planes in d ½ ½ Ê in [5]. the d dimensional space . Here, the use of the Six features are used for segmentation. Three of them Euclidean distance is reasonable because we cluster the fea- ¢ Ø are the average color components in a Ø block. The other ture space with each feature point corresponding to a region. three represent energy in high frequency bands of wavelet Our region-to-region distance is a variation of the Euclidean transforms [2], that is, the square root of the second or- distance. der moment of wavelet coefficients in high frequency bands. Both the initialization process and the stopping criterion We use the well-known LUV color space, where L encodes are critical in the process. We initialize the algorithm adap- luminance, and U and V encode color information (chromi- tively by choosing the number of clusters k by gradually nance). The LUV color space has good perception correla- increasing k and stop when a criterion is met. We start with k = ¾ 4 tion properties. We chose the block size Ø to be to compro- . The k-means algorithm terminates when no more mise between the texture detail and the computation time. feature vectors are changing classes. It can be proved that Let Æ denote the total number of images in the image the k-means algorithm is guaranteed to terminate, based on Ê database. For the i-th image, denoted as i , in the database, the fact that both steps of k-means (i.e., assigning vectors to Ò we obtain a set of i feature vectors after the region seg- nearest centroids and computing cluster centroids) reduce Ò d mentation process. Each of these i -dimensional feature the average class variance. In practice, running to comple- vectors represents the dominant visual features (including tion may require a large number of iterations. The cost for ´kÒµ Ò color and texture) of a region, the shape of that region, the each iteration is Ç , for the data size . Our stopping rough location in the image, and some statistics of the fea- criterion is to stop after the average class variance is smaller tures obtained in that region. than a threshold or after the reduction of the class variance is smaller than a threshold. 2.2. Feature space analysis 2.3. Image matching The new integrated region matching scheme depends on the entire picture library. We must first process and analyze the To retrieve similar images for a query image, we first locate the clusters of the feature space to which the regions of the characteristics of the d-dimensional feature space. query image belong. Let’s assume that the centroids of the Suppose feature vectors in the d-dimensional feature space k fc ;c ; :::; c g ¾ k set of clusters are ½ . We assume the query fÜ : i =½; :::; Äg Ä are i ,where is the total number of re- È Æ Ê = fÖ ;Ö ; :::; Ö g ½ ¾ Ñ image is represented by region sets ½ , Ä = Ò gions in the picture library. Then i . i=½ Ö i The goal of the feature clustering algorithm is to parti- where i is the descriptor of region . For each region fea- Ö j ture i ,wefind such that k Ü^ ; Ü^ ; :::; Ü^ ¾ k tion the features into groups with centroids ½ such that d´Ö ;c µ= ÑiÒ d´Ö ;c µ i j i Ð ½Ðk Ä X ¾ ÑiÒ ´Ü Ü^ µ D ´k µ= i j ´Ö ;Ö µ (1) d ¾ where ½ is the region-to-region distance defined for ½j k =½ i the system. This distance can be a non-Euclidean distance. fc ;c ; :::; c g Ö Ö We create a list of clusters, denoted as Ö . ¾ Ñ is minimized. That is, the average distance between a fea- ½ The matching algorithm will further investigate only these ture vector and the group with the nearest centroid to it is ‘suspect’ clusters to answer the query. minimized. Two necessary conditions for the k groups are: With the list of ‘suspect’ clusters, we create a list of ‘sus- 1. Each feature vector is partitioned into the cluster with pect’ images. An image in the database is a ‘suspect’ image the nearest centroid to it. to the query if the image contains at least one region feature in these ‘suspect’ clusters. This step can be accomplished 2. The centroid of a cluster is the vector minimizing the by merging the cluster image IDs non-repeatedly. average distance from it to any feature vector in the To define the similarity measure between two sets of re- Ê Ê ¾ cluster. In the special case of the Euclidean distance, gions, we assume that the image ½ and image are rep- Ê = fÖ ;Ö ; :::; Ö g Ê = ½ ¾ Ñ ¾ the centroid should be the mean vector of all the fea- resented by region sets ½ and ¼ ¼ ¼ ¼ fÖ ;Ö ; :::; Ö g Ö Ö i ,where i or is the descriptor of region . ½ ¾ Ò ture vectors in the cluster. i ¼ ¼ Ö Ö d´Ö ;Ö µ i Denote the distance between region i and as , j j d These requirements of our feature grouping process are which is written as i;j in short.

Load more