Fast Nearest Neighbor Search in Medical Image Databases

Fast Nearest Neighb or Search in Medical Image Databases Christos Faloutsos Nikolaos Sidiropoulos Flip Korn Institute for Systems Research, Inst. for Systems Research Dept. of Computer Science Institute for Advanced Computer Studies Univ. of Maryland Univ. of Maryland and Dept. of Computer Science College Park, MD College Park, MD Univ. of Maryland [email protected] [email protected] College Park, MD 20740 [email protected] Eliot Siegel, M.D. and Zenon Protopapas, M.D. UM Medical School and VA Medical Center Baltimore, MD feliot,[email protected] Abstract We examine the problem of nding similar tumor shap es. Starting from a natural similarity function (the so-called `max morphological distance'), we showed how to lower-b ound it and how to search for nearest neighb ors in large collections of tumor-like shap es. Sp eci cally, we used state-of-the-art concepts from morphology, namely the `pattern sp ectrum' of a shap e, to map each shap e to a p oint in n-dimensional space. Following [19, 36], we organized the n-d p oints in an R-tree. We showed that the L (= max) norm in the n-d space lower-b ounds 1 the actual distance. This guarantees no false dismissals for range queries. In addition, we developed a nearest-neighb or algorithm that also guarantees no false dismissals. Finally, we implemented the metho d, and we tested it against a testb ed of realistic tumor shap es, using an established tumor-growth mo del of Murray Eden[15]. The exp eriments showed that our metho d is up to 27 times faster than straightforward sequential scanning. His work was partially supp orted by the National Science Foundation under Grants IRI-9205273 and IRI-8958546, with matching funds from Empress Software Inc. and Thinking Machines Inc. 1 1 Intro duction This pap er prop oses an algorithm which rapidly searches for \similar shap es". Such an algorithm would have broad applications in electronic commerce (e.g., ` nd shapes similar to a screw-driver'), photo-journalism [20], etc., but would b e particularly useful in medical imaging. During the past twenty years, the development of new mo dalities such as Computed Tomography (CT) and Magnetic Resonance Imaging (MRI) have substantially increased the number and complexity of images presented to radiologists and other physicians. Additionally, the recent intro duction of large scale PACS (Picture Archival and Communication Systems) has resulted in the creation of large digital image databases. A typical radiology department currently generates b etween 100,000 and 10,000,000 such images p er year. A lmless imaging department such as the Baltimore VA Medical Center (VAMC) generates approximately 1.5 terabytes of image data annually. An algorithm that would b e able to search for similar shap es rapidly would have a number of useful applications in diagnostic imaging. Both \exp erts" such as radiologists and non-exp erts could use such a system for the following tasks: 1. Diagnosis/Classi cation: distinguish b etween a primary or metastatic (secondary) tumor based on shap e and degree of change in shap e over time correlating this with data ab out diagnoses and symptoms. Computer-aided diagnosis will b e esp ecially useful in increasing the reliability of detection of pathology, particularly when overlapping structures create a distraction or in other cases where limitations of the human visual system hamp er diagnosis [37]. 2. Forecasting/Time Evolution Analysis: predict the degree of aggressiveness of the pathologic pro- cess or try to distinguish a particular histology based on patterns of change in shap e. In this setting, we would like to nd tumors in the database with the similar history as the current tumor. 3. Data Mining: detect correlations among shap es, diagnoses, symptoms and demographic data, and thus form and test hyp otheses ab out the development and treatment of tumors. In all of the ab ove tasks, the central problem is similarity matching: ` nd tumors that are similar to a given pattern' (including shap e, shap e changes, and demographic patient data). We mainly fo cus on matching similar shap es. In section 6, we discuss how our approach can b e naturally extended to handle more complicated queries. Some terminology is necessary. Following [19], we distinguish b etween (a) range queries (e.g., nd shapes that are within distance from the desirable query shape) and (b) nearest neighbor queries (e.g., nd the rst k closest shapes to the query shape) An orthogonal axis of classi cation distinguishes b etween whole-matching and sub-pattern matching: In whole-matching queries, the user sp eci es an 2 S S query image and requires images of S S that are similar; in sub-pattern matching queries, the user sp eci es only a small p ortion and requires all the (arbitrary-size) images that contain a similar pattern In this work we fo cus on whole-matching, b ecause this is the stepping stone for the sub-pattern matching, and all the problems listed ab ove, as discussed in section 6. For the whole matching problem, there are two ma jor challenges: How to measure the dis-similarity/distance b etween two shap es. In the tumor application, as well as in most other shap e applications, the distance function should b e invariant to rotation and translation. Moreover, we would like a function that pays attention to details at several scales, as we explain later. Given such a distance function, how can we do b etter than sequential scanning of the whole database? This faster metho d, however, should not compromise the correctness: it should have no false dismissals; that is, it should return exactly the same resp onse set as sequential scanning would do. Next we provide solutions to the ab ove two challenges. This pap er is organized as follows: Section 2 gives the survey. Section 3 gives an intro duction to morphology and tumor-shap e mo deling. Section 4 presents our main result: the lower-bounding of the `max-morphological' distance, as well as a k -nearest neighb or algorithm, without false dismissals. Section 5 gives the exp eriments. Section 6 discusses how to use the prop osed algorithms for more general problems. Section 7 gives the conclusions. 2 Survey 2.1 Multimedia Indexing The state of the art in multimedia indexing is based on feature extraction [36, 19]. The idea is to extract n numerical features from the ob jects of interest, mapping them into p oints in n-dimensional space. Then, any multi-dimensional indexing metho d can b e used to organize, cluster and eciently search the resulting p oints. Such metho ds are traditionally called Spatial Access Methods (SAMs). A query of the form nd objects similar to the query object Q b ecomes the query nd points that are close to the query point q, and thus b ecomes a range query or nearest neighb or query. Thus, we can use the SAM to identify quickly the qualifying p oints, and, from them, the corresp onding ob jects. Following [1], we call the resulting index as the `F-index' (for `feature index'). This general approach has b een used in several settings, such as searching for similar time-sequences [1] (e.g., trying to nd quickly sto ck prices that move like MacDonalds), color images [18, 20], etc. The ma jor challenge is to nd feature extraction functions that preserve the dis-similarity/distance b etween the ob jects as much as p ossible. In [1, 19] we showed that the F-index metho d can guarantee 3 that there will not b e any false dismissals, if the actual distance is lower-bounded by the distance in feature space. Mathematically, let O and O b e two ob jects (e.g., time sequences, bitmaps of tumors, etc.) with 1 2 distance function D () (e.g., the sum of squared errors) and F (O ), F (O ) b e their feature vectors obj ect 1 2 (e.g., their rst few Fourier co ecients), with distance function D () (e.g., the Euclidean distance, f eatur e again). Then we have: Lemma 1 (Lower-Bounding) To guarantee no false dismissals for range queries, the feature extraction function F () should satisfy the fol lowing formula: D (F (O );F (O )) D (O ;O ) (1) f eatur e 1 2 obj ect 1 2 Pro of: In [19]. Thus, the search for range queries involves two steps. For a query ob ject Q with tolerance , we 1. Discard quickly those objects whose feature vectors are too far away. That is, we retrieve the objects X such that D (F (Q);F (E )) < . f eatur e 2. Apply D () to discard the false alarms (the clean-up stage). obj ect 2.2 Spatial access metho ds Since we rely on spatial access metho ds as the eventual indexing mechanism, we give a brief survey of them. These metho ds fall in the following broad classes: metho ds that transform rectangles into p oints in a higher dimensionality space [32]; metho ds that use linear quadtrees [23] [3] or, equivalently, the z -ordering [51] or other space lling curves [17] [35]; and nally, metho ds based on trees (R-tree [27], k-d-trees [6], k-d-B-trees [56], hB-trees [41], cell-trees [26], etc.) One of the most promising approaches in the last class is the R-tree [27] and its numerous variants + (Greene's variation [25], the R -tree [58], R-trees using Minimum Bounding Polygons [34], the R - tree [5], the Hilb ert R-tree [38], etc.).

Load more