Fast Nearest Neighb or Search in Medical Image
Databases
Christos Faloutsos Nikolaos Sidiropoulos Flip Korn
Institute for Systems Research, Inst. for Systems Research Dept. of Computer Science
Institute for Advanced Computer Studies Univ. of Maryland Univ. of Maryland
and Dept. of Computer Science College Park, MD College Park, MD
Univ. of Maryland [email protected] [email protected]
College Park, MD 20740
Eliot Siegel, M.D. and Zenon Protopapas, M.D.
UM Medical School and
VA Medical Center
Baltimore, MD
feliot,[email protected]
Abstract
We examine the problem of nding similar tumor shap es. Starting from a natural similarity
function (the so-called `max morphological distance'), we showed how to lower-b ound it and how to
search for nearest neighb ors in large collections of tumor-like shap es.
Sp eci cally, we used state-of-the-art concepts from morphology, namely the `pattern sp ectrum'
of a shap e, to map each shap e to a p oint in n-dimensional space. Following [19, 36], we organized
the n-d p oints in an R-tree. We showed that the L (= max) norm in the n-d space lower-b ounds
1
the actual distance. This guarantees no false dismissals for range queries. In addition, we developed
a nearest-neighb or algorithm that also guarantees no false dismissals.
Finally, we implemented the metho d, and we tested it against a testb ed of realistic tumor shap es,
using an established tumor-growth mo del of Murray Eden[15]. The exp eriments showed that our
metho d is up to 27 times faster than straightforward sequential scanning.
His work was partially supp orted by the National Science Foundation under Grants IRI-9205273 and IRI-8958546,
with matching funds from Empress Software Inc. and Thinking Machines Inc. 1
1 Intro duction
This pap er prop oses an algorithm which rapidly searches for \similar shap es". Such an algorithm
would have broad applications in electronic commerce (e.g., ` nd shapes similar to a screw-driver'),
photo-journalism [20], etc., but would b e particularly useful in medical imaging. During the past
twenty years, the development of new mo dalities such as Computed Tomography (CT) and Magnetic
Resonance Imaging (MRI) have substantially increased the number and complexity of images presented
to radiologists and other physicians. Additionally, the recent intro duction of large scale PACS (Picture
Archival and Communication Systems) has resulted in the creation of large digital image databases.
A typical radiology department currently generates b etween 100,000 and 10,000,000 such images p er
year. A lmless imaging department such as the Baltimore VA Medical Center (VAMC) generates
approximately 1.5 terabytes of image data annually.
An algorithm that would b e able to search for similar shap es rapidly would have a number of useful
applications in diagnostic imaging. Both \exp erts" such as radiologists and non-exp erts could use such
a system for the following tasks:
1. Diagnosis/Classi cation: distinguish b etween a primary or metastatic (secondary) tumor based
on shap e and degree of change in shap e over time correlating this with data ab out diagnoses
and symptoms. Computer-aided diagnosis will b e esp ecially useful in increasing the reliability of
detection of pathology, particularly when overlapping structures create a distraction or in other
cases where limitations of the human visual system hamp er diagnosis [37].
2. Forecasting/Time Evolution Analysis: predict the degree of aggressiveness of the pathologic pro-
cess or try to distinguish a particular histology based on patterns of change in shap e. In this
setting, we would like to nd tumors in the database with the similar history as the current
tumor.
3. Data Mining: detect correlations among shap es, diagnoses, symptoms and demographic data, and
thus form and test hyp otheses ab out the development and treatment of tumors.
In all of the ab ove tasks, the central problem is similarity matching: ` nd tumors that are similar
to a given pattern' (including shap e, shap e changes, and demographic patient data). We mainly fo cus
on matching similar shap es. In section 6, we discuss how our approach can b e naturally extended to
handle more complicated queries.
Some terminology is necessary. Following [19], we distinguish b etween (a) range queries (e.g., nd
shapes that are within distance from the desirable query shape) and (b) nearest neighbor queries (e.g.,
nd the rst k closest shapes to the query shape) An orthogonal axis of classi cation distinguishes
b etween whole-matching and sub-pattern matching: In whole-matching queries, the user sp eci es an 2
S S query image and requires images of S S that are similar; in sub-pattern matching queries, the
user sp eci es only a small p ortion and requires all the (arbitrary-size) images that contain a similar
pattern
In this work we fo cus on whole-matching, b ecause this is the stepping stone for the sub-pattern
matching, and all the problems listed ab ove, as discussed in section 6.
For the whole matching problem, there are two ma jor challenges:
How to measure the dis-similarity/distance b etween two shap es. In the tumor application, as well
as in most other shap e applications, the distance function should b e invariant to rotation and
translation. Moreover, we would like a function that pays attention to details at several scales, as
we explain later.
Given such a distance function, how can we do b etter than sequential scanning of the whole
database? This faster metho d, however, should not compromise the correctness: it should have
no false dismissals; that is, it should return exactly the same resp onse set as sequential scanning
would do.
Next we provide solutions to the ab ove two challenges. This pap er is organized as follows: Section 2
gives the survey. Section 3 gives an intro duction to morphology and tumor-shap e mo deling. Section 4
presents our main result: the lower-bounding of the `max-morphological' distance, as well as a k -nearest
neighb or algorithm, without false dismissals. Section 5 gives the exp eriments. Section 6 discusses how
to use the prop osed algorithms for more general problems. Section 7 gives the conclusions.
2 Survey
2.1 Multimedia Indexing
The state of the art in multimedia indexing is based on feature extraction [36, 19]. The idea is to extract
n numerical features from the ob jects of interest, mapping them into p oints in n-dimensional space.
Then, any multi-dimensional indexing metho d can b e used to organize, cluster and eciently search
the resulting p oints. Such metho ds are traditionally called Spatial Access Methods (SAMs). A query of
the form nd objects similar to the query object Q b ecomes the query nd points that are close to the
query point q, and thus b ecomes a range query or nearest neighb or query. Thus, we can use the SAM
to identify quickly the qualifying p oints, and, from them, the corresp onding ob jects. Following [1], we
call the resulting index as the `F-index' (for `feature index'). This general approach has b een used in
several settings, such as searching for similar time-sequences [1] (e.g., trying to nd quickly sto ck prices
that move like MacDonalds), color images [18, 20], etc.
The ma jor challenge is to nd feature extraction functions that preserve the dis-similarity/distance
b etween the ob jects as much as p ossible. In [1, 19] we showed that the F-index metho d can guarantee 3
that there will not b e any false dismissals, if the actual distance is lower-bounded by the distance in
feature space.
Mathematically, let O and O b e two ob jects (e.g., time sequences, bitmaps of tumors, etc.) with
1 2
distance function D () (e.g., the sum of squared errors) and F (O ), F (O ) b e their feature vectors
obj ect 1 2
(e.g., their rst few Fourier co ecients), with distance function D () (e.g., the Euclidean distance,
f eatur e
again). Then we have:
Lemma 1 (Lower-Bounding) To guarantee no false dismissals for range queries, the feature extrac-
tion function F () should satisfy the fol lowing formula:
D (F (O );F (O )) D (O ;O ) (1)
f eatur e 1 2 obj ect 1 2
Pro of: In [19].
Thus, the search for range queries involves two steps. For a query ob ject Q with tolerance , we
1. Discard quickly those objects whose feature vectors are too far away. That is,
we retrieve the objects X such that D (F (Q);F (E )) < .
f eatur e
2. Apply D () to discard the false alarms (the clean-up stage).
obj ect
2.2 Spatial access metho ds
Since we rely on spatial access metho ds as the eventual indexing mechanism, we give a brief survey of
them. These metho ds fall in the following broad classes: metho ds that transform rectangles into p oints
in a higher dimensionality space [32]; metho ds that use linear quadtrees [23] [3] or, equivalently, the
z -ordering [51] or other space lling curves [17] [35]; and nally, metho ds based on trees (R-tree [27],
k-d-trees [6], k-d-B-trees [56], hB-trees [41], cell-trees [26], etc.)
One of the most promising approaches in the last class is the R-tree [27] and its numerous variants
+
(Greene's variation [25], the R -tree [58], R-trees using Minimum Bounding Polygons [34], the R -
tree [5], the Hilb ert R-tree [38], etc.). We use R-trees, b ecause they have already b een used successfully
for high-dimensionali ty spaces (10-20 dimensions [18]); in contrast, grid- les and linear quadtrees may
su er from the `dimensionality curse'.
2.3 Tumor growth mo dels
Our target class is a collection of images of tumor-like shap es. As a preliminary testb ed, we use arti cial
data generated by a certain sto chastic mo del of simulated tumor growth. Our particular mo del is a
discrete-time version of Eden's tumor growth mo del [15], whose idea is illustrated in Figures 1 and 2.
At time t=0, only one grid-cell is `infected'; each infected grid-cell may infect its four non-diagonal
neighb ors with equal probability p at each time-tick. 4
On the basic Eden mo del, we have added the notion of East-West/North-South bias, to capture
the e ects of anisotropic growth patterns, due to anisotropies in the surrounding tissue (e.g., lesions
shap ed by their lo cation within the lung, breast, or liver.) Thus, in our mo del, an infected grid-cell has
probability p to infect its North and South neighb ors, and probability p to infect its East/West
NS EW
ones, with p not necessarily equal to p .
NS EW
6 9 9 85 47 4 6 9 6 32 1 6 8
9 9 9
Figure 1: Lattice at t = 9. The infection time of each infected cell is marked.
t = 1 t = 5 t = 10 t = 25 t = 50 t = 100
Figure 2: Initial seed (left column) and snapshots of tumor at later time steps, with p = :7
The Eden mo del is the simplest of a hierarchy of mo dels: in increasing generality, we have the
Williams-Bjerkness mo del [62], the Contact Process [31], and the Interacting Particle Systems [13, 40].
See [21, 30, 55, 10] for a more comprehensive survey on tumor growth mo dels.
2.4 Shap e Representation and Matching
Shap e representation is an interesting enough problem to have attracted many researchers and generated
a rich array of approaches [52]. There are two closely related problems: (a) how to describ e a single
shap e compactly and (b) how to measure the di erence of two shap es, so that it corresp onds to the
humanly p erceived di erence.
With resp ect to representations, the most p opular metho ds are:
representation through `landmarks': for example, in order to match two faces, information ab out
the eyes, nose, etc., are extracted manually [4] or automatically. Thus, a shap e is represented by a
set of landmarks and their attributes (area, p erimeter, relative p osition etc). Again, the distance
b etween two images is the sum of the p enalties for the di erences of the landmarks. 5
representation through numerical vectors, such as (a) samples of the `turning angle' plot [33] (that
is, the slop e of the tangent at each p oint of the p eriphery, as a function of the distance traveled on
the p eriphery from a designated starting p oint) (b) some co ecients of the 2-d Discrete Fourier
Transform (DFT), or, more recently, the (2-d) Discrete Wavelet Transform [43] or (c) the rst few
moments of inertia [20, 18]. In these cases, we typically use the (weighted) Euclidean distance of
the vectors.
representation through a simpler shap e, such as p olygonalization [24, 50, 54, 60, 39] and Mathe-
matical Morphology [63, 47, 44, 11, 7], which we shall examine in detail next.
Among them, representations based on morphology are very promising, b ecause they have two very
desirable characteristics for our applications:
they can b e easily designed to b e essentially invariant to rotation and translation (= rigid motions).
they are inherently `multi-scale', and thus they can highlight di erences at several scales, as we
explain next.
The multi-scale characteristic is imp ortant, esp ecially for tumors, b ecause the `ruggedness' of the
p eriphery of a tumor contains a lot of information ab out it [9]. Thus, given two tumor-like shap es, we
would like to examine di erences at several scales b efore we pronounce the two shap es as `similar'.
Even for general shap es, there exists substantial evidence that scale-space b ehavior is an imp ortant
and highly discriminating shap e \signature" [61, 42, 8].
3 Intro duction to Morphology
Our goal is to cho ose a distance function b etween shap es which will b e invariant to translation and
rotation, and which will `give attention' to all levels of detail. One such function is founded on ideas
from the eld of Mathematical Morphology. See [12] for a very accessible intro duction. Next, we present
the de nitions and concepts that we need for our application. Table 3 lists the symb ols and their
de nitions.
Some de nitions are in order: Consider black-and-white images in 2-d space; the `white' p oints of
an image are a subset of the 2-d address space, while the background is, by convention, black. More
2
formally, let X (the \shap e space") b e a set of compact subsets of < , and R b e the group of rigid
motions R : X 7! X .
3.1 Intro duction to Morphology
Mathematical Morphology is a rich quantitative theory of shap e, which incorp orates a multiscale com-
p onent. It has b een develop ed mainly by Matheron [48, 49], Serra [59, 14], and their collab orators. 6
Symb ol De nition
< the set of reals
< the set of non-negative reals
+
Z the set of integers
Z the set of non-negative integers
+
the op erator for dilation
the op erator for erosion
the op erator for Morphological Op ening
the op erator for Morphological Closing
jX j area of a shap e X
H
f (X ) a smo othed version of X at scale n wrt structural elt H
m
H
y the size-distribution ( cumulative pattern sp ectrum) of X wrt structural elt H
X
d(; ) the set-di erence distance b etween two shap es
d (; ) the oating shap e distance
d (; ) the max-morphological distance b etween two shap es
1
(; ) the max-granulometric distance b etween two shap es
1
a resp onse set size (number of actual hits)
N database size (number of images)
n number of features in feature space
Table 1: Symb ol table
Since the 1980's, morphology and its applications have b ecome extremely p opular.
In mathematical morphology, mappings are de ned in terms of a structural element, a \small" prim-
itive shap e (set of p oints) which interacts with the input image to transform it, and, in the pro cess,
extract useful information ab out its geometrical and top ological structure. The basic morphological
op erators are: ; ; ; (dilation, erosion, Morphological Op ening, and Closing, resp ectively), and they
are de ned as follows.
s
Let X denote the translate of shap e X by the vector h, and let H denote the symmetric of shap e
h
H with resp ect to the origin:
s
H = f h j h 2 H g (2)
s 2
De nition 1 The dilation, X H , of a shape X < by a structural element H , is de ned as
n o
[
s 2
X H = X = (x; y ) 2 < j H \ X 6= ; (3)