Volume 11 (4) 1998, pp. 413 { 437
Markov Random Fields and Images
Patrick Perez
Irisa/Inria, Campus de Beaulieu,
35042 Rennes Cedex, France
e-mail: [email protected]
At the intersection of statistical physics and probability theory,Markov random
elds and Gibbs distributions have emerged in the early eighties as p owerful to ols
for mo deling images and coping with high-dimensional inverse problems from low-
level vision. Since then, they have b een used in many studies from the image
pro cessing and computer vision community. Abrief and simple intro duction to
the basics of the domain is prop osed.
1. Introduction and general framework
With a seminal pap er by Geman and Geman in 1984 [18], powerful to ols
known for long by physicists [2] and statisticians [3] were brought in a com-
prehensive and stimulating way to the knowledge of the image pro cessing and
computer vision community. Since then, their theoretical richness, their prac-
tical versatility, and a numb er of fruitful connections with other domains, have
resulted in a profusion of studies. These studies deal either with the mo d-
eling of images (for synthesis, recognition or compression purp oses) or with
the resolution of various high-dimensional inverse problems from early vision
(e.g., restoration, deblurring, classi cation, segmentation, data fusion, surface
reconstruction, optical ow estimation, stereo matching, etc. See collections of
examples in [11,30,40]).
The implicit assumption b ehind probabilistic approaches to image analysis
is that, for a given problem, there exists a probability distribution that can
capture to some extent the variability and the interactions of the di erent sets
of relevant image attributes. Consequently, one considers the variables of the
n
problem as random variables forming a set (or random vector) X = (X )
i
i=1
1
with joint probability distribution P .
X
1
P is actually a probability mass in the case of discrete variables, and a probability density
X
function when the X 's are continuously valued. In the latter case, all summations over
i
states or con gurations should b e replaced byintegrals. 413
The rst critical step toward probabilistic mo deling thus obviously relies on
the choice of the multivariate distribution P . Since there is so far no really
X
generic theory for selecting a mo del, a tailor-made parameterized function P
X
is generally chosen among standard ones, based on intuition of the desirable
2
prop erties .
The basic characteristic of chosen distributions is their decomp osition as a
pro duct of factors dep ending on just a few variables (one or two in most cases).
Also, a given distribution involves only a few typ es of factors.
One has simply to sp ecify these local \interaction" factors (which might
be complex, and might involvevariables of di erent nature) to de ne, up to
some multiplicative constant, the joint distribution P (x ;::: ;x ): one ends
X 1 n
up with a global mo del.
With such a setup, eachvariable only directly dep ends on a few other \neigh-
b oring" variables. From a more global p oint of view, all variables are mutually
dep endent, but only through the combination of successive lo cal interactions.
This key notion can b e formalized considering the graph for which i and j are
neighb ors if x and x app ear within a same lo cal comp onent of the chosen
i j
factorization. This graph turns out to b e a p owerful to ol to account for lo cal
and global structural prop erties of the mo del, and to predict changes in these
prop erties through various manipulations. From a probabilistic p oint of view,
this graph neatly captures Markov-typ e conditional indep endencies among the
random variables attached to its vertices.
After the sp eci cation of the mo del, one deals with its actual use for mo d-
eling a class of problems and for solving them. At that p oint, as we shall see,
one of the two following things will have to b e done: (1) drawing samples from
the joint distribution, or from some conditional distribution deduced from the
jointlaw when part of the variables are observed and thus xed; (2) maximiz-
ing some distribution (P itself, or some conditional or marginal distribution
X
deduced from it).
The very high dimensionality of image problems under concern usually ex-
cludes any direct metho d for p erforming b oth tasks. However the lo cal decom-
p osition of P fortunately allows to devise suitable deterministic or sto chastic
X
iterative algorithms, based on a common principle: at each step, just a few
variables (often a single one) are considered, all the others b eing \frozen".
Markovian prop erties then imply that the computations to be done remain
lo cal, that is, they only involve neighb oring variables.
This pap er is intended to give a brief (and de nitely incomplete) overview
of how Markovian mo dels can b e de ned and manipulated in the prosp ect of
mo deling and analyzing images. Starting from the formalization of Markov ran-
dom elds (mrfs) on graphs through the sp eci cation of a Gibbs distribution
(x2), the standard issues of interest are then grossly reviewed: sampling from
a high-dimensional Gibbs distribution (x3); learning mo dels (at least param-
eters) from observed images (x4); using the Bayesian machinery to cop e with
2
Sup erscript denotes a parameter vector. Unless necessary, it will b e dropp ed for nota-
tional convenience. 414
inverse problems, based on learned mo dels (x5); estimating parameters with
partial observations, esp ecially in the case of inverse problems (x6). Finally
two mo deling issues (namely the intro duction of so-called auxiliary variables,
and the de nition of hierarchical mo dels), which are receiving a great deal of
attention from the community at the moment, are evoked (x7).
2. Gibbs distribution and graphical Markovproperties
Let us now make more formal acquaintance with Gibbs distributions and their
Markov prop erties. Let X ;i = 1;::: ;n, be random variables taking values
i
in some discrete or continuous state space , and form the random vector
T n
X =(X ;::: ;X ) with con guration set = . All sorts of state spaces
1 n
are used in practice. More common examples are: = f0;::: ;255g for 8-bit
quantized luminances; = f1;::: ;Mg for semantic \lab elings" involving M
classes; = R for continuously-valued variables like luminance, depth, etc.;
= f u ;::: ;u g f v ;::: ;v g in matching problems involving
max max max max
2
displacementvectors or stereo disparities for instance; = R in vector eld-
based problems like optical ow estimation or shap e-from-shading.
As said in the intro duction, P exhibits a factorized form:
X
Y
P (x) / f (x ); (1)
X c c
c2C
where C consists of small index subsets c, the factor f dep ends only on the
c
Q
variable subset x = fx ;i 2 cg, and f is summable over . If, in addi-
c i c
c
tion, the pro duct is p ositive (8x 2 ; P (x) > 0), then it can be written in
X
exp onential form (letting V = ln f ):
c c
X
1
P (x)= expf V (x )g: (2)
X c c
Z
c
Well known from physicists, this is the Gibbs (or Boltzman) distribution with
P
interaction potential fV ;c 2Cg, energy U = V , and partition function (of
c c
c
P
3
parameters) Z = expf U (x)g . Con gurations of lower energies are
x2
the more likely, whereas high energies corresp ond to low probabilities.
The interaction structure induced by the factorized form is conveniently
captured by a graph that statisticians refer to as the independencegraph: the
Q
indep endence graph asso ciated with the factorization f is the simple
c
c2C
undirected graph G =[S; E ] with vertex set S = f1;::: ;ng, and edge set E
de ned as: fi; j g2 E ()9c2C :fi; j gc, i.e., i and j are neighb ors if x
i
and x app ear simultaneously within a same factor f . The neighborhood n(i)
j c
4
. As a consequence of site i is then de ned as n(i) = fj 2 S : fi; j g2 Eg
3
Formal expression of the normalizing constant Z must not veil the fact that it is unknown
and b eyond reach in general, due to the intractable summation over .
4
n = fn(i);i 2 Sg is called a neighborhood system, and the neighb orho o d of some subset
a S is de ned as n(a)=fj 2S a:n(j)\a6= g. 415
of the de nition, any subset c is either a singleton or comp osed of mutually
neighb oring sites: C is a set of cliques for G .
When variables are attached to the pixels of an image, the most common
neighb orho o d systems are the regular ones where a site away from the b order
of the lattice has four or eight neighb ors. In the rst case ( rst-order neighb or-
ho o d system, like in Figure 1.a) subsets c have at most two elements, whereas,
in the second-order neighb orho o d system cliques can exhibit up to 4 sites. How-
ever, other graph structures are also used: in segmentation applications where
the image plane is partitioned, G might b e the planar graph asso ciated to the
partition (Figure 1.b); and hierarchical image mo dels often live on (quad)-trees (Figure 1.c).
(a) (b) (c)
Figure 1. Three typical graphs supp orting mrf-based mo dels for image
analysis: (a) rectangular lattice with rst-order neighborhood system; (b)
non-regular planar graph asso ciated to an image partition; (c) quad-tree. For
each graph, the grey no des are the neighb ors of the white one.
The indep endence graph conveys the key probabilistic information by absent
edges: if i and j are not neighb ors, P (x) can obviously b e split into two parts
X
resp ectively indep endent from x and from x . This suces to conclude that
i j
the random variables X and X are indep endent given the others. It is the
i j
pair-wise Markov property [29,39].
In the same fashion, given a set a S of vertices, P splits into
X
Q Q
f f where the second factor do es not dep end on x . As
c c a
c:c\a6= c:c\a=
a consequence P reduces to P , with:
X jX X jX
a S a a
n(a)
X Y
f (x ) = expf V (x )g; (3) P (x jx ) /
c c c c a
X jX n(a)
a
n(a)
c:c\a6= c:c\a6=
with some normalizing constant Z (x ), whose computation by summing
a
n(a)
over all p ossible x is usually tractable. This is the local Markov property.
a
The conditional distributions (3) constitute the key ingredients of iterative
pro cedures to be presented, where a small site set a (often a singleton) is
considered at a time.
It is p ossible (but more involved) to prove the global Markov property ac-
cording to which, if a vertex subset A separates two other disjoint subsets B
and C in G (i.e., all chains from i 2 B to j 2 C intersect A) then the random 416
vectors X and X are indep endent given X [29,39]. It can also b e shown
B C A
that the three Markov prop erties are equivalent for strictly p ositive distribu-
tions. Conversely, if a p ositive distribution P ful lls one of these Markov
X
prop erties w.r.t. graph G (X is then said to be a mrf on G ), then P is a
X
Gibbs distribution of the form (2) relative to the same graph. This equivalence
constitutes the Hammersley-Cli ord theorem [3].
To conclude this section, weintro duce twovery standard Markov random
elds whichhave b een extensively used for image analysis purp oses.
Example 1: Ising mrf. Intro duced in the twenties in statistical physics of
condensed matter, and studied since then with great attention, this mo del
deals with binary variables ( = f 1; 1g) that interact lo cally. Its simpler
formulation is given by energy
X
U (x)= x x ;
i j
hi;j i
where summation is taken over all edges hi; j i2E of the chosen graph. The
\attractive" version ( > 0) statistically favors identity of neighb ors. The
conditional distribution at site i is readily deduced:
P
expf x x g
i j
j2n(i)
P
P (x jx )= :
i
X jX n(i)
i
n(i)
2 cosh( x )
j
j 2n(i)
The Ising mo del is useful in detection-typ e problems where binary variables
are to b e recovered. Some other problems (essentially segmentation and clas-
si cation) require the use of symb olic (discrete) variables with more than two
p ossible states. For these cases, the Potts mo del also stemming from statistical
physics [2], provides an immediate extension of the Ising mo del, with energy
P
U (x)= [2 (x ;x ) 1], where is the Kronecker delta.
i j
hi;j i
Example 2: Gauss-Markov eld. It is a continuously-valued random vector
( = R) ruled byamultivariate Gaussian distribution
1 1
T 1
p
P (x)= expf (x ) (x )g;
X
2
det(2 )
with exp ectation vector E ( X )=and covariance matrix . The \Markovian-
ity" shows up when eachvariable only interacts with a few others through the