<<

Volume 11 (4) 1998, pp. 413 { 437

Markov Random Fields and Images

Patrick Perez

Irisa/Inria, Campus de Beaulieu,

35042 Rennes Cedex, France

e-mail: [email protected]

At the intersection of and theory,Markov random

elds and Gibbs distributions have emerged in the early eighties as p owerful to ols

for mo deling images and coping with high-dimensional inverse problems from low-

level vision. Since then, they have b een used in many studies from the image

pro cessing and community. Abrief and simple intro duction to

the basics of the domain is prop osed.

1. Introduction and general framework

With a seminal pap er by Geman and Geman in 1984 [18], powerful to ols

known for long by physicists [2] and statisticians [3] were brought in a com-

prehensive and stimulating way to the knowledge of the image pro cessing and

computer vision community. Since then, their theoretical richness, their prac-

tical versatility, and a numb er of fruitful connections with other domains, have

resulted in a profusion of studies. These studies deal either with the mo d-

eling of images (for synthesis, recognition or compression purp oses) or with

the resolution of various high-dimensional inverse problems from early vision

(e.g., restoration, deblurring, classi cation, segmentation, data fusion, surface

reconstruction, optical ow estimation, stereo matching, etc. See collections of

examples in [11,30,40]).

The implicit assumption b ehind probabilistic approaches to image analysis

is that, for a given problem, there exists a probability distribution that can

capture to some extent the variability and the interactions of the di erent sets

of relevant image attributes. Consequently, one considers the variables of the

n

problem as random variables forming a set (or random vector) X = (X )

i

i=1

1

with joint probability distribution P .

X

1

P is actually a probability mass in the case of discrete variables, and a probability density

X

function when the X 's are continuously valued. In the latter case, all summations over

i

states or con gurations should b e replaced byintegrals. 413

The rst critical step toward probabilistic mo deling thus obviously relies on

the choice of the multivariate distribution P . Since there is so far no really

X



generic theory for selecting a mo del, a tailor-made parameterized function P

X

is generally chosen among standard ones, based on intuition of the desirable

2

prop erties .

The basic characteristic of chosen distributions is their decomp osition as a

pro duct of factors dep ending on just a few variables (one or two in most cases).

Also, a given distribution involves only a few typ es of factors.

One has simply to sp ecify these local \interaction" factors (which might

be complex, and might involvevariables of di erent nature) to de ne, up to

some multiplicative constant, the joint distribution P (x ;::: ;x ): one ends

X 1 n

up with a global mo del.

With such a setup, eachvariable only directly dep ends on a few other \neigh-

b oring" variables. From a more global p oint of view, all variables are mutually

dep endent, but only through the combination of successive lo cal interactions.

This key notion can b e formalized considering the graph for which i and j are

neighb ors if x and x app ear within a same lo cal comp onent of the chosen

i j

factorization. This graph turns out to b e a p owerful to ol to account for lo cal

and global structural prop erties of the mo del, and to predict changes in these

prop erties through various manipulations. From a probabilistic p oint of view,

this graph neatly captures Markov-typ e conditional indep endencies among the

random variables attached to its vertices.

After the sp eci cation of the mo del, one deals with its actual use for mo d-

eling a class of problems and for solving them. At that p oint, as we shall see,

one of the two following things will have to b e done: (1) drawing samples from

the joint distribution, or from some conditional distribution deduced from the

jointlaw when part of the variables are observed and thus xed; (2) maximiz-

ing some distribution (P itself, or some conditional or marginal distribution

X

deduced from it).

The very high dimensionality of image problems under concern usually ex-

cludes any direct metho d for p erforming b oth tasks. However the lo cal decom-

p osition of P fortunately allows to devise suitable deterministic or sto chastic

X

iterative algorithms, based on a common principle: at each step, just a few

variables (often a single one) are considered, all the others b eing \frozen".

Markovian prop erties then imply that the computations to be done remain

lo cal, that is, they only involve neighb oring variables.

This pap er is intended to give a brief (and de nitely incomplete) overview

of how Markovian mo dels can b e de ned and manipulated in the prosp ect of

mo deling and analyzing images. Starting from the formalization of Markov ran-

dom elds (mrfs) on graphs through the sp eci cation of a Gibbs distribution

(x2), the standard issues of interest are then grossly reviewed: sampling from

a high-dimensional Gibbs distribution (x3); learning mo dels (at least param-

eters) from observed images (x4); using the Bayesian machinery to cop e with

2

Sup erscript  denotes a parameter vector. Unless necessary, it will b e dropp ed for nota-

tional convenience. 414

inverse problems, based on learned mo dels (x5); estimating parameters with

partial observations, esp ecially in the case of inverse problems (x6). Finally

two mo deling issues (namely the intro duction of so-called auxiliary variables,

and the de nition of hierarchical mo dels), which are receiving a great deal of

attention from the community at the moment, are evoked (x7).

2. Gibbs distribution and graphical Markovproperties

Let us now make more formal acquaintance with Gibbs distributions and their

Markov prop erties. Let X ;i = 1;::: ;n, be random variables taking values

i

in some discrete or continuous state space , and form the random vector

T n

X =(X ;::: ;X ) with con guration set =  . All sorts of state spaces

1 n

are used in practice. More common examples are: = f0;::: ;255g for 8-bit

quantized luminances; = f1;::: ;Mg for semantic \lab elings" involving M

classes;  = R for continuously-valued variables like luminance, depth, etc.;

= fu ;::: ;u g fv ;::: ;v g in matching problems involving

max max max max

2

displacementvectors or stereo disparities for instance;  = R in vector eld-

based problems like optical ow estimation or shap e-from-shading.

As said in the intro duction, P exhibits a factorized form:

X

Y

P (x) / f (x ); (1)

X c c

c2C

where C consists of small index subsets c, the factor f dep ends only on the

c

Q

variable subset x = fx ;i 2 cg, and f is summable over . If, in addi-

c i c

c

tion, the pro duct is p ositive (8x 2 ; P (x) > 0), then it can be written in

X

exp onential form (letting V = ln f ):

c c

X

1

P (x)= expf V (x )g: (2)

X c c

Z

c

Well known from physicists, this is the Gibbs (or Boltzman) distribution with

P

interaction potential fV ;c 2Cg, energy U = V , and partition function (of

c c

c

P

3

parameters) Z = expfU (x)g . Con gurations of lower energies are

x2

the more likely, whereas high energies corresp ond to low .

The interaction structure induced by the factorized form is conveniently

captured by a graph that statisticians refer to as the independencegraph: the

Q

indep endence graph asso ciated with the factorization f is the simple

c

c2C

undirected graph G =[S; E ] with vertex set S = f1;::: ;ng, and edge set E

de ned as: fi; j g2 E ()9c2C :fi; j gc, i.e., i and j are neighb ors if x

i

and x app ear simultaneously within a same factor f . The neighborhood n(i)

j c

4

. As a consequence of site i is then de ned as n(i) = fj 2 S : fi; j g2 Eg

3

Formal expression of the normalizing constant Z must not veil the fact that it is unknown

and b eyond reach in general, due to the intractable summation over .

4

n = fn(i);i 2 Sg is called a neighborhood system, and the neighb orho o d of some subset

a  S is de ned as n(a)=fj 2Sa:n(j)\a6= g. 415

of the de nition, any subset c is either a singleton or comp osed of mutually

neighb oring sites: C is a set of cliques for G .

When variables are attached to the pixels of an image, the most common

neighb orho o d systems are the regular ones where a site away from the b order

of the lattice has four or eight neighb ors. In the rst case ( rst-order neighb or-

ho o d system, like in Figure 1.a) subsets c have at most two elements, whereas,

in the second-order neighb orho o d system cliques can exhibit up to 4 sites. How-

ever, other graph structures are also used: in segmentation applications where

the image plane is partitioned, G might b e the planar graph asso ciated to the

partition (Figure 1.b); and hierarchical image mo dels often live on (quad)-trees (Figure 1.c).

(a) (b) (c)

Figure 1. Three typical graphs supp orting mrf-based mo dels for image

analysis: (a) rectangular lattice with rst-order neighborhood system; (b)

non-regular planar graph asso ciated to an image partition; (c) quad-tree. For

each graph, the grey no des are the neighb ors of the white one.

The indep endence graph conveys the key probabilistic information by absent

edges: if i and j are not neighb ors, P (x) can obviously b e split into two parts

X

resp ectively indep endent from x and from x . This suces to conclude that

i j

the random variables X and X are indep endent given the others. It is the

i j

pair-wise [29,39].

In the same fashion, given a set a  S of vertices, P splits into

X

Q Q

f  f where the second factor do es not dep end on x . As

c c a

c:c\a6= c:c\a=

a consequence P reduces to P , with:

X jX X jX

a S a a

n(a)

X Y

f (x ) = expf V (x )g; (3) P (x jx ) /

c c c c a

X jX n(a)

a

n(a)

c:c\a6= c:c\a6=

with some normalizing constant Z (x ), whose computation by summing

a

n(a)

over all p ossible x is usually tractable. This is the local Markov property.

a

The conditional distributions (3) constitute the key ingredients of iterative

pro cedures to be presented, where a small site set a (often a singleton) is

considered at a time.

It is p ossible (but more involved) to prove the global Markov property ac-

cording to which, if a vertex subset A separates two other disjoint subsets B

and C in G (i.e., all chains from i 2 B to j 2 C intersect A) then the random 416

vectors X and X are indep endent given X [29,39]. It can also b e shown

B C A

that the three Markov prop erties are equivalent for strictly p ositive distribu-

tions. Conversely, if a p ositive distribution P ful lls one of these Markov

X

prop erties w.r.t. graph G (X is then said to be a mrf on G ), then P is a

X

Gibbs distribution of the form (2) relative to the same graph. This equivalence

constitutes the Hammersley-Cli ord theorem [3].

To conclude this section, weintro duce twovery standard Markov random

elds whichhave b een extensively used for image analysis purp oses.

Example 1: Ising mrf. Intro duced in the twenties in statistical physics of

condensed matter, and studied since then with great attention, this mo del

deals with binary variables ( = f1; 1g) that interact lo cally. Its simpler

formulation is given by energy

X

U (x)= x x ;

i j

hi;j i

where summation is taken over all edges hi; j i2E of the chosen graph. The

\attractive" version ( > 0) statistically favors identity of neighb ors. The

conditional distribution at site i is readily deduced:

P

expf x x g

i j

j2n(i)

P

P (x jx )= :

i

X jX n(i)

i

n(i)

2 cosh( x )

j

j 2n(i)

The Ising mo del is useful in detection-typ e problems where binary variables

are to b e recovered. Some other problems (essentially segmentation and clas-

si cation) require the use of symb olic (discrete) variables with more than two

p ossible states. For these cases, the Potts mo del also stemming from statistical

physics [2], provides an immediate extension of the Ising mo del, with energy

P

U (x)= [2 (x ;x ) 1], where  is the Kronecker delta. 

i j

hi;j i

Example 2: Gauss-Markov eld. It is a continuously-valued random vector

( = R) ruled byamultivariate Gaussian distribution

1 1

T 1

p

P (x)= expf (x )  (x )g;

X

2

det(2 )

with exp ectation vector E ( X )=and . The \Markovian-

ity" shows up when eachvariable only interacts with a few others through the

1

quadratic energy, that is, when the matrix A =  is sparse. Sites i and

j are then neighb ors in G i the corresp onding entry a = a is non-null.

ij ji

Graph G is exactly the so-called structure of sparse matrix A [24]. In practice

a Gauss-Markov eld is often de ned simply by its quadratic energy function

P P

a

1

T T n

ii

x Ax x b = a x x + ( x b )x , with b 2 R and U (x) =

ij i j i i i

hi;j i i

2 2

A a sparse symmetric de nite p ositive matrix. In that case, the exp ectation

which corresp onds to the most probable con guration, is the solution of the

large sparse linear system A = b. 417

Note that in the Gaussian case, any conditional or marginal distribution

taken from P is Gaussian and can be explicitly written down by using ad-

X

equate blo ck partitioning of A and b [29, 37, 39]. All Markovian prop erties

can then b e directly deduced from this. Site-wise conditional distributions in

particular turn out to b e Gaussian laws

X

1

1

(X jX = x ) N( (b a x );a ) :

i i ij j

n(i) n(i)

ii

a

ii

j 2n(i)

At this p oint, it is worth emphasizing the di erence with so-called simultaneous

autoregressive models [3] de ned by:

X

8ia X = b a X + W ;

ii i i ij j i

j 2n(i)

with W 's b eing i.i.d. reduced zero-mean Gaussian variables. As a fact, the

i

T

resulting inverse covariance matrix of X in this case is A A, whose lling

structure is larger than the one of A.

The nice prop erties of Gaussian mrfs, inherited from the quadratic form

of their energy, make them the more p opular mo dels in case of continuous or

\almost continuous" (i.e., jj very large) variables. 

3. Sampling from Gibbs distributions

In order to visually evaluate the statistical prop erties of the sp eci ed mo del,

or simply to get synthetic images, one mightwant to draw the samples from

the distribution P . A more imp ortant and technical reason for which this

X

sampling can be of much help is that for all issues requiring an exhaustive

visit of con guration set (intractable in practice), e.g., summation or maxi-

mization over all p ossible o ccurrences, an approximate way to pro ceed consists

in randomly visiting for long enough according to distribution P . These

X

approximation metho ds b elong to the class of Monte Carlo metho ds [25,35].

The dimensionality of the mo del makes sampling delicate (starting with

the fact that normalizing constant Z is b eyond reach). One has to resort to a

Markovchain pro cedure on which allows to sample successively from random

elds whose distributions get closer and closer to the target distribution P .

X

The lo cality of the mo del indeed p ermits to design a chain of random vectors

1 m

m

X ;::: ;X ;:::, without knowing Z , and such that P tends to P (for

X X

0

some distance) as m go es to in nity, whatever initial distribution P is. The

X

m+1 m

crux of the metho d lies in the design of transition probabilities P based

X jX

only on lo cal conditional distributions stemming from P , and such that the

X

5 6

resulting Markovchain is irreducible , ap erio dic , and preserves the target

5 0

There is a non-null probability to get from any x to any x within a nite number of

0

transitions: 9m : P (x jx) > 0.

m 0

X jX

6

Con guration set cannot be split in subsets that would be visited in a p erio dic way:

d1

k 0 0 m mdbm=dc

@ d>1: =[ with X 2 ) X 2 ; 8m

k=0 418

P

0 0

m+1 m

distribution, i.e., P (xjx )P (x )=P (x) for any con guration

0

X X

X jX

x 2

x 2 . The latter condition is esp ecially met when the so-called \detailed

0 0 0

m+1 m m+1 m

balance" holds: P (xjx )P (x )=P (x jx)P (x).

X X

X jX X jX

Various suitable dynamics can be designed [35]. A quite p opular one for

image mo dels is the Gibbs sampler [18] which directly uses conditional distribu-

tions from P to generate the transitions. More precisely S is split into small

X

pieces (often singletons), which are \visited" either at random, or according

to some pre-de ned schedule. If a is the concerned set of sites at step m, and

m m+1

x the realization of X , a realization of X is obtained by replacing x in

a

0

x =(x ;x )byx , according to the conditional law

a Sa

a

P (:jx )=P (:jx )

S a

X jX X jX n(a)

a S a a

n(a)

(this law of a few variables can b e exactly derived, unlike P ). The chain thus

X

sampled is clearly irreducible and ap erio dic (the null transition x ! x is always

p ossible), and detailed balance is veri ed since

0

m+1 m

P (x ;x jx ;x )P (x ;x )=

Sa a Sa X a Sa

X jX

a

0

=P (x jx )P (x ;x )

Sa X a Sa

X jX

a

a Sa

0

P (x ;x )P (x ;x )

X Sa X a Sa

a

=

P (x )

X Sa

Sa

0

is symmetric in x and x .

a

a

From a practical p oint of view the chain thus designed is started from any

con guration x , and run for a large numb er of steps. For m large enough, it

0

has almost reached an equilibrium around P , and following con gurations can

X

be considered as (non-indep endent) samples from P . The decision whether

X

equilibrium is reached is intricate, though. The design of criteria and indicators

to answer this question is an active area of researchin mcmc studies.

If the exp ectation of some function f of X has to b e evaluated, ergo dicity

0 m1

f (x )+:::+f (x )

. Practically, given a prop erties yield E [f (X )] = lim

m!+1

m

large but nite realization of the chain, one gets rid of the rst m out-of-

0

equilibrium samples, and computes an average over the remainder:

m

1

X

1

m

E [f (X )]  f (x ) :

m m

1 0

m=m +1

0

Example 3: sampling . Consider the Ising mo del from example

1 on a rectangular lattice with rst-order neighb orho o d system. S is visited

site-wise (sets a are singletons). If i is the current site and x the current con g-

uration, x is up dated according to the sampling of asso ciated lo cal conditional

i

P

distribution: it is set to 1 with probability / expf x g, and to 1

j

j 2n(i)

P

with probability / expf x g. Small size samples are shown in Figure

j

j 2n(i)

2, illustrating typical b ehavior of the mo del for increasing interaction parameter

. 419

This basic mo del can b e re ned. It can, for instance, b e made anisotropic

byweighting in a di erentway \horizontal" and \vertical" pairs of neighb ors:

X X

U (x)= x x x x :

1 i j 2 i j

i j i

j

Some typical patterns drawn from this version are shown in Figure 3 for di er-

ent combinations of and . 

1 2

Figure 2. Samples from an Ising mo del on a lattice with rst-order neighb or-

ho o d system and increasing = 0 (in that case X s are i.i.d., and sampling is

i

direct), 0.7, 0.9, 1.1, 1.5, and 2.

Figure 3. Samples from an anisotropic Ising mo del on a lattice with rst-order

neighb orho o d system and ( ; )=(5;0:5); (5; 0:1); (1; 1); (1; 1), resp ec-

1 2

tively.

Example 4: sampling from Gauss-Markov random elds. S b eing a rectangu-

lar lattice equipp ed with the second-order neighb orho o d system, consider the

quadratic energy

P P

2 2

(x x ) + (x x ) + U (x) =

i

i j 2 i j 1

i j

j

P P P

2 2 2

+ (x x ) + (x x ) + " x ;

i i

3 i j 4 i j

i

i

j j

with ; ; ; ;" some p ositive parameters. The quadratic form is obviously

1 2 3 4

p ositive de nite (since the minimum is reached for the unique con guration x 

0) and thus de nes a Gauss-Markov random eld whose A matrix can b e readily

2

@ U

. For current site i (away from the b order) and assembled using a =

ij

@x @x

i j

con guration x, the Gibbs sampler resorts to sampling from the Gaussian law 420

P

x

ij j

+ + +

j2n(i)

1

1 2 3 4

 

(X jX = x ) N( ; (2" +16 ) ), where = and

i

n(i) n(i)



4

"+8

= ; ; ,or , dep ending on hi; j i orientation. Some typical \textures"

ij 1 2 3 4

can b e obtained this way for di erent parameter values (Figure 4). 

Figure 4. Samples from an anisotropic Gaussian mo del on a lattice with

second-order neighb orho o d system and various values of ( ; ; ; ;").

1 2 3 4

4. Estimating parameters

When it is p ossible to make available various o ccurrences of X , like images

in Figure 2, 3 or 4, one can try to learn parameters of the assumed under-

P

 1 

lying Gibbs distribution P = Z ( ) expf V g. The distribution thus

c

X

c

adjusted might then b e used either to generate \similar" realizations through

sampling or simply as a compact information characterizing a class of patterns.

Also, if di erent mo dels are learned, they then might b e used in comp etition

to analyze and identify the content of an image, in the context of statistical

.

Learning distribution parameters based on observed samples is a standard

issue from . This estimation is often conducted based on the likelihood

of observed data. However, in the case of the Gibbs distributions under concern

in image problems, the high dimensionality rises once again technical diculties

whichhave to b e sp eci cally addressed or circumvented.

o

Given a realization x (it could also b e a set of realizations) supp osed to



arise from one Gibbs distribution among the family fP ; 2 g, the question

X

is to estimate at b est the \real"  . Maximum likeliho o d estimation (mle) seeks

0

^

parameters that maximize the o ccurrence probabilityof x :  = arg max L( )



 o

with (log-)likeliho o d function L( )=lnP (x ). Unfortunately, the partition

X

function Z ( ) which is here required, cannot b e derived in general (apart from

causal cases where it factorizes as a pro duct of lo cal partition functions).

A simple and p opular way to tackle this problem, is to lo ok instead at site-

o

wise conditional distributions asso ciated with data x : the global likeliho o d

as a go o dness-of- t indicator is replaced by a sum of lo cal indicators through

P

o o 

). (x jx the so-called pseudo-likelihood function [3], L ( ) = ln P

i

i

n(i) X jX

i

n(i)

Maximum pseudo-likeliho o d estimate (mple) is then:

X X

o  o

^

 = arg min [ln Z (x ;)+ V (x )];

i

c c

n(i)

 2

c3i

i2S

o

where lo cal partition functions Z (x ;) are usually accessible.

i

n(i) 421

Example 5: mple of Ising model. Consider the Ising mo del from example 1.

P

Denoting n(x )= x , the mple is obtained by maximizing

a i

i2a

X

o o o

L ( )= [ln cosh( n(x )) + x n(x )]:

i

n(i) n(i)

i

P

dL( )

o o o

= n(x )[x tanh( n(x ))] : Gradient ascent can b e conducted using

i

i

n(i) n(i)

d



Example 6: mple of Gaussian model. Consider the Gauss-Markov mo del

2 1

from example 4, with all 's equal to . Denoting  =(2"+16 ) and =

k

2 2

2  , the site-wise conditional laws are N ( n(x ); ). Setting the pseudo-

n(i)

P P

o o o o 2

x n(x ) (x ^n(x ))

i i

n(i) n(i)

i i

2

P

likeliho o d gradient to zero yields mple ^ = . ,^ =

o

2

n n(x )

i

n(i)





In widely encountered cases where the energy U dep ends linearly on the

 

parameters (family fP ; 2 g is then said to be exponential), i.e., U =

X

P P

 V (x ), for some partition of C = [ C , the gradients of the like-

k k c k k

k c2C

k

liho o d and pseudo-likeliho o d take sp ecial forms:

@ L( )

k o k

=[U (x )E (U (X))];



@

k

X

@ L ( )

k o o k o 0

= [U (x ;x ) E (U (X ;x )jx )];

 i

i i i

n(i) n(i) n(i)

@

k

i

P P

k k

with U (x)= V (x ) and U (x ;x )= V (x ). In that con-

k c i k c

n(i)

i

c2C c2C :i2c

k k

text, the mle can b e directly soughtby (sto chastic) gradient ascent techniques

k

where the exp ectation E (U (X )) is approximated by sampling. Resulting



mcmc mle metho ds [16,19,41] that recent progress in mcmc techniques (like

the use of imp ortance sampling) makes more and more tractable, thus o er an

alternativeto mple .

Also, (good) theoretical prop erties of mle and mple can be thoroughly

investigated in the exp onential case. In particular, the asymptotic consistency,



^

that is the desirable prop erty that  tends to the real parameter  when data



7 

are actually drawn from some stationary distribution P , and the \size" of

X

data increases to in nity, has b een established under rather general conditions

(see [12] for instance).

In the case of stationary Gibbs distributions on regular lattices, with small

nite state space , another useful approach to parameter estimation has b een

develop ed in a slightly di erent p ersp ective. The idea of this alternative tech-

nique is to tune parameters such that site-wise conditional distributions arising



from P t at b est those empirical ly estimated [15]. One has rst to determine

X

on which information of neighb orho o d con guration x , the lo cal distribution

n(i)

7

i.e., graph structure is regular, and o -b order variables have the same conditional distri-

butions. 422



P (:jx ) really dep ends. For instance, in an anisotropic Ising mo del,

n(i)

X jX

i

n(i)

jn(i)j

only n(x ) is relevant. The neighborhood con guration set  =  is

n(i)

partitioned accordingly:

[

 

=  ; with P (:jx )=p (:)if x 2 :

n(i) n(i)

X jX

i

n(i)

For eachtyp e of neighb orho o d con guration, the conditional distribution is

o o

#fi2S: x = and x 2 g

i

n(i)

o 8

^

: Then empirically estimated based on x as p ()=

o

#fi2S : x 2 g

n(i)



^

one tries to make p and p as close as p ossible, for all . One way to pro ceed

consists in solving the simultaneous equations



^

p () p ()

ln =ln ; 8f;  g; 8 :



^

p ( ) p ()

Where energy is linear w.r.t. parameters, one ends up with an over-determined

linear system of equations which is solved in the least-square sense. The asymp-

totic consistency of the resulting estimator has b een established under certain

conditions [22].

Example 7: empirical parameter estimation of Ising model. For each p ossible

expf n g

value n of n(x ), the lo cal conditional distribution is p () = .

n(i)

2 cosh n

The system to b e solved is then:

o o

)=n g #fi2S : x = +1 and n(x

p (+1)

i

n(i)

ln ; 8 ; =2 n =ln

o o

)=n g #fi2S : x =1 and n(x

p (1)

i

n(i)

P

n g

^

P

yielding least-square solution = , where g denotes the left hand side

2

2 n

of the linear equation ab ove. 

5. Inverse problems and Bayesian estimation

A learned Gibbs distribution might b e used to identify pieces of texture present

in an image, provided they are p ointed out in some way. However, if one wants

to address the joint issue of separating and recognizing these di erent pieces of

texture, another question has to b e dealt with at the same time as recognition,

namely \where are the pieces of texture to b e recognized?". This classi cation

problem is typical of so-called inverse problems: based on a luminance image,

an hidden underlying partition is searched.

In inverse problems from early vision, one tries to recover a large number

of unknown or hidden variables based on the knowledge of another bunch of

q

variables: given a set of data y = (y ) , which are either plain luminance

j

j =1

image(s) or quantities extracted b eforehand from images(s) such as discontinu-

n

ities, hidden variables of interest x =(x) are sought (Figure 5).

i=1

8

The mo del b eing stationary over the lattice, empirical estimation makes sense for large

o

enough con guration x . 423

?

data y hidden variables x

Figure 5. Typical inverse problem: data might b e a degraded image and/or

discontinuities of this image; variables to b e estimated might constitute a re-

stored version of the original image, or a partition of the image into meaningful

regions.

Vectors x and y can b e of the same nature (two images in restoration prob-

lems) or of completely di erent nature (a lab eling in terms of region numb ers

and an image in segmentation and classi cation problems; a vector eld and a

couple of images in disparity or displacement estimation). Also, comp onents of

one of the vectors can b e of various natures, like restored luminance and binary

indicators of discontinuities on the edge lattice in case of image restoration with

preservation of abrupt changes. For sake of concision, this latter p ossibilityis

o

not integrated in the following notations:  (resp.  ) will indistinctly denote

state spaces of all x 's (resp. y 's).

i j

Stated in probabilistic terms, the problem consists in inferring the \b est"

n o q

o ccurrence of the random vector X in = given a realization y 2 ( )

of the random vector Y . The key ingredient will be the conditional distri-

bution P (:jy ), referred to as the posterior distribution. This distribution

X jY

can b e sp eci ed at once, or according to a two-step Bayesian mo deling com-

bining knowledge ab out how data could be explained from hidden variables,

and exp ected (or at least desirable) statistical characteristics of the variables

of interest [37].

Let us make more precise this standard Bayesian construction pro cess. The

rst step amounts to mo deling Y as a random function of X , Y = F (X; W )

where W is a non-observed random vector. F captures the pro cess yielding

observed data from underlying attributes. The simplest case is met in the

restoration of an image corrupted with additive noise: Y = X + W .

Equivalently, one sp eci es in some way, the conditional distribution of data



`

, which dep ends on some given X . This is the conditional data likelihood, P

Y jX

parameter vector  . With the goal of getting backtox from given y , one may

`

simply try to invert this mo deling as X = f (Y; W ). The maximum likeliho o d

estimate, which corresp onds to the con guration equipping observed data with

highest probability, is then obtained by setting W to its most probable oc-

currence (null in general). Unfortunately such a metho d is, in general, either

intractable (F (x; w ) is a many-to-one mapping for given w ) or simply not sensi- 424

ble (e.g., for additive , maximum likeliho o d restoration provides the

observed image itself, as a result!). The inverted mo del might also happ en to

b e far to o sensitive (non-di erentiable), yielding completely di erent estimates

for slightly di erent input data.

ABayesian approach allows to x these problems through the sp eci cation



p

of a prior distribution P (x) for X . Often, prior knowledge captured by this

X

distribution is lo ose and generic, merely dealing with the regularity of desired

estimates (it is then related to Tikhonov regularization [38]). The prior might

however b e far more sp eci c ab out the class of acceptable con gurations (in

that case mighteven b e restricted to some parametric subspace) and their

resp ective probabilities of o ccurrence. Except in extreme cases where all prior

has b een put into the de nition of a reduced con guration set equipp ed with

uniform prior, the prior distribution is chosen as an interacting Gibbs distri-

bution over .

Mo deling is then completed by forming the joint and p osterior distributions

from previous ingredients. Bayes' rule provides:





p

`

P P

X

Y jX





p

 

`

/ P P = P = P ;

XY

X jY

X

Y jX

P

Y

with  =( ; ). The estimation problem can now b e de ned in terms of the

p `

p osterior distribution.

A natural way to pro ceed is to lo ok for the most probable con guration x

given the data:

x^ = arg max P (xjy ):

X jY

x2

This constitutes the maximum a posteriori (map) estimator. Although it is

often considered as a \brutal" estimator [32], it remains the most p opular for

it is simply connected to the p osterior distribution and the asso ciated energy

function. One has simply to devise the energy function (as a sum of lo cal func-

tions) of x and y , and the estimation goal is xed at once, as a global minimizer

in x of this energy. This means in particular that here, no probabilistic p oint

of view is nally required. Numerous energy-based approaches to inverse prob-

lems havethus b een prop osed as an alternativeto the statistical framework.

A very active class of such deterministic approaches is based on continuous

9

functionals, variational metho ds, and pdes .

9

As a consequence, one might legitimately wonder why one should b other with a proba-

bilistic formulation. Elements in favor of probabilistic p oint of view rely up on the variety

of to ols it o ers. As partly explained in this pap er, statistical to ols allow to learn parame-

ters, to generate \typical" instances, to infer unknown variables in di erentways (not only

the one using energy minimization), to assess estimation uncertainty, or to capture and

combine all sorts of priors within the Bayesian machinery. On the other hand, continuous

(deterministic) approaches allow to derive theoretical prop erties of mo dels under concern,

in a fashion that is b eyond reach of most discrete approaches, whether they are sto chastic

or not. Both p oints of view are therefore complementary. Besides, it is common that they

eventually yield similar discrete implementations. 425

To circumvent the crude globality of the map estimator, a more lo cal esti-

mator is also widely employed. It asso ciates to each site the value of X that

i

is the most probable given all the data:

x^ = arg max P (jy ):

i

X jY

i

2

It is referred to as the \marginal p osterior mo de" (mpm) estimator [32]. It relies

on site-wise p osterior marginals P which haveto be derived, or approxi-

X jY

i

mated, from the global p osterior distribution P . Note that for Gaussian

X jY

p osterior distribution, the map and mpm estimators coincide with the p oste-

rior exp ectation whose determination amounts to solving a linear system of

equations.

Both Bayesian estimators havenow to b e examined in the sp ecial case of

factorized distributions under concern. The prior mo del is captured by a Gibbs

P

 

p p

distribution P (x) / expf V (x )g sp eci ed on some prior graph

p

c

c

X

c2C



p

p p

[S; E ] through p otential fV ;c 2C g, and a data mo del is often chosen of

c

the following form:

X

 

` `

P (y jx) / expf V (x ;y )g;

d j

j

j

Y jX

j 2R

where R = f1;::: ;qg is the data site set and fd ;j 2 Rg is a set of small site

j

subsets of S (Figure 6.a). Note that this data mo del sp eci cation should b e such

that the normalizing constant, if unknown, is indep endent from x (otherwise

the p osterior distribution of x will b e incompletely de ned; see fo otnote 10).

The resulting p osterior distribution is a Gibbs distribution parameterized by 

10

and y :

X X

1



 

`

p

g: P (xjy )= expf V (x ) V (x ;y )

c d j

c j

X jY

j

Z(; y )

p

c2C j 2R

{z } |



U (x;y )

In the asso ciated p osterior indep endence graph [S; E ], anytwo sites fi; k g can

p

be neighb ors either through a V function (i.e., fi; k g 2 E ), or through a

c

function V (i.e., 9j 2 R : fi; k g  d ). This second typ e of neighborhood

j j

relationship thus o ccurs b etween comp onents of X that b oth participate to the

\formation" of a same comp onentofY (Figure 6.a). The neighb orho o d system

of the p osterior mo del is then at least as big as the one of the prior mo del (see

example in Figure 6.b). In site-wise measurement cases (i.e., R  S , and, 8i,

d = fig), the two graphs are identical.

i

Example 8: luminance-based classi cation. The luminance of the observed

o

image b eing seen as continuous ( = R) , one tries to partition the pixel set S

10

Many studies actually start by the sp eci cation of some energy function U (x; y ) =

P P P P

V (x )+ V (x ;y ). Then, unless exp f V g is indep endent from x, the

c c

j j j

d

c j y j

j

P

prior deduced from P / exp fU g is not P / expf V g.

c

XY X

c 426

j

R

d

j

S

p

d n (i)

j

S n(i)

Figure 6. (a) Ingredients of data mo del and neighb oring relationship they

induce on x comp onents; (b) p osterior neighborhood on lattice induced by

rst-order prior neighb orho o d system and symmetric ve-site subsets d cen-

j

tered at j .

into M classes ( = f1;::: ;Mg and S  R ) asso ciated to previously learned

2 M

means and variances  =( ; ) . Simple p oint-wise measurment mo del

` 

 =1

2

assuming (Y jX = ) N( ; ) results in

i i 



2

X

1 (y  )

i x

i

2

+ ln(2 )] P (y jx) = expf [ g:

Y jX

x

i

2

2 2

x

i

i

| {z }

V (x ;y )

i i i

A standard prior is furnished by the Potts mo del

X



p

[2 (x ;x ) 1]g P (x) / expf

i j

X

hi;j i

with resp ect to some prior graph structure ( = ). 

p

Example 9: Gaussian deblurring. The aim is to recover an image from a

ltered and noisy observed version, within a continuous setting (R  S ,=

o

 =R). The data mo del is Y = BX + W with B a sparse matrix asso ciated

with a known convolution kernel, and W a Gaussian white noise with known

2

variance  . Using the stationary isotropic Gaussian smo othing prior from

W

example 6, one gets the p osterior mo del:

P

2

X X X

(y b x )

j ji i

i2d

j

2 2

(x x ) "x P (xjy ) / expf g;

i j

X jY

i

2

2

W

i j

hi;j i

| {z }

V (x ;y )

j j

d

j

where d = fi 2 S : b 6= 0g corresp onds to the lter supp ort window

j ji

centered at j . The graph structure of the p osterior mo del is then usually larger

than the prior structure (Figure 6.b). The p osterior energy in matrix form

1

2 2 2

p

+ "kxk + ky Bxk where Dx =(x x ) is U (x; y )= kDxk

2

i j

hi;j i2E

2

W

de nes a many-to-one op erator. The map estimate is then the solution of the

linear system

1 1

T T T

B B )x = B y : ( D D + "Id +

2 2

2 2

W W 427

The dimensionality of mo dels under concern makes in general intractable

the direct and exact derivation of b oth Bayesian estimators: indeed, map re-

quires a global minimization over , while mpm is based on marginal computa-

tions (i.e., summing out x in P , for each i). Again, iterative pro cedures

S i

X jY

based on lo cal moves have to b e devised.

The marginal computation required by mpm is precisely the kind of task

that can be approximately p erformed using sampling. As describ ed in x3, a

0 m

1

long sequence of con gurations x ;:::;x can b e iteratively generated such

that b eyond some rank m they can b e considered as sampled from the Gibbs

0

distribution P (:jy ). Ergo dicity relation applied to the function f (X ) =

X jY

 (X ;) for some i 2 S and  2  yields

i

m

#fm 2 [m +1;m ]:x =g

0 1

i

P (jy )=E[(X ;)jY = y]  ;

i

X jY

i

m m

1 0

that is the p osterior marginal is approximated by the empirical frequency of

app earance. The mpm estimator is then replaced by x^ = arg max #fm 2

i 

m

[m +1;m ]:x =g:

0 1

i

Iterative researchof map estimate can b e either sto chastic or deterministic.

In the rst case, one makes use of so-called simulated annealing which relies on

a clever use of mcmc sampling techniques [18]. The idea consists in sampling

U (x;y )

, where T is a \temp erature" pa- from Gibbs distribution with energy

T

rameter which slowly decreases to 0. The co oling schedules insuring theoretical

11

convergence to a global minimizer unfortunately result in impractically long

pro cedures. Deterministic counterparts are therefore often preferred, although

they usually require a sensible initialization not to get stuck in too poor local

minima. With a continuous state space, all gradient descent techniques or iter-

ative system solving metho ds can b e used. With b oth discrete and continuous

state spaces, the simple \iterated conditional mo des" (icm) metho d [4] can b e

used: the comp onent at current site i is up dated so as to minimize the energy.

In the p oint-wise measurement case, this yields:

X

m+1

m

x = arg min V [(; x )] + V (; y );

c i i

ci

i



p

c2C :i2c

m

where x is the current con guration.

Example 10: using model from example 8. The p osterior distribution at site

i is

2

X

(y  )

i 

ln  g: P (jx ;y ) / expf [2 (; x ) 1]

 i j

X jX ;Y n(i)

i i

n(i)

2

2



j 2n(i)

The mpm estimate is approximated using a Gibbs sampler based on these distri-

butions, simulated annealing also iteratively samples from these distributions

C

11 m

If T  for large enough constant C then lim PrfX 2 arg max U g = 1 [18].

m

m!+1

ln m 428

but with energy scaled by temp erature T , and the icm up dates the current x

m

2

P

(y  )

i



by setting x to arg min [ln  + 2  (; x )]. 

2

i   j

j 2n(i)

2



Example 11: using Gaussian model from example 9. Letting " = 0, icm

up date at current site i picks the mean of the site-wise p osterior conditional

distribution:

X X X

m+1

2 2 1 2 m m

x = (16 + b ) [2 n(x )+ b (y b x )]:

ji j jk

W ji W k

n(i)

i

j j

k6=i

This exactly corresp onds to the Gauss-Seidel iteration applied to the linear

system of which the map estimate is the solution. 

6. Parameter estimation with incomplete data

At last, we deal with the estimation of parameters within inverse problem

mo deling. Compared with the setting from x4, the issue is dramatically more



complex since only part of the variables (namely y ) of the distribution P

XY

to be tuned are available. One talks ab out an incomplete data case, or par-

tial ly observedmodel. This complicated issue arises when it comes to designing

systems able to automatically adapt their underlying p osterior distribution to

various input data. It is for instance at the heart of unsupervised classi cation

problems.

A pragmatic technique is to cop e simultaneously with estimation of x and

estimation of  within an alternate pro cedure: for a given  , infer x according

to the chosen Bayesian estimator; given current estimate of x, learn parameters

from the pair (x; y ), as in the complete data case.

A more satisfactory and sounder idea consists in extending likeliho o d ideas

12

from the complete data case . The aim is then to maximize the data likeliho o d

P

 

L( )=lnP (y)=ln P (x; y ), whose derivation is intractable. However,

Y XY

x



its gradient rL( )= E [r ln P (X; y )jY = y ] suggests an iterative pro cess

 

XY

(k)

where exp ectation is taken with resp ect to the current t  and the resulting

expression is set to zero. This amounts to maximizing the conditional exp ecta-

 



tion E ln P (X; y )jY = y , with resp ect to  . The maximizer is the new

(k )



XY

(k +1)

parameter t  . The resulting pro cedure is the Expectation-Maximization

algorithm (em), which has b een intro duced in the di erent context of mix-

tures of laws [14]. It can b e shown that the asso ciated sequence of likeliho o ds

(k )

fL( )g is well increasing. However, the application of the plain em algo-

rithm involves twofold diculties with high-dimensional Gibbs distributions:



(1) the joint distribution P is usually known up to its partition function

XY

Z ( ) (the indetermination usually comes from the one of prior partition func-

tion); (2) computation of the conditional exp ectation is usually intractable.

As in the complete data case, the rst problem can b e circumvented either by

12

For exp onential families, gradient ascent techniques have in particular b een extended to

partially observed case [1, 42]. 429

mcmc techniques [42]orby considering pseudo-likeliho o d, i.e., replacing P

XY

Q

by P  P . Concerning the second p oint, mcmc averaging allows

Y jX X jX

i

i

n(i)

m m

0 1

to approximate exp ectations based on samples x ;::: ;x drawn from the

(k )



conditional distribution P [10]. Note that in the incomplete data case, b oth

X jY

mle and mple remain asymptotically consistent in general [13].

Distinguishing prior parameters from data mo del parameters, mple with em

pro cedure and mcmc exp ectation approximations yields the following up date:

8

P P P



(k +1)

p

1

m m

<

 = arg min [ln Z ( ;x )+ V (x )]

p c

 i p

p

c

i m c3i

n(i)

m m

1 0

P P

: (k +1)



1

m

`

 = arg min V (x ;y )

 j

`

j d

` j m

m m

j

1 0

. It has also b een suggested to compute rst maximum when V = ln P

j

Y jX

j

d

j

(pseudo-)likeliho o d estimators on each sample, and then to average the results.

The resulting up date scheme is similar to the previous one, with maximization

and summation w.r.t. samples b eing switched. With this metho d of averaging

(k )



complete data-based estimators computed on samples drawn from P ,any

X jY

other estimator might b e used as well. In particular, in case of reduced nite

state space , the empirical estimator presented in x4 can replace the mple

estimator on the prior parameters. It is the iterative conditional estimation

(ice) metho d [34].

It must b e said that parameter estimation of partially observed mo dels re-

mains in a Markovian context a very tricky issue due to the huge computational

load of the techniques sketched here, and to convergence problems toward lo-

cal minima of low quality. Besides, some theoretical asp ects (e.g., asymptotic

normality) still constitute op en questions.

Example 12: ice for classi cation model from example 8. For a given param-

(k )

eter t  , Gibbs sampling using conditional distributions

X

(k )

1

(k )

(k ) 2 

[2 (; x ) 1] (jx ;y ) / expf ) g P (y 

j i i

n(i)

X jX ;Y



2

i i

n(i)

(k )

2

j 2n(i)



0 m

provides samples x ;::: ;x . mles of data mo del parameters are computed

m

from (x ;y)'s and averaged:

P

m

1

X

y

m

i

1

i:x =

(k +1)

i

 = ;



m

m m #fi : x = g

1 0

i

m=m +1

0

P

(k +1)

2

m

1

X

(y  )

m

2

i

1

i:x = 

(k +1)

i

 = :



m

m m #fi : x = g

1 0

i

m=m +1

0

As for the prior parameter , empiric estimators can b e used for small M . It

will exploit the fact that the prior lo cal conditional distribution P (:jx )

X jX n(i)

i

n(i)

dep ends only on comp osition (n :::n ) of neighborhood:

1

M

8x 2  ; #fj 2 n(i):x =g=n

j

n(i)

 430

and

P (jx ) / expf (2n jn(i)j)g: 

X jX n(i)



i

n(i)

7. Some research directions

Within the domain of mrfs applied to image analysis, two particular areas

have b een exhibiting a remarkable vitality for the past few years. Both tend

to expand the mo deling capabilities and the inference facilities of standard

mrf-based mo dels. The rst one concerns the adding of a new set of so-called

auxiliary variables to a given mo del P , whereas the second one deals with the

X

de nition of hierarchical models.

7.1. Auxiliary variables and augmentedmodels

Auxiliary variable-based metho ds rst app eared in statistical physics as a way

to accelerate mcmc sampling. To this end, an augmented mo del P is con-

XD

sidered, where D is the set of auxiliary variables, such that (i) the marginal of

P

X arising from P coincides with the pre-de ned P : P (:; d)=P (:);

XD X XD X

d

(ii) sampling from P can b e done in a more ecientway than the one from

XD

P , by alternatively sampling from P and P . The augmented mo del

X

D jX X jD

is usually sp eci ed through P . The resulting joint mo del P = P P

XD X

D jX DjX

then obviously veri es (i). Alternate sampling then rst requires to derive P

D jX

from the joint distribution.

For Ising and Potts mo dels, the intro duction of binary b ond variables

D = fD ; hi; j i 2 Cg within the Swendsen-Wang algorithm [36] has b een

ij

extremely fruitful. The original prior mo del is augmented through sp eci ca-

tion

(

1 if x 6= x ;

Y

i j

P ; and P (0jx ;x )= P =

i j

D jX ;X D jX ;X D jX

ij i j ij i j

exp otherwise:

hi;j i

P is trivial to sample from. The nice fact is that the resulting distribution

D jX

P sp eci es that sites of connected comp onents de ned on S by 1-valued

X jD

b onds must have the same lab el, and that each of these clusters can get one of

the p ossible states from  with equal probability. As a consequence sampling

from P is very simple and intro duces long range interactions by simultane-

X jD

ously up dating all x 's of a p ossibly large cluster.

i

In a di erent prosp ect, auxiliary variables have b een used in order to intro-

duce meaningful non-linearities within quadratic mo dels. Despite their practi-

cal convenience, quadratic p otentials like in Gaussian mrfs, are often far to o

\rigid": they discourage to o drastically the deviations from mean con gura-

tions. Geman and Geman thus intro duced a binary line-process to allow an

adaptive detection and preservation of discontinuities within their Markovian

restoration mo del [18]. In the simplest version, the augmented smo othing prior

P

2

is P (x; d) / expf d [(x x )  ]g whichfavors discontinuity ap-

XD ij i j

hi;j i

p earing d = 0 (and thus a susp ension of smo othing between x and x ) as

ij i j 431

2

so on as the di erence (x x ) exceeds some threshold  . More precisely,

i j

in case of map estimation (with data mo del such that P = P ), it is

Y jXD Y jX

readily established that

P

2

min f d [(x x )  ] ln P (y jx)g

x;d ij i j

Y jX

hi;j i

P

2

= min f min[(x x ) ; 0] ln P (y jx)g;

x i j

Y jX

hi;j i

where the truncated quadratic p otentials app ear after optimal elimination of

the binary d 's [7]. In the view of that, the mo del asso ciated with the energy

ij

in the right hand side could as well b e used, without talking ab out auxiliary

variables. However, the binary variables capture in that case the notion of

discontinuity, and the Bayesian machinery allows to add a prior layer on them

ab out likely and unlikely spatial contour con gurations [18].

Starting from the idea of a b ounded smo othing p otential that app eared in

the minimization rewriting ab ove, a di erentiable p otentials of the same typ e,

but with improved exibility and numerical prop erties, have b een intro duced.

Stemming from statistics where they allow robust tting of parametric mo dels

[27], such cost functions p enalize less drastically residuals than quadratics do:

+

they are even functions (u), increasing on R and with derivative negligible

0

 (u)

2

at in nity compared with the one of u (lim = 0). It turns out that

u!+1

2u

p

+

if (u) = ( u) de ned on R is concave, (u) is the inferior envelop e of a

2

family of parab olas fzu + (z);z 2 [0;]g continuously indexed byvariable z ,

0

+

where  = lim  (u)[5,17]:

u!0

0

(u)

2 2

(u)= min zu + (z); with arg min zu + (z)= :

z 2[0; ] z 2[0; ] 2u

2

Function (u)=1exp(u ) is a common example of such a function. Note

0 0 2

that the optimal auxiliary variable  (u)=2u =  (u ) decreases to zero as the

2

residual u go es to in nity. De ning a smo othing p otential with such a function

yields, from a map p oint of view, and only concentrating on the prior part:

X X

2

(x x ) = min d (x x ) + (d ): min

i j ij i j ij

x

x;d

hi;j i hi;j i

A continuous line process D is thus intro duced. map estimation can b e p er-

formed on the augmented mo del according to an alternate pro cedure: given d,

the mo del is quadratic with resp ect to x; given x, the optimal d is obtained

0 2

^

in closed form as d =  [(x x ) ]. Where the mo del is Gaussian for xed

ij i j

d, this minimization pro cedure amounts to iteratively reweightedleast squares.

Note that such non-quadratic cost function can as well b e used within the data

mo del [5], to p ermit larger departures from this (usually crude) mo del.

It is only recently that a global picture of the link between robust esti-

mators from statistics, line pro cesses for discontinuity preservation, mean eld

approximation from statistical physics, \graduate non-convexity" continuation 432

metho d, and adaptative ltering stemming from anisotropic di usion, has b een

clearly established [5,6]. This new domain of researchisnow actively investi-

gated by p eople with various backgrounds.

7.2. Hierarchical model

While p ermitting tractable single-step computations, the lo cality which is at

the heart of Markovian mo deling, results in a very slow propagation of informa-

tion. As a consequence, iterative sampling and inference pro cedures reviewed

in previous sections may converge very slowly. This motivates the search ei-

ther for improved algorithms, or for new mo dels allowing non-iterative or more

ecient manipulation.

So far, the more fruitful approaches in b oth cases have relied on some notion

of hierarchy (see [21] for a recent review). Hierarchical mo dels or algorithms

allowtointegrate in a progressive and ecientway the information (esp ecially

in the case of multiresolution data, when images come into a hierarchy of

scales). They thus provide gains b oth in terms of computational eciency and

of result quality, along with new mo deling capabilities.

Algorithm-based hierarchical approaches are usually related to well-known

multigrid resolution techniques from numerical analysis [23] and statistical

L L1

physics [20], where an increasing sequence of nested spaces  ::: 

0

= is explored in a number of p ossible ways. Con gurations in subset

l

are describ ed by a reduced numberofvariables (the co ordinates in a basis

l l

of subspace in the linear case) according to some prop er mapping  , i.e.,

l l l l

= Im . Let be the corresp onding reduced con guration set which 

l

is de ned from. If, for instance, is the set of con gurations that are piece-

l

wise constant with resp ect to some partition of S , x 2 is asso ciated to the

l

reduced vector x made up of the values attached by x to each subset of the

l

partition. As for map estimation, inference conducted in yields:

l l l

U ( (x );y)]: P (xjy )= [arg min arg max

X jY

l l l

x 2 x2

The exploration of the di erent subsets is usually done in a coarse-to- ne fash-

ion, esp ecially in the case of discrete mo dels [8, 26]: approximate map esti-

l+1 l+1 l

mate x^ reached in provides initialization of iterations in , through

l 1 l+1 l+1 l

( )   (which exists due to inclusion Im( ) Im( )). This hier-

archical technique has proven useful to accelerate deterministic minimization

while improving the quality of results [26].

Mo del-based hierarchical approaches, instead, aim at de ning a new global

0 1 K 0 1 K

hierarchical mo del X =(X ;X ;::: ;X )onS =S [S [::: [ S , which

has nothing to do with any original (spatial) mo del. This is usually done

0

in a -typ e causal way, sp ecifying the \coarsest" prior P and

X

coarse-to- ne transition probabilities P = P as lo cal factor

k +1 k 0 k +1 k

X jX :::X X jX

Q

k+1

pro ducts: P = P , where eachnode i 2 S is b ound to

k +1 k

k +1

X jX

X jX

i

i2S

p(i)

k 0

a parent no de p(i) 2 S . When S reduces to a single site r , the indep endence 433

Q

graph corresp onding to the Gibbs distribution P = P P is a

X X

X jX

r

i

i6=r

p(i)

tree ro oted at r . When X 's at level K are related to pixels, this hierarchical

i

mo del usually lies on the no des of a quad-tree whose leaves t the pixel lattice

[9,28,31].

K

Even if one is only interested in last level variables X , the whole struc-

ture has to b e manipulated. But, its p eculiar tree nature allows, like in case

of Markovchains, to design non-iterative map and mpm inference pro cedures

made of twosweeps: a b ottom-up pass propagating all information to the ro ot,

and a top-down one which, in turn, allows to get optimal estimate at eachnode

given al l the data. In case of Gaussian mo deling, these pro cedures are identical

to techniques from linear algebra for direct factorization and solving of large

sparse linear systems with tree structure [24].

As for the sampling issue, drawing from a hierarchical prior is immedi-

0

ate, provided that one can sample P . Drawing from the p osterior mo del

X

P / P P requires rst to recover its causal factorization, i.e., to de-

X

X jY Y jX

rive P . This can be done in one single pass resorting to b ottom-up

X jX ;Y

i

p(i)

marginalizations. Also note that the prior partition function is exactly known

as a pro duct of normalizations of the lo cal transition probabilities. This makes

em-typ e pro cedures much lighter [9, 28].

The computational and mo deling app eal of these tree-based hierarchical

approaches is, so far, mo derated by the fact that their structure might app ear

quite arti cial for certain typ es of problems or data: it is not shift-invariant

with resp ect to the pixel lattice for instance, often resulting in visible non-

stationarity in estimates; also the relevance of the inferred variables at coarsest

levels is not obvious (esp ecially at the ro ot). Much work remains ahead for

extending the versatility of mo dels based on this philosophy.

8. Last words

Within the limited space of this presentation, technical details and most recent

developments of evoked issues were hardly touched, whereas some other is-

sues were completely left out (e.g., approximation of multivariate distributions

with causal mo dels, advanced mcmc sampling techniques, learning of indep en-

dence structure and p otential form, etc.). A couple of recent b o oks referenced

hereafter prop ose more complete exp ositions with various emphasis and gather

collections of state-of-the-art applications of Gibbsian mo deling, along with

complete reference lists.

As a nal word, it mightbeworth noting that mrfs do not constitute any

longer an isolated class of approaches to image problems. Instead they now

exhibit a growing number of stimulating connections with other active areas

of computer vision research(pdes for image analysis, from

ai, neural networks, parallel computations, sto chastic geometry, mathematical

morphology, etc.), as highlighted byanumb er of recent \transversal" publica-

tions (e.g., [6,33]). 434

References

1. P.M. Almeida, B. Gidas (1993). A variational metho d for estimating

the parameters of mrf from complete and noncomplete data. Ann. Applied

Prob. 3(1), 103{136.

2. R.J. Baxter (1992). Exactly solvedmodels in . Aca-

demic Press, London.

3. J. Besag (1974). Spatial interaction and the statistical analysis of lattice

systems. J. Royal Statist. Soc. B 36, 192{236.

4. J. Besag (1986). On the statistical analysis of dirty pictures. J. Royal

Statist. Soc. B 48 (3), 259{302.

5. M. Black, A. Rangarajan (1996). On the uni cation of line pro cesses,

outlier rejection, and robust statistics with applications in early vision. Int.

J. Computer Vision 19 (1), 75{104.

6. M. Black, G. Sapiro, D. Marimont, D. Heeger. Robust anisotropic

di usion. IEEE Trans. Image Processing. To app ear.

7. A. Blake, A. Zisserman (1987). Visual reconstruction. The MIT Press,

Cambridge.

8. C. Bouman, B. Liu (1991). Multiple resolution segmentation of textured

images. IEEE Trans. Pattern Anal. Machine Intel l. 13 (2), 99{113.

9. C. Bouman, M. Shapiro (1994). Amultiscale image mo del for Bayesian

. IEEE Trans. Image Processing 3 (2), 162{177.

10. B. Chalmond (1989). An iterative Gibbsian technique for reconstruction

of M -ary images. Pattern Recognition 22 (6), 747{761.

11. R. Chellappa, A.K. Jain, editors (1993). Markov random elds, theory

and applications. Academic Press, Boston.

12. F. Comets (1992). On consistency of a class of estimators for exp onential

families of Markov random elds on the lattice. Ann. Statist. 20, 455{486.

13. F. Comets, B. Gidas (1992). Parameter estimation for Gibbs distribution

from partially observed data. Ann. Appl. Probab. 2, 142{170.

14. A. Dempster, N. Laird, D. Rubin (1977) . Maximum likeliho o d from

incomplete data via the EM algorithm. J. Royal Statist. Soc. Series B 39,

1{38. with discussion.

15. H. Derin, H. Elliot (1987). Mo deling and segmentation of noisy and

textured images using Gibbs random elds. IEEE Trans. Pattern Anal.

Machine Intel l. 9 (1), 39{55.

16. X. Descombes, R. Morris, J. Zerubia, M. Berthod (1996). Estima-

tion of Markov random eld prior parameter using Markov chain Monte

Carlo maximum likeliho o d. Technical Rep ort 3015, Inria.

17. D. Geman, G. Reynolds (1992). Constrained restoration and the recov-

ery of discontinuities. IEEE Trans. Pattern Anal. Machine Intel l. 14 (3),

367{383.

18. S. Geman, D. Geman (1984). Sto chastic relaxation, Gibbs distributions

and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Ma- 435

chine Intel l. 6 (6), 721{741.

19. C. Geyer, E. Thompson (1992). Constrained Monte Carlo maximum

likeliho o d for dep endent data. J. Royal Statist. Soc. B 54 (3), 657{699.

20. J. Goodman, A.D. Sokal (1989). Multigrid Monte Carlo metho d. Con-

ceptual foundations. Phys. Rev. D. 40 (6), 2035{2071.

 ^

21. C. Graffigne, F. Heitz, P. Perez, F. Preteux, M. Sigelle,



J. Zerubia. Hierarchical and statistical mo dels applied to image anal-

ysis, a review. submitted to IEEE Trans. Inf. Theory available at

ftp://gdr-isis.enst.fr/pub/publications/Rapports/GDR/it96.ps.

22. X. Guyon (1993). Champs aleatoires sur un reseau. Masson, Paris.

23. W. Hackbusch (1985). Multi-grid methods and applications. Springer-

Verlag, Berlin.

24. W. Hackbusch (1994). Iterative solution of large sparse systems of equa-

tions. Springer-Verlag, New-York.

25. J. M. Hammersley and D. C. Handscomb (1965). Monte Carlo methods.

Methuen, London.



26. F. Heitz, P. Perez, P. Bouthemy (1994). Multiscale minimization of

global energy functions in some visual recovery problems. CVGIP : Image

Understanding 59 (1), 125{134.

27. P. Huber (1981). Robust Statistics. John Wiley & Sons, New York.

 

28. J.-M. Laferte, F. Heitz, P. Perez, E. Fabre (1995). Hierarchical

statistical mo dels for the fusion of multiresolution image data. In Proc. Int.

Conf. Computer Vision Cambridge, June.

29. S. Lauritzen (1996). Graphical models. Oxford Science Publications.

30. S.Z. Li (1995). Markov random eld modeling in computer vision. Springer-

Verlag, Tokyo.

31. M. Luettgen, W. Karl, A. Willsky (1994). Ecientmultiscale regu-

larization with applications to the computation of optical ow. IEEE Trans.

Image Processing 3 (1), 41{64.

32. J.L. Marroquin, S. Mitter, T. Poggio (1987). Probabilistic solution

of ill-p osed problems in computational vision. J. American Statis. Assoc.

82, 76{89.

33. D. Mumford (1995). Bayesian rationale for the variational formulation.

B. ter Haar Romeny, editor, Geometry-driven di usion in computer

vision. Kluwer Academic Publishers, Dordrecht, 135{146.

34. W. Pieczynski (1992). Statistical image segmentation. Mach. Graphics

Vision 1 (1/2), 261{268.

35. A.D. Sokal (1989). Monte Carlo metho ds in statistical mechanics : foun-

dations and new algorithms. Cours de troisieme cycle de la physique en

Suisse Romande.

36. R.H. Swendsen, J.-S. Wang (1987). Non-universal critical dynamics in

Monte Carlo simulations. Phys. Rev. Lett. 58, 86{88.

37. R. Szeliski (1989). Bayesian modeling of uncertainty in low-level vision.

Kluwer, Boston.

38. A.N. Tikhonov, V.Y. Arsenin (1977). Solution of il l-posed problems. 436

Winston, New York.

39. J. Whittaker (1990). Graphical models in applied multivariate statistics.

Wiley, Chichester.

40. G. Winkler (1995). Image analysis, random elds and dynamic Monte

Carlo methods. Springer, Berlin.

41. L. Younes (1988). Estimation and annealing for Gibbsian elds. Ann.

Inst. Henri Poincare-Probabilites et Statistiques 24 (2), 269{294.

42. L. Younes (1989). Parametric inference for imp erfectly observed Gibbsian

elds. Prob. Th. Rel. Fields 82, 625{645. 437