Unsup ervised Progressive Parsing of Poisson Fields

Using Minimum Description Length Criteria

 y

Rob ert D. Nowak Mario A. T. Figueiredo

Dept. of Electr. and Comp. Eng. Instituto de Telecomunicac~oes, and

Rice University Instituto Sup erior Tecnico

Houston, TX, U.S.A. 1049-001 Lisb oa, Portugal

Abstract

This paper describes novel methods for estimating

piecewise homogeneous Poisson elds based on mini-

mum description length (MDL) criteria. By adopt-

ing a coding-theoretic approach, our methods are able

to adapt to the the observed eld in an unsupervised

(a) (b) (c)

manner. We present a parsing scheme basedon xed

multiscale trees (binary, for 1D, quad, for 2D) and an

Figure 1: Illustration of the problem addressed in this

adaptive recursive partioning algorithm, both guided

pap er: (a) piecewise-constantintensity function (Pois-

by MDL criteria. show that the recursive

son intensities 0.05, 0.2, and 0.4); (b) scatter- of

scheme outperforms the xedtree approaches.

observed photon events; (c) intensity eld parsing by

MDL-based recursive algorithm.

1 Intro duction

MDL criteria without recourse to asymptotic approx-

Consider a realization from a spatial Poisson p oint

imations. In contrast, most applications of MDL in-

pro cess whose underlying intensity function is (or can

volve Gaussian (which are real-valued) and

be approximated as) piecewise constant (as exempli-

require asymptotic arguments. Hence, our application

ed in Figure 1(a)). This typ e of arises in photon-

of MDL here is esp ecially simple and well motivated.

limited imaging, particle and astronomical physics,

This work is a co ding-theoretic alternative to related

computer trac network analysis, and many other ap-

Bayesian estimation schemes [2, 3,4].

plications involving counting statistics. In particular,

photon-limited images are formed by detecting and

2 Mininum Description Length

counting individual photon events (e.g., in gamma-ray

The MDL principle addresses the following ques-

astronomyornuclear medicine).

tion: given a set of generation mo dels, which b est

From an observed realization (see Figure 1(b)), our

explains the observed data? To get a handle on

goal is to parse,orsegment, the observation space into

the notion of \b est," Rissanen employed the follow-

regions of (roughly) homogeneous intensity (Figure

ing gedankenexperiment. Supp ose that we wish to

1(c)); i.e., regions of space in which the distribution of

transmit the observed data x to a hyp othetical re-

points is well mo deled by a spatial Poisson distribution

ceiver. Given a (probabilistic) generation mo del for

with constantintensity. In this pap er, we prop ose two

the data, say p(xj ), the Shannon-optimal co de length

new metho ds for unsup ervised progressive parsing of

is log p(xj ). Of course, the receiver would also need

Poisson elds based on Rissanen's minimum descrip-

to know the mo del  to deco de the trans-

tion length (MDL) principle [1]. One of the interesting

mission; then, if  is a priori unknown, we also need

asp ects of our development is that, since the Poisson

to estimate it, co de it, and transmit it. Now, consider

K

data (counts) are integer-valued, we are able to derive

a set of K comp eting mo del classes fp (xj )g . In

i i

i=1

each class i, the \b est" mo del is the one that gives the



Supp orted by the National Science Foundation, grantno.

minimum co de length,

MIP{9701692.

y

Supp orted bytheFoundation for Science and Technology

b

 = arg min f log p (xj )g = arg max p (xj );

i i i i i

(Portugal), grant Praxis-XXI T.I.T.-1580.

 

i i

this is simply the maximum likelihood (ML) estimate counts x and x , resp ectively. So,

j;k j +1;2k

within mo del class i. But if the class is a priori

p(x j x ; ) = B i(x j x ; )

j +1;2k j;k j;k j +1;2k j;k j;k

unknown, the \b est" overall mo del is the one that

 

x

x

leads to the minimum description length: the sum of

j;k

j +1;2k

x

j +1;2k +1

=  (1  ) : (2)

j;k

j;k

x

log p (xj ) with the length of the co de for  itself.

j +1;2k

i i i

The fundamental asp ect of MDL is that it p erforms

Now consider two available mo del classes. Mo del

mo del selection (which the ML criterion alone do es

Class 0 assumes a homogeneous Poisson pro cess; then,

not) by p enalizing more complex mo del classes (those

 =  and consequently, since  =

j +1;2k j +1;2k +1 j;k

requiring longer co de lengths). MDL crite-

1

. Alternatively,  +  ,wehave  =

j +1;2k j +1;2k +1 j;k

2

ria have b een successfully used in several image anal-

in Mo del Class 1,  is a free parameter.

j;k

ysis/pro cessing problems (see references in [5]).

Hence, wehavetwo p ossible description lengths (re-

The delicate issue in applying MDL is in how to

call that x is already known by the receiver).

j;k

enco de the parameter  ; appropriate parameter co de

i

Mo del Class 0: Since  =1=2, it do esn't require

lengths are usually based on asymptotic approxima-

j;k

enco ding; the description length is then simply

tions; e.g., the well known (1=2) log N ,whereN is the

amount of data, is an asymptotic co de length [1]. In

L = log B i(x j x ; 1=2) (3)

0 j +1;2k j;k

this pap er, we are able to avoid asymptotic approxi-

 

x

j;k

mations and obtain exact co de lengths.

= log + x log 2

j;k

x

j +1;2k

3 MDL for Poisson Data

Mo del Class 1: In this case, the rst step consists in

We intro duce next two progressive (multiscale or

estimating, co ding, and transmitting  . Its ML

j;k

multiresolution) approaches to co ding Poisson data.

x

j +1;2k

estimate is b = . Because x was al-

j;k j;k

x

These approaches use MDL mo del class selection cri-

j;k

ready transmitted, it suces to enco de and trans-

teria as the basic building blo ck.

mit x ;sincex 2f0; 1;:::;x g, this re-

j +1;2k j +1;2k j;k

3.1 Binary Multiscale Tree

quires log(x + 1) bits. Surprisingly,we nd that

j;k

Assume that the observed data is a (1D) sequence

while enco ding the ML estimate of the parame-

N 1

J

of counts, fx g , whose length N =2 . Let x 

k J;k

ter, wehave enco ded the data x itself, and

k =0

j +1;2k

J

x , for k = 0;:::;2 1; for j = J 1;:::;0, let us

k

so no additional co ding is needed. The resulting

de ne the multiscale analysis of the data according to:

description length is simply

j

1

x  x + x ; k =0;:::;2 1: (1)

j;k j +1;2k j +1;2k +1

: (4) L =log(x +1)= log

1 j;k

x +1

j;k

The fx g are Haar scaling co ecients; for more de-

j;k

We transmit the data according to Mo del Class 0

tails on multiscale analyses of Poisson data, see [2, 3].

if L

0 1

Now supp ose we wish to transmit the observed

wecho ose when L = L is irrelevant). We then de ne

0 1

data fx g. Adopting a predictive co ding approach,

J;k

the optimal MDL parsing of the data according to the

op erating in scale, from coarse to ne, we naturally

same criterion, applied at each scale and lo cation. If

start by transmitting the total count x ;thiscanbe

0;0

x

j +1;2k

L

0 1 j;k j;k

x

co ded using Elias' technique for arbitrarily large in-

j;k

The underlying piece-wise constantintensity(scale

tegers [1]. We then progressively transmit the data

J ) is reconstructed/estimated from the total counts

(in a coarse-to- ne fashion) by next sending x , then

1;0

x and the sequence of splitting prop ortion esti-

0;0

x ;x , and so on. Note that we only need to send

2;0 2;2

b

mates fb g. Set   x . Next, with j = 1,

the scaling co ecients with even k indices; the cor-

j;k 0;0 0;0

b b b b

resp onding o dd-indexed data are deduced from them

 =  b and  =  (1 b ). Rep eat this

1;0 0;0 0;0 1;1 0;0 0;0

and the previously transmitted coarser data (e.g.,

b

re nement pro cess for j =2;:::;J, to obtain f g.

J;k

x = x x ). At each stage, our progressive

1;1 0;0 1;0

Finally,itisworth p ointing out that the same cri-

transmission scheme takes advantage of the coarser

terion to cho ose between Mo del Classes 0 and 1can

data already sent. Formally, we are interested in

be obtained under a Bayesian mo del selection per-

the conditional probability p(x jx ). It is well

j +1;2k j;k

sp ective. Let y denote a sample of a binomial ran-

known (see [2, 3]) that this probability is binomial

dom (same as x ) with probabilityfunc-

j +1;2k



j +1;2k

with parameters x and   , where 

tion B i(y j n; ) and consider the problem of decid-

j;k j;k j;k



j;k

ing between two hyp otheses: H :  = 1=2, or H : and  are the intensities underlying the Poisson

0 1 j +1;2k

 6= 1=2 (otherwise totally unknown). Furthermore, Mo del Class 0: With a constant intensity mo del,

with no a priori preference for H or H we use a and given the total count s , the individual

0 1 N

prior p(H )=p(H )=1=2. The mo dels for  under counts followa multinomial distribution with all

0 1

the twohyp otheses are parameters equal to 1=N , i.e.,

  

s

N

1

s

N

p(jH ) =  ( 1=2); (5)

0

; 1 p(x ;:::;x js )=

N 1

1 N N

X

x :::x

N

x = s 1 N

N

k

p(jH ) = U ( j 0; 1); (6)

1

k =0

where 1 is the indicator function of condition

C

where  (x a) denotes a Dirac delta function (a p oint

C (equal to one if C is true, zero otherwise); the

mass) at a and U ( j a; b) stands for a uniform prob-

multimonial co ecients are given by

ability density function between a and b. Naturally,

 

we decide for H if p(H jy )  p(H jy ), which is equiv-

1 1 0

s !

s

N

N

= :

alent to p(y jH )  p(y jH ) b ecause p(H ) = p(H ).

1 0 0 1

x :::x

x ! x ! x !

1 N

1 2 N

The marginal likeliho o ds are particular cases of the

binomial-Beta distribution (see [6], pp. 117)

In this case, there is no parameter to estimate and

the resulting description length is simply

Z

1

 

p(y j) p(jH ) d = B i(y j n; 1=2) p(y jH ) =

0 0

s

N

L = log + s log N: (7)

0

0 N

x :::x

Z

1 N

1

1

p(y jH ) = p(y j) p(jH ) d = :

1 1

Observe that (3) is a particular case of this ex-

n +1

0

pression, for N =2, s = x , and x = x .

N j;k 1 j +1;2k

Then, comparing p(y jH )versus p(y jH )isthesame

0 1

Mo del Classes 1;:::;N 1: Mo del class i assumes

as comparing L versus L ,asgiven by(3)and(4).

0 1

i1 N 1

that fx g and fx g are sets of Pois-

k k

k =0 k =i

3.2 Adaptive Recursive Parsing

son samples of di erent intensities. Given s ,

N

One of the limitations of the multiscale approach

the individual counts are still multinomially dis-

ab ove is that the parsing is restricted to a xed binary

tributed. However, the rst i parameters are now

tree. In general, the b est lo cations for parsing may

equal to, say, =i, and the N i last ones equal

not coincide with the dyadic partition enforced bythe

to (1 )=(N i). Note that with  = i=N ,we

multiscale analysis ab ove. Hence, wenow consider an

recover Mo del Class 0. Then,

adaptive recursive approach that allows for splits at

 

s

arbitrary lo cations. Again, supp ose we wish to trans-

N

p(x ;:::;x js ;)= 

1 N N

N 1

x :::x

mit a length N sequence of counts x = fx g . Un-

1 N

k

k =0

 

s s

like in the xed tree approach, N no longer needs to

  N i

s

i

 1 

(8) 1

be a power of 2.

N 1

X

i N i

x = s

N

k

Under Mo del Class 0, we co de and transmit the N

k =0

counts assuming that they are Poisson samples with a

P

i1

x , i =1;:::;N. where s 

k i

common intensity . Alternatively,we split the data

k =0

into two (connected) comp onents with two di erent

To use this mo del to enco de the data, we rst

constantintensities; however, unlike in the xed tree

have to estimate ; its ML estimate is simply

s

i

approach describ ed in the previous subsection, weare

. Since s is known, all that needs to b =

N

s

N

allowed to lo ok for the b est lo cation to split the counts.

be enco ded is s whichinvolves a co de-length of

i

Accordingly, this alternative actually consists of N 1

log(1 + s ) (since s 2f0; 1; :::; s g). But after

N i N

mo del classes, one for each split lo cation, whichwe will

transmitting s , we can build a b etter co de for

i

P

i1

index by i 2f1; 2; :::; N 1g.Wethus have a total of N

the data, b ecause we know that x = s

k i

k =0

P

N 1

candidate classes; if they are all a priori equiprobable,

and x = s s . Sp eci cally, eachsetof

k N i

k =i

each index requires the same log N co de length which

counts is itself multinomially distributed, leading

can then b e dropp ed from any comparisons.

to a total co de length

Since this basic building blo ck will b e included in a

 

s

i

recursive/predictive, coarse-to- ne, pro cedure, weas-

+ s log i L = log (1 + s ) log

i i N

P

N 1

x :::x

1 i1

sume that the total count s = x is known to

N k

k =0

 

the receiver and need not b e enco ded. The description

s s

N i

log +(s s ) log(N i):

N i

lengths achieved are:

x :::x

i N 1

Notice that (4) is a particular case of this ex-

120 120

pression for N = 2, s = x , x = x ,

j;k 1 j +1;2k

N 100 100

x = x .

j +1;2k +1 2 80 80

60 60

Our progressive/recursive parsing (or transmission)

40 40

scheme, pro ceeds as follows. As ab ove, we start by

20 20

enco ding the total count s by using, e.g., Elias' N

100 200 300 400 500 100 200 300 400 500

technique for arbitrary integers [1]. Then, from the

(a) (b)

full data set, we compute all the L 's. If L <

i 0

minfL ;:::;L g, our criterion states that the data

N 1

1 120 120

best enco ded as a single piece, and the pro cedure

is 100 100

stops. Otherwise, there is one best partition of the

80 80

i1 N 1

data, say fx g and fx g . We then transmit

k k

=0 k =i

k 60 60

i and s and apply the criterion to the two segments

i 40 40

i1 N 1

fx g and fx g . The receiver can compute the

k k

=0 k =i

k 20 20

second partial count from s (which it already re- N

100 200 300 400 500 100 200 300 400 500

ceived) and s as s s ; i.e., when the pro cedure

i N i

(c) (d)

is applied to each of the subsegments, the resp ective

lengths and totals were already transmitted. By re-

Figure 2: Example of estimating/parsing a piece-wise

cursively rep eating this pro cedure indep endently to

constantintensity function from observed counts. (a)

the resulting sub-blo cks of data we obtain avery ef-

Intensity. (b) Counts. (c) Estimate from multiscale bi-

cient recursive scheme of re nement. The pro cess

nary tree algorithm. (d) Adaptive recursive estimate.

stops when no further splits are indicated bythecri-

terion (i.e.,wekeep splitting blo cks until L is selected

0

for each sub-blo ck). The underlying intensity eldes-

Here, j = J and j = 0 are the highest ( nest)

timate is piece-wise constant, with the segments de-

and lowest (coarsest) resolutions (scales), resp ec-

ned by the obtained parsing and the corresp onding

tively. In a progressive co ding scheme, at scale

intensities as the ML estimates inside each segment.

j +1 we transmit each triple x , x ,

j +1;2k;2l j +1;2k +1;2l

Of course, this is a sub optimal scheme, b ecause at each

x , since the receiver already has the cor-

j +1;2k;2l+1

level we are ignoring that each segment will b e further

resp onding total x . The conditional probability

j;k ;l

sub divided into even smaller pieces, thus achieving an

p(x ;x ;x jx )ismultino-

j +1;2k;2l j +1;2k +1;2l j +1;2k;2l+1 j;k ;l

even shorter co de length. An optimal scheme would

mial (rather than binomial, as in the 1D case) with



j +1;2k;2l

b e computationally extremely heavy.

,  = parameters  =

j +1;2k +1;2l j +1;2k;2l



j;k ;l

We conclude this section with an illustrative exam-

 

j +1;2k +1;2l j +1;2k;2l+1

, and  = , where the

j +1;2k;2l+1

 

j;k ;l j;k ;l

ple. As seen in Figure 2, the recursive parsing scheme

f g are the intensities underlying the Poisson

j;k ;l

clearly outp erforms the binary tree approach when the

counts.

underlying intensity is piece-wise constant.

Analogous to the 1D case, we consider two alterna-

tivemodels.

4 Parsing in Two or More Dimensions

The 1D strategies describ ed ab ove are easily ex-

Mo del Class 0 (no split): The 's are set to 1=4,

tended to 2D (or higher) using rectangular parsing.

requiring no enco ding, and the description length

To illustrate the minor mo di cations encountered in

is the log of the multinomial probability.

higher dimensions, we lo ok at b oth metho ds in the 2D

(image) setting.

Mo del Class 1 (split into four): The parameters

are co ded and transmitted (which, just as in

4.1 Quadtree-based Image Parsing

the 1D case, enco des the data itself ). The pa-

In 2D, the Haar multiscale is as fol-

rameters can be transmitted progressively with

lows. We begin with Poisson data fx g, k; l =

k;l

log (x + 1), log (x x +1), and

j;k ;l j;k ;l j +1;2k;2l

J

2 2

0;:::;2 1, and de ne x  x and for j =

J;k ;l k;l

log (x x x + 1) bits, re-

j;k ;l j +1;2k;2l j +1;2k +1;2l

2

J 1;:::;0

sp ectively. The sum is the description length.

x = x + x +

j;k ;l j +1;2k;2l j +1;2k +1;2l

This MDL rule can be shown to be equivalent to a

Bayesian selection criterion (see Subsection 3.1). x + x :

j +1;2k;2l+1 j +1;2k +1;2l+1

4.2 Adaptive Recursive Image Parsing

In 2D we have more freedom in how we split the

data. Tomaintain a manageable algorithm, we restrict

the splitting to rectangular tesselations of the plane.

In our recursivescheme, the MDL criterion is applied

to rectangular blo cks to select one of the following p os-

sibilities: a) no splitting (the rectangle is considered

homogeneous); b) the rectangle is split into four sub-

rectangles de ned by a common vertex (the b est p os-

sible such splitting is chosen); and, c) the rectangle is

(a) (b)

split horizontally or vertically into two sub-rectangles

(the b est p ossible such splitting is chosen). As in the

1D case, the co de lengths for these options are derived

from the multinomial probabilities.

We start by applying the criterion to the full image.

Every time one rectangular blo ck (the image itself, to

start) is split (into 2 or 4 sub-rectangles), the crite-

rion is again applied to the resulting sub-regions. The

parsing pro cess stops when no further splits are indi-

cated by the MDL criterion. The nalestimateofthe

intensity eld is piece-wise at, with the rectangular

(c) (d)

regions de ned by the parsing; the corresp onding in-

tensities are the ML estimates based on the data inside

Figure 3: Parsing a natural image. (a) Intensity. (b)

each region.

Counts, (normalized) MSE = 1.00. (c) Estimate from

Application of the quadtree and recursive parsing

multiscale quadtree algorithm, MSE =0.83. (d) Adap-

schemes to a natural intensity image is shown b elowin

tive recursive estimate, MSE = 0.54.

Figure 3; see also Figure 1. The true intensity is near

piece-wise constant, and b ecause the recursivescheme

References

is more adaptive in its selection of parsing regions, it

[1] J. Rissanen, Stochastic Complexity in Stastistical

do es a b etter job in this case.

Inquiry. Singap ore: World Scienti c, 1989.

5 Conclusions and Future Work

[2] K. Timmermann and R. Nowak, \Multiscale mo d-

Our MDL multiscale tree-based parsing scheme is

eling and estimation of Poisson pro cesses with ap-

an alternative to the Bayesian metho ds of [2,3]. Re-

plication to photon-limited imaging," IEEE Trans.

call that we have shown that our MDL criterion is,

on Info. Theory,vol. 45, pp. 846{862, 1999.

in fact, a sp ecial case of a Bayesian approach. The

[3] E. Kolaczyk, \Bayesian multi-scale mo dels for

MDL approach, however, has no free parameters; it

Poisson pro cesses," J. Amer. Statist. Assoc., Sept.,

is fully data-driven. Due to the predictive (coarse-

1999.

to- ne) nature of the enco ding/estimation scheme, we

were able to write exact (non-asymptotic) expressions

[4] J. Scargle, \Studies in astronomical

for the parameter co de-lengths.

analysis. V. Bayesian blo cks, a new metho d to an-

Our adaptive recursive metho d is related to the

alyze structure in photon counting data," Astro-

\Bayesian Blo cks" pro cedure develop ed in [4] (the re-

physical Journal,vol. 504, pp. 405{418, 1998.

cursive structure is similar); however, the Bayesian

selection rule used in [4] di ers considerably from the

[5] M. Figueiredo and J. Leit~ao, \Unsup ervised im-

MDL criterion, and only 1D data is considered there.

age restoration and edge lo cation using comp ound

The 2D metho ds describ ed here are based on rect-

GMRFs and the MDL principle," IEEE Trans. on

angular tesselations. We could use more general re-

Image Proc.,vol. 6, pp. 1089{1102, 1997.

nementschemes based on p olygonal region splitting.

[6] J. Bernardo and A. Smith, Bayesian Theory.

For example, in the recursivescheme, at eachstepwe

Chichester, UK: J. Wiley & Sons, 1994.

could search for the optimal (in MDL sense) line(s)

partitioning a given p olygon into to smaller p olygons.