<<

chirp, 7/8/1996

y

BAYESIAN SPECTRUM AND CHIRP ANALYSIS

E. T. Jaynes

Wayman Crow Professor of Physics

Washington University, St. Louis MO 63130

Abstract : We seek optimal metho ds of estimating p ower sp ectrum and chirp  change

rate for the case that one has incomplete noisy data on values y t of a time series. The Schuster

p erio dogram turns out to b e a \sucient statistic" for the sp ectrum, a generalization playing the

same role for chirp ed . However, the optimal pro cessing is not a linear ltering op eration like

the Blackman{Tukey smo othing of the p erio dogram, but a nonlinear op eration. While suppressing

noise/side lob e artifacts it achieves the same kind of improved resolution that the Burg metho d did

for noiseless data.

CONTENTS

1. INTRODUCTION 2

2. CHIRP ANALYSIS 4

3. SPECTRAL SNAPSHOTS 5

4. THE BASIC REASONING FORMAT 5

5. A SIMPLE BAYESIAN MODEL 6

6. THE PHASELESS LIKELIHOOD FUNCTION 9

7. DISCUSSION { MEANING OF THE CHIRPOGRAM 10

8. POWER SPECTRUM ESTIMATES 13

9. EXTENSION TO CHIRP 16

10. MANY SIGNALS 17

11. CONCLUSION 23

APPENDIX A. OCEANOGRAPHIC CHIRP 23

APPENDIX B. WHY GAUSSIAN NOISE? 25

APPENDIX C. CALCULATION DETAILS 27

REFERENCES 28

y

A revised and up dated version of a pap er presented at the Third Workshop on Maximum{Entropy and

Bayesian Metho ds, Laramie Wyoming, August 1{4, 1983. Published in Maximum{Entropy and Bayesian

Spectral Analysis and Estimation Problems, C. R. Smith and G. J. Erickson, Editors, D. Reidel Publishing

Co., 1987. This is the original investigation that evolved into the Bayesian Sp ectrum Analysis of Bretthorst

1988. The \plain vanilla" Bayesian analysis intro duced here proved to b e unexp ectedly p owerful in

applications b ecause of its exibility, which allows it to accommo date itself to all kinds of complicating

circumstances in a way determined uniquely by the rules of probability theory, with no need for any ad hoc

devices or app eal to unreliable intuition.

2 1. INTRODUCTION 2

1. INTRODUCTION

The Maximum Entropy solution found by Burg 1967, 1975 has b een shown to give the optimal

sp ectrum estimate { by a rather basic, inescapable criterion of optimality { in one well{de ned

problem Jaynes, 1982. In that problem we estimate the sp ectrum of a time series y y ,

1 N

from incomplete data consisting of a few auto covariances R R ; m

0 m

entire time series, and there is no noise.

This is the rst example in sp ectrum analysis of an exact solution, which follows directly from

rst principles without ad hoc intuitive assumptions and devices. In particular, we found Jaynes,

1982 that there was no need to assume that the time series was a realization of a \stationary

Gaussian pro cess". The Maximum Entropy principle automatically created the Gaussian form for

us, out of the data. This indicated something that could not have b een learned by assuming a

distribution; namely that the Gaussian distribution is the one that can b e realized by Nature in

more ways than can any other that agrees with the given auto covariance data. In this sense it is

the `safest' probability assignment one could make from the given information. This classic solution

will go down in history as the \hydrogen atom" of sp ectrum analysis theory.

Butamuch more common problem, also considered by Burg, is the one where our data consist,

not of auto covariances, but the actual values of y y , a subset of a presumably longer full

1 N

time series, contaminated with noise. Exp erience has shown Burg's metho d to b e very successful

here also, if we rst estimate m auto covariances from the data and then use them in the MAXENT

calculation. The choice of m represents our judgment ab out the noise magnitude, values to o large

intro ducing noise artifacts, values to o small losing resolution. For any m, the estimate we get would

b e the optimal one if a the estimated auto covariances were known to b e the exact values; and b

we had no other information b eyond those m auto covariances.

Although the success of the metho d just describ ed indicates that it is probably not far from

optimal when used with go o d judgment ab out m,wehaveasyet no analytical theory proving this or

indicating any preferred di erent pro cedure. One would think that a true optimal solution should

1 use all the information the data can give; i.e. estimate not just m

the data, but nd our \b est" estimate of all N of them and their probable errors; 2 then make

allowance for the uncertainty of these estimates by progressively de{emphasizing the unreliable

ones. There should not b e any sharp break as in the pro cedure used now, which amounts to giving

full credence to all auto covariance estimates up to lag m, zero credence to all b eyond m.

In Jaynes 1982 we surveyed these matters very generally and concluded that much more

analytical work needs to b e done b efore we can knowhow close the present partly ad hoc metho ds

are to optimal in problems with noisy data. The following is a sequel, rep orting the rst stage of an

attempt to understand the theoretical situation b etter, by a direct Bayesian analysis of the noisy

data problem. In e ect, we are trying to advance from the \hydrogen atom" to the \helium atom"

of sp ectrum analysis theory.

One might think that this had b een done already, in the many pap ers that study autoregressive

AR mo dels for this problem. However, as wehave noted b efore Jaynes, 1982, intro ducing an

AR mo del is not a step toward solving a sp ectrum analysis problem, only a detour through an

alternativeway of formulating the problem. An AR connection can always b e made if one wishes

to do so; for anypower sp ectrum determines a covariance function, which in turn determines a

Wiener prediction lter, whose co ecients can always b e interpreted as the co ecients of an AR

mo del. Conversely, given a set of AR co ecients, we can follow this sequence backwards and

construct a unique p ower sectrum. Therefore, to ask, \What is the p ower sp ectrum?" is entirely

equivalent to asking, \What are the AR co ecients?" A reversible mathematical transformation

converts one formulation into the other. But while use of an AR representation is always p ossible,

2

it may not b e appropriate just as representing the function f x = exp x by an in nite series

3 3

of Bessel functions is always p ossible but not always appropriate.

Indeed, learning that sp ectrum analysis problems can b e formulated in AR terms amounts to

little more than discovering the Mittag{Leer theorem of complex variable theory under rather

general conditions an analytic function is determined by its p oles and residues.

In this eld there has b een some contention over the relative merits of AR and other mo d-

els such as the MA moving average one. Mathematicians never had theological disputes over

the relative merits of the Mittag{Leer expansion and the Taylor series expansion. We exp ect

that the AR representation will b e appropriate i.e., conveniently parsimonious when a certain

resp onse function has only a few p oles, which happ en to b e close to the unit circle; it maybevery

inappropriate otherwise.

Better understanding should come from an approach that emphasizes logical economyby going

directly to the question of interest. Instead of invoking an AR mo del at the b eginning, which might

bring in a lot of inappropriate and unnecessary detail, and also limits the scop e of what can b e

done thereafter, let us start with a simpler, more exible mo del that contains only the facts of

data and noise, the sp eci c quantities wewant to estimate; and no other formal apparatus. If

AR relations { or any other kind { are appropriate, then they ought to app ear automatically,asa

consequence of our analysis, rather than as arbitrary initial assumptions.

This is what did happ en in Burg's problem; MAXENT based on auto covariance data led

automatically to a sp ectrum estimator that could b e expressed most concisely and b eautifully in

P

a a . AR form, the Lagrange multipliers b eing convolutions of the AR co ecients:  =

k nk n

k

The rst reaction of some was to dismiss the whole MAXENT principle as \nothing but AR",

thereby missing the p oint of Burg's result. What was imp ortantwas not the particular analytical

form of the solution; but rather the logic and generality of his metho d of nding it.

The reasoning will apply equally well, generating solutions of di erent analytical form, in other

problems far b eyond what any AR mo del could cop e with. Indeed, Burg's metho d of extrap olating

the auto covariance b eyond the data was identical in rationale, formal relations, and technique, with

the means by which mo dern statistical mechanics predicts the course of an irreversible pro cess from

incomplete macroscopic data.

This demonstration of the p ower and logical unityofaway of thinking, across the gulf of what

app eared to b e entirely di erent elds, was of vastly greater, and more p ermanent, scienti c value

than merely nding the solution to one particular technical problem.

Quickly, the p ointwas made again just as strongly by applying the same reasoning to problems

of image reconstruction, of which the work of Gull and Daniell 1978 is an outstandingly concise,

readable example.

I think that, 200 years from now, scholars will still b e reading these works, no longer for

technical enlightenment { for by then this metho d of reasoning will b e part of the familiar cultural

background of everyb o dy { but as classics of the History of Science, which op ened up a new era in

how scientists think.

The actual reasoning had, in fact, b een given by Boltzmann and Gibbs long b efore; but it

required explicit, successful applications outside the eld of thermo dynamics b efore either physicists

or statisticians could p erceive its p ower or its generality.

In the study rep orted here wehave tried to pro t by these lessons in logical economy; at the

b eginning it was decided not to put in that fancy entropy stu until we had done an absolutely

conventional, plain vanilla Bayesian analysis { just to see what was in it, unobscured by all the

details that app ear in AR analyses. It turned out that so much surprising new stu was in it that

we are still exploring the plain vanilla Bayesian solution and have not yet reached the entropic of the theory!

4 2. CHIRP ANALYSIS 4

The new stu rep orted here includes what we think is the rst derivation of the Schuster p eri-

o dogram directly from the principles of probability theory, and an extension of sp ectrum analysis

to include chirp analysis rate of change of frequency. Before we had progressed very far it b e-

came evident that estimation of chirp is theoretically no more dicult than \stationary" sp ectrum

estimation. Of course, this takes us b eyond the domain of AR mo dels, and could never have b een

found within the con nes of an AR analysis.

Our calculations and results are straightforward and elementary; in a communication b etween

exp erienced Bayesians it could all b e rep orted in ve pages and the readers would understand

p erfectly well what we had done, whywe had done it, and what the results mean for data pro cessing.

Such readers will doubtless nd our style maddeningly verb ose.

However, hoping that this work might also serve a tutorial function, wehave scattered through-

out the text and app endices many pages of detailed explanation of the reasons for what wedoand

the meaning of each new equation as it app ears.

Also, since consideration of chirp has not gured very much in past sp ectrum analysis, and

Bayesian analysis has not b een very prominent either, the next three Sections survey brie y the

history and nature of the chirp problem and review the Bayesian reasoning format. Our new

calculation b egins in Sec. 5.

2. CHIRP ANALYSIS

The detection and analysis of chirp ed signals in noise may b e viewed as an extension of sp ectrum

analysis to include a new parameter, the rate of change of frequency.We ask whether there exist

principles for optimal data pro cessing in such problems.

Chirp ed signals o ccur in many contexts; in quantum , Jaynes, 1973; Nikolaus & Grisch-

kowsky, 1983, the ionospheric \whistlers" of Helliwell 1965, human sp eech, the sounds of birds,

bats, insects, and slide tromb onists; radio altimeters, frequency{mo dulated , etc. Any tran-

sient , propagating through a disp ersive medium, emerges as a chirp ed signal, as is observed

in optics, ultrasonics, and o ceanography; and presumably also in seismology, although the writer

has not seen explicit mention of it. Thus in various elds the detection and/or analysis of chirp ed

signals in noise is a p otentially useful adjunct to sp ectrum analysis.

Quite aside from applications, the problem is a worthyintellectual challenge. Bats navigate

skillfully in the dark, avoiding obstacles by a kind of acoustical chirp ed radar Grin, 1958. It

app ears that they can detect echo es routinely in conditions of signal/noise ratio where our b est

correlation lters would b e helpless. How do they do it?

At rst glance it seems that a chirp ed signal would b e harder to detect than a mono chromatic

one. Why, then, did bats evolve the chirp technique? Is there some as yet unrecognized prop erty

ofachirp ed signal that makes it actually easier to detect than a mono chromatic one?

Further evidence suggesting this is provided by the \Whistle language" develop ed by the

Canary Islanders. We understand that by a system of chirp ed whistles they are able to communicate

fairly detailed information b etween a mountaintop and the village b elow, in conditions of range a

few miles and wind where the human voice would b e useless.

In the case of the bats, one can conjecture three p ossible reasons for using chirp: A Since

natural noise is usually generated by some \stationary" pro cess, a weak chirp ed signal may resemble

natural noise less than do es a weak mono chromatic signal. B Prior information ab out the chirp

rate, p ossessed by the bat, may b e essential; it helps to know what you are lo oking for. C Our

whole conceptual outlo ok, based on years of non{chirp thinking, may b e wrong; bats may simply

b e asking a smarter question than we are.

Of course, for b oth the bats and the Canary Islanders it may b e that chirp ed signals are not

actually easier to detect, only easier to recognize and interpret in strong noise.

5 5

After noting the existing \Sp ectral Snapshot" metho d of chirp analysis we return to our general

Bayesian solution for a mo del that represents one p ossible real situation, then generalize it in various

ways on the lo okout for evidence for or against these conjectures.

In this problem we are already far b eyond what Box and Tukey have called the \exploratory

phase" of data analysis. We already know that the bats, airplanes, and Canary Islanders are there,

that they are emitting purp oseful signals whichitwould b e ludicrous to call \random", and that

those signals are b eing corrupted by additive noise that we are unable to control or predict. Yet

wedohave some cogent prior information ab out b oth the signals and the noise, and so our job is

not to ask \What seems to go on here?" but rather to set up a mo del which expresses that prior

information.

3. SPECTRAL SNAPSHOTS

A metho d of chirp analysis used at present, b ecause it can b e implemented with existing hardware,

is to do a conventional Blackman{Tukey sp ectrum analysis of a run of data over an interval t  T ,

1

then over a later interval t  T , and so on. Any p eak that app ears to move steadily in this sequence

2

of sp ectral snapshots is naturally interpreted as a chirp ed signal the evanescentcharacter of the

phenomenon making the adjective \sp ectral" seem more appropriate than \sp ectrum".

The metho d do es indeed work in some cases; and impressively in the feat of o ceanographers

Barb er & Ursell, 1948; Munk & Sno dgrass, 1957 to correlate chirp ed o cean waves, with p erio ds

in the 10{20 second range, with storms thousands of miles away.Yet it is evident that sp ectral

snapshots do not extract all the relevant information from the data; at the very least, evidence

contained in correlations b etween data segments is lost.

The more serious and fundamental shortcoming of this metho d is that if one tries to analyze

chirp ed data by algorithms appropriate to detect mono chromatic signals, the chirp ed signal of inter-

est will b e weakened { p ossibly disastrously { through phase cancellation the same mathematical

phenomenon that physicists call Fresnel di raction.

For this reason, if we adhere to conventional sp ectrum analysis algorithms, the cutting of the

data into segments analyzed separately is not a correctible approximation. As shown in App endix

2

A, to detect a signal of chirp rate { i.e. a sinusoid cos!t + t { with go o d sensitivityby

2

that metho d, one must keep the data segments so short that T < 1. If T is much longer than

this, a large chirp ed signal { far ab ove the noise level { can still b e lost in the noise through phase

cancellation. App endix A also tries to correct some currently circulating misconceptions ab out the

history of this metho d.

Our conclusion is that further progress b eyond the sp ectral snapshot metho d is necessary and

p ossible { and it must consist of nding new algorithms that a protect against phase cancellation;

b extract more information from the data; and c make more use of prior information. But

nob o dy's intuition has yet revealed the sp eci c algorithm for this data analysis, so we turn for

guidance to probability theory.

4. THE BASIC REASONING FORMAT

The principle we need is just the pro duct rule of probability theory, pAB jC = pAjC  pB jAC ,

whichwe note is symmetric in the prop ositions A and B . Therefore let

I = prior information, H =anyhyp othesis to b e tested, D = data.

Then pHDjI = pD jI  pH jDI = pH jI  pDjHI or, if pD jI  > 0i.e. the data set is a

p ossible one,

pD jHI

pH jDI = pH jI  0 

pD jI 

6 5. A SIMPLE BAYESIAN MODEL 6

whichisBayes' theorem, showing how the prior probability pH jI of H is up dated to the p osterior

probability pH jDI  as a result of acquiring the new information D .Bayesian analysis consists of

the rep eated application of this rule.

Progress in scienti c inference was held up for decades by a p ersistent b elief { for whichwe

can nd no basis in the mathematics or the physical facts of the problem { that the equations

of probability theory were only rules for calculating ; and not for conducting inference.

However, wenowhave many analyses B. de Finetti, H. Je reys, R. T. Cox, A. Wald, L. J. Savage,

D. V. Lindley, and others showing from widely di erent viewp oints that these equations are also

the uniquely \right" rules for conducting inference.

That is, it is a theorem that anyone who represents degrees of plausibili tyby real numb ers { and

then reasons in a way not reducible to these equations { is necessarily violating some very elementary

qualitative desiderata of rationality transitivity, strong domination, consistency, coherence, etc..

We are concerned simply with the logic of consistent plausible reasoning; there is no necessary

connection with frequencies or random exp eriments although of course, nothing forbids us to

intro duce such notions in a particular problem.

Put di erently, suciently deep and careful intuitive thinking will, after all inconsistencies have

b een detected and removed, necessarily converge eventually to the Bayesian conclusions from the

same information. Recognizing this only enables us to reach those conclusions more quickly { how

much more quickly we shall see presently.

New demonstrations of the p ower of Bayesian inference in real problems { yielding in a few lines

imp ortant results that decades of \frequentist" analysis or intuitive thinking had not found { have

b een app earing steadily for ab out twentyyears, the presentwork providing another example.

However, b efore we can apply Eq. 0 quantitatively, our problem must have enough structure

so that we can determine the term pD jHI. In its dep endence on D for xed H this is the

\sampling distribution"; in its dep endence on H for xed D it is the \likeliho o d function". In the

exploratory phase of a problem such structure may not b e at hand.

Fortunately, our present problem is free of this diculty.We shall apply 0 in which H stands

typically for the statement that a multidimensional parameter lies in a certain sp eci ed region of

the parameter space. Deploring, but nevertheless following, the present common custom, we use

the same symbol pxjy  for a probability or a probability density; the distinction must b e read from

the context.

5. A SIMPLE BAYESIAN MODEL

In this section, which is p ossibly the rst direct Bayesian analysis of the problem, attempts at

conceptual innovation are out of order, and we wish only to learn the consequences of an absolutely

standard kind of mo del. We follow slavishly the time{worn pro cedure of \assuming the data

contaminated with additive white gaussian noise", which is time{worn just b ecause it has so many

merits. As has long b een recognized, it is not only the most realistic choice one can make in most

problems, the solutions will b e analytically simple and not far from optimal in others.

But there is an even more cogent reason for cho osing this probability assignment. In most real

problems, the only prior information wehave ab out the noise is its mean square value; often not

even that. Then, as noted ab ove, b ecause it has maximum entropy for a given mean square noise

level, the indep endent \white" Gaussian distribution will b e the safest, most conservative one we

can use; i.e. it protects us most strongly from drawing erroneous conclusions. In other words, it

most honestly describ es our state of know ledge . But since there are still many p eople who simply

do not b elieve this in spite of all demonstrations, let us amplify the p oint.

The purp ose of our assigning a prior distribution to the noise is to de ne the range of p ossible

variations of the noise vector e = fe ; ;e g { that we shall make allowance for in our inference. As

1 n

7 7

is well known in the literature of Information Theory, the entropy of a distribution is an asymptotic

measure of the size of the basic \supp ort set" W of that distribution; in our case the n{dimensional

\volume" o ccupied by the reasonably probable noise vectors. The MAXENT principle tells us { as

do es elementary common sense { that the distribution which most honestly represents what we

know while avoiding assuming things that we do not know, is the one with the largest supp ort set

W p ermitted by our information.

max

Then what Bayesian inference do es is that any data that lie well within W are judged to b e

almost certainly noise artifacts and are ignored; while data that lie well outside W are judged to

b e almost certainly real e ects and are emphasized. There is necessarily a transition region, where

the data lie near the b oundary of W , where no sure conclusions can b e drawn. Therefore, unless

wehavevery sp eci c prior information in addition to the mean square value, so that we know

some particular way in which the noise departs from white Gaussian, it would b e dangerous to

use any other distribution in our inference. According to the MAXENT principle, to do so would

necessarily b e making an assumption, that restricts our considerations to some arbitrary subset

W  W ,ina way not justi ed by our prior information ab out the noise.

max

The price wewould pay for this indiscretion is that if the true noise vector happ ened to lie in

0

the complementary set W = W W , then wewould b e misled into interpreting what is only

max

an artifact of the noise, as a real e ect. And the chance of this happ ening is not small, as shown

by the Entropy Concentration Theorem Jaynes, 1982. To assign a distribution with entropy only

slightly smaller than the maximum may contract the volume of W by an enormous factor, often

10 0

more than 10 . Then virtually every p ossible noise vector would lie in W , and wewould b e seeing

things that are not there in almost every data set. Gratuitous assumptions in assigning a noise

probability distribution can b e very costly.

Indeed, unless we can identify and correctly understand some sp eci c defect in this simplest

indep endent Gaussian mo del, we are hardly in a p osition to invent a b etter one.

However, these arguments apply only to the noise, b ecause the noise is completely unknown

except for its mean square value. The signal of interest is of course something ab out whichwe know

a great deal in advance just the reason whyitis of interest. It would b e ludicrous to think of the

signal as a sample from an \indep endent Gaussian distribution". This would amount to throwing

away all the prior information wehave ab out its structure functional form, thus losing our b est

means of nding it in the noise.

Our only seemingly drastic simpli cation is that for the time b eing we supp ose it known in

advance that only a single signal can b e present, of the form

2

f t=A cos!t + t +  ; 1 

and so the problem reduces to estimating its ve parameters A; ! ; ; ;   [although often we are

interested only in !; ]. However, this is ab out the most realistic assumption that could b e

made for the Barb er{Ursell{Munk{Sno dgrass o ceanographic chirp problem discussed in App endix

A, where it is highly unlikely that two di erent signals would b e present simultaneously. It is

considerably more realistic than that supp osition { which has actually b een made in this problem {

that the chirp ed signal is a \sample from a Gaussian random pro cess".

In the end it will develop that for purp oses of estimating the p ower sp ectrum, this assumption

of a single signal is hardly an assumption at all; the resulting solution remains valid, with only a

slight reinterpretation and no change in the actual algorithm, however many signals may b e present.

It is a restrictive assumption only when we ask more detailed questions than \What is your b est

estimate of the p ower sp ectrum?"

Other assumptions constant amplitude and chirp rate, etc. turn out also to b e removable.

In fact, once we understand the solution to this simplest problem, it will b e evident that it can

8 5. A SIMPLE BAYESIAN MODEL 8

b e generalized at once to optimal detection of any signal of known functional{parametric form,

sampled at arbitrary times, in non{white noise.

But for the present our true signal 1 is contaminated with the aforementioned white gaussian

noise et, so the observable data are values of the function

y t=f t+et : 2 

In practice we shall have these data only at discrete times whichwe supp ose for the momentto

b e equally spaced at integer values of t; and over a nite interval 2T .Thus our data consist of

N =2T +1values

D = fy t; T  t  T g ; 3 

and we assign the aforementioned indep endent gaussian joint probability distribution for the values

2

of the noise et at the corresp onding times, taking each et  N 0; where the variance  is

supp osed known.

Wehave gone to some length to explain our basis for cho osing this noise distribution b ecause

it is a matter of much confusion, di erentscho ols of thought holding diametrically opp osite views

as to whether this is or is not a restrictive assumption. In fact, this confusion is so great that the

rationale of our choice still requires further discussion, continued in App endix B.

Whatever scho ol of thought one favors, our equations will b e just the same; only our judgments

of their range of validity will di er. Given any true signal f t, the probability density that we

shall obtain the data set D = fy tg is just the probability that the noise values et will makeup

i

the di erence:

 

T

Y

1 1

2

p

[y t f t] 4  pD jA; ! ; ; ;  = exp

2

2

2

2

t=T

which is our sampling distribution. Conversely, given  and the data D , the joint likeliho o d of the

unknown parameters is

 

T

X

1

2 2

LA; ! ; :  / exp [y t A cos!t + t + ] 5 

2

2

t=T

In analysis of discrete time series, the mathematics has tended to get cluttered with minute details

ab out \end e ects" asso ciated with the exact limits of summation. But this is only a notational

problem; we can remove the clutter with no loss of precision if we adopt the convention call it

\in nite padding with zero es" if you like:

y = y t= 0; jtj T: 6 

t

Then all our sums of functions K y over time indices can take the form K , understo o d to run

t t

over 1

automatically, but kept out of sight.

Usually, the absolute phase is of no interest to us and wehave no prior information ab out it;

i.e. it is a \nuisance parameter" that wewant to eliminate. Wemayintegrate it out with resp ect

to a uniform prior probability density, getting a marginal quasi{likeliho o d

Z

2

1

LA; ! ; = LA; ! ; ; d 7 

2 0

9 9

which represents the contribution from the data to the joint marginal p osterior distribution of

A; ! ; . This is the course we shall pursue in the presentwork.

But an imp ortant exception o ccurs if our ultimate ob jective is not to estimate the parameters

A; ! ;  of the \regular" signal f t, but rather to estimate the particular \irregular" noise sequence

et that o ccurred during our observation p erio d. This is the problem of seasonal adjustment,

where an estimate of is also needed, and our data pro cessing algorithm will b e quite di erent.

y

The Bayesian theory of seasonal adjustment, to b e given elsewhere yields a new demonstration

of the p ower of prior information to improve our estimates; weinvite intuitionists to discover it

without Bayesian metho ds.

6. THE PHASELESS LIKELIHOOD FUNCTION

z

We shall consider the exact relations later when we generalize to many signals, but for the moment

we make an approximation whichwe b elieve to b e generally accurate and harmless although with

obvious exceptions like ! = =0,or ! = ; =0:

X

2 2

cos !t + t +  ' 2T +1=2= N=2 : 8 

t

Values of di ering by2 are indistinguish able in this discrete sampled data; i.e. chirp aliasing,

like frequency aliasing, con nes us to the domain <   .

With  considered known, the joint likeliho o d of the four signal parameters is then

 

2

X

NA A

2

y cos!t + t +  9  LA; ! ; ;  = exp

t

2 2

 4

t

in which, since only the dep endence on A; ! ; ;  matters, wemay discard any factor not containing

them.

The integration 7 over , carried out in App endix C, yields the phaseless likeliho o d or

quasi{likeliho o d function

 !

p

2

A NA NC!; 

I 10  LA; ! ;  = exp

0

2 2

4 

?

where I x= iJ ix is a Bessel function and

0 0

X

1 2 2

C !;  N y y cos[! t s+ t s ] : 11 

t s

ts

y

This app eared later, in Bernardo, et al 1985, pp. 329{360.

z

Given in detail in Bretthorst 1988.

?

Bretthorst 1988 discovered that weavoid Bessel functions by considering instead the quadrature com-

2 2

p onents cos!t + t  and sin !t + t  with indep endent amplitudes A ;A . This amounts to a slightly

1 2

di erent treatment of the prior information ab out amplitudes, in which the parameters A ;A take the

1 2

place of our A; . In e ect, our analysis assigns equal prior probability to equal ranges of A; Bretthorst's

assigns equal probability to equal areas in the A ;A  plane. But byawell{known rule of thumb, this

1 2

has ab out the same e ect on the nal conclusions as would having one less data p oint; he shows that his

nal conclusions are numerically indistinguishable from the ones obtained here.

10 7. DISCUSSION { MEANING OF THE CHIRPOGRAM 10

The form of 10 already provides some at least to the writer unexp ected insight. Given any

sampling distribution, a likeliho o d or quasi{likeliho o d function, in its dep endence on the parame-

ters, contains all the information in the data that is relevant for any inference ab out those parame-

ters { whether it b e joint or individual p oint estimation, interval estimation, testing anyhyp otheses

concerning them, etc . But the only data dep endence in LA; ! ;  comes from the function C !; .

Therefore, C plays the role of a \sucient statistic"; this function summarizes all the information

in the data that is relevant for inference ab out A; ! ; .

Because of its fundamental imp ortance, C !;  should b e given a name. We shall call it the

chirpogram of the data, for a reason that will app ear in Eq. 13 b elow. It seems, then, that

whatever sp eci c question we seek to answer ab out a chirp ed signal, the rst step of data analysis

will b e to determine the chirp ogram of the data.

Of course, to answer a recent criticism Tukey, 1984, in setting up a mo del the Bayesian { like

any other theoretician { is only formulating a working hyp othesis, to nd out what its consequences

would b e. He has not thereby taken a vow of theological commitment to b elieve it forever in the

face of all new evidence. Having got this far in our calculation, nothing in Bayesian principles

forbids us to scan that chirp ogram byeye, on the lo okout for anyunusual features such asapeak

stretched out diagonally, which suggests a change in chirp rate during the data run that we had

y

not anticipated when setting up our mo del.

Indeed, we consider the need for such elementary precautions so obvious and trivial that it

would never o ccur to us that anyone could fail to see it. If we do not stress this constantly in our

writings, it is b ecause wehave more substantive things to say.

7. DISCUSSION { MEANING OF THE CHIRPOGRAM

The chirp ogram app ears less strange if we note that when = 0 it reduces to

X X X

2

1 1 i! t

C !; 0 = N y y cos ! t s=N y e = Rt cos !t 13 

t s t

ts t t

where Rt is the data auto covariance:

X

1

Rt=N y y : 14 

s s+t

s

Thus C !; 0 is just the p erio dogram of Schuster 1897.

For nearly a Century, therefore, the calculation of C !; 0 has seemed, intuitively, the thing to

do in analyzing a stationary p ower sp ectrum. However, the results were never satisfactory.At rst

one tried to interpret C !; 0 as an estimate of the p ower sp ectrum of the sampled signal. But the

p erio dograms of real data app eared to the eye as to o wiggly to b elieve, and in some problems the

details of those wiggles varied erratically from one data set to another.

Intuition then suggested that some kind of smo othing of the wiggles is called for. Blackman

and Tukey 1958; hereafter denoted B{T recognized the wiggles as in part spurious side{lob es,

in part b eating b etween \outliers" in the data, and showed that in some cases one can makean

estimated p ower sp ectrum with a more pleasing app earance, which one therefore feels has more

y

Likewise, in calculating within a given problem, the ortho doxian do es not question the correctness of his

sampling distribution, and we do not criticize him for this. But it is for him, as for us, only a tentative

working hyp othesis; having nished the calculation, his unhappiness at the result may lead him to consider

a di erent sampling distribution. The Bayesian has an equal right to do this.

11 11

truth in it, byintro ducing a lag window function W t, which cuts o the contributions of large t

in 13, giving the B{T sp ectrum estimate

m

X

^

W tRt cos !t ; 15  P !  =

BT

t=m

in whichwe use only auto covariances determined from the data up to some lag m whichmaybea

small fraction of the record length N .

That this estimate disagrees with the data [the measured auto covariance Rt] at every lag t for

which W t 6= 1, do es not seem to have troubled anyone until Burg p ointed it out 17 years later. He

termed it a \willingness to falsify" the data, and advo cated instead the Maximum Entropy estimate,

whichwas forced by the constraints to agree with the data, the wiggles b eing removed by a totally

di erent metho d, the \smo othest" extrap olation of Rtbeyond the data. Others, including this

writer, quickly echo ed Burg's argument with enthusiasm; but as we shall see presently, this was

not the end of the story.

In anyevent, a lag window W t smo othly tap ered to zero at t = m do es reduce the unwanted

wiggles { at a price. It leads to a reasonable estimate in the case of a broad, featureless sp ectrum;

but of course, in that case the contributions from large t were small anyway, and the window had

little e ect. But lag window smo othing equivalent to a linear ltering op eration that convolves the

p erio dogram with the fourier transform of the lag window function, thus smearing out the wiggles

sideways necessarily loses resolution and makes it imp ossible to represent sharp sp ectrum lines

correctly.

One wonders, then, why B{T stopp ed at this p oint; for other pro cedures were available. To

put it with a cynicism that we shall correct later: once one has b een willing to falsify the data in

one way, then his virtue is lost, and he should have no ob jection to falsifying them also in other

ways. Why, then, must we use the same window function at all frequencies? Whymust we pro cess

Rt linearly? Whymust we set all Rtbeyond m equal to zero, when we know that this is surely

wrong? There were dozens of other ad hoc pro cedures, whichwould have corrected the failure of

15 to deal with sharp lines without venturing into the forbidden realm of Bayesianity.

But history pro ceeded otherwise; and now nally, the rst step of a Bayesian analysis has told

us what a Century of intuitive ad hockery did not. The p erio dogram was intro duced previously

only as an intuitive sp ectrum estimate; but now that it has b een derived from the principles of

probability theory we see it in a very di erent light. Schuster's p erio dogram is indeed fundamental

to sp ectrum analysis; but not b ecause it is itself a satisfactory sp ectrum estimator, nor b ecause

any linear smo othing can convert it into one in our problem.

The imp ortance of the p erio dogram lies rather in its information content; in the presence of

white gaussian noise, it conveys all the information the data have to o er ab out the sp ectrum of

f t. As noted, the chirp ogram has the same prop erty in our more general problem. This makes it

clear that any smo othing or other tamp ering with the p erio dogram can only hurt, destroying some

of the information in the data.

It will follow from 10 Eq. 27 b elow, that the prop er algorithm to convert C !; 0 into a

power sp ectrum estimate is a nonlinear op eration much like exp onentiation followed by renormal-

z

ization, an approximation b eing:

 

C !; 0

^

P !  ' A exp

2



z

In the slightly di erent treatment of Bretthorst 1988 this was found to b e an exact relation.

12 7. DISCUSSION { MEANING OF THE CHIRPOGRAM 12

This will suppress those spurious wiggles at the b ottom of the p erio dogram as well as did the B{T

linear smo othing; but it will do it by attenuation rather than smearing, and will therefore not lose

any resolution. The Bayesian nonlinear pro cessing of C !; 0 will also yield, when the data give

evidence for them, arbitrarily sharp sp ectral line p eaks from the top of the p erio dogram, that linear

smo othing cannot give.

It is clear from 10 why a nonlinear pro cessing of C is needed. The likeliho o d involves not just

C , but C in comparison with the noise level. The pro cedure 15 smears out all parts of C equally,

without considering where they stand relativetoany noise. The Bayesian nonlinear pro cessing

takes the noise level into account; wiggles b elow the noise level are almost certainly artifacts of the

noise and are suppressed, while p eaks that rise ab ove the noise level are b elieved and emphasized.

It may seem at this p oint surprising that intuition did not see the need for this long ago.

Note, however, that Blackman and Tukey had in mind a very di erent problem than ours. For

them the whole data y twere a sample from a \sto chastic pro cess" with a multivariate Gaussian

distribution. From the standp oint of our present problem we mightinterpret the B{T work as a

preliminary study of the noise sp ectrum b efore the purp oseful signal was added. So for them the

notion of \C in comparison with the noise level" did not exist.

To emphasize this, note that B{T considered the p erio dogram to have a sampling distribution

that was Chi{squared with two degrees of freedom, indep endently of the sample size. That would

not b e the case in the problem we are studying unless the signal f twere absent.

From this observation there follows a p oint that has not b een suciently stressed { or even

noticed { in the literature: the B{T e orts were not directed at all toward the present problem of

estimating the p ower sp ectrum of a signal f t from data which are contaminated with noise.

Confusion over \What is the problem?" has b een rampant here. We conceded, after a theoret-

ical study Jaynes, 1982 that pure Maximum Entropy is not optimal for estimating the sp ectrum

of a signal in the in the presence of noise; but failed to see the p oint just noted. Immediately,

Tukey & Brillinger 1982 pro ceeded to stress the extreme imp ortance of noise in real problems

and the necessity of taking it into account. But they failed to note that the B{T metho d do es

not take noise into account either, much less the robustness of Maximum Entropy with resp ect to

noise i.e., its practical success in problems where noise is present. Finally, although one would

exp ect Tukey to b e the rst to emphasize it, they did not note that a given pro cedure may solve

more than one problem. In fact, the Maximum Entropy pro cedure is also the optimal solution to

a Gaussian problem.

Therefore, although we agree with the need to take noise into account as the presentwork

demonstrates, we can hardly see that as an argumentinfavor of B{T metho ds in preference to

Maximum Entropyinany problem. For wemust distinguish b etween the B{T problem sp ectrum

of Gaussian noise and the B{T procedure 15, which has not b een derived from, or shown to have

?

any logical connection at all to, that problem.

Indeed, Burg's original derivation of the Maximum Entropy algorithm started from just that

same B{T assumption that the data are Gaussian noise; and we showed Jaynes, 1982 that pure

Maximum Entropy from auto covariance data leads automatically to a Gaussian predictive distri-

bution for future data. Thus it app ears that the optimal solution to the B{T problem is not the

B{T tap ering pro cedure 15, but the Burg pro cedure!

But strangely enough, this enables us to take a more kindly view toward B{T metho ds. The

pro cedure 15 cannot b e claimed as the \b est" one for any sp ectrum analysis problem; yet it has a

?

This is one of the many diculties with intuitive ad hoc pro cedures; not only are they invariably faulty

on pragmatic grounds, whatever theoretical justi cation they mayhave lies hidden in the mind of the

inventor. So they cannot b e improved; only replaced.

13 13

place in our to olb ox. As a procedure it is applicable to any data, and is dep endentonnohyp otheses

of Gaussianity. Whatever the phenomenon, if nothing is known in advance ab out its sp ectrum, the

tap ering 15 is a quick and easy waytowash out the wiggles enough to let the eye get a preliminary

view of the broad features of the sp ectrum, helpful in deciding whether a more sophisticated data

analysis is called for. The most skilled precision machinist still has o ccasional use for a quick and

dirty jackknife; and if in fact there are no sharp sp ectrum lines, the B{T pro cedure will lead to

substantially the same results with less computation.

Nevertheless, any tap ering clearly do es falsify the data, throwing away usable information. But

data contaminated with noise are themselves in part \false". The valid criticism from the standp oint

of our present problem is that when the noise go es away the falsi cation in 15 remains; that is

the case where Burg's pure Maximum Entropy solution was clearly the optimal one. But in the

di erent problem envisaged by B{T, which led to 15, the noise cannot go away b ecause the noise

and the data are identical.

Thus from insp ection of 10 we can see already some clari cation of these muddy waters.

The Bayesian metho d is going to give us, for sp ectrum estimation of a signal in the presence of

noise, the same kind of improvement in resolution and removal of spurious features relativetothe

p erio dogram that the Maximum{Entropy formalism did in the absence of noise; and it will do this

as well for chirp ed or non{chirp ed signals and for arbitrarily sharp lines. There is no meaningful

comparison with B{T metho ds at all; for they do not address the same problem.

Realizing these things changed the direction of the presentwork. We started with the in-

tention only of getting a quick, preliminary glimpse at what Bayesian theory has to say ab out

mono chromatic sp ectrum analysis in the presence of noise, b efore pro ceeding to entropy consider-

ations. But as so on as the phaseless likeliho o d function 10 app eared, it was realized that a The

status of B{T metho ds in this problem is very di erent from what we and others had supp osed;

b mono chromatic sp ectrum estimation from noisy data must b e radically revised by these results;

and c given that revision, the extension to chirp is almost trivial in a Bayesian analysis although

imp ossible in an AR analysis.

Therefore, wenow reconsider at some length the \old" problem of conventional pre{entropy

sp ectrum estimation of a signal in noise, from this \new" viewp oint.

8. POWER SPECTRUM ESTIMATES

The terms \p ower sp ectrum" and \p ower sp ectrum estimator" can b e de ned in various ways.

Also, we need to distinguish b etween the quite di erent goals of estimating a p ower sp ectral density,

estimating the p ower in a sp ectrum line, and estimating the frequencies present.

^

In calling P !  an estimate of the p ower sp ectral densitywe mean that

Z

b

^

P ! d! 16 

a

is the exp ectation, over the joint p osterior distribution for all the unknown parameters, of the

energy carried by the signal f t, not the noise, in the frequency band a  !  b, in the

2

observation time N =2T + 1. The true total energy carried by the signal 1 is NA =2, and given

data D we can write its exp ectation as

Z Z Z

 1 

2 2

N=2E A jD; I = N=2 d! dAA d pA; ! ; jD; I  : 17 

 0 

For formal reasons it is convenient to de ne our sp ectrum as extending over b oth p ositive and

negative frequencies; thus 17 should b e equated to the integral 16 over ; . Therefore our

power sp ectrum estimate is

14 8. POWER SPECTRUM ESTIMATES 14

Z Z



2

^

d! pA; ! ; jD; I  ; 



To de ne a p ower sp ectrum only over p ositive frequencies as would b e done exp erimentally, one

^ ^

should take instead P ! = 2P ! ; 0  !<.

+

In 17, 18 pA; ! ; jD; I  is the joint p osterior distribution

pA; ! ; jD; I = BpA; ! ; jI  LA; ! ;  19 

in which B is a normalization constant, I stands for prior information, and pA; ! ; jI  is the joint

prior probability density function for the three parameters.

If wehaveany prior information ab out our parameters, this is the place to put it into our

equations, in the prior probability factors; if we do not, then a noninformative, at prior will

express that fact and leave the decision to the evidence of the data in the likeliho o d function thus

realizing R. A. Fisher's goal of \letting the data sp eak for themselves".

Of course, it is not required that we actually have, or b elieve, a piece of prior information

I or data D b efore putting it into these equations; wemay wish to nd out what would b e the

consequences of having or not having a certain kind of prior information or data, in order to decide

whether it is worth the e ort to get it.

Therefore, by considering various kinds of prior information and data in 18 we can analyze

amultitude of di erent sp ecial situations, real or hyp othetical, of whichwe can indicate only a few

in the present study.

First, let us note the form of traditional sp ectrum analysis, which did not contemplate the

existence of chirp, that is contained in this \formalism". In many situations we knowinadvance

that our signals are not chirp ed, i.e. the prior probability in 19 is concentrated at = 0. Our

equations will then b e the same as if we had never intro duced chirp at all; i.e. failure to note its

p ossible existence is equivalent to asserting unwittingly the prior information: \ = 0: there is

no chirp".

[We add parenthetically that much of the confusion in statistics is caused by this phenomenon.

In other areas of applied mathematics, failure to notice all the p ossibilitie s means only that not all

p ossibilities will b e studied. In probability theory, failure to notice some p ossibiliti es is mathemat-

ically equivalent to making unconscious { and often very strong { assumptions ab out them].

Setting = 0 in 10 gives the joint likeliho o d LA; ! . Not using prior information i.e. using

at prior densities for A; !  our sp ectrum estimator is then

R

1

2

N=2 dA A LA; ! 

0

^

R R

P ! = 20 

 1

d! dALA; ! 

 0

Note that this can b e factored:

   

R R

1 1

2

N=2 dALA; !  dAA LA; ! 

0 0

^

R R R

21   P ! =

1 1 

dALA; !  dALA; !  d!

0 0 

or,

2

^

P ! = N=2E A j!DI   p! jDI  : 22 

15 15

^

In words: P !  = conditional exp ectation of energy given !   p osterior probability density for

! given the data. Both factors may b e of sp ecial interest so weevaluate them separately. In

App endix C we derive

 

I q 

1

2 2

1+2q +2q N=2E A j!DI =  23 

I q 

0

where

2

q !  C !; 0=2 : 24 

This has two limiting forms:

 

2

2C !; 0; C !; 0 > 2

2

N=2E A jDI  ! : 25 

2 2

 + C !; 0; C !; 0 << 

The second case is unlikely to arise, b ecause it would mean that the signal and noise have, by

chance, nearly cancelled each other out. But if this should happ en, Bayes' theorem recognizes

automatically that the reasonable estimate of signal energy is no longer 2C !; 0 but much greater;

2

for now the mean square signal strength must b e close to  . This is something that intuition

would b e extremely unlikely to notice.

From App endix C the second factor in 21 is

expq I q 

0

R

26  p! jDI =



exp q I q  d!

0



This answers the question: \Conditional on the data, what is the probability that the signal

frequency lies in the range !; ! + d!  irresp ective of its amplitude?" In many situations it is this

question { and not the p ower sp ectrum density { that matters, for the signal amplitude mayhave

b een corrupted en route by all kinds of irrelevant circumstances and the frequency alone maybe

of interest.

Combining 23 and 26 wehave the explicit Bayesian p ower sp ectrum estimate

[1+2q I q + 2qI q ] exp q 

0 1

2

R

27  P ! =



expq  I q d!

0



A graph of the nonlinear pro cessing function

f q  [1 + 2q I q +2qI q ] expq  28 

0 1

1=2

shows it close to the asymptotic form 8= q  exp 2q over most of the range likely to app ear; in

most cases wewould not make a bad approximation if we replaced 27 by

4q exp2q 

2

^

R

P ! =  29 



exp2q d!



Whenever the data give evidence for a signal well ab ove the noise level [i.e. C !; 0 reaches a global

2 2

maximum C = C ^!; 0 >>  ], then q = C = >> 1 and most of the contribution to

max max max

the integral in the denominator of 27 will come from the neighb orho o d of this greatest p eak.

Expanding

00 2

q ! = q q !   =2+  30  max

16 9. EXTENSION TO CHIRP 16

and using saddle{p ointintegration wehave then

Z



q 00 1=2

e I q d! ' 2q q  exp2q  31 

0 max max



the factor of 2 coming from the equal p eak at  . Near the greatest p eak the p ositive frequency

estimator reduces to

 

2

!   1

^

p

exp P !  ' 2C 32 

+ max

2

2

2!

2 !

00 1=2

where ! =q  is the accuracy with which the frequency can b e estimated. Comparing with

the factorization 22 we see that the Gaussian is the p osterior distribution for ! , while the rst

factor, the estimated total energy in the line, is:

Z

1

^

P ! d! =2C : 33 

+ max

0

in agreement with 25.

As a further check, supp ose wehave a pure sinusoid with very little noise:

y = A cos t + e ; A>> 34 

t t

Then the auto covariance 14 is approximately

2

Rt ' A =2 cos t 35 

and so from 13

X

2 2 2

C = C ; 0 = A =2 cos t = NA =4 36 

max

t

2

so 2C = NA =2 is indeed the correct total energy carried by the signal in the observation time.

max

Likewise,

X

2 00 2 3 2

 q = t Rt cos t ' N A =48 37 

t

which gives the width

r

12 

! ' 38 

n C

max

These relations illustrate what we stressed ab ove; in this problem the p erio dogram C !; 0 is not

even qualitatively an estimate of the p ower sp ectral density. Rather, when C reaches a p eak ab ove

the noise level, thus indicating the probable presence of a sp ectrum line, its p eak value 2C is

max

an estimate of the total energy in the line.

9. EXTENSION TO CHIRP

Let P !; d! d b e the exp ectation of energy carried by the signal f t in an element d! d of the

frequency{chirp plane. From 14 the frequency{chirp density estimator which do es not use any

prior information ab out A; ! ;  is simply

f q 

^

R R

39  P !; =

q

d! d e I q  0

17 17

where f q  is the same nonlinear pro cessing function 28, except that now in place of 24 wehave

2

q q !; = C !; =2 : 40 

As in 27, any p eaks of C !;  that rise ab ove the noise level will b e strongly emphasized,

indicating high probability of a signal.

If wehave indeed no prior information ab out the frequencies and chirp rates to b e exp ected, but

need to b e ready for all contingencies, then there seems to b e no wayofavoiding the computation

^ ^

in determining C !; over the entire plane 

^ ^

by symmetry, when chirp is presentwehaveinversion symmetry, P !; = P !;  so half of

y

the !  plane needs to b e searched if no prior information ab out the signal lo cation is at hand.

But noting this shows howmuch reduction in computation can b e had with suitable prior

information. That bat, knowing in advance that it has emitted a signal of parameters ! ; , and

0 0

knowing also what frequency interval determined by its ight sp eed v and the range of imp ortant

targets is of interest, do es not need to scan the whole plane. It needs only to scan a p ortion of the

line ' extending from ab out ! 1 v=cto ! 1+4v=c, where c =velo city of sound, in order

0 0 0

cover the imp ortant contingencies a target that is approaching is a p otential collision, one dead

ahead approaching at the ight sp eed is a stationary ob ject, a p otential landing site, one moving

away rapidly is uninteresting, one moving away slowly may b e a moth, a p otential meal.

10. MANY SIGNALS

2

In the ab ovewe made what seems a strong assumption, that only one signal f t= A cos!t+ t + 

is present, and all our results were inferences as to where in the parameter space of A; ! ;  that

one signal might b e. This is realistic in some problems i.e. o ceanographic chirp, or only one bat

is in our cage, etc., but not in most. In what way has this assumption a ected our nal results?

^

Supp ose for the moment that the chirp rate = 0. Then the p ower sp ectrum estimate P ! d!

in 27 represents, as we noted, the answer to:

Question A: What is your \b est" estimate exp ectation of the energy carried by the signal

f t in the frequency band d! , in the interval of observation?

But had we asked a di erent question, such as:

Question B : What is your estimate of the pro duct of the energy transp orts in the two

non{overlapping frequency bands a

the answer would b e zero; a single xed frequency ! cannot b e in two di erent bands simultaneously.

If our prior information I tells us that only one frequency can b e present, then the joint p osterior

probability pE ;E jD; I  of the events

ab cd

E a< ! < b ; E c< ! < d

ab cd

is zero; one with ortho dox indo ctrination might b e tempted to say that the \ uctuations" in non{

overlapping frequency bands must b e p erfectly negatively correlated in our p osterior distribution.

Of course, in the present problem there are no such uctuations; they are creations only of the Mind

y

Bretthorst 1988 gives such a full chirp ogram, computer generated as contour lines in the !;  plane,

1

for Wolf 's famous sunsp ot data, long recognized as having a prominent sp ectrum p eak at ab out 11 years .

But the chirp ogram has its main p eak o the ! {axis, indicating that a slightly chirp ed signal would t

Wolf 's data considerably b etter than do es any mono chromatic signal.

18 10. MANY SIGNALS 18

Pro jection Fallacy which presupp oses that probabilities are physically real things. The greater

adaptabilityofBayesian analysis arises in part from our recognition that probabilities are only

indications of our own incomplete information; therefore we are free to change them whenever our

information changes.

Nowiftwo frequencies can b e present, it turns out that the answer to question A will b e

essentially the same. But the answer to question B { or any other question that involves the joint

p osterior probabilities in two di erent frequency bands { will b e di erent; for then it is p ossible for

power to b e in two di erent bands simultaneously and it is not obvious without calculation whether

the correlations in energy transp ort in di erent bands is p ositive or negative.

If nowwe supp ose that three signals may b e present, the answers to questions A and B will

not b e a ected; it makes a di erence only when we ask a still more complicated question, involving

joint probabilities for three di erent frequency bands; for example,

Question C : What is your estimate of the p ower carried in the frequency band a

given the p owers carried in c< ! < d and e< ! < f?

and so on! In conventional sp ectrum estimation one asks only question A; and for that our

one{signal solution 27 requires no change, however many signals may b e present. Our seemingly

unrealistic assumption makes a di erence only when we ask more complicated questions.

In this Section we prove these statements, or at least give the general solution from which they

can b e proved. For the momentwe leave out the chirp; it is now clear howwe can add it easily to

the nal results. The signal to b e analyzed is a sup erp osition of n signals like 1:

n

X

f t= A cos! t +  41 

m m m

m=1

sampled at instants t t , which need not b e uniformly spaced our previous equally spaced

1 N

version is recovered if we make the particular choice t = T + m 1. Using the notation

m

f f t  42 

m m

the joint likeliho o d for the parameters is now

 "

 

N

X

N 1

N 2 N

2

y + Q 43  =  exp y f   LA ;! ; ;= exp

m m i i i

2 2

2 2

m=1

with the quadratic form

2

QA ;! ; = f 2 yf 44 

i i i

in which, as always, we use overbars to denote averages over the data:

N

X

1 2

2

y = N y 45 

m

m=1

X

1

y f 46  yf = N

m m

m

X

1 2

2

f = N f : 47 

m m

19 19

To rewrite Q as an explicit function of the parameters, de ne the function

N

X

1

x!  N y exp i! t  48 

m m

m=1

which is the \complex square ro ot of the p erio dogram", its pro jections

q

 

i 1

j

d Re x!  e = N NC! ; 0 cos +   ; 1  j  n 49 

j j j j j

where wehave used App endix C to write it in terms of the p erio dogram; and the matrix

N

X

1

M N cos! t +  cos! t +  ; 1  j; k  n 50 

jk j m j k m k

m=1

Then

n

X

d A 51  yf =

j j

j =1

n

X

2

f = M A A 52 

jk j k

j;k =1

and the quadratic form 44 is

X X

Q = M A A 2 d A 53 

jk j k j j

j

jk

The likeliho o d 43 will factor in the desired wayifwe complete the square in Q. De ne the

^ ^

quantities A A by

1 n

n

X

^

d = A ; 1  j  n 54  M

j k jk

k =1

1

and note that, if the inverse matrix M exists,

X X X

1

^ ^ ^

d A = M A A = M d d : 55 

j j jk j k k j

kj

j

jk kj

Then

X X

^ ^ ^ ^

Q = M A A A A  M A A 56 

jk j j k k jk j k

jk jk

and the joint likeliho o d function splits into three factors:

L = L L L 57 

1 2 3 with

20 10. MANY SIGNALS 20

N 2

2

L =  expN y =2  58

1

8 9

< =

X

N

^ ^

59 L = exp M A A A A 

2 jk j j k k

2

: ;

2

jk

8 9

< =

X

N

^ ^

+ L = exp : 60 M A A

3 jk j k

2

: ;

2

jk

If  is known, the factor L may b e dropp ed, since it will b e absorb ed into the normalization

1

constant of the joint p osterior distribution for the other parameters. If  is unknown, then it

b ecomes a \nuisance parameter" to b e integrated out of the problem with resp ect to whatever

prior probability p jI  describ es our prior information ab out it. The most commonly used case is

that where we are initially completely ignorant ab out it { or wish to pro ceed as if wewere, in order

to see what the consequences would b e. As exp ounded in some detail elsewhere Jaynes, 1980

the Je reys prior probability assignment p jI = 1= is uniquely determined as the one which

expresses \complete ignorance" of a scale parameter.

One of the fascinating things ab out Bayes' theorem is its eciency in handling this situation.

The noise level  is highly relevant to our inferences; so if it is initially completely unknown, then it

must b e estimated from the data. But Bayes' theorem do es this for us automatically in the pro cess

of integrating  out of the problem. To see this, note that from 43 when weintegrate out  the

quasi{likeliho o d for the remaining parameters b ecomes

Z

1

h i

N=2

2

Ld= / 1+ Q= y  : 61 

0

But when N is reasonably large i.e. wehave enough data to p ermit a reasonably go o d estimate

of  , this is nearly the same as

2

exp NQ=2y  : 62 

2

2

y .Thus, in e ect, if we In its dep endence on fA ;! ; g this is just 43 with  replaced by

i i i

2

tell Bayes' theorem: \I'm sorry, I don't know what  is!" it replies to us, \That's all right, don't

2 2

worry.We'll just replace  by the b est estimate of  that we can make from the data, namely

2

y ."

After doing this, we shall have the same quadratic form Q as b efore, and its minimization will

lo cate the same \b est" estimates of the other parameters as b efore. The only di erence is that for

small N the p eak of 61 will not b e as sharp as that of the Gaussian 43 so we are not quite so

sure of the accuracy of our estimates; but that is the only price we paid for our ignorance of  .

But if N is reasonably large it hardly matters whether  is known or unknown. Supp osing for

simplicity,aswe did b efore, that  is known, the joint p osterior density for the other parameters

fA ;! ; g factors:

i i i

pfA :! ; gjD; I = pfA gjf! g;D;I  pf! ; gjD; I  63 

i i i i i i i i

in which, to explain this compact notation more explicitly,wehave the joint conditional probability

for the amplitudes A given the frequencies and phases: i

21 21

8 9

< =

X

N

^ ^

pfA gjf! ; g;D;I / exp M A A A A  64 

j j j jk j j k k

2

: ;

2

jk

and the joint marginal p osterior density for the frequencies and phases:

9 8

= <

X

N

^ ^

: 65  A A M + pf! ; gjDI  / exp

k j jk j j

2

; :

2

jk

^

Eq. 64 says that A is the estimate of the amplitude A that we should make, given the frequencies

j j

and phases f! ; g, and 65 says that the most probable values of the frequencies and phases are

j j

those for which the estimated amplitudes are large. All this makes excellent common sense as so on

as we see it; but nob o dy's intuition had seen it.

The ab ove relations are, within the context of our mo del, exact and quite general. The number

n of p ossible signals and the sampling times t maybechosen arbitrarily.Partly for that reason,

m

to explore all the results that are in 63 { 65 would require far more space than wehave here.

We shall forego working out the interesting details of what happ ens to our conclusions when the

sampling times are not equally spaced, what the answers to those more complicated questions like

B and C are, and what happ ens to the matrix M in the limit when two frequencies coincide.

Somehow, as ! ! ! 0, it must b e that A and A b ecome increasingly confounded

j k j k

indistinguishable . In the limit there is only one amplitude where two used to b e. Then will

the rank of M drop from n to n 1? It is a recommended exercise to work this out for yourself

in detail for the case n = 2, and see how as the two signals merge continuously into one there is a

y

mixing of the eigenvectors of M much like the \level crossing" phenomenon of quantum theory. At

every stage the results make sense in a way that we do not think anyb o dy's intuition can foresee,

but which seems obvious after we contemplate what Bayes' theorem tells us.

Here we shall examine only the opp osite limit, that of frequencies spaced far enough apart to

b e resolved, which of course requires that the number n of signals is not greater than the number

N of observations. If our sampling p oints are equally spaced:

t = T +1+m; 1  m  N; 2T +1 = N 66 

m

then M reduces to

jk

T

X

1

cos! t +  cos! t +  : 67  M = N

j j k k jk

t=T

The diagonal elements are

1 sin N!

j

M = + cos 2 ; 1  j  n 68 

jj j

2 2N sin !

j

in which the second term b ecomes appreciable only when ! ! 0or! !  .Ifwe con ne our

j j

frequencies to b e p ositive, in 0;, then the terms in 67 with ! + !  will never b ecome large,

j k

and the o {diagonal elements are

sin Nu

1

cos + O N  69  M '

j k jk

2N sin u

y

If you are to o lazy to work this out for himself, it is done for you in Bretthorst 1988.

22 10. MANY SIGNALS 22

where u ! ! =2. This b ecomes of order unity only when the two frequencies are to o close

j k

to resolve and that merging phenomenon b egins. Thus as long as the frequencies ! are so well

j

separated that

N j! ! j >> 1 70 

j k

a go o d approximation will b e simply

M ' 1=2  71 

jk jk

and the ab ove relations simplify drastically. The amplitude estimates reduce to, from 49,

q

1

^

NC! ; 0 cos +   72  A =2d =2N

j j j j j

and the joint p osterior density for the frequencies and phases is

3 2 3 2

X X

N

2 2 2

5 4 5 4

d = exp C ! ; 0 cos  +    73  pf! ; jDI  / exp

j j j j j

j

2



j j

while the joint p osterior density for all the parameters is

8 9

 

n

q

< =

X

2 2

pfA ;! ; gjDI  = exp  A NC! ; 0 cos +   NA =4 74 

j j j j j j j

j

: ;

j =1

which is a pro duct of indep endent distributions. If the prior probabilities for the phases f g were

j

correlated, there would b e an extra factor p  jI  in 74 that removes this indep endence,

1 n

and might make an appreciable di erence in our conclusions.

This could arise, for example, if we knew that the entire signal is a wavelet originating in a

single event some time in the past; and at that initial time all frequency comp onents were in phase.

Then integrating the phases out of 74 will transfer that phase correlation to a correlation in the

fA g. As a result the answers to questions such as B and C ab ovewould b e changed in a

j

p ossibly imp ortantway. This is another interesting detail that we cannot go into here; but merely

note that all this is inherentinBayes' theorem.

But usually our prior probability distribution for the phases is indep endent and uniform on

0; 2 ; i.e. wehave no prior information ab out either the values of, or connections b etween, the

di erent phases. Then the b est inference we shall b e able to make ab out the amplitudes and

frequencies is obtained byintegrating all the out of 73 and 74 indep endently. This will

j

2

generate just the same I q  Bessel functions as b efore; and writing q = C ! ; 0=2 , our nal

0 j j

results are:

n

Y

q

j

pf! gjD; I  / I q  75  e

j 0 j

j =1

q

Y

n 2 2

2

NC! ; 0= ] : 76  j =1 expNA pfA ;! gjD; I  / =4  I [A

j j j 0 j

j

But these are just pro ducts of indep endent distributions identical with our previous single{signal

results 10, 26. We leave it as an exercise for the reader to show from this that our previous

23 23

power sp ectrum estimate 27 will follow. Thus as long as we ask only question A ab ove, our

single{signal assumption was not a restriction after all.

At this p oint it is clear also that if our n signals are chirp ed, we need only replace C ! ; 0

j

by C ! ;  in these results, and we shall get the same answers as b efore to any question ab out

j j

frequency and chirp that involves individual signals, but not correlations b etween di erent signals.

11. CONCLUSION

Although the theory presented here is only the rst step of the development that is visualized, we

have thought it useful to give an extensive exp osition of the Bayesian part of the theory.

z

No connection with AR mo dels has yet app eared; but we exp ect this to happ en when addi-

tional prior information is put in byentropy factors. In a full theory of sp ectrum estimation in the

presence of noise, in the limit as the noise go es to zero the solution should reduce to something like

the original Burg pure Maximum Entropy solution it will not b e exactly the same, b ecause we are

assuming a di erent kind of data.

For understanding and appreciating Bayesian inference, no theorems giving its theoretical

foundations can b e quite as e ective as seeing it in op eration on a real problem. Every new

Bayesian solution like the present one gives us a new appreciation of the p ower and sophistication

of Bayes' theorem as the true logic of science. It seeks out every factor in the mo del that has any

relevance to the question b eing asked, tells us quantitatively how relevant it is { and relentlessly

exp oses how crude and primitive other metho ds were.

We could expand the example studied here to manyvolumes without exhausting all the inter-

esting and useful detail contained in the general solution { almost none of whichwas anticipated

by sampling theory or intuition.

APPENDIX A: OCEANOGRAPHIC CHIRP

We illustrate the phase cancellation phenomenon, for the sp ectral snapshot metho d describ ed in

the text, as follows. That metho d essentially calculates the p erio dogram of a data segment fy :

t

T  t  T g, or p ossibly a Blackman{Tukey smo othing of it the di erence, a ecting only the

quantitative details, is not crucial for the p oint to b e made here. The p erio dogram is

X

1 i! t 2

X ! = N j y e j : A1 

t

t

If y is a sinusoid of xed frequency  :

t

y = A cost +  A2 

t

then the p erio dogram reaches its p eak value at or very near the true frequency:

2

X  = NA =4 : A3 

But if the signal is chirp ed:

2

y = A cost + t +  A4 

t

z

By 1994, ten years later, it still had not app eared; not b ecause a search for it failed to nd it, but b ecause

wewere so overwhelmed by all the rich and useful detail in the `plain vanilla' Bayesian solution that instead

the writer and his colleagues have concentrated on pro ducing ab out two dozen articles and a b o ok dealing

with it. The AR { entropy connections remain tasks for the future.

24 APPENDIX A: OCEANOGRAPHIC CHIRP 24

then the p erio dogram A1 is reduced, broadened, and distorted. Its value at the center frequency

is only ab out

2

X

2

1 i t 2

A5  N e X  = NA =4

t

which is always less than A3 if 6= 0. As a function of , A5 is essentially the Fresnel di raction

pattern of N equally spaced narrow slits. To estimate the phase cancellation e ect, note that when

2

T < 1 the sum in A5 may b e approximated byaFresnel integral, from whose analytic prop erties

2

wemay infer that when T  1, X   will b e reduced b elow A3 by a factor of ab out

h i

2 2 1=2 2 2

=4 T  1 2= T  cos T + =4 + O T  A6 

2

It is clear from insp ection of A5 that the reduction is not severe if T  1, but A6 shows that

2

it can essentially wip e out the signal if T >> 1.

Also, in the presence of chirp the p erio dogram A1 exhibits \line broadening" and \line

2

splitting" phenomena; for some values of T there app ear to b e two or more lines of di erent

amplitudes. For graphs demonstrating this, see Barb er & Ursell 1948.

Discovery of chirp ed o cean waves, originating from storms thousands of miles away,was at-

tributed byTukey et al to Munk and Sno dgrass 1957. Their feat was termed \one of the virtuoso

episo des in the annals of p ower sp ectrum analysis" showing the value of alertness to small unex-

p ected things in one's data. In Tukey, 1984 it is suggested that a Bayesian wears some kind of

blinders that make him incapable of seeing such things, and that discovery of the phenomenon

mighthave b een delayed by decades { or even centuries { if Munk & Sno dgrass had used Bayesian,

AR, or Maximum Entropy metho ds.

The writer's examination of the Munk{Sno dgrass article has led him to a di erent picture of

these events. The chirp ed signals they found were not small; in the frequency band of interest they

were the most prominent feature present. The signals consisted of pressure variations measured

o the California coast at a depth of ab out 100 meters, attributed to a storm in the Indian o cean

ab out 9,000 miles away. The p erio ds were of the order of 20 seconds, decreasing at a rate of ab out

10 p er day for a few days.

They to ok measurements every 4 seconds, accumulating continuous records of length N =

2T = 6000 observations, or 62=3 hours. Thus, from their grand average measured chirp rate of

7 2 2

ab out =1:6  10 sec we get T = 24, so the sum in A5 wrapp ed several times around

the Cornu spiral, and from A6 phase cancellation must have reduced the apparent signal p ower

at the center frequency to ab out 3 of its real value this would have b een ab out their b est case,

2

since nearer sources would give prop ortionally larger T .

Such a phase cancellation would not b e enough to prevent them from seeing the e ect alto-

gether, but it would greatly distort the line shap e, as seems to have happ ened. Although they state

that the greater length of their data records makes p ossible a much higher resolution of the order

of one part in 3000 than previously achieved, their actual lines are of complicated shap e, and over

100 times wider than this.

As to date of discovery, this same phenomenon had b een observed by Barb er & Ursell in

England, as early as 1945. A decade b efore Munk & Sno dgrass, they had correlated chirp ed o cean

waves arriving at the Cornwall coast with storms across the Atlantic, and as far away as the

Falklands. In Barb er & Ursell 1948 we nd an analysis of the phase cancellation e ect, which

led them to limit their data records to 20 minutes, avoiding the diculty. Indeed, the broad and

complicated line shap es published by Munk & Sno dgrass lo ok very much like the sp ectra calculated

25 25

by Barb er & Ursell to illustrate the disastrous e ects of phase cancellation when one uses to o long

a record.

The theory of this physical phenomenon, relating the chirp rate to the distance r to the source,

was given in 1827 by Cauchy. In our notation, = g=4r , where g is the acceleration of gravity.For

example, in the Summer of 1945 Barb er and Ursell observed a signal whose p erio d fell from 17.4

7 2

sec to 12.9 sec in 36 hours. These data give =5 10 sec , placing the source ab out 3000

miles away, whichwas veri ed byweather records. In May 1946 a signal app eared, whose p erio d

fell from 21.0 sec to 13.9 sec in 4 days, and came from a source 6000 miles away.

Up to 1947 Barb er and Ursell had analyzed some 40 instances like this, without using anyb o dy's

recommended metho ds of sp ectrum analysis to measure those p erio ds. Instead they needed only a

home{made analog computer, putting a picture of their data on a rotating drum and noting how

it excited a resonant galvanometer as the rotation sp eed varied!

Munk & Sno dgrass surely knew ab out the phase cancellation e ect, and were not claiming to

have discovered the phenomenon, since they make reference to Barb er & Ursell. It app ears to us

that they did not cho ose their record lengths with these chirp ed signals in mind, simply b ecause

they had intended to study other phenomena. But the signals were so strong that they saw them

anyway { not as a result of using a data analysis metho d appropriate to nd them, but in spite of

an inappropriate metho d.

Today,itwould b e interesting to re{analyze their original data by the metho d suggested here.

If there is a constant amplitude and chirp rate across the data record, the chirp ogram should

reach the full maximum A3, without amplitude degradation or broadening, at the true center

frequency and chirp rate; and should therefore provide a much more sensitive and accurate data

analysis pro cedure.

APPENDIX B: WHY GAUSSIAN NOISE?

In what L. J. Savage 1954 called the \ob jectivist" scho ol of statistical thought nowadays more

often called \ortho dox" or \sampling theory", assigning a noise distribution is interpreted as

asserting or hyp othesizing a statement of fact; i.e. aphysically real prop erty of the noise. It is,

furthermore, a widely b elieved \folk-theorem" that if the actual frequency distribution of the noise

di ers from the probability distribution that we assigned, then all sorts of terrible things will happ en

to us; we shall b e misled into drawing all sorts of erroneous conclusions. We wish to commenton

b oth of these b eliefs.

We are aware of no real problem in whichwehave the detailed information that could justify

such a strong interpretation of our noise distribution at the b eginning of a problem; nor do weever

acquire information that could verify suchaninterpretation at the end of the problem.

As noted in the text, an assigned noise distribution is a joint distribution for all the errors; i.e.

a probability pejI  assigned to the total noise vector e =e e inan n{dimensional space.

1 n

Obviously, this cannot b e a statementofveri able fact, for the exp eriment will generate only one

noise vector. Our prior distribution pejI  de nes rather the range of di erent possible noise vectors

that we wish to allow for, only one of which will actually b e realized.

In the problems considered here the information that would b e useful in improving our sp ec-

trum estimates consists of correlations b etween the e nonwhite noise. Such correlations cannot

i

b e describ ed at all in terms of the frequencies of individual noise values. A correlation extending

over a lag m is related to frequencies only of noise sequences of length  m.Even for m = 3 the

numb er of p ossible sequences of length m is usually far greater than the length of our data record;

so it is quite meaningless to sp eak of the \frequencies" with which the di erent sequences of length

m = 3 app ear in our data, and therefore equally meaningless to ask whether our noise probability distribution correctly describ es those frequencies.

26 APPENDIX B: WHY GAUSSIAN NOISE? 26

As these considerations indicate, the function of pejI  cannot b e to describ e the noise; but

rather to describ e our state of know ledge ab out the noise. It is related to facts to this extent: we

want to b e fairly sure that wecho ose a set of p ossible vectors big enough to include the true one.

This is a matter of b eing honest ab out just howmuch prior information we actually have; i.e. of

avoiding unwarranted assumptions.

If in our ignorance we assign a noise distribution that is \wider" for example, which supp oses a

greater mean{square error than the actual noise vector, then wehave only b een more conservative{

making allowance for a greater range of p ossibilities { than we needed to b e. But the result is not

that we shall see erroneous \e ects" that are not there; rather, we shall have less discriminating

power to detect small e ects than we mighthave enjoyed, had we more accurate prior knowledge

ab out the noise.

If we assign a noise distribution so \narrow" that the true noise vector lies far outside the

set thought p ossible, then wehave b een dishonest and made unwarranted assumptions; for valid

prior information could not have justi ed such a narrow distribution. Then as noted in the text we

shall, indeed, pay the p enalty of seeing things that are not there. The goal of stating, by our prior

distribution, what we honestly do know { and nothing more { is the means by whichwe protect

ourselves against this danger.

Tukey et al 1980 comment: \Trying to think of data analysis in terms of hyp otheses is

dangerous and misleading. Its most natural consequences are a hesitation to use to ols that would

b e useful b ecause `we do not know that their hyp otheses hold' or b unwarranted b elief that the

real world is as simple and neat as these hyp otheses would suggest. Either consequence can b e very

costly . A pro cedure do es not havehyp otheses { rather there are circumstances where it do es

b etter and others where it do es worse."

We are in complete agreement with this observation; and indeed would put it more strongly.

While some hyp otheses ab out the nature of the phenomenon may suggest a pro cedure { or even

uniquely determine a pro cedure { the pro cedure itself has no hyp otheses, and the same pro cedure

may b e suggested by manyvery di erenthyp otheses. For example, as we noted in the text, the

Blackman{Tukey window smo othing pro cedure was asso ciated by them with the hyp othesis that

the data were a realization of a \stationary Gaussian random pro cess". But of course nothing

prevents one from applying the pro cedure itself to any set of data whatso ever, whether or not

`their hyp otheses hold'. And indeed, there are \circumstances where it do es b etter and others

where it do es worse".

But we b elieve also that probability theory incorp orating Bayesian/Maximum Entropy prin-

ciples is the prop er to ol { and a very p owerful one { for a determining those circumstances for

a given pro cedure; and b determining the optimal pro cedure, given what we know ab out the

circumstances. This b elief is supp orted by decades of theoretical and practical demonstrations of

that p ower.

Clearly, while striving to avoid gratuitous assumption of information that we do not have, we

ought at the same time to use all the relevant information that we actually do have; and so Tukey

has also wisely advised us to think very hard ab out the real phenomenon b eing observed, so that

we can recognize those sp ecial circumstances that matter and take them into account. As a general

statement of p olicy,we could ask for nothing b etter; so our question is: how can we implement

that p olicy in practice?

The original motivation for the Principle of Maximum Entropy Jaynes, 1957 was precisely

the avoidance of gratuitous hyp otheses, while taking account of what is known. It app ears to us

that in many real problems the pro cedure of the Maximum Entropy Principle meets b oth of these

requirements, and thus represents the explicit realization of Tukey's goal.

27 27

If we are so fortunate as to have additional information ab out the noise b eyond the mean-

square value supp osed in the text, we can exploit this to make the signal more visible, b ecause it

reduces the measure W ,or\volume", of the supp ort set of p ossible noise variations that wehave

to allow for. For example, if we learn some resp ect in which the noise is not white, then it b ecomes

in part predictable and some signals that were previously indistinguishabl e from the noise can now

b e separated.

The e ectiveness of new information in thus increasing signal visibility, is determined by the

reduction it achieves in the entropy of the joint distribution of noise values { essentially, the loga-

0

rithm of the ratio W =W by which that measure is reduced. The Maximum Entropy formalism is

the mathematical to ol that enables us to lo cate the new contracted supp ort set on which the likely

noise vectors lie.

It app ears to us that the evidence for the sup erior p ower of Bayesian/Maximum Entropy meth-

ods over b oth intuition and \ortho dox" metho ds is nowsooverwhelming that nob o dy concerned

with data analysis { in any eld { can a ord to ignore it. In our opinion these metho ds, far from

con icting with the goals and principles exp ounded byTukey, represent their explicit quantitative

realization, whichintuition could only approximate in a crude way.

APPENDIX C: DETAILS OF CALCULATIONS

Derivation of the Chirpogram . Expanding the cosine in 9 wehave

X

2 2 2 1=2 1

y cos!t + t + =P cos Q sin =P + Q  cos[ + tan Q=P ] C1 

t

where

X

2

y cos!t + t  C2  P =

t

t

X

2

Q = y sin !t + t  C3 

t

t

2 2

But P + Q  can b e written as the double sum

X

2 2 2 2 2 2

P + q = y y [cos!t + t  cos!s + s  + sin !t + t  sin !s + s ]

t s

ts

C4 

X

2 2

= y y [cos ! t s+ t s ]

t s

ts

Therefore, de ning C !; as

1 2 2

C !;  N P + Q  C5 

and substituting C1, C4 into 9, the integral 7 over is the standard integral representation

of the Bessel function:

Z

2

1

I x= 2  exp x cos  d C6 

0

0

which yields the result 10 of the text.

Power Spectrum Derivations .From the integral formula

28 REFERENCES 28

Z

1

2

ax 1=2 2 2

e I bx dx ==4a expb =8a I b =8a C7  Z a; b

0 0

0

we obtain

r

Z

1

2

1  @Z

q ax

= [1 + 2q I q +2qI q ] e C8  e I bx dx =

0 1 0

3

@a 4 a

0

2 2 2 4

where q b =8a. In the notation of Eq. 12 wehave x = A; a = N=4 ;b = NC!; 0= ,

2

therefore q = C !; 0=2 .Thus C5 and C6 b ecome

r

Z

1

2



q

dALA; ! = e I q  C9 

0

N

0

r

Z

1



2 3 q

dAA LA; ! = 2 [1 + 2q I q + 2qI q ] e C10 

0 1

3

N

0

from which 18, 20 of the text follow.

REFERENCES

N. F. Barb er & F. Ursell 1948, \The Generation and Propagation of Ocean Waves and Swell",

Phil. Trans. Roy. So c. London, A240, pp 527{560.

J. M. Bernardo, et al, editors 1985, Bayesian Statistics 2 , Elsevier Science Publishers, North{

Holland. Pro ceedings of the Second Valencia International Meeting on Bayesian Statistics,

Sept. 6{10, 1983.

R. B. Blackman&J.W.Tukey 1958, The Measurement of Power Spectra ,Dover Publications,

Inc., New York.

G. Larry Bretthorst 1988, Bayesian Spectrum Analysis and Parameter Estimation , Springer Lec-

ture Notes in Statistics, 47. Numerous computer printouts with real and simulated data.

John Parker Burg 1967, \Maximum Entropy Sp ectral Analysis", in Pro c. 37'th Meet. So c. Ex-

ploration Geophysicists. Reprinted in Modern Spectrum Analysis , D. Childers, Editor, IEEE

Press, New York. 1978.

John Parker Burg 1975, \Maximum Entropy Sp ectral Analysis", Ph.D. Thesis, Stanford Univer-

sity.

D. R. Grin 1958, Listening in the Dark ,Yale University Press, New Haven; see also About Bats ,

R. H. Slaughter & D. W. Walton, Editors, SMU Press, Dallas, Texas 1970.

B. Nikolaus & D. Grischkowsky 1983, \90 Fsec Tunable Optical Pulses Obtained byTwo{Stage

Pulse Compression", App. Phys. Lett.

S. F. Gull & G. J. Daniell 1978, \Image Reconstruction from Incomplete and Noisy Data", Nature,

272, p. 686.

R. A. Helliwell 1965, Whistlers and Related Ionospheric Phenomena , Stanford University Press,

Palo Alto, Calif.

E. T. Jaynes 1957, \Information Theory and Statistical Mechanics", Phys. Rev. 106, 620; 108,

171.

E. T. Jaynes 1973, \Survey of the Present Status of Neo classical Radiation Theory", in Pro ceed-

ings of the 1972 Ro chester Conference on Optical Coherence, L. Mandel & E. Wolf, Editors,

Pergamon Press, New York.

29 29

E. T. Jaynes 1980, \Marginalization and Prior Probabilities", in Bayesian Analysis in Econo-

metrics and Statistics, A. Zellner, Editor, North{Holland Publishing Co. Reprinted in E. T.

Jaynes, Papers on Probability, Statistics, and Statistical Physics , a reprint collection, D. Reidel,

Dordrecht{Holland 1983.

E. T. Jaynes 1981, \What is the Problem?", Pro ceedings of the Second ASSP Workshop on

Sp ectrum Analysis, McMaster University,S.Haykin, Editor.

E. T. Jaynes 1982, \On the Rationale of Maximum{Entropy Metho ds", Pro c. IEEE, 70, pp.

939{952.

W. H. Munk & F. E. Sno dgrass 1957, \Measurements of Southern Swell at Guadalup e Island",

Deep{Sea Research, 4, pp 272{286.

L. J. Savage 1954, The Foundations of Statistics , J. Wiley & Sons, Inc., New York.

A. Schuster 1897, \On Lunar and Solar Perio dicities of Earthquakes", Pro c. Roy.Soc.61, pp.

455{465.

J. W. Tukey,P. Blo om eld, D. Brillinger, and W. S. Cleveland 1980, The PracticeofSpectrum

Analysis , notes on a course given in Princeton, N. J. in Decemb er 1980.

J. W. Tukey & D. Brillinger 1982, unpublished.

J. W. Tukey 1984, \Styles of Sp ectrum Analysis", Scripps Institution of Oceanography Reference

Series 84{85, March 1984; pp. 100{103. An astonishing attack on all theoretical principles,

including AR mo dels, MAXENT, and Bayesian metho ds. John Tukey b elieved that one should

not rely on any theory, but simply use his own intuition on every individual problem. But this

made most Bayesian results inaccessible to him.