<<

arXiv:1109.0442v3 [physics.-an] 29 Dec 2011 r,agetaon feothsgn notedevel- gravita- the detect into to [ gone directly methods has waves and tional effort instruments the- of of relativity amount opment general great from a consequence ory, a as tablished ihr o feotnest oit dniyn such Student- identifying a into go plain to like needs look effort not alarms. of false do lot also A with in for, similarity either. events sought little noise showing transient although often the loud which, filter or data, matched outliers the the to example, sensitive to For is due in- actual problems noise. the into strument of runs behavior weak nonstationary commonly or discriminate it re- non-Gaussian to works noise, able method the the is from While and noise models. well of Gaussian markably procedure class ML a wider be a actually of for will framework generally more the but de- model, (ML) in maximum-likelihood method a as tection derived be may filter e.[ Ref. aafrrr,wa inl htaeotno known, a of approaches often used on are commonly based Many are that signals time-series shape. weak of parametrized filtering rare, requires for hence data aiming detection analysis GW is Data waveform for . exact several whose poten- by signal — determined superimposed and is a noise, detector — nonwhite GW of tially series interferometric time an a of essentially output measure The to techniques as them. analysis instruments data sensitive sophisticated extraordinarily as well both takes and in it noise detectors, so of gravitational-wave sources ground-based the to today’s compared weak notoriously are ∗ [email protected] ic h xsec fgaiainlrdainwses- was radiation gravitational of existence the Since epooeamr outpoeueta sbsdon based is that procedure robust more a propose We 3 .Svrlmtvtosmyb sdfrintroducing for used be may motivations Several ]. t itiuinfrtenie sitoue in introduced as noise, the for distribution aydt nlssmtosaebsdo h hoyo station pr used properties of data’s commonly the theory from The reflecting the departures closely more clear on models from exhibit based frequently are data methods measurement analysis data many oe.I hspprw rps eeaie matched-filter generalized a propose we paper this In model. distribution ASnmes 25.r 48.n 54.p 95.75.Wx h 05.45.Tp, a 04.80.Nn, 02.50.-r, to numbers: leads PACS it sho we noise study interferometer filter. Carlo le actual matched Monte (“Gaussian”) filter’s in simplified matched buried a the signals In generalizes it variation. side, adaptive, technical the On ace filtering matched h erhfrgaiainlwv inl ndtco data detector in signals gravitational-wave for search The .INTRODUCTION I. 1 , tdn- ae le o outsga detection signal robust for filter based Student-t A hti bet con o eve-aldnieadi robus is and noise heavier-tailed for account to able is that 2 .Gaiainlwvs(GWs) waves Gravitational ]. a-lnkIsiu ¨rGaiainpyi (Albert-Ei f¨ur Gravitationsphysik Max-Planck-Institut n ebi nvri¨tHnoe,317Hnoe,German Hannover, 30167 Universit¨at Hannover, Leibniz and ace filter matched h aa h matched The data. the ssc eeto ehdta a edrvdvaaGaussian a via derived be may that method detection a such is hita R¨over Christian xettepooe leigmto ob sfli other well. in useful as be contexts We to signal-processing method violated. filtering is proposed the noise expect Gaussian stationary of assumption ae le nedyed etrdtcinrt.We rate. detection Student- remarks. better concluding the a some here with yields close that indeed show filter to based signals simulated and oeseicniemdl[ model noise specific more Sec. In model. noise Gaussian a from xml ae ncnieain ftersda sum-of- residual the of (or squares considerations on based example ewl hwi h olwn,temdlas exhibits also model the for are following, performance kind the better this in a of show Models will be- we time. robust or for over accuracy, used varying limited commonly is to estimated it either only cause spectrum, is noise it the incorporating of because Al- as knowledge seen noise. prior be the also imperfect may in model outliers the accommodate ternatively, to contours, density one probability allowing non-spherical and tails” ier oe,addrv h nlgu leigpoeue In procedure. filtering analogous Gaussian the the Sec. from derive differences the and out model, point as well as use ehdt h aeo Student- matched-filtering of usual case the the will generalize to This method easily station- likelihood. to of Whittle us assumption the allow and the noise via Gaussian derivation ary a on centrate Student- the h Student- the upino asinnie hnw a “Gaussian say we When as- that noise. the mind to Gaussian connected in of necessarily keep not sumption to is though filter matched important the is It following. nSec. In A IV ace filter matched I ASINMATCHED-FILTERING GAUSSIAN II. ∗ msst il oesniieprocedures. sensitive more yield to omises erpr nacs td sn eldtco data detector real using study case a on report we s-qae ehdt nieaie or iterative, an to method ast-squares hs supin.Drvn methods Deriving assumptions. these ge eeto aeta h usual the than rate detection igher II sotnhmee ytefc that fact the by hampered often is n ehiu ae na on based technique ing htwe ple osimulated to applied when that w power t ewl rtdrv h sa ace filter matched usual the derive first will we t r asinnie hl actual while noise, Gaussian ary oe,eaoaeo h oiainfrits for motivation the on elaborate model, oe;ms biul tehbt “heav- exhibits it obviously most model; nstein-Institut) gis ulesi h data. the in outliers against t eopsto,wtotrfrnet a to reference without decomposition, ) a edrvdi ieetwy,for ways, different in derived be may .General A. y detection aaee estimation 4 ;hwvr eew ilcon- will we here however, ]; t itiue os nthe in noise distributed Student- upsswe the when purposes III LIGO-P1100103 t AEI-2011-060 eintroduce we u,as but, , t 2 matched filter”, this is meant to refer to its derivation of the likelihood is equivalent to minimizing a weighted and interpretation in the Gaussian context. sum-of-squares, i.e., a weighted least-squares approach. It should be noted that in a Bayesian reasoning frame- work, the detection problem would be approached via B. The Gaussian noise model the marginal likelihood rather than the maximized like- lihood [8, 9]. The marginal likelihood is the expectation In order to implement the assumption of stationary, of the with respect to the prior dis- Gaussian noise residuals, the approx- tribution, and both marginal and maximized likelihood imation is commonly utilized [3, 5, 6]. In the Whittle may be equivalent for a certain choice of the prior dis- approximation, signal and noise are treated tribution. One can show the marginal likelihood to be in their Fourier-domain representation. The explicit as- optimal for any particular choice of prior distribution, sumption being made on the noise n(t) is that its dis- while the maximized likelihood in general is not (see e.g. crete Fourier transformn ˜(f) is independently Gaussian [10, 11]). Maximization of the likelihood on the other distributed with zero and proportional to hand is commonly much easier computationally. the power (PSD), As will be seen in the following, it is often convenient to divide the parameter vector into subsets, as it may Var Re n˜(f ) = Var Im n˜(f ) = N S (f ), (1) j j 4∆t 1 j be possible to analytically maximize the likelihood for     fixed values of some parameters over the remaining pa- where fj is the jth Fourier , S1(fj ) is the cor- rameters. This maximized conditional likelihood as a responding one-sided power spectral density, and j = function of a subset of parameters is also called the pro- 0,...,N/2 indexes the Fourier frequency bins. An ex- file likelihood. If a profile likelihood is given, likelihood plicit definition of the Fourier transform conventions used maximization may be reduced to maximizing over the here is given in the appendix. remaining lower-dimensional parameter subspace. As an For some measured data d(t) one then commonly as- example, consider a signal having three free parameters: sumes a parametrized signal sθ(t) with parameter vec- amplitude, phase, and time of arrival. If likelihood max- tor θ and additive Gaussian noise with a known 1-sided imization can be done analytically over amplitude and power spectral density S1(f): phase for any given arrival time, this results in a profile likelihood that is a function of time. The likelihood’s ˜ d(t)= sθ(t)+ n(t) d(f)=˜sθ(f)+˜n(f) (2) overall maximum then may be computed via a numerical ⇔ brute-force search of the profile likelihood over the time (i.e., additivity holds in both time and Fourier domains). parameter. The corresponding likelihood function then is given by ˜ 2 1 d(fj ) s˜θ(fj ) p d θ exp 2 | N − | (3) | ∝ − S1(fj )  j 4∆t  2. Why care about linear models?  X [3]. In in general, and in GW data anal- ysis in particular, the signals of interest are commonly C. Likelihood maximization parametrized (among other additional parameters) in terms of an amplitude and a phase parameter. Consider 1. ML detection and the profile likelihood for example a simple sinusoidal signal of the form

If there were no unknown signal parameters to the sA,φ,f (t)= A sin(2πft + φ) (4) signal model (like time-of-arrival, amplitude, phase,...), = β sin(2πft)+ β cos(2πft) (5) then, according to the Neyman-Pearson lemma [7], the s c optimal detection would be the likelihood ratio between the “signal” and “no-signal” models. Once there which instead of amplitude A and phase φ may equiv- are unknowns in the signal model, a common approach is alently be parametrized in terms of sine- and cosine- to use a generalized Neyman-Pearson test statistic, that amplitudes βs and βc. Other examples of signal mod- is, the maximized likelihood ratio, where maximization els given in terms of linear combinations are the singular is carried out over the unknown parameters [7]. While value decomposition approach used e.g. in [12, 13] or the this is in general not an optimal detection statistic, this transformation of antennae pattern effects into four am- ad hoc approach is often efficient and effective. Such a plitude parameters in the derivation of the F -statistic maximum likelihood (ML) detection approach is closely [14]. A formulation will turn out conve- related to ML estimation, as either way the parameter nient in the following, as a linear (or conditionally linear) values maximizing the likelihood will need to be derived. model will allow us to perform (conditional) likelihood In case of a Gaussian noise model as in (3), maximization maximization analytically. 3

3. The 4. The detection statistic and its distribution

Consider a linear model for the data, i.e., We define the statistic y = Xβ + ǫ (6) 2 k N xi,j yi i=1 σ2 p(y βˆ) where y is a N-dimensional data vector, X is a (N k)- H = i = 2 log (14) k x2 | matrix, β is a k-dimensional parameter vector, and×ǫ is P N i,j  × p(y ~0) j=1 i=1 σ2  |  an N-dimensional vector of error terms. The errors ǫ are X i P assumed to be Gaussian distributed with mean zero and (see also (12)) which, under the null hypothesis of the some matrix Σ. data y being purely noise, is χ2 distributed with k de- In the above signal-processing context, y and ǫ are the grees of freedom. Under the signal hypothesis, when a ⋆ N-dimensional vectors of re-arranged real and imaginary signal s ⋆ = Xβ is present in the data, the correspond- ˜ β parts of Fourier-domain data (d) and noise (˜n), the sig- ing figure evaluated at the true parameter values β⋆, nal sθ is given by a linear combination of the columns of a matrix X according to the parameter vector β, p(y β⋆) 2 log | , (15) and the noise covariance Σ is a diagonal matrix defined × p(y ~0) through (1).  |  The Gaussian likelihood function is characterized by will be Gaussian distributed with mean ̺2 and variance p(y β) (y Xβ)′ Σ−1 (y Xβ). (7) 4̺2, where | ∝− − − 2 In the linear model, the likelihood may be maximized N k ⋆ N ⋆ 2 β xi,j (Xβ ) analytically, and the ML for the unknown pa- ̺2 = j=1 j = i (16) σ2 E ǫ2 rameter vector β is given by i=1 P i  i=1 i X ⋆ ′ −1 ⋆ X βˆ = (X′Σ−1X)−1X′Σ−1y (8) = (Xβ ) Σ (Xβ )  (17) [8, 15]. is the true signal’s signal-to-noise ratio (SNR). Conse- In the models of concern here, estimation is simpli- quently, for a signal of given SNR ̺2, the expected log- fied by the fact that the noise covariance Σ is a diagonal arithmic likelihood ratio evaluated at the true parame- ⋆ matrix (1) so that its inverse again is diagonal. In ad- ters is E log p(y|β ) = 1 ̺2, while the likelihood ra- p(y|~0) 2 dition, here we add the common assumption that the ⋆ tio p(y|β )h follows a log-normali distribution with vectors spanning the signal manifold, the columns of X, p(y|~0)  are orthogonal. A non-orthogonal basis X would com- p(y|β⋆) exp( 1 ̺2) and expectation E = exp(̺2). The plicate the procedure slightly; see e.g. [14]. Under these 2 p(y|~0) conditions, the pivotal quantities for ML estimation and maximized likelihood ratio willh be largeri than that; the 2 2 detection are statistic Hk follows a noncentral χk(̺ )-distribution with 2 2 N noncentrality parameter ̺ , its expectation is ̺ + k , so x y p(y|βˆ) b = X′ Σ−1 y = i,j i and (9) that E log = 1 ̺2 + k . Note that the GW j ·,j σ2 p(y|~0) 2 i=1 i and signal-processingh i literature is sometimes confusing, X 2   N 2 as both ̺ and Hk, or their square roots, are commonly ′ −1 xi,j cj = X Σ X·,j = , (10) referred to as the SNR. ·,j σ2 i=1 i X In common signal detection problems, the signal model i.e., the quadratic forms, or inner products, involving the is usually only partially linear, as suggested in Sec. IIC2, jth basis vector (jth column of X) with the data vector y, so that analytical maximization over the “linear” param- and with itself. The elements of the parameter vector’s eters only yields a maximized conditional likelihood, or ML estimate βˆ are then given by profile likelihood. The statistic Hk then is proportional to the profile likelihood, and (since the likelihood under bj βˆ = , (11) the “noise only” null hypothesis, p(y ~0), is a constant) j c j constitutes a generalized Neyman-Pearson| test statistic. the maximized likelihood ratio vs. the no-signal model is This statistic, or its maximum over additional parame- given by ters, is commonly referred to as a detection statistic, as it is used to find the signal fitting the data best, and p(y βˆ) k b2 log | = j , (12) to determine its significance. The detection statistic’s p(y ~0) 2 cj distributions under null and alternative hypotheses as  |  j=1 X stated above only apply for a single (conditional) likeli- and the fitted values are given by hood maximization, i.e., for a given data set y and a given k k b model matrix X. When maximizing the profile likelihood yˆ = Xβˆ = βˆ X = j X . (13) j ·,j c ·,j over additional parameters (or pieces of data), the test- j=1 j=1 j X X ing problem turns into a multiple testing problem, and 4 the statistic’s distribution will be an extreme value statis- [3, 6, 14], and the maximized likelihood ratio for a signal tic [7, 16]. Since the particular statistic Hk only comes that is a linear combination of waveforms (d = j βj sj + up in the context of the Gaussian model, we will in the n, see also (12)) then is following be mostly referring to the more universal cor- P ˆ responding likelihood ratio figure p(y|β) = exp( 1 H ). p(y|~0) 2 k p(d βˆ) d, s 2 log | = h j i . (25) ~ s ,s p(d 0) j j j  |  X h i D. Common implementation and terminology An implementation of a matched filter in the GW con- In the GW data analysis literature, likelihoods and text is concisely described e.g. in [17, 18]. The signal matched filters are commonly expressed in terms of the searched for is a “chirping” binary inspiral waveform of inner product a,b of real-valued functions (signal tem- h i increasing frequency and amplitude, which is character- plates or data) a and b, technically defined in terms of ized by five parameters, namely two mass parameters analytical Fourier transforms, determining the phase/amplitude evolution, and ampli- tude, phase and arrival time. The signal waveform s for ∞ a˜(f) ˜b(f)∗ a,b = df (18) given mass parameters ϑ = (m1,m2) is (in analogy to h i S (f) Z−∞ 1 (4)) given in terms of “sine” and “cosine” components s and s , [6, 14], which in practice is implemented (analogously s,ϑ c,ϑ to the Whittle likelihood) in terms of discrete Fourier transforms, s (t) = β s (t t )+ β s (t t ) (26) ϑ s s,ϑ − 0 c c,ϑ − 0 a,b (19) h i [17], where βs and βc are determined by the orbital phase N/2 ∆t ˜ ˜ N Re a˜(fj ) Re b(fj) +Im a˜(fj ) Im b(fj ) and orientation of the binary system, and t0 defines the =2 . signal arrival time. The sine and cosine waveforms here h S1(f ) i j=0   j   X constitute the signal manifold’s orthogonal “basis vec- tors”. The actual matched-filter detection statistic is de- In terms of the linear models discussed in the previous fined as ρ(t )= Xs(t )2 + Xc(t )2, where section, this is equivalent to a quadratic form 0 0 0 p ′ −1 ~ ∗ ~a Σ b (20) d˜(f)(˜s (f)) exp( 2πift0) Xs/c(t ) s/c,ϑ − df (27) 0 ∝ S ( f ) as in Eqs. (9), (10) above. Note that especially in the Z y | | context of the Student-t model discussed below, expres- sion (18) may be hard to motivate, as it is continuous in [17], and where the exponential term does the time shift- frequency, but the corresponding discrete expression (19) ing of data and template against each other. For any may readily be related to expressions derived above. In given time shift t0, this filter corresponds to (the square this terminology, the signal-to-noise ratio of a signal sθ root of) the detection statistic Hk above (14). Comput- (16) turns out as ing the matched filter (27) across time points t0 yields the profile likelihood, the conditional likelihood (condi- 2 2 s˜θ(fj ) tional on time t0 and waveforms ss,ϑ, sc,ϑ) maximized ̺ = |N | = 2 sθ,sθ , (21) S1(fj ) h i over phase and amplitude. The “overall” maximum like- j 4∆t X lihood then is determined via a brute-force search over t0 and over additional alternative signal waveforms corre- the correlation of some data d with a template sθ (as in (9)) simplifies to sponding to different mass parameters ϑ. Note that the search over arrival time t0 in (27) may be efficiently implemented via another Fourier transform [18]. The Re d˜(fj ) Re s˜θ(fj ) + Im d˜(fj ) Im s˜θ(fj ) matched-filtering algorithm is also described in more de- N h S1(fj ) i j  4∆t   tail in Appendix A3. X =2 d, sθ , (22) In order to claim the detection of a signal, one needs h i to determine a threshold for the detection statistic (the the likelihood ratio of some signal template s for given maximized likelihood), with respect to some pre-specified data d is false alarm rate. The detection statistic’s distributions

2 derived in Sec. IIC4 are likely not to be of much prac- |d˜(fj )−s˜θ (fj )| p(d s ) j S (f ) tical relevance, due to common non-Gaussian or nonsta- log | θ = 1 j (23) ˜ 2 tionary features in the data. Critical values for the de- p(d ~0) P |d(fj )|  |  j S1(fj ) tection statistic instead are commonly computed using =2 d, s s ,s (24) bootstrapping methods (see e.g. [19, 20]). h Pθi − h θ θi 5 σ 4 σ µ + 2 y probability density Gaussian density µ µ + Student−t density (ν=10) Gaussian density Student−t density (ν=3) Student−t density 1e−05 1e−03 1e−01

µ µ + 2σ µ + 4σ µ µ + 2σ µ + 4σ

x x

FIG. 1. Density functions of Gaussian and Student-t distributions. The left panel shows univariate densities on the logarithmic scale. The right panel shows density contours of the joint distribution of two independent Gaussian random variables in contrast with two independent Student-t distributed variables of the same location (µ) and scale (σ). The two Student-t variables have differing degrees-of-freedom; the one corresponding to the x axis has ν = 3, while the one along the y axis has ν = 10.

III. THE STUDENT-T FILTER freedom parameter νj denotes the (prior) precision [3]. So the model is not only applicable in contexts where A. The Student-t noise model the noise itself is in fact t distributed, but also in cases where it is Gaussian, but the noise spectrum is a pri- The Student-t model for time series analysis was in- ori only known to a certain accuracy. Alternatively, the troduced in [3] as a generalization of the commonly same model would result when the noise spectrum itself was assumed to be randomly deviating from the scale used Gaussian model described in the previous section. 2 The Student-t distribution has an additional degrees-of- parameter S1(f), according to a χ distribution, e.g. be- freedom parameter, essentially controlling the distribu- cause it is only estimated with some uncertainty, which tion’s heavy-tailedness, i.e., the allowance for large out- in fact resembles the original motivation for introducing liers. The Student-t likelihood function is given by Student’s t-distribution in the context of the t-test and related procedures [7, 21]. Both or uncer- p d~ θ tainty of the noise PSD technically lead to the same like- | lihood expression here [3]. In general, the interpretation 2 νj +2 ˜ − 2  1 d(fj ) s˜θ(fj) of the S1(f) in the contexts of the Gaus- 1+ N − (28) ∝ νj S (f ) sian and the Student-t model is not necessarily exactly j 4∆t 1 j Y  the same. For the Gaussian model, it may be defined 2 ∆t 2 1 d˜(f ) s˜ (f ) via the expected power S1(fj )=E 2 n˜(fj ) , while νj +2 j θ j N | | =exp 2 log 1+ N − (29) for the Student-t model this only holds in the limiting − νj S1(fj ) j 4∆t    X   case of great certainty (ν ). Within the Student-t model, the S(f) term specifies→ ∞ the scale of the uncer- [3]. According to this model, the residuals (Re(˜n(f )), j tain PSD parameter and the expected power is in fact Im(˜n(fj ))) within each Fourier frequency bin j follow a ∆t 2 νj given by E 2 n˜(fj ) = S1(fj ). The choice of bivariate Student-t distribution [8] with location µ = ~0, N | | νj −2 the degrees-of-freedom parameter νj as well as the spec- N S1(fj ) 0   scale matrix Σ = , degrees-of- trum parameter S1(fj ) may be approached in different 4∆t 0 S (f ) 1 j ways and may for the filtering purpose eventually be con- freedom ν > 0 and implicit dimension 2. This implies j sidered a matter of tuning [3]. In the example in Sec. IV that (i) residuals in different frequency bins are indepen- below, we simply kept the scale parameter S (f ) to be dent; (ii) residuals within the same bin are uncorrelated, 1 j the estimated noise spectrum as in the Gaussian case, and but dependent; and (iii) the marginal distribution of each fitted a common degrees-of-freedom parameter ν = ν for individual residual is a Student-t distribution with scale j all frequency bins to the empirical data. proportional to S1(fj ) and degrees-of-freedom νj . De- creasing values of the degrees-of-freedom parameters νj imply a heavier-tailed distribution, and in the limit of B. Comparison to the Gaussian model νj the model again reduces to the Gaussian model. Besides→ ∞ simply constituting a heavier-tailed noise model, the Student-t model arises as a generalization When comparing to the Gaussian distribution, first of of the Gaussian model when the power spectral den- all the Student-t distribution exhibits heavier tails, i.e., sity S(fj) is treated as uncertain, where the degrees-of- the probability for obtaining “large” values (relative to 6 the distribution’s scale) is much greater. While the den- show little similarity with the signal template, but may sity functions are very similar within the of µ 2σ, often, due to non-negligible correlation with the template where the bulk of probability is concentrated, the densi-± and very large power, still seem to indicate the presence ties’ ratio will grow indefinitely toward the distributions’ of a signal. The χ2 veto then essentially checks for excess tails (see Fig. 1). The degrees-of-freedom parameter ν power that is inconsistent with the shape of the signals controls the distribution’s heavy-tailedness; a setting of aimed for and that way will rule out such alleged detec- ν = 1 yields the “pathological” , for tions. The consideration of excess residual power is also ν > 2 the variance is finite, and in the limit of ν it implicitly happening in the Student-t model. From the again approaches the Gaussian distribution. → ∞ different likelihood formulations ((3), (29)) one can write Another discriminating feature is the shape of the den- down the corresponding likelihood ratios for some data d sity contours. While a Gaussian density will always have and a signal template sθ, elliptical contours, the Student-t distribution is differ- p(d θ, Gauss) ent in that its contours are rather diamond-shaped, with log | elongations pointing along the principal axes (see Fig. 1). p(d ~0, Gauss)  |  This way the Student-t model does not only allow for 2 2 1 d˜(fj ) d˜(fj) s˜θ(fj ) larger outliers, but it also considers outliers more likely = N N − , (30) 2 S1(fj ) − S1(fj ) to occur only in individual variables rather than jointly in j 4∆ t 4∆t ! X all variables. Note that this latter effect follows from the fact the different frequency bins are stochastically inde- p(d θ, Student) pendent and not merely uncorrelated [22, 23]. Since the log | ~ two (real and imaginary) residuals within each Fourier p(d 0, Student) | 2 frequency bin follow a joint, bivariate, t distribution, ˜ 1 d(fj ) 1+ N the density contours within bins will still be spherical— νj S1(fj ) νj +2 4∆t otherwise a strange phase/amplitude dependence would = 2 log  2  . (31) d˜(f )−s˜ ( f ) j 1 j θ j be implied for the Fourier-domain model. The effect of 1+ N X  νj S1(fj )  independent Student-t variables only comes to bear be-  4∆t    tween frequency bins. In both of the above cases the likelihood ratio is a func- 2 ˜ An important difference to note between the Gaussian d(fj ) tion of the “data power” N and the “residual S1(fj ) and Student-t model is that the least-squares fitting that 4∆t 2 results from the Gaussian model will actually be a ML d˜(fj )−s˜θ (fj ) power” N , i.e., the data’s normalized sum- procedure for any model within the wider class of “ellip- S1(fj ) 4∆t tically symmetric” models for the noise residuals (includ- of-squares in each frequency bin j before and after sub- ing e.g. a Student-t model with merely uncorrelated, but tracting the signal sθ. For the Gaussian case, a “data not independent residuals) [22, 23]. The Student-t model power” of 10 and a “residual power” of 1 in the jth described here hence advances into a fundamentally dif- bin would have the same effect on the likelihood ratio ferent class of models. as if the numbers were, say, 1010 and 1001 instead; the Student-t or similar models are commonly used in pa- only relevant figure is their difference. In the Student-t rameter estimation contexts as robust alternatives to model, the latter case would lead to a lower likelihood the Gaussian model that are less sensitive to outliers in ratio; here not only the amount by which the signal sθ is the data [24–27]. Such models may be motivated in a able to reduce the sum-of-squares is relevant, but so is its “top-down” manner by the observation that the data do magnitude relative to the remaining residual term. The not actually fit the Gaussianity assumption, or also in additional feature of the ML fit that is intrinsically con- a “bottom-up” way by pointing out that the resulting sidered in the Student-t likelihood ratio (31) is essentially least-squares procedures are very sensitive to occasional the corresponding coefficient of determination (R2)[15]. outliers in the data. In the spirit of the latter view- As will become obvious in the following, when the actual point, the concept of M estimation was introduced, which implementation is described, the generalization to the aims at “fixing” outlier-sensitive least-squares procedures Student-t model will on the technical side essentially re- by replacing them by more correspond- place the least-squares procedure by an adaptive version. ing to more favorable influence functions [28, 29]. Sim- The adaptation step again ensures that excess residual ilar approaches, namely down-weighting or ignorance of noise power will downweight the supposed significance of outliers in the data, have been proposed in the context a signal. of gravitational-wave detection before [30, 31], and the Student-t assumption may in fact be considered a spe- cial case of M estimation [26, 27]. C. Likelihood maximization: the EM-algorithm Another fix that is commonly applied in GW data anal- ysis is the χ2 veto [32], which is a figure computed along While likelihood maximization in the Gaussian model with a detection statistic that is supposed to discriminate boils down to least-squares fitting, the maximization step actual signals from noise bursts. Such noise events may is not quite as simple for the Student-t model. However, 7 due to the structure of the problem, the expectation- There are two obvious points in the matched-filtering maximization (EM) algorithm may be used to efficiently procedure at which one could insert the EM-iterations maximize the likelihood function [8, 33]. In order to ap- in order to generalize it to the Student-t case: either at ply the EM algorithm, the likelihood expression needs to the level of each (originally analytical) maximization over be reformulated. The Student-t likelihood may be viewed linear model coefficients (usually corresponding to ampli- as a marginal likelihood, averaging out a set of unknown tude and phase), or at a higher level, iterating over linear variance parameters ~σ2 [3]. Each of the variance param- coefficients as well as the signal arrival time parameter. It 2 eters σj then corresponds to the power spectral density is not obvious whether one implementation is more sen- at the jth Fourier frequency bin. The EM algorithm’s sitive than the other, but there definitely are differences details as applied to the present problem are derived in in the implied computational costs. Both approaches are detail in Appendix A2 below. It turns out that maxi- described and discussed in more detail in Appendix A3. mization of the Student-t likelihood may be done in an An implementation of the latter algorithm, together with iterative manner, where each iteration again requires a the analogous matched filter, is available in [35]. In case weighted least-squares fit as in the Gaussian matched fil- of a brute-force search over additional signal parameters ter. The EM algorithm requires a starting value θ0 for (i.e., a “template bank”), one could in fact consider mov- the signal parameters. Given θ0, the expression ing the EM-level yet another stage higher.

2 As a starting parameter value (θ0) for the algorithm, ˜ 1 d(fj )−s˜θ (fj ) one could, for example, use the null vector or an ini- (θ0,θ)= ν 2 2 N j 2 2∆t E − S1(fj )+ d˜(fj )−s˜θ (fj ) j 4∆t νj +2 νj +2 N 0 tial least-squares fit. As a stopping criterion, one could X terminate the algorithm once the improvement in loga- (32)  is maximized with respect to the parameter vector θ. The rithmic likelihood from the previous iteration falls below parameter value maximizing the above expression then some threshold, or when some maximum number of iter- constitutes the new θ0 value, for which the expression ations is reached. Note that—unlike for the Gaussian lin- again is maximized, and so forth. The resulting sequence ear least-squares fit—the (conditional) likelihood might of parameter values then converges to the maximum like- actually be multimodal [36], so that different starting lihood estimate θˆ [8]. values might lead to different results. It is not obvious Maximizing the above expression (32) again amounts whether this occurs frequently in practice, or rather re- to a weighted least-squares fit, exactly as in the case quires particularly rare pathological circumstances; how- of the Gaussian matched filter (see also the correspond- ever, it does not appear to pose a problem in the example ing likelihood expression (3)). The Student-t filter will below. therefore generalize the Gaussian matched filter by re- placing the least-squares procedure by an iterative, or adaptive, least-squares fit. Note that the denominator IV. FILTERING ON ACTUAL in (32) simply is a weighted average of the noise spec- DATA trum (as in (3)) and the previous iteration’s residual noise power, where the degrees-of-freedom parameter ν defines A. General the relative weighting. Instead of the “plain” weighted least-squares match that is done in the Gaussian filter, Besides any theoretical or heuristic arguments why a the EM-iterations adapt the weights (the denominator Student-t based filter may improve detection, the figure in (32), which in the Gaussian model was the a priori of eventual relevance is going to be the resulting improve- known, fixed noise spectrum) to the residual noise power ment in detection efficiency when applied to actual data as found in the data, and the level of adaptation is reg- — keeping in mind the additional complication and com- ulated by the degrees-of-freedom parameter ν. putational cost. In the following, we will demonstrate the The (ML) detection statistic does not follow a simple filter’s performance in a minimalistic, yet realistic toy distributional form as in the Gaussian model, but in the problem. To that end, we will set up a filter for a certain example below one can already see that both statistics kind of parametrized signal, and then test it against a still behave similarly. The generalized likelihood ratio conventional matched filter using injections of simulated statistic will, by Wilks’ theorem, in fact still approxi- signals. For the additive noise, we will use both simu- 2 mately follow a χ distribution [7, 34]. lated Gaussian noise as well as actual gravitational-wave detector instrument noise. Detection efficiency is going to be measured via the receiver operating characteristic D. The filter implementation (ROC) curve, allowing one to compare detection proba- bilities for given false alarm probabilities, or vice versa. As for the Gaussian matched filter, the aim again is In order to make the example realistic, we require a to maximize the likelihood (29), i.e., find best-fitting pa- nontrivial signal waveform to be searched for; in par- rameter values θˆ in parameter space. Again, it is ad- ticular the waveform should not be monochromatic, but vantageous if the signal model can (at least partly) be should instead span a wider range of Fourier frequencies. formulated as a linear model. There should be parameters to be maximized over ana- 8

Gaussian noise Instrument noise

99.99 % 5 10 20 5 10 20 99.99 % 99.9 % 99.9 % 99 % 99 % empirical quantile empirical quantile

90 % 90 %

Gaussian Gaussian Student−t (ν = 40) Student−t (ν = 10)

5 10 20 5 10 20 theoretical quantile theoretical quantile

FIG. 2. Quantile-quantile plots (Q-Q plots) of the empirically found normalized residual noise power (33) versus its theoretical values assuming Gaussian and Student-t models. The marks indicate particular quantiles corresponding to powers of 10 in tail probability. The 10 largest empirical samples are shown as individual dots; the remaining quantiles are connected by a line.

⋆ lytically as well as numerically, and we should use noise timate) is ̺ = ̺2 = 5.257 so that E log p(y|β ) = p(y|~0) that is non-Gaussian or nonstationary. The example de- ⋆ 1 ̺2 = log(106)p and E p(y|β ) = exp h̺2 = 1012(seei scribed in the following mimics the setup of a search for 2 p(y|~0) binary inspiral signals in interferometric gravitational- also Sec. IIC4). Each 8-secondh i chunk of data is eventu- wave detector data (see e.g. [17]). The noise data are ally analyzed twice, with and without a signal injection. taken from an actual detector, and, for comparison, a second data set of simulated, Gaussian noise of a realis- tic noise spectrum is used in parallel. The “search” being C. Setting the degrees-of-freedom parameter performed however is much simplified and not intended to be exhaustive or to span an astrophysically sensible In order to determine a suitable degrees-of-freedom pa- parameter range. rameter ν for the Student-t model, we considered the tail behavior of the noise. If the Gaussian (Whittle) model was accurate, then the normalized Fourier-domain noise B. The data power at the jth frequency bin,

n˜(fj ) The data used in the following examples are going to , (33) N be either simulated Gaussian noise with a power spec- S1( fj ) 4∆ t tral density corresponding to LIGO’s initial design sen- q sitivity [37], or real instrument noise from LIGO’s Liv- being the square root of the sum of two independent stan- ingston interferometer, taken during LIGO’s fifth science dard Gaussian random variables (see Sec. IIB), should run (“S5”) in late 2005 [38]. The data will be considered follow a . The residuals’ normaliza- in chunks of 8 seconds length, downsampled to a tion here is done — in analogy to the computations done rate of 1024 Hz, and windowed using a Tukey window ta- in an actual search — via the estimated noise spectrum, pering 10% of the data (5% at each end). The noise’s as described in the previous subsection. We are only con- power spectral density S1(f) is estimated essentially us- sidering the binned noise power here (and not the individ- ing Welch’s method [39], by considering the empirical ual real and imaginary components) as this is the relevant power in the 32 preceding data segments, and taking the figure entering both the Gaussian as well as the Student-t median as a robust estimator. The figures shown in the likelihoods ((3), (29), (30), (31)). Under the Student-t following are each based on 100000 such data chunks. model, instead of being Rayleigh distributed, the power The signal waveform searched for here is taken to be a (33) would instead follow a similar, more heavy-tailed binary inspiral waveform approximated to the 2.0 post- distribution. We will refer to the Student-t power’s distri- Newtonian order [40]. The same waveform family is used bution as the “Student-Rayleigh” distribution here; more for both injections as well as in the detection stage, and it details on this distribution’s particular form are given in has five free parameters: chirp mass (mc), mass ratio (η), Appendix A4. coalescence time (tc), coalescence phase (φc), and ampli- We investigated the empirical distribution of actual tude (A). The signal waveforms injected into the data noise residuals, for both simulated and actual instrumen- were all done at the same mass parameters (mc = 4.5, tal data. For the simulated data, this will account for η = 0.25), and the amplitude is set such that the sig- effects of finite sample size, windowing and PSD estima- nal’s SNR (as computed based on the current PSD es- tion, and for actual data it will in addition give some in- 9

2 log(max. LR) 2 log(max. LR) 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 18 18

10 Gaussian noise 10 Instrument noise LR LR t t 12 12 ) ) 10 10 max. LR max. LR ( ( 2 log 2 log 6 6 10 10 maximized Student− maximized Student−

noise only noise only signal injection signal injection 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9

106 1012 1018 106 1012 1018 maximized Gaussian LR maximized Gaussian LR

FIG. 3. Detection statistics (maximized likelihood ratios) based on Gaussian and Student-t models for simulated Gaussian data (left panel) and actual interferometer noise (right panel). Injections were of SNR ̺ = 5.257. sight into the effects of realistic nonstationarities or non- ples should be 4.3, while empirically the 99.99% quan- Gaussianities in actual measurement noise. The noise tile lies at 8.1≤ for actual instrument noise (see the right samples are based on the residuals from 200 eight-second panel of Fig. 2). A Student-t model seems to fit the data noise realizations of either simulated Gaussian noise, or better, especially in the distributions’ tails, although dis- actual instrument noise from LIGO’s Livingston inter- crepancies in the extreme outliers are still large. Try- ferometer. The residuals (33) are each normalized via ing to estimate the degrees-of-freedom parameter ν from a PSD estimate from 32 preceding noise samples, as de- different subsets of the empirical data yields ML esti- scribed in the previous section, yielding a total of 800 000 mates roughly in the range from 5 to 50; in the following residuals. The data used here did not overlap with the we simply fixed the parameter at ν = 10 for the sim- data used in the following detection experiment. ulations involving actual data. A value of > 40 would Figure 2 shows quantile-quantile plots (Q-Q plots) il- not seem to make sense here (even if the data were per- lustrating how well the models fit the actual data. The fectly Gaussian) and in the simulation results below we axes indicate theoretical (Rayleigh or Student-Rayleigh) found that detection performance seemed to depend only quantiles, and the empirical quantiles as found in the weakly on ν as long as it was roughly in the range 5–20. data. If a model fits the data well, both theoretical and While the Student-t distribution does not fit the data empirical quantiles should coincide, so that the quantiles perfectly, it seems to fit better than the Gaussian model. follow a straight, diagonal line. A mismatch between Instead of only fitting the degrees-of-freedom parameter, model and data results in a differently shaped curve; in one could actually in addition also adapt the t distribu- particular, if the data are more heavy-tailed than pre- tion’s scale to the data (see also Sec. IIIA, or [3]). dicted by the model, the curve will show an upward bend [41]. One can see that the actual data exhibit heavier tails D. Filtering setup in both cases of simulated, Gaussian noise as well as the instrument noise. In the case of Gaussian noise this is due For each piece of data, the likelihood ratio is maxi- to the estimation uncertainty in the noise spectrum. If mized over phase and amplitude for given combinations we had been using the mean instead of the median to esti- of time and mass parameter values, where the evaluated mate the noise PSD, then the distribution of normalized time points were t 6.50, 6.55,..., 7.50 and the con- c ∈{ } noise residuals should be exactly Student-t with degrees sidered masses were η = 0.25, mc 3.0, 3.1,..., 6.0 . of freedom equal to twice the number of noise samples The injected signal’s parameter∈ values { always were} averaged over (here, 32 2 = 64) [7, 21]. For the median among the grid points maximized over, so that sig- × estimation method, this is only approximately true, but nal/template mismatch considerations are not of concern apparently still roughly accurate; a maximum-likelihood here. On the technical side, this is implemented in a fit for ν suggests a value of ν 40 here. For the case loop over template waveforms (corresponding to different ≈ of Gaussian data, the mismatch between assumed and mass parameters) and time points. At each mass/time observed quantiles is minimal anyway. combination, computation of the conditionally maxi- For the real interferometer noise, the discrepancy be- mized Gaussian likelihood ratio amounts to computing tween Gaussian model and actual data is more dramatic; an inner product / quadratic form (see Sec. IIC), while in the distribution’s tails, the empirical quantiles are sig- maximizing the conditional Student-t likelihood requires nificantly larger than the assumed quantiles. For exam- iterating over several such least-squares fits within the ple, according to the Gaussian model, 99.99%of the sam- EM algorithm (see Sec. IIIC). The EM iterations were 10

Gaussian noise Instrument noise P(detection) P(detection)

Gaussian matched filter Gaussian matched filter Student−t filter Student−t filter 0.002 0.010 0.050 0.200 1.000 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.002 0.010 0.050 0.200 1.000 0.002 0.010 0.050 0.200 1.000 P(false alarm) P(false alarm)

FIG. 4. ROC curves for the Gaussian and the Student-t detection statistics in both data scenarios. The shaded area marks the region where any sensible detection statistic (one that is not worse than mere guessing) should lie. terminated whenever the improvement in logarithmic hand, the Student-t model is able to provide a signif- likelihood over the previous iteration fell below 10−6. In icantly greater detection probability especially at low this example setting, this lead to an average number of false-alarm probabilities. A remarkable feature of the four EM iterations for each conditional likelihood max- ROC curves for instrumental noise is that for very low imization in both noise scenarios. The eventual maxi- false alarm probabilities both filters eventually perform mized likelihood then is given by the overall maximum as poorly as mere guessing. The Student-t filter is able over the conditional maxima, and as the detection statis- to sustain its discriminating power for lower false alarm ˆ tic we use the maximized likelihood ratio p(d|θ) . The rates, though. This effect is connected to the frequency p(d|~0) of noise outliers (“glitches”) in the data, leading to very algorithm used was essentially the one described in Ap- large detection statistic values even in the absence of pendix A 3 d. a signal. Figure 5 shows the corresponding detection thresholds as a function of false-alarm probabilities. The point where the detection threshold reaches the injected E. Simulation results signals’ SNR is where the corresponding detection proba- bility is 50%. One can see that, due to the heavy-tailed ≈ Figure 3 shows resulting detection statistic values distribution of detection statistics in the case of actual in- (maximized likelihood ratios) under the Gaussian and the strument noise, the detection threshold necessary for low Student-t models both when a signal is injected as well as false-alarm probabilities very quickly grows beyond val- when he data are noise only. The signal injections here ues that could obviously be attributed to be due to the were all done at the same amplitude relative to the noise signal injections considered here; the rate of noise tran- spectrum (SNR ̺ = 5.257). In general, both detection sients of “SNR” greater than the injections’ SNR exceeds statistics are very similar; the Student-t likelihood ratio the false alarm rate (in a realistic search, some of these tends to turn out slightly lower than the Gaussian one, might actually be vetoed beforehand). This effect is very in particular in the case of real interferometer noise. obvious here also because signal injections were done only The question of to what extent these differences af- at a single SNR, but it will of course persist for other SNR fect the ability to discriminate signals from noise will be distributions—assuming other SNR distributions for in- approached by considering the receiver operating char- jections will affect the detection probability, but not the acteristic (ROC) curves. ROC curves are based on the detection threshold, i.e., the detection procedure itself. detection statistics’ (here: empirical) distributions. Plac- ing different detection thresholds on a detection statistic yields a corresponding false alarm probability (based on The exact relative performance of both methods of the distribution under the noise-only hypothesis) as well course depends on the details of the particular detection as a detection probability (based on the distribution un- problem, the kind of signal searched for, the parameter der the particular signal hypothesis). The ROC curve space, noise characteristics, data conditioning, and tun- illustrates these combinations over varying threshold val- ing parameters. The ROC curves shown above are based ues [42]. on a particular, artificial signal population, but their gen- Figure 4 shows ROC curves for the Gaussian and the eral features persist in a number of additional simulations Student-t filter for both noise cases. In the case of sim- not shown here, for a range of degrees-of-freedom set- ulated Gaussian noise, both detection statistics perform tings, injection SNRs, data from a different instrument, almost identically. For real instrument noise on the other and data from a different time period. 11

Gaussian noise, Gaussian filter individual νj parameters for different frequency ranges Gaussian noise, Student−t filter Instrument noise, Gaussian filter may yield a significant improvement. It may also make 18 Instrument noise, Student−t filter

10 sense to specify the degrees-of-freedom parameter depen- dent on additional information, like e.g. the data quality ) category [45]. 12

10 It will be interesting to study the Student-t filter’s per- max. LR ( formance in a realistic search for gravitational-wave sig-

2 log nals, in conjunction with the existing infrastructure (data 6 5.257

10 quality flags, additional vetoes, etc.) and in comparison threshold on maximized LR with the conventional matched filter [18, 19]. We are also investigating the use of the Student-t model in the 0 3 4 5 6 7 8 9 10 1 context of Bayesian [46]. Here it may 0.002 0.010 0.050 0.200 1.000 again yield a more robust discriminator for actual sig- nals against noise; on the computational side this prob- P(false alarm) lem is based on integration of the likelihood, rather than FIG. 5. Detection thresholds on the maximized likelihood maximization, and we do not expect a difference in com- ratio (the detection statistic), corresponding to certain false- putational cost between Gaussian and Student-t models. alarm probabilities. These thresholds are based on the detec- We expect the Student-t filtering procedure to be also tion statistic’s distribution in the absence of a signal. The useful in many other signal-processing contexts, wher- horizontal line indicates the injected signals’ SNR (see also ever robustness or uncertainty in the power spectrum is Fig. 4). an issue.

V. CONCLUSIONS ACKNOWLEDGMENTS

We introduced a generalization of the matched filter The author wishes to thank Nelson Christensen, Drew that is commonly applied in signal detection problems. Keppel, Karsten L¨ubke, Renate Meyer, and Reinhard The Student-t filter is derived as a maximum-likelihood Prix for fruitful discussions at various stages of this work, detection method that is based on a Student-t distribu- and the LIGO Scientific Collaboration (LSC) for pro- tion for the noise, rather than a Gaussian distribution, viding the data used here. The author gratefully ac- which would again yield the common matched filter in- knowledges the of the United States National stead. On the technical side, it generalizes a least-squares Science Foundation for the construction and operation method to an adaptive variety. While a “Gaussian” of the LIGO Laboratory and the Science and Technol- matched filter is certainly appropriate when the assump- ogy Facilities Council of the United Kingdom, the Max- tion of stationary Gaussian noise and a known spectrum Planck-Society, and the State of Niedersachsen/Germany is met, there are several ways to motivate the Student-t for support of the construction and operation of the filter as a robust alternative when these assumptions are GEO600 detector. The author also gratefully acknowl- violated: (i) “theoretically”: the Student-t model allows edges the support of the research by these agencies and for uncertainty in the PSD, heavier-tailed noise and out- by the Australian Research Council, the International liers; (ii) “heuristically”: the resulting adaptive least- Science Linkages program of the Commonwealth of Aus- squares method is less outlier-sensitive; or (iii) “pragmat- tralia, the Council of Scientific and Industrial Research of ically”: the filter may turn out more effective in practice, India, the Istituto Nazionale di Fisica Nucleare of Italy, as in the realistic example shown above. Besides that, the Spanish Ministerio de Educaci´on y Ciencia, the Con- being a generalization of the (Gaussian) matched filter, selleria d’Economia, Hisenda i Innovaci´oof the Govern it should generally be able to perform as well or better. de les Illes Balears, the Royal Society, the Scottish Fund- The question of course is whether the gain in detection ing Council, the Scottish Universities Physics Alliance, efficiency is worth the additional implementation, tuning The National Aeronautics and Space Administration, the and computational effort. The difference in computa- Carnegie Trust, the Leverhulme Trust, the David and Lu- tional cost for deriving both detection statistics suggests cile Packard Foundation, the Research Corporation, and that a combined, hierarchical search strategy may also the Alfred P. Sloan Foundation. be worth considering. In the example shown above, the Student-t model’s degrees-of-freedom parameter was treated as a single con- APPENDIX stant. In the context of gravitational-wave interferomet- ric data, this is an oversimplification; a study of actual 1. Discrete Fourier transform instrument noise shows that the Fourier-domain data’s tail behavior clearly depends on the frequency [43, 44]. The Fourier transform convention used in this paper is Accounting for this effect in an actual search by fitting specified below; it is defined for a real-valued function h 12 of time t, sampled at N discrete time points, at a sam- a scaled inverse χ2 distribution, pling rate of 1 , and it maps from ∆t ∆t 2 νj S1(fj )j +4 d˜(fj ) s˜θ(fj ) Inv-χ2 ν +2, N − (A4) R j h(t) : t =0, ∆t, 2∆t,..., (N 1)∆t (A1) νj +2 { ∈ − }   [3] with probability density function to a function of frequency f ν 2 νj +4 j ∆t ˜ 2 2 − 2 2 S1(fj )+4 N d(fj ) s˜θ(fj ) h˜(f) C : f =0, ∆f , 2∆f ,..., (N 1)∆f , (A2) f(σ ) σ exp − { ∈ − } j ∝ j − 2σ2  j  1  (A5) where ∆f = and N∆t [3]. The conditional distribution of the data d for given N−1 ~σ2 and signal parameters θ, P(y θ, ~σ2), is Gaus- h˜(f) = h(j∆ ) exp( 2πij∆ f) (A3) | 2 t − t sian [3], and the variance parameters’ prior, P(~σ ), again j=0 2 2 X was Inv-χ [3]. The joint conditional density of θ and ~σ for given data d is given by [3]. log p(θ, σ2 y) log p(y θ, σ2) p(θ, σ2) (A6) | ∝ | ×2 ∆t ˜ 2  4 N d(fj )−s˜θ (fj )  log(σj )+ 2σ2 ∝− j 2. Applying the EM algorithm j X  νj 2 νj S1(fj ) (1 + 2 ) log(σj )+ 2σ2 (A7) a. Preliminaries − j j   X 2 ∆t ˜ νj 2 νj S1(fj )+4 N d(fj )−s˜θ (fj ) The expectation-maximization (EM) algorithm is re- = (2 + 2 ) log(σj )+ 2σ2 (A8) − j j quired for maximizing the Student-t likelihood; see X  Sec. IIIC. What is desired is the maximum of the [3]. marginal likelihood p(d θ), which is equivalent to the marginal density p(θ d)| when assuming a uniform prior b. The E step distribution on θ. What| is required in order to apply the EM algorithm are expressions involving the marginal- 2 For the EM algorithm’s expectation step, one needs to ized σj parameters, namely, the conditional distribution P(~σ2 θ, d) and the joint density p(θ, ~σ2 d). The EM algo- evaluate the conditional posterior expectation | | rithm will then iteratively maximize the likelihood func- 2 EP(σ2|θ=θ ,y) log p(θ, σ y) tion by performing alternating “expectation” and “max- 0 | 2 2 2 imization” steps [8, 33]. = log p(θ, σ  y) p(σ θ =θ0,y) dσ (A9) 2 | | The conditional posterior distribution P(σj θ, d) of the Z 2 |  jth variance parameter σj for given data and signal sθ is as a function of θ for some given θ0 [8]. Here,

log p(θ, σ2 y) p(σ2 θ =θ ,y) dσ2 (A10) | | 0 Z 2  ∆t ˜ νj 2 νj S1(fj )+4 N d(fj )−s˜θ (fj ) (2 + 2 ) log(σ )+ 2σ2 ∝− j j X Z   ν ∆ 2 j ν S (f )+4 t d˜(f )−s˜ (f ) 2 −(2+ 2 ) j 1 j N j θ0 j 2 σ exp 2 dσ (A11) × 2σ j ∆ 2 ν ∆ 2 4 t d˜(f )−s˜ (f )  j ν S ( f )+4 t d˜(f )−s˜ (f ) N j θ j 1 2 −(2+ 2 ) j 1 j N j θ0 j 2 2 σ2 σ exp 2σ2 dσj , (A12) ∝− × j j Z X    

(∗)

ν ∆ 2 j ν S (f )+4 t d˜(f )−s˜ (f ) 1 2 −(2+ 2 ) j 1 j N j θ0 j 2 νj +2 where σ2 σ exp 2σ2 dσj = 2 , (A13) j z }| { ∆t ˜ Z νj S1(fj )+4 N d(fj ) s˜θ0 (fj )     −

13 since the term marked by the asterisk ( ) is the density ∆∗ 2 TABLE I. Matched filter, general implementation. ν S (f )+4 t d˜(f )−s˜ (f ) function of an Inv-χ2 ν +2, j 1 j N j θ0 j j νj +2 normS = hss,θ, ss,θ; S1i , so that normC = hsc,θ, sc,θ; S1i  for (i = 1,...,m) do // loop over time points: for (j = 0,...,N/2) do // time shift the data: 2 2 2 log p(β, σ y) p(σ β =β0,y) dσ ˜′ ˜ | | 5: dj = dj × exp(2πifj τi) Z end for ∆ 2 4 t d˜(f )−s˜ (f ) ′ 1 N j θ j prodS = hss,θ, d ; S1i ν 2 (A14) ′ 2 j 1 ∆t ∝ − S1(fj )+ 4 d˜(fj )−s˜θ (fj ) prodC = hsc,θ, d ; S1i j νj +2 νj +2 N 0 X // compute log-likelihood ratio / profile likelihood: 2 2 2 d˜(fj )−s˜ (fj )  10: maxLLR[i] =(prodS) /normS +(prodC) /normC = 1 θ (A15) 2 ν 2 end for N j 2 2∆t − S1(fj )+ d˜(fj )−s˜θ (fj ) j 4∆t ν +2 ν +2 N 0 X j j return max(maxLLR) =: (θ ,θ).   E 0

b. The “Gaussian” matched filter: general implementation c. The M step The first algorithm (Table I) is a “naive” matched-filter implementation that maximizes the likelihood (-ratio) In the EM algorithm’s maximization step, the above over a given grid of m time points (τ). The algorithm expectation (θ ,θ) (A15) needs to be maximized with 0 mainly consists of a loop over time points, where for respect to theE parameter θ. The parameter value max- each time point the (conditional) likelihood is maximized imizing the expectation then constitutes the next iter- over amplitude and phase. In order to match signal and ation’s “new” θ value, for which then the expectation 0 data for a certain signal arrival time, the data d are time again is maximized, and so forth [8]. As one can see shifted against the signal waveforms s . The eventual from expression (A15), maximization of the expectation s/c result is the profile likelihood evaluated at the specified again amounts to minimizing weighted least-squares, as time points, the maximum of which then constitutes the in the Gaussian matched filter described above. generalized likelihood ratio detection statistic that is re- turned.

3. Pseudocode matched and Student-t filters c. The “Gaussian” matched filter: efficient implementation a. Preliminaries If the time points to be maximized over are taken to This section sketches actual implementations of be the same as the data time series’ points (τi = i∆t, i = Student-t and (Gaussian) matched filters in comparison. 1,...,m = N), then the matched-filtering procedure may In the following, we will use essentially the same con- be implemented much more efficiently. The algorithm ventions as before; we will be considering a time series d shown in Table II will give identical results to the pre- of length N, sampled at a sampling interval of ∆t. The vious, but it is more efficient as it takes advantage of signal waveform here is assumed to be a linear combina- a Fourier transform to essentially maximize over ampli- tion of a sine and a cosine component (ss,θ, sc,θ), it has an associated arrival time parameter, and possibly ad- ditional parameters θ (as in (26)). Additional waveform TABLE II. Matched filter, efficient implementation. parameters (other than amplitude, phase, and time) are then commonly treated by running several matched fil- normS = hss,θ, ss,θ; S1i normC = hs , s ; S i ters corresponding to different values of θ. The general- c,θ c,θ 1 for (j = 0,..., (N − 1)) do // correlate data and signals: ization to the case of more than two linear signal compo- ˜ ∗ corS[j +1] = dj × s˜s,θ,j /S1(fj ) nents should be straightforward. The profile likelihood ˜ ∗ 5: corC[j +1] = dj × s˜c,θ,j /S1(fj ) will be evaluated along a discrete grid of time points end for τi (i = 1,...,m), where the special case of τi = i∆t // apply Fourier transforms: and m = N is of particular interest. The filter’s output FTS = DFT(corS) each time is a single number, the maximized (logarith- FTC = DFT(corC) mic) likelihood ratio of signal vs. no-signal models. We 10: for (i = 1,...,N) do // profile likelihood (-ratio): FTS − 2 FTC − 2 will be making use of the inner product / quadratic form ∆t 2 ( [N+1 i]) ( [N+1 i]) maxLLR[i] = normS + normC notation a,b; S as defined in (19). Implementations of N    h i end for the algorithms sketched in Sec. A3c and A3e are also return max(maxLLR) provided in [35]. 14

TABLE III. Student-t filter, general implementation. TABLE IV. Student-t filter, efficient implementation.

LL0 = log(p(d,S1,ν)) // log-likelihood noise-only model LL0 = log(p(d,S1,ν)) // log-likelihood noise-only model for (i = 1,...,m) do // loop over time points: // EM-iterations: ⋆ for (j = 0,...,N/2) do // time shift the data: k =1; ∆LLR = 1; LLRprev = 0; S1 = S1 ′ ˜ ˜ LLR dj = dj × exp(2πifj τi) while (∆ > ∆max) and (k ≤ kmax) do 5: end for 5: // the “plain” matched filter: ⋆ // EM-iterations: normS = hss,θ, ss,θ ; S1 i ⋆ ⋆ k =1; ∆LLR = 1; LLRprev = 0; S1 = S1 normC = hsc,θ, sc,θ; S1 i while (∆LLR > ∆max) and (k ≤ kmax) do for (j = 0,..., (N − 1)) do ⋆ ˜ ∗ ⋆ normS = hss,θ, ss, θ; S1 i corS[j +1] = dj × s˜s,θ,j /S1 (fj ) ⋆ ∗ ⋆ 10: normC = hsc,θ, sc, θ; S1 i 10: corC[j +1] = d˜j × s˜ /S (fj ) ′ ⋆ c,θ,j 1 prodS = hss,θ, d ; S1 i end for ′ ⋆ prodC = hsc,θ, d ; S1 i FTS = DFT(corS) βˆs = prodS/normS FTC = DFT(corC) βˆc = prodC/normC for (i = 1,...,N) do ′ FTS − 2 FTC − 2 ˆ ˆ ∆t 2 ( [N+1 i]) ( [N+1 i]) 15: nˆ = d − βsss,θ +βcsc,θ // vector of noise residuals 15: maxLLR[i] = +  N   normS normC  LL1 = log(p(ˆn, S1,ν)) // log-likelihood signal model end for LLR = LL1 − LL0 // log-likelihood ratio // end of “plain” matched filter. ∆LLR = LLR − LLRprev // Determine best-fitting template, residuals, etc.: LLRprev = LLR imax = arg maxi maxLLR[i] 20: for (j = 0,...,N/2) do // adapt the spectrum: 20: for (j = 0,...,N/2) do // time shift the data: ⋆ νj 2 2∆t 2 ′ S1 (fj ) = S1(fj )+ n˜ˆj ˜ ˜ νj +2 νj +2 N dj = dj × exp(2πifj τimax ) end for end for ′ ⋆ k = k + 1 prodS = hss,θ, d ; S1 i ′ ⋆ end while prodC = hsc,θ, d ; S1 i 25: maxLLR[i] = LLR // profile likelihood (-ratio) 25: βˆs = prodS/normS end for βˆc = prodC/normC return max(maxLLR) ′ nˆ = d − βˆsss,θ + βˆcsc,θ // vector of noise residuals  LL1 = log(p(ˆn, S1,ν)) // log-likelihood signal model LLR = LL1 − LL0 // log-likelihood ratio tude, phase, and time simultaneously (see also Sec. II D). 30: ∆LLR = LLR − LLRprev In practice, one may want to restrict the profile likeli- LLRprev = LLR hood maximization (line 13) to the subset of sensible time for (j = 0,...,N/2) do // adapt the spectrum: ⋆ νj 2 2∆t 2 shifts that do not “wrap” the signal circularly around S1 (fj ) = S1(fj )+ n˜ˆj νj +2 νj +2 N the data’s end points. Instead of a Fourier transform, end for one could also implement an inverse Fourier transform 35: k = k + 1 and would then also not need to time-reverse the result’s end while indices (line 11). return LLR

d. The Student-t filter: general implementation e. The Student-t filter: efficient implementation This algorithm (see Table III) again is a “general” version of the Student-t filter, analogous to the general The Student-t filter also may be implemented more matched filter (Sec. A 3 b), where the set of time points τ efficiently in case the signal arrival times to maximize is not restricted. The EM algorithm here is applied at over are taken to be the time points of the original time the level of each single amplitude/phase maximization series (τi = i∆t, i = 1,...,m = N, as in Sec. A3c). conditional on some time shift τi. The EM component This implementation (Table IV) then requires one to requires the specification of a threshold ∆max on the im- move the level at which the EM-algorithm is applied provement in logarithmic maximized likelihood ratio (e.g. from the conditional maximization over amplitude and −6 10 ), and a threshold kmax on the number of EM itera- phase to the joint amplitude/phase/time maximization; tions (e.g. 100). The Student-t likelihood function effectively this implementation iteratively runs several 2 matched filters (see lines 6–16) while adapting the noise νj +2 1 x˜j spectrum in between. It is unclear whether or how the p(x, S1,ν) exp 2 log 1+ N ∝ − νj S1(fj ) level at which the EM-algorithm is applied affects the re- j 4∆t  X   sults; as noted in Sec. IIID, the likelihood may be multi- (see also (29)) only needs to be computed up to a pro- modal and different implementations might end up with portionality constant here, as only the likelihood ratio is differing maximization results, but whether this actually of eventual interest. poses a problem in practice is not obvious. Computa- 15

ν = 1 dom. Then the random vector ν = 2 ν = 5 ν = 10 X 1 A ν = 20 ν = ∞ = Y ! C/ν B !

follows a bivariate Student-pt distribution with a diagonal probability density , exactly like the real and imaginary components ofn ˜(fj )[8]. The root-mean-square figure corresponding to the power then may be written as 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0 2 4 6 8 10 A 2 B 2 σ + σ / 2 X2 + Y 2 = v2σ2 = √2σ2 D, u   C/ν   u p t (A16) where the D, being a ratio of χ2 dis- tributed random variables that are normalized by their ν = 1 respective degrees-of-freedom, follows an F (2,ν) distri- probability density ν = 2 ν = 5 bution with 2 and ν degrees of freedom [7]. ν = 10 ν = 20 ν = ∞ 1e−06 1e−04 1e−02 0 2 4 6 8 10 b. Probability density function, etc.

In the Gaussian noise model (see Sec. IIB), the FIG. 6. Probability density functions of Student-Rayleigh distributions for varying degrees of freedom ν and fixed noise power at the jth frequency bin, n˜(fj) , follows a 2 Rayleigh distribution with probability density function scale σ = 1. For ν = ∞, the distribution corresponds to the usual (“Gaussian”) Rayleigh distribution. x x2 f (x σ) = 2 exp 2 , (A17) R | σ − 2σ

 N where the scale parameter σ is given as σ = S1(fj ). tionally, this latter implementation should be much eas- 4∆t ier, though. Another difference to note is that while the The analogue Student-Rayleigh probabilityq distribu- matched filter allows us to return the profile likelihood tion in the Student-t noise model (see Sec. IIIA) is de- as a function of time (the SNR time series), only the fined through its density function Student-t filter implementation from Sec. A 3 d is able to x x2 provide this, while the more efficient implementation will fSR(x σ, ν) = 2 fF (2,ν) 2 , (A18) | σ 2σ only return the overall maximum.  where fF (2,ν)( ) is the probability density function of an F (2,ν) distribution· with 2 and ν degrees of freedom. Similarly, the cumulative distribution function and quan- tile function are given by 4. The Student-Rayleigh distribution x2 F (x σ, ν)= F 2 and (A19) SR | F (2,ν) 2σ a. Relation to the F distribution Q (p σ, ν)= 2σ2 Q  (p), (A20) SR | F (2,ν) q The noise power’s probability distribution under the where FF (2,ν)( ) and QF (2,ν)( ) are the F distribution’s Student-t model (see (33), Sec. IVC) may be related cumulative distribution· function· and . to Snedecor’s F distribution. First, real and imagi- Figure 6 illustrates probability density functions of nary parts of the jth element of the discretely Fourier- Student-Rayleigh probability distributions for varying transformed vector n follow a multivariate (bivariate) degrees of freedom ν. For ν = , the distribution corre- Student-t distribution (see Sec. IIIA). Let A and B be sponds to the usual (“Gaussian”)∞ Rayleigh distribution. independent Gaussian random variables with zero mean Note, in particular, the differing tail behavior (analogous and σ. Furthermore, let C be a to Fig. 1) that is apparent especially in the logarithmic 2 χν distributed random variable with ν degrees of free- .

[1] K. S. Thorne. Gravitational radiation. In S. W. Hawking pages 330–358. Cambridge University Press, Cambridge, and W. Israel, editors, 300 years of gravitation, chapter 9, 16

1987. sical and Quantum Gravity, 27(1):015005, January 2010. [2] B. F. Schutz. Gravitational wave astronomy. Classical [21] W. S. Gosset. The of a mean. Biometrika, and Quantum Gravity, 16(12A):A131–A156, December 6(1):1–25, March 1908. 1999. [22] H. H. Kelejian and I. R. Prucha. Independent or uncor- [3] C. R¨over, R. Meyer, and N. Christensen. Modelling related disturbances in : An illustration coloured residual noise in gravitational-wave signal pro- of the difference. Economics Letters, 19(1):35–38, 1985. cessing. Classical and Quantum Gravity, 28(1):015010, [23] T. S. Breusch, J. C. Robertson, and A. H. Welsh. The January 2011. emperor’s new clothes: a critique of the multivariate t [4] G. L. Turin. An introduction to matched filters. IRE regression model. Statistica Neerlandica, 51(3):269–286, Transactions on Information Theory, 6(3):311–329, June December 1997. 1960. [24] K. L. Lange, R. J. A. Little, and J. M. G Taylor. Ro- [5] N. Choudhuri, S. Ghosal, and A. Roy. Contiguity of the bust statistical modeling using the t distribution. Journal Whittle measure for a Gaussian time series. Biometrika, of the American Statistical Association, 84(408):881–896, 91(4):211–218, 2004. December 1989. [6] L. S. Finn. Detection, measurement, and gravitational [25] J. Geweke. Bayesian treatment of the independent radiation. Physical Review D, 46(12):5236–5249, Decem- Student-t linear model. Journal of Applied Economet- ber 1992. rics, 8:S19–S40, December 1993. [7] A. M. Mood, F. A. Graybill, and D. C. Boes. Introduction [26] D. R. Divgi. Robust estimation using Student’s t distri- to the theory of statistics. McGraw-Hill, New York, 3rd bution. CNA Research Memorandum CRM 90-217, Cen- edition, 1974. ter for Naval Analyses, Alexandria, VA, USA, December [8] A. Gelman, J. B. Carlin, H. Stern, and D. B. Rubin. 1990. Bayesian data analysis. Chapman & Hall / CRC, Boca [27] J. B. McDonald and W. K. Newey. Partially adaptive Raton, 1997. estimation of regression models via the generalized t dis- [9] J. O. Berger. Statistical decision theory and Bayesian tribution. Econometric Theory, 4(3):428–457, December analysis. Springer-Verlag, 2nd edition, 1985. 1988. [10] R. Prix and B. Krishnan. Targeted search for contin- [28] F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw, and uous gravitational waves: Bayesian versus maximum- W. A. Stahel. Robust statistics: The approach based on likelihood statistics. Classical and Quantum Gravity, influence functions. Wiley, New York, 1986. 26(20):204013, October 2009. [29] P. J. Huber and E. M. Ronchetti. Robust statistics. Wiley, [11] A. C. Searle. Monte-Carlo and Bayesian techniques in 2nd edition, 2009. gravitational wave burst data analysis. Arxiv preprint [30] J. D. Creighton. Data analysis strategies for the detection 0804.1161 [gr-qc], April 2008. of gravitational waves in non-Gaussian noise. Physical [12] C. R¨over, M.-A. Bizouard, N. Christensen, H. Dim- Review D, 60(2):021101, July 1999. melmeier, I. S. Heng, and R. Meyer. Bayesian reconstruc- [31] B. Allen, J. D. E. Creighton, E.´ E.´ Flanagan, and J. D. tion of gravitational wave burst signals from simulations Romano. Robust statistics for deterministic and stochas- of rotating stellar core collapse and bounce. Physical Re- tic gravitational waves in non-Gaussian noise: Frequen- view D, 80(10):102004, November 2009. tist analyses. Physical Review D, 65(12):122002, June [13] K. Cannon, A. Chapman, C. Hanna, D. Keppel, A. C. 2002. Searle, and A. Weinstein. Singular value decomposition [32] B. Allen. χ2 time-frequency discriminator for gravita- applied to compact binary coalescence gravitational-wave tional wave detection. Physical Review D, 71(6):062001, signals. Physical Review D, 82(4):044025, August 2010. March 2005. [14] P. Jaranowski, A. Kr´olak, and B. Schutz. Data analy- [33] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maxi- sis of gravitational-wave signals from spinning neutron mum likelihood from incomplete data via the EM algo- stars: The signal and its detection. Physical Review D, rithm. Journal of the Royal Statistical Society B, 39(1):1– 58(6):063001, September 1998. 38, 1977. [15] J. Neter, M. H. Kutner, C. J. Nachtsheim, and [34] S. S. Wilks. The large-sample distribution of the likeli- W. Wasserman. Applied linear statistical models. hood ratio for testing composite hypotheses. The Annals McGraw-Hill, New York, 4th edition, 1996. of , 9(1):60–62, March 1938. [16] C. R¨over, C. Messenger, and R. Prix. Bayesian versus fre- [35] C. R¨over. bspec: Bayesian spectral infer- quentist upper limits. Arxiv preprint 1103.2987, Febru- ence, version 1.3, 2011. R package. URL: ary 2011. http://cran.r-project.org/package=bspec. [17] B. Allen et al. Observational limit on gravitational waves [36] T. M¨akel¨ainen, K. Schmidt, and G. P. H. Styan. On the from binary neutron stars in the galaxy. Physical Review existence and uniqueness of the maximum likelihood esti- Letters, 83(8):1498–1501, August 1999. mate of a vector-valued parameter in fixed-size samples. [18] B. Allen, W. G. Anderson, P. G. Brady, D. A. Brown, and The Annals of Statistics, 9(4):758–767, July 1981. J. D. E. Creighton. Findchirp: an algorithm for detection [37] T. Damour, B. R. Iyer, and B. S. Sathyaprakash. Com- of gravitational waves from inspiraling compact binaries. parison of search templates for gravitational waves from Arxiv preprint gr-qc/0509116, September 2005. binary inspiral. Physical Review D, 63(4):044023, Jan- [19] D. A. Brown. Using the Inspiral program to search uary 2001. for gravitational waves from low-mass binary inspiral. [38] B. P. Abbott et al. LIGO: the Laser Interferometer Classical and Quantum Gravity, 22(18):S1097–S1107, Gravitational-wave Observatory. Reports on Progress in September 2005. Physics, 72(7):076901, July 2009. [20] M. Was et al. On the background estimation by time [39] P. D. Welch. The use of Fast Fourier Transform for the slides in a network of gravitational wave detectors. Clas- 17

estimation of power spectra: A method based on time av- [43] S. Waldman. Rayleigh distributions for H1, L1 for eraging over short, modified . IEEE Trans- S6a. LIGO-Virgo collaboration internal report, Novem- actions on Audio and Electroacoustics, AU-15(2):70–73, ber 2009. June 1967. [44] C. R¨over. Degrees-of-freedom estimation in the Student-t [40] T. Tanaka and H. Tagoshi. Use of new coordinates for noise model. Technical Report LIGO-T1100497, LIGO- the template space in a hierarchical search for gravita- Virgo collaboration, September 2011. tional waves from inspiraling binaries. Physical Review D, [45] J. Slutsky et al. Methods for reducing false alarms in 62(8):082001, October 2000. searches for compact binary coalescences in LIGO data. [41] M. B. Wilk and R. Gnanadesikan. Probability plotting Classical and Quantum Gravity, 27(16):165023, August methods for the analysis of data. Biometrika, 55(1):1–17, 2010. March 1968. [46] J. Veitch and A. Vecchio. Bayesian approach to the [42] T. Fawcett. An introduction to ROC analysis. Pattern follow-up of candidate gravitational wave signals. Physi- Recognition Letters, 27(8):861–874, June 2006. cal Review D, 78(2):022001, July 2008.